diff options
302 files changed, 739 insertions, 64789 deletions
diff --git a/contributors/design-proposals/Design_Proposal_TEMPLATE.md b/contributors/design-proposals/Design_Proposal_TEMPLATE.md index 9f3a683b..f0fbec72 100644 --- a/contributors/design-proposals/Design_Proposal_TEMPLATE.md +++ b/contributors/design-proposals/Design_Proposal_TEMPLATE.md @@ -1,38 +1,6 @@ -# <Title> +Design proposals have been archived. -Status: Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Alpha | Beta | GA - -Implementation Owner: TBD - -## Motivation - -<2-6 sentences about why this is needed> - -## Proposal - -<4-6 description of the proposed solution> - -## User Experience - -### Use Cases - -<enumerated list of use cases for this feature> - -<in depth description of user experience> - -<*include full examples*> - -## Implementation - -<in depth description of how the feature will be implemented. in some cases this may be very simple.> - -### Client/Server Backwards/Forwards compatibility - -<define behavior when using a kubectl client with an older or newer version of the apiserver (+-1 version)> - -## Alternatives considered - -<short description of alternative solutions to be considered> +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/OWNERS b/contributors/design-proposals/OWNERS deleted file mode 100644 index 7bda97c6..00000000 --- a/contributors/design-proposals/OWNERS +++ /dev/null @@ -1,22 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - brendandburns - - dchen1107 - - jbeda - - lavalamp - - smarterclayton - - thockin - - wojtek-t - - bgrant0607 -approvers: - - brendandburns - - dchen1107 - - jbeda - - lavalamp - - smarterclayton - - thockin - - wojtek-t - - bgrant0607 -labels: - - kind/design diff --git a/contributors/design-proposals/README.md b/contributors/design-proposals/README.md index 617713b2..4abc4bab 100644 --- a/contributors/design-proposals/README.md +++ b/contributors/design-proposals/README.md @@ -1,16 +1,9 @@ -# Kubernetes Design Documents and Proposals +Design proposals have been archived. -**The Design Proposal process has been deprecated in favor of [Kubernetes Enhancement Proposals (KEP)][keps]. These documents are here for historical purposes only.** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). ---- +For information on the replacement design proposal process, see the [Kubernetes Enhancement Proposals (KEP)](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/0000-kep-process) process. -This directory contains Kubernetes design documents and accepted design proposals. -For a design overview, please see [the architecture document](architecture/architecture.md). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first. -Note that a number of these documents are historical and may be out of date or unimplemented. - -TODO: Add the current status to each document and clearly indicate which are up to date. - - -[keps]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/0000-kep-process diff --git a/contributors/design-proposals/api-machinery/OWNERS b/contributors/design-proposals/api-machinery/OWNERS deleted file mode 100644 index ef142b0f..00000000 --- a/contributors/design-proposals/api-machinery/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-api-machinery-leads -approvers: - - sig-api-machinery-leads -labels: - - sig/api-machinery diff --git a/contributors/design-proposals/api-machinery/add-new-patchStrategy-to-clear-fields-not-present-in-patch.md b/contributors/design-proposals/api-machinery/add-new-patchStrategy-to-clear-fields-not-present-in-patch.md index d2b894d2..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/add-new-patchStrategy-to-clear-fields-not-present-in-patch.md +++ b/contributors/design-proposals/api-machinery/add-new-patchStrategy-to-clear-fields-not-present-in-patch.md @@ -1,405 +1,6 @@ -Add new patchStrategy to clear fields not present in the patch -============= +Design proposals have been archived. -We introduce a new struct tag `patchStrategy:"retainKeys"` and -a new optional directive `$retainKeys: <list of fields>` in the patch. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The proposal of Full Union is in [kubernetes/community#388](https://github.com/kubernetes/community/pull/388). -| Capability | Supported By This Proposal | Supported By Full Union | -|---|---|---| -| Auto clear missing fields on patch | X | X | -| Merge union fields on patch | X | X | -| Validate only 1 field set on type | | X | -| Validate discriminator field matches one-of field | | X | -| Support non-union patchKey | X | TBD | -| Support arbitrary combinations of set fields | X | | - -## Use cases - -- As a user patching a map, I want keys mutually exclusive with those that I am providing to automatically be cleared. - -- As a user running kubectl apply, when I update a field in my configuration file, -I want mutually exclusive fields never specified in my configuration to be cleared. - -## Examples: - -- General Example: Keys in a Union are mutually exclusive. Clear unspecified union values in a Union that contains a discriminator. - -- Specific Example: When patching a Deployment .spec.strategy, clear .spec.strategy.rollingUpdate -if it is not provided in the patch so that changing .spec.strategy.type will not fail. - -- General Example: Keys in a Union are mutually exclusive. Clear unspecified union values in a Union -that does not contain a discriminator. - -- Specific Example: When patching a Pod .spec.volume, clear all volume fields except the one specified in the patch. - -## Proposed Changes - -### APIs - -**Scope**: - -| Union Type | Supported | -|---|---| -| non-inlined non-discriminated union | Yes | -| non-inlined discriminated union | Yes | -| inlined union with [patchMergeKey](/contributors/devel/sig-architecture/api-conventions.md#strategic-merge-patch) only | Yes | -| other inlined union | No | - -For the inlined union with patchMergeKey, we move the tag to the parent struct's instead of -adding some logic to lookup the metadata in go struct of the inline union. -Because the limitation of the latter is that the metadata associated with -the inlined APIs will not be reflected in the OpenAPI schema. - -#### Tags - -old tags: - -1) `patchMergeKey`: -It is the key to distinguish the entries in the list of non-primitive types. It must always be -present to perform the merge on the list of non-primitive types, and will be preserved. - -2) `patchStrategy`: -It indicates how to generate and merge a patch for lists. It could be `merge` or `replace`. It is optional for lists. - -new tags: - -`patchStrategy: "retainKeys"`: - -We introduce a new optional directive `$retainKeys` to support the new patch strategy. - -`$retainKeys` directive has the following properties: -- It contains a list of strings. -- All fields needing to be preserved must be present in the `$retainKeys` list. -- The fields that are present will be merged with live object. -- All of the missing fields will be cleared when patching. -- All fields in the `$retainKeys` list must be a superset or the same as the fields present in the patch. - -A new patch will have the same content as the old patch and an additional new directive. -It will be backward compatible. - -#### When the patch doesn't have `$retainKeys` directive - -When the patch doesn't have `$retainKeys` directive, even for a type with `patchStrategy: "retainKeys"`, -the server won't treat the patch with the retainKeys logic. - -This will guarantee the backward compatibility: old patch behaves the same as before on the new server. - -#### When the patch has fields that not present in the `$retainKeys` list - -The server will reject the patch in this case. - -This is an invalid patch: - -```yaml -union: - $retainKeys: - - foo - foo: a - bar: x -``` - -#### When the `$retainKeys` list has fields that are not present in the patch - -The server will merge the change and clear the fields not present in the `$retainKeys` list - -This is a valid patch: -```yaml -union: - $retainKeys: - - foo - - bar - foo: a -``` - -#### Examples - -1) Non-inlined non-discriminated union: - -Type definition: -```go -type ContainerStatus struct { - ... - // Add patchStrategy:"retainKeys" - State ContainerState `json:"state,omitempty" protobuf:"bytes,2,opt,name=state" patchStrategy:"retainKeys"`` - ... -} -``` -Live object: -```yaml -state: - running: - startedAt: ... -``` -Local file config: -```yaml -state: - terminated: - exitCode: 0 - finishedAt: ... -``` -Patch: -```yaml -state: - $retainKeys: - - terminated - terminated: - exitCode: 0 - finishedAt: ... -``` -Result after merging -```yaml -state: - terminated: - exitCode: 0 - finishedAt: ... -``` - -2) Non-inlined discriminated union: - -Type definition: -```go -type DeploymentSpec struct { - ... - // Add patchStrategy:"retainKeys" - Strategy DeploymentStrategy `json:"strategy,omitempty" protobuf:"bytes,4,opt,name=strategy" patchStrategy:"retainKeys"` - ... -} -``` -Since there are no fields associated with `recreate` in `DeploymentSpec`, I will use a generic example. - -Live object: -```yaml -unionName: - discriminatorName: foo - fooField: - fooSubfield: val1 -``` -Local file config: -```yaml -unionName: - discriminatorName: bar - barField: - barSubfield: val2 -``` -Patch: -```yaml -unionName: - $retainKeys: - - discriminatorName - - barField - discriminatorName: bar - barField: - barSubfield: val2 -``` -Result after merging -```yaml -unionName: - discriminatorName: bar - barField: - barSubfield: val2 -``` - -3) Inlined union with `patchMergeKey` only. -This case is special, because `Volumes` already has a tag `patchStrategy:"merge"`. -We change the tag to `patchStrategy:"merge|retainKeys"` - -Type definition: -```go -type PodSpec struct { - ... - // Add another value "retainKeys" to patchStrategy - Volumes []Volume `json:"volumes,omitempty" patchStrategy:"merge|retainKeys" patchMergeKey:"name" protobuf:"bytes,1,rep,name=volumes"` - ... -} -``` -Live object: -```yaml -spec: - volumes: - - name: foo - emptyDir: - medium: - ... -``` -Local file config: -```yaml -spec: - volumes: - - name: foo - hostPath: - path: ... -``` -Patch: -```yaml -spec: - volumes: - - $retainKeys: - - name - - hostPath - name: foo - hostPath: - path: ... -``` -Result after merging -```yaml -spec: - volumes: - - name: foo - hostPath: - path: ... -``` - -**Impacted APIs** are listed in the [Appendix](#appendix). - -### API server - -No required change. -Auto clearing missing fields of a patch relies on package Strategic Merge Patch. -We don't validate only 1 field is set in union in a generic way. We don't validate discriminator -field matches one-of field. But we still rely on hardcoded per field based validation. - -### kubectl - -No required change. -Changes about how to generate the patch rely on package Strategic Merge Patch. - -### Strategic Merge Patch -**Background** -Strategic Merge Patch is a package used by both client and server. A typical usage is that a client -calls the function to calculate the patch and the API server calls another function to merge the patch. - -We need to make sure the new client always sends its patches with the `$retainKeys` directive. -When merging, auto clear missing fields of a patch if the patch has a directive `$retainKeys` - -### Open API - -Update OpenAPI schema. - -## Version Skew - -The changes are all backward compatible. - -Old kubectl vs New server: All behave the same as before, since no new directive in the patch. - -New kubectl vs Old server: All behave the same as before, since new directive will not be recognized -by the old server and it will be dropped in conversion. - -# Alternatives Considered - -# 1. Use directive `$patch: retainKeys` in the patch - -Add tags `patchStrategy:"retainKeys"`. -For a given type that has the tag, all keys/fields missing -from the request will be cleared when patching the object. -Each field present in the request will be merged with the live config. - -## Analysis - -There are 2 reasons of avoiding this logic: -- Using `$patch` as directive key will break backward compatibility. -But can easily be fixed by using a different key, e.g. `retainKeys: true`. -Reason is that `$patch` has been used in earlier releases. -If we add new value to this directive, -the old server will reject the new patch due to not knowing the new value. -- The patch has to include the entire struct to hold the place in a list with `replace` patch strategy, -even though there may be no changes at all. -This is less efficient compared to the approach above. - -The proposals below are not mutually exclusive with the proposal above, and maybe can be added at some point in the future. - -# 2. Add Discriminators in All Unions/OneOf APIs - -Original issue is described in kubernetes/kubernetes#35345 - -## Analysis - -### Behavior - -If the discriminator were set, we'd require that the field corresponding to its value were set and the APIServer (registry) could automatically clear the other fields. - -If the discriminator were unset, behavior would be as before -- exactly one of the fields in the union/oneof would be required to be set and the operation would otherwise fail validation. - -We should set discriminators by default. This means we need to change it accordingly when the corresponding union/oneof fields were set and unset. - -## Proposed Changes - -### API -Add a discriminator field in all unions/oneof APIs. The discriminator should be optional for backward compatibility. There is an example below, the field `Type` works as a discriminator. -```go -type PersistentVolumeSource struct { -... - // Discriminator for PersistentVolumeSource, it can be "gcePersistentDisk", "awsElasticBlockStore" and etc. - // +optional - Type *string `json:"type,omitempty" protobuf:"bytes,24,opt,name=type"` -} -``` - -### API Server - -We need to add defaulting logic described in the [Behavior](#behavior) section. - -### kubectl - -No change required on kubectl. - -## Summary - -Limitation: Server-side automatically clearing fields based on discriminator may be unsafe. - -# Appendix - -## List of Impacted APIs -In `pkg/api/v1/types.go`: -- [`VolumeSource`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L235): -It is inlined. Besides `VolumeSource`. its parent [Volume](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L222) has `Name`. -- [`PersistentVolumeSource`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L345): -It is inlined. Besides `PersistentVolumeSource`, its parent [PersistentVolumeSpec](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L442) has the following fields: -```go -Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` -// +optional -AccessModes []PersistentVolumeAccessMode `json:"accessModes,omitempty" protobuf:"bytes,3,rep,name=accessModes,casttype=PersistentVolumeAccessMode"` -// +optional -ClaimRef *ObjectReference `json:"claimRef,omitempty" protobuf:"bytes,4,opt,name=claimRef"` -// +optional -PersistentVolumeReclaimPolicy PersistentVolumeReclaimPolicy `json:"persistentVolumeReclaimPolicy,omitempty" protobuf:"bytes,5,opt,name=persistentVolumeReclaimPolicy,casttype=PersistentVolumeReclaimPolicy"` -``` -- [`Handler`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L1485): -It is inlined. Besides `Handler`, its parent struct [`Probe`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L1297) also has the following fields: -```go -// +optional -InitialDelaySeconds int32 `json:"initialDelaySeconds,omitempty" protobuf:"varint,2,opt,name=initialDelaySeconds"` -// +optional -TimeoutSeconds int32 `json:"timeoutSeconds,omitempty" protobuf:"varint,3,opt,name=timeoutSeconds"` -// +optional -PeriodSeconds int32 `json:"periodSeconds,omitempty" protobuf:"varint,4,opt,name=periodSeconds"` -// +optional -SuccessThreshold int32 `json:"successThreshold,omitempty" protobuf:"varint,5,opt,name=successThreshold"` -// +optional -FailureThreshold int32 `json:"failureThreshold,omitempty" protobuf:"varint,6,opt,name=failureThreshold"` -```` -- [`ContainerState`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L1576): -It is NOT inlined. -- [`PodSignature`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/api/v1/types.go#L2953): -It has only one field, but the comment says "Exactly one field should be set". Maybe we will add more in the future? It is NOT inlined. -In `pkg/authorization/types.go`: -- [`SubjectAccessReviewSpec`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/authorization/types.go#L108): -Comments says: `Exactly one of ResourceAttributes and NonResourceAttributes must be set.` -But there are some other non-union fields in the struct. -So this is similar to INLINED struct. -- [`SelfSubjectAccessReviewSpec`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/authorization/types.go#L130): -It is NOT inlined. - -In `pkg/apis/extensions/v1beta1/types.go`: -- [`DeploymentStrategy`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/extensions/types.go#L249): -It is NOT inlined. -- [`NetworkPolicyPeer`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/extensions/v1beta1/types.go#L1340): -It is NOT inlined. -- [`IngressRuleValue`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/extensions/v1beta1/types.go#L876): -It says "exactly one of the following must be set". But it has only one field. -It is inlined. Its parent [`IngressRule`](https://github.com/kubernetes/kubernetes/blob/v1.5.2/pkg/apis/extensions/v1beta1/types.go#L848) also has the following fields: -```go -// +optional -Host string `json:"host,omitempty" protobuf:"bytes,1,opt,name=host"` -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/admission-control-webhooks.md b/contributors/design-proposals/api-machinery/admission-control-webhooks.md index 100c27fa..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/admission-control-webhooks.md +++ b/contributors/design-proposals/api-machinery/admission-control-webhooks.md @@ -1,960 +1,6 @@ -# Webhooks Beta +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## PUBLIC -Authors: @erictune, @caesarxuchao, @enisoc -Thanks to: {@dbsmith, @smarterclayton, @deads2k, @cheftako, @jpbetz, @mbohlool, @mml, @janetkuo} for comments, data, prior designs, etc. - -[TOC] - - -# Summary - -This document proposes a detailed plan for bringing Webhooks to Beta. Highlights include (incomplete, see rest of doc for complete list) : - - - -* Adding the ability for webhooks to mutate. -* Bootstrapping -* Monitoring -* Versioned rather than Internal data sent on hook -* Ordering behavior within webhooks, and with other admission phases, is better defined - -This plan is compatible with the [original design doc](/contributors/design-proposals/api-machinery/admission_control_extension.md). - - -# Definitions - -**Mutating Webhook**: Webhook that can change a request as well as accept/reject. - -**Non-Mutating Webhook**: Webhook that cannot change request, but can accept or reject. - -**Webhook**: encompasses both Mutating Webhook and/or Non-mutating Webhook. - -**Validating Webhook**: synonym for Non-Mutating Webhook - -**Static Admission Controller**: Compiled-in Admission Controllers, (in plugin/pkg/admission). - -**Webhook Host**: a process / binary hosting a webhook. - -# Naming - -Many names were considered before settling on mutating. None of the names -considered were completely satisfactory. The following are the names which were -considered and a brief explanation of the perspectives on each. - -* Mutating: Well defined meaning related to mutable and immutable. Some - negative connotations related to genetic mutation. Might be too specifically - as CS term. -* Defaulting: Clearly indicates a create use case. However implies a lack of - functionality for the update use case. -* Modifying: Similar issues to mutating but not as well defined. -* Revising: Less clear what it does. Does it imply it works only on updates? -* Transforming: Some concern that it might have more to do with changing the - type or shape of the related object. -* Adjusting: Same general perspective as modifying. -* Updating: Nice clear meaning. However it seems to easy to confuse update with - the update operation and intuit it does not apply to the create operation. - - -# Development Plan - -Google able to staff development, test, review, and documentation. Community help welcome, too, esp. Reviewing. - -Intent is Beta of Webhooks (**both** kinds) in 1.9. - -Not in scope: - - - -* Initializers remains Alpha for 1.9. (See [Comparison of Webhooks and Initializers](#comparison-of-webhooks-and-initializers) section). No changes to it. Will revisit its status post-1.9. -* Converting static admission controllers is out of scope (but some investigation has been done, see Moving Built-in Admission Controllers section). - - -## Work Items - -* Add API for registering mutating webhooks. See [API Changes](#api-changes) -* Copy the non-mutating webhook admission controller code and rename it to be for mutating. (Splitting into two registration APIs make ordering clear.) Add changes to handle mutating responses. See [Responses for Mutations](#responses-for-mutations). -* Document recommended flag order for admission plugins. See [Order of Admission](#order-of-admission). -* In kube-up.sh and other installers, change flag per previous item. -* Ensure able to monitor latency and rejection from webhooks. See [Monitorability](#monitorability). -* Don't send internal objects. See [#49733](https://github.com/kubernetes/kubernetes/issues/49733) -* Serialize mutating Webhooks into order in the apiregistration. Leave non-mutating in parallel. -* Good Error Messages. See [Good Error Messages](#good-error-messages) -* Conversion logic in GenericWebhook to send converted resource to webhook. See [Conversion](#conversion) and [#49733](https://github.com/kubernetes/kubernetes/issues/49733). -* Schedule discussion around resiliency to down webhooks and bootstrapping -* Internal Go interface refactor (e.g. along the lines suggested #[1137](https://github.com/kubernetes/community/pull/1137)). - - -# Design Discussion - - -## Why Webhooks First - -We will do webhooks beta before initializers beta because: - - - -1. **Serves Most Use Cases**: We reviewed code of all current use cases, namely: Kubernetes Built-in Admission Controllers, OpenShift Admission Controllers, Istio & Service Catalog. (See also [Use Cases Detailed Descriptions](#use-cases-detailed-descriptions).) All of those use cases are well served by mutating and non-mutating webhooks. (See also [Comparison of Webhooks and Initializers](#comparison-of-webhooks-and-initializers)). -1. **Less Work**: An engineer quite experienced with both code bases estimated that it is less work to adding Mutating Webhooks and bring both kinds of webhooks to beta; than to bring non-mutating webhooks and initializers to Beta. Some open issues with Initializers with long expected development time include quota replenishment bug, and controller awareness of uninitialized objects. -1. **API Consistency**: Prefer completing one related pair of interfaces (both kinds of webhooks) at the same time. - - -## Why Support Mutation for Beta - -Based on experience and feedback from the alpha phase of both Webhooks and Initializers, we believe Webhooks Beta should support mutation because: - - - -1. We have lots of use cases to inform this (both from Initializers, and Admission Controllers) to ensure we have needed features -1. We have experience with Webhooks API already to give confidence in the API. The registration API will be quite similar except in the responses. -1. There is a strong community demand for something that satisfies a mutating case. - - -## Plan for Existing Initializer-Clients - -After the release of 1.9, we will advise users who currently use initializers to: - - - -* Move to Webhooks if their use case fits that model well. -* Provide SIG-API-Machinery with feedback if Initializers is a better fit. - -We will continue to support Initializers as an Alpha API in 1.9. - -We will make a user guide and extensively document these webhooks. We will update some existing examples, maybe https://github.com/caesarxuchao/example-webhook-admission-controller (since the initializer docs point to it, e.g. https://github.com/kelseyhightower/kubernetes-initializer-tutorial), or maybe https://github.com/openshift/generic-admission-server. - -We will clearly document the reasons for each and how users should decide which to use. - - -## Monitorability - -There should be prometheus variables to show: - - - -* API operation latency - * Overall - * By webhook name -* API response codes - * Overall - * By webhook name. - -Adding a webhook dynamically adds a key to a map-valued prometheus metric. Webhook host process authors should consider how to make their webhook host monitorable: while eventually we hope to offer a set of best practices around this, for the initial release we won't have requirements here. - - -## API Changes - -GenericAdmissionWebhook Admission Controller is split and renamed. - - - -* One is called `MutatingAdmissionWebhook` -* The other is called `ValidatingAdmissionWebhook` -* Splitting them allows them to appear in different places in the `--admission-control` flag's order. - -ExternalAdmissionHookConfiguration API is split and renamed. - - - -* One is called `MutatingAdmissionWebhookConfiguration` -* The other is called `ValidatingAdmissionWebhookConfiguration` -* Splitting them: - * makes it clear what the order is when some items don't have both flavors, - * enforces mutate-before-validate, - * better allows declarative update of the config than one big list with an implied partition point - -The `ValidatingAdmissionWebhookConfiguration` stays the same as `ExternalAdmissionHookConfiguration` except it moves to v1beta1. - -The `MutatingAdmissionWebhookConfiguration` is the same API as `ValidatingAdmissionWebhookConfiguration`. It is only visible via the v1beta1 version. - -We will change from having a Kubernetes service object to just accepting a DNS -name for the location of the webhook. - -The Group/Version called - -`admissionregistration.k8s.io/v1alpha1` with kinds - -InitializerConfiguration and ExternalAdmissionHookConfiguration. - -InitializerConfiguration will not join `admissionregistration.k8s.io/v1beta1` at this time. - -Any webhooks that register with v1alpha1 may or may not be surprised when they start getting versioned data. But we don't make any promises for Alpha, and this is a very important bug to fix. - - -## Order of Admission - -At kubernetes.io, we will document the ordering requirements or just recommend a particular order for `--admission-control`. A starting point might be `MutatingAdmissionWebhook,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,ValidatingAdmissionWebhook,ResourceQuota`. - - There might be other ordering dependencies that we will document clearly, but some important properties of a valid ordering: - -* ResourceQuota comes last, so that if prior ones reject a request, it won't increment quota. -* All other Static ones are in the order recommended by [the docs](https://kubernetes.io/docs/admin/admission-controllers/#is-there-a-recommended-set-of-plug-ins-to-use). (which variously do mutation and validation) Preserves the behavior when there are no webhooks. -* Ensures dynamic mutations happen before all validations. -* Ensures dynamic validations happen after all mutations. -* Users don't need to reason about the static ones, just the ones they add. - -System administrators will likely need to know something about the webhooks they -intend to run in order to make the best ordering, but we will try to document a -good "first guess". - -Validation continues to happen after all the admission controllers (e.g. after mutating webhooks, static admission controllers, and non-mutating admission controllers.) - -**TODO**: we should move ResourceQuota after Validation, e.g. as described in #1137. However, this is a longstanding bug and likely a larger change than can be done in 1.9--a larger quota redesign is out of scope. But we will likely make an improvement in the current ordering. - - -## Parallel vs Serial - -The main reason for parallel is reducing latency due to round trip and conversion. We think this can often mitigated by consolidating multiple webhooks shared by the same project into one. - -Reasons not to allow parallel are complexity of reasoning about concurrent patches, and CRD not supporting PATCH. - -`ValidatingAdmissionWebhook `is already parallel, and there are no responses to merge. Therefore, it stays parallel. - -`MutatingAdmissionWebhook `will run in serial, to ensure conflicts are resolved deterministically. - -The order is the sort order of all the WebhookConfigs, by name, and by index within the Webhooks list. - -We don't plan to make mutating webhooks parallel at this time, but we will revisit the question in the future and decide before going to GA. - -## Good Error Messages - -When a webhook is persistently failing to allow e.g. pods to be created, then the error message from the apiserver must show which webhook failed. - -When a core controller, e.g. ReplicaSet, fails to make a resources, it must send a helpful event that is visible in `kubectl describe` for the controlling resources, saying the reason create failed. - -## Registering for all possible representations of the same object - -Some Kubernetes resources are mounted in the api type system at multiple places -(e.g., during a move between groups). Additionally, some resources have multiple -active versions. There's not currently a way to easily tell which of the exposed -resources map to the same "storage location". We will not try to solve that -problem at the moment: if the system administrator wishes to hook all -deployments, they must (e.g.) make sure their hook is registered for both -deployments.v1beta1.extensions AND deployments.v1.apps. - -This is likely to be error-prone, especially over upgrades. For GA, we may -consider mechanisms to make this easier. We expect to gather user feedback -before designing this. - - -## Conversion and Versioning - -Webhooks will receive the admission review subject in the exact version which -the user sent it to the control plane. This may require the webhook to -understand multiple versions of those types. - -All communication to webhooks will be JSON formatted, with a request body of -type admission.k8s.io/v1beta1. For GA, we will likely also allow proto, via a -TBD mechanism. - -We will not take any particular steps to make it possible to know whether an -apiserver is safe to upgrade, given the webhooks it is running. System -administrators must understand the stack of webhooks they are running, watch the -Kubernetes release notes, and look to the webhook authors for guidance about -whether the webhook supports Kubernetes version N. We may choose to address this -deficency in future betas. - -To follow the debate that got us to this position, you can look at this -potential design for the next steps: https://docs.google.com/document/d/1BT8mZaT42jVxtC6l14YMXpUq0vZc6V5MPf_jnzDMMcg/edit - - -## Mutations - -The Response for `MutatingAdmissionWebhook` must have content-type, and it must be one of: - -* `application/json` -* `application/protobuf` -* `application/strategic-merge-patch+json` -* `application/json-patch+json` -* `application/merge-json-patch+json` - -If the response is a patch, it is merged with the versioned response from the previous webhook, where possible without Conversion. - -We encourage the use of patch to avoid the "old clients dropping new fields" problem. - - -## Bootstrapping - -Bootstrapping (both turning on a cluster for the first time and making sure a -cluster can boot from a cold start) is made more difficult by having webhooks, -which are a dependency of the control plane. This is covered in its [own design -doc](./admission-webhook-bootstrapping.md). - -## Upgrading the control plane - -There are two categories of webhooks: security critical (e.g., scan images for -vulnerabilities) and nice-to-have (set labels). - -Security critical webhooks cannot work with Kubernetes types they don't have -built-in knowledge of, because they can't know if e.g. Kubernetes 1.11 adds a -backwards-compatible `v1.Pod.EvilField` which will defeat their functionality. - -They therefore need to be updated before any apiserver. It is the responsibility -of the author of such a webhook to release new versions in response to new -Kubernetes versions in a timely manner. Webhooks must support two consecutive -Kubernetes versions so that rollback/forward is possible. When/if Kubernetes -introduces LTS versions, webhook authors will have to also support two -consecutive LTS versions. - -Non-security-critical webhooks can either be turned off to perform an upgrade, -or can just continue running the old webhook version as long as a completely new -version of an object they want to hook is not added. If they are metadata-only -hooks, then they should be able to run until we deprecate meta/v1. Such webhooks -should document that they don't consider themselves security critical, aren't -obligated to follow the above requirements for security-critical webhooks, and -therefore do not guarantee to be updated for every Kubernetes release. - -It is expected that webhook authors will distribute config for each Kubernetes -version that registers their webhook for all the necessary types, since it would -be unreasonable to make system administrators understand all of the webhooks -they run to that level of detail. - -## Support for Custom Resources - -Webhooks should work with Custom Resources created by CRDs. - -They are particularly needed for Custom Resources, where they can supplement the validation and defaulting provided by OpenAPI. Therefore, the webhooks will be moved or copied to genericapiserver for 1.9. - - -## Support for Aggregated API Servers - -Webhooks should work with Custom Resources on Aggregated API Servers. - -Aggregated API Servers should watch apiregistraton on the main APIserver, and should identify webhooks with rules that match any of their resources, and call those webhooks. - -For example a user might install a Webhook that adds a certain annotation to every single object. Aggregated APIs need to support this use case. - -We will build the dynamic admission stack into the generic apiserver layer to support this use case. - - -## Moving Built-in Admission Controllers - -This section summarizes recommendations for Posting static admission controllers to Webhooks. - -See also [Details of Porting Admission Controllers](#details-of-porting-admission-controllers) and this [Backup Document](https://docs.google.com/spreadsheets/d/1zyCABnIzE7GiGensn-KXneWrkSJ6zfeJWeLaUY-ZmM4/edit#gid=0). - -Here is an estimate of how each kind of admission controller would be moved (or not). This is to see if we can cover the use cases we currently have, not necessarily a promise that all of these will or should be move into another process. - - - -* Leave static: - * OwnerReferencesPermissionEnforcement - * GC is a core feature of Kubernetes. Move to required. - * ResourceQuota - * May [redesign](https://github.com/kubernetes/kubernetes/issues/51820) - * Original design doc says it remains static. -* Divide into Mutating and non-mutating Webhooks - * PodSecurityPolicy - * NamespaceLifecycle -* Use Mutating Webhook - * AlwaysPullImages - * ServiceAccount - * StorageClass -* Use non-mutating Webhook - * Eventratelimit - * DenyEscalatingExec - * ImagePolicy - * Need to standardize the webhook format - * NodeRestriction - * Needs to be admission to access User.Info - * PodNodeSelector - * PodTolerationRestriction -* Move to resource's validation or defaulting - * AntiAffinity - * DefaultTolerationSeconds - * PersistentVolumeClaimResize - * Initializers are reasonable to consider moving into the API machinery - -For "Divide", the backend may well be different port of same binary, sharing a SharedInformer, so data is not cached twice. - -For all Kubernetes built-in webhooks, the backend will likely be compiled into kube-controller-manager and share the SharedInformer. - - -# Use Case Analysis - - -## Use Cases Detailed Descriptions - -Mutating Webhooks, Non-mutating webhooks, Initializers, and Finalizers (collectively, Object Lifecycle Extensions) serve to: - - - -* allow policy and behavioral changes to be developed independently of the control loops for individual Resources. These might include company specific rules, or a PaaS that layers on top of Kubernetes. -* implement business logic for Custom Resource Definitions -* separate Kubernetes business logic from the core Apiserver logic, which increases reusability, security, and reliability of the core. - -Specific Use cases: - - - -* Kubernetes static Admission Controllers - * Documented [here](https://kubernetes.io/docs/admin/admission-controllers/) - * Discussed [here](/contributors/design-proposals/api-machinery/admission_control_extension.md) - * All are highly reliable. Most are simple. No external deps. - * Many need update checks. - * Can be separated into mutation and validate phases. -* OpenShift static Admission Controllers - * Discussed [here](/contributors/design-proposals/api-machinery/admission_control_extension.md) - * Similar to Kubernetes ones. -* Istio, Case 1: Add Container to all Pods. - * Currently uses Initializer but can use Mutating Webhook. - * Simple, can be highly reliable and fast. No external deps. - * No current use case for updates. -* Istio, Case 2: Validate Mixer CRDs - * Checking cached values from other CRD objects. - * No external deps. - * Must check updates. -* Service Catalog - * Watch PodPreset and edit Pods. - * Simple, can be highly reliable and fast. No external deps. - * No current use case for updates. - -Good further discussion of use cases [here](/contributors/design-proposals/api-machinery/admission_control_extension.md) - - -## Details of Porting Admission Controllers - -This section summarizes which Kubernetes static admission controllers can readily be ported to Object Lifecycle Extensions. - - -### Static Admission Controllers - - -<table> - <tr> - <td>Admission Controller - </td> - <td>How - </td> - <td>Why - </td> - </tr> - <tr> - <td>PodSecurityPolicy - </td> - <td>Use Mutating Webhook and Non-Mutating Webhook. - </td> - <td>Requires User.Info, so needs webhook. -<p> -Mutating will set SC from matching PSP. -<p> -Non-Mutating will check again in case any other mutators or initializers try to change it. - </td> - </tr> - <tr> - <td>ResourceQuota - </td> - <td>Leave static - </td> - <td>A Redesign for Resource Quota has been proposed, to allow at least object count quota for other objects as well. This suggests that Quota might need to remain compiled in like authn and authz are. - </td> - </tr> - <tr> - <td>AlwaysPullImages - </td> - <td>Use Mutating Webhook (could implement using initializer since the thing is it validating is forbidden to change by Update Validation of the object) - </td> - <td>Needs to - </td> - </tr> - <tr> - <td>AntiAffinity - </td> - <td>Move to pod validation - </td> - <td>Since this is provided by the core project, which also manages the pod business logic, it isn't clear why this is even an admission controller. Ask Scheduler people. - </td> - </tr> - <tr> - <td>DefaultTolerationSeconds - </td> - <td>Move to pod defaulting or use a Mutating Webhook. - </td> - <td>It is very simple. - </td> - </tr> - <tr> - <td>eventratelimit - </td> - <td>Non-mutating webhook - </td> - <td>Simple logic, does not mutate. Alternatively, have rate limit be a built-in of api server. - </td> - </tr> - <tr> - <td>DenyEscalatingExec - </td> - <td>Non-mutating Webhook. - </td> - <td>It is very simple. It is optional. - </td> - </tr> - <tr> - <td>OwnerReferences- PermissionEnforcement (gc) - </td> - <td>Leave compiled in - </td> - <td>Garbage collection is core to Kubernetes. Main and all aggregated apiservers should enforce it. - </td> - </tr> - <tr> - <td>ImagePolicy - </td> - <td>Non-mutating webhook - </td> - <td>Must use webhook since image can be updated on pod, and that needs to be checked. - </td> - </tr> - <tr> - <td>LimitRanger - </td> - <td>Mutating Webhook - </td> - <td>Fast - </td> - </tr> - <tr> - <td>NamespaceExists - </td> - <td>Leave compiled in - </td> - <td>This has been on by default for years, right? - </td> - </tr> - <tr> - <td>NamespaceLifecycle - </td> - <td>Split: -<p> - -<p> -Cleanup, leave compiled in. -<p> - -<p> -Protection of system namespaces: use non-mutating webhook - </td> - <td> - </td> - </tr> - <tr> - <td>NodeRestriction - </td> - <td>Use a non-mutating webhook - </td> - <td>Needs webhook so it can use User.Info. - </td> - </tr> - <tr> - <td>PersistentVolumeClaimResize - </td> - <td>Move to validation - </td> - <td>This should be in the validation logic for storage class. - </td> - </tr> - <tr> - <td>PodNodeSelector - </td> - <td>Move to non-mutating webhook - </td> - <td>Already compiled in, so fast enough to use webhook. Does not mutate. - </td> - </tr> - <tr> - <td>podtolerationrestriction - </td> - <td>Move to non-mutating webhook - </td> - <td>Already compiled in, so fast enough to use webhook. Does not mutate. - </td> - </tr> - <tr> - <td>serviceaccount - </td> - <td>Move to mutating webhook. - </td> - <td>Already compiled in, so fast enough to use webhook. Does mutate by defaulting the service account. - </td> - </tr> - <tr> - <td>storageclass - </td> - <td>Move to mutating webhook. - </td> - <td> - </td> - </tr> -</table> - - -[Backup Document](https://docs.google.com/spreadsheets/d/1zyCABnIzE7GiGensn-KXneWrkSJ6zfeJWeLaUY-ZmM4/edit#gid=0) - - -### OpenShift Admission Controllers - - -<table> - <tr> - <td>Admission Controller - </td> - <td>How - </td> - <td>Why - </td> - </tr> - <tr> - <td>pkg/authorization/admission/restrictusers" - </td> - <td>Non-mutating Webhook or leave static - </td> - <td>Verification only. But uses a few loopback clients to check other resources. - </td> - </tr> - <tr> - <td>pkg/build/admission/jenkinsbootstrapper - </td> - <td>Non-mutating Webhook or leave static - </td> - <td>Doesn't mutate Build or BuildConfig, but creates Jenkins instances. - </td> - </tr> - <tr> - <td>pkg/build/admission/secretinjector - </td> - <td>Mutating webhook or leave static - </td> - <td>uses a few loopback clients to check other resources. - </td> - </tr> - <tr> - <td>pkg/build/admission/strategyrestrictions - </td> - <td>Non-mutating Webhook or leave static - </td> - <td>Verifications only. But uses a few loopback clients, and calls subjectAccessReview - </td> - </tr> - <tr> - <td>pkg/image/admission - </td> - <td>Non-Mutating Webhook - </td> - <td>Fast, checks image size - </td> - </tr> - <tr> - <td>pkg/image/admission/imagepolicy - </td> - <td>Mutating and non-mutating webhooks - </td> - <td>Rewriting image pull spec is mutating. -<p> -acceptor.Accepts is non-Mutating - </td> - </tr> - <tr> - <td>pkg/ingress/admission - </td> - <td>Non-mutating webhook, or leave static. - </td> - <td>Simple, but calls to authorizer. - </td> - </tr> - <tr> - <td>pkg/project/admission/lifecycle - </td> - <td>Initializer or Non-mutating webhook? - </td> - <td>Needs to update another resource: Namespace - </td> - </tr> - <tr> - <td>pkg/project/admission/nodeenv - </td> - <td>Mutating webhook - </td> - <td>Fast - </td> - </tr> - <tr> - <td>pkg/project/admission/requestlimit - </td> - <td>Non-mutating webhook - </td> - <td>Fast, verification only - </td> - </tr> - <tr> - <td>pkg/quota/admission/clusterresourceoverride - </td> - <td>Mutating webhook - </td> - <td>Updates container resource request and limit - </td> - </tr> - <tr> - <td>pkg/quota/admission/clusterresourcequota - </td> - <td>Leave static. - </td> - <td>Refactor with the k8s quota - </td> - </tr> - <tr> - <td>pkg/quota/admission/runonceduration - </td> - <td>Mutating webhook - </td> - <td>Fast. Needs a ProjectCache though. Updates pod.Spec.ActiveDeadlineSeconds - </td> - </tr> - <tr> - <td>pkg/scheduler/admission/podnodeconstraints - </td> - <td>Non-mutating webhook or leave static - </td> - <td>Verification only. But calls to authorizer. - </td> - </tr> - <tr> - <td>pkg/security/admission - </td> - <td>Use Mutating Webhook and Non-Mutating Webhook. - </td> - <td>Similar to PSP in k8s - </td> - </tr> - <tr> - <td>pkg/service/admission/externalip - </td> - <td>Non-mutating webhook - </td> - <td>Fast and verification only - </td> - </tr> - <tr> - <td>pkg/service/admission/endpoints - </td> - <td>Non-mutating webhook or leave static - </td> - <td>Verification only. But calls to authorizer. - </td> - </tr> -</table> - - - -### Other Projects - -Istio Pod Injector: - - - -* Injects Sidecar Container, Init Container, adds a volume for Istio config, and changes the Security Context -* Source: - * https://github.com/istio/pilot/blob/master/platform/kube/inject/inject.go#L278 - * https://github.com/istio/pilot/blob/master/cmd/sidecar-initializer/main.go - -<table> - <tr> - <td> -Function - </td> - <td>How - </td> - <td>Why - </td> - </tr> - <tr> - <td>Istio Pod Injector - </td> - <td>Mutating Webhook - </td> - <td>Containers can only be added at pod creation time. -<p> -Because the change is complex, showing intermediate state may help debugging. -<p> -Fast, so could also use webhook. - </td> - </tr> - <tr> - <td>Istio Mixer CRD Validation - </td> - <td>Non-Mutating Webhook - </td> - <td> - </td> - </tr> - <tr> - <td>Service Catalog PodPreset - </td> - <td>Initializer - </td> - <td>Containers can only be added at pod creation time. -<p> -Because the change is complex, showing intermediate state may help debugging. -<p> -Fast, so could also use webhook. - </td> - </tr> - <tr> - <td>Allocate Cert for Service - </td> - <td>Initializer - </td> - <td>Longer duration operation which might fail, with external dependency, so don't use webhook. -<p> -Let user see initializing state. -<p> -Don't let controllers that depend on services see the service before it is ready. - </td> - </tr> -</table> - - - -## Comparison of Webhooks and Initializers - - -<table> - <tr> - <td>Mutating and Non-Mutating Webhooks - </td> - <td>Initializers (and Finalizers) - </td> - </tr> - <tr> - <td><ul> - -<li>Act on Create, update, or delete -<li>Reject Create, Update or delete</li></ul> - - </td> - <td><ul> - -<li>Act on Create and delete -<li>Reject Create.</li></ul> - - </td> - </tr> - <tr> - <td><ul> - -<li>Clients never see pre-created state. <ul> - - <li>Good for enforcement. - <li>Simple invariants.</li> </ul> -</li> </ul> - - </td> - <td><ul> - -<li>Clients can see pre-initialized state. <ul> - - <li>Let clients see progress - <li>Debuggable</li> </ul> -</li> </ul> - - </td> - </tr> - <tr> - <td><ul> - -<li>Admin cannot easily override broken webhook. <ul> - - <li>Must be highly reliable code - <li>Avoid deps on external systems.</li> </ul> -</li> </ul> - - </td> - <td><ul> - -<li>Admin can easily fix a "stuck" object by "manually" initializing (or finalizing). <ul> - - <li>Can be <em>slightly</em> less reliable. - <li>Prefer when there are deps on external systems.</li> </ul> -</li> </ul> - - </td> - </tr> - <tr> - <td><ul> - -<li>Synchronous <ul> - - <li>Apiserver uses a go routine - <li>TCP connection open - <li>Should be very low latency</li> </ul> -</li> </ul> - - </td> - <td><ul> - -<li>Asynchronous <ul> - - <li>Can be somewhat higher latency</li> </ul> -</li> </ul> - - </td> - </tr> - <tr> - <td><ul> - -<li>Does not persist intermediate state <ul> - - <li>Should happen very quickly. - <li>Does not increase etcd traffic.</li> </ul> -</li> </ul> - - </td> - <td><ul> - -<li>Persist intermediate state <ul> - - <li>Longer ops can persist across apiserver upgrades/failures - <li>Does increase etcd traffic.</li> </ul> -</li> </ul> - - </td> - </tr> - <tr> - <td><ul> - -<li>Webhook does not know if later webhooks fail <ul> - - <li>Must not have side effects, - <li>Or have a really good GC plan.</li> </ul> -</li> </ul> - - </td> - <td><ul> - -<li>Initializer does not know if later initializers fail, but if paired with a finalizer, it could see the resource again. <ul> - - <li>This is not implemented - <li>TODO: initializers: have a way to ensure finalizer runs even if later initializers reject?</li> </ul> -</li> </ul> - - </td> - </tr> - <tr> - <td> - Use Examples:<ul> - -<li>checking one field on an object, and setting another field on the same object</li></ul> - - </td> - <td> - Use Examples:<ul> - -<li>Allocate (and deallocate) external resource in parallel with a Kubernetes resource.</li></ul> - - </td> - </tr> -</table> - - -Another [Detailed Comparison of Initializers and Webhooks](https://docs.google.com/document/d/17P_XjXDpxDC5xSD0nMT1W18qE2AlCMkVJcV6jXKlNIs/edit?ts=59d5683b#heading=h.5irk4csrpu0y) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/admission-webhook-bootstrapping.md b/contributors/design-proposals/api-machinery/admission-webhook-bootstrapping.md index 190a538a..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/admission-webhook-bootstrapping.md +++ b/contributors/design-proposals/api-machinery/admission-webhook-bootstrapping.md @@ -1,98 +1,6 @@ -# Webhook Bootstrapping +Design proposals have been archived. -## Background -[Admission webhook](./admission-control-webhooks.md) is a feature that -dynamically extends Kubernetes admission chain. Because the admission webhooks -are in the critical path of admitting REST requests, broken webhooks could block -the entire cluster, even blocking the reboot of the webhooks themselves. This -design presents a way to avoid such bootstrap deadlocks. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Objective -- If one or more webhooks are down, it should be able restart them automatically. -- If a core system component that supports webhooks is down, the component - should be able to restart. -## Design idea -We add a selector to the admission webhook configuration, which will be compared -to the labels of namespaces. Only objects in the matching namespaces are -subjected to the webhook admission. A cluster admin will want to exempt these -namespaces from webhooks: -- Namespaces where this webhook and other webhooks are deployed in; -- Namespaces where core system components are deployed in. - -## API Changes -`ExternalAdmissionHook` is the dynamic configuration API of an admission webhook. -We will add a new field `NamespaceSelector` to it: - -```golang -type ExternalAdmissionHook struct { - Name string - ClientConfig AdmissionHookClientConfig - Rules []RuleWithOperations - FailurePolicy *FailurePolicyType - // Only objects in matching namespaces are subjected to this webhook. - // LabelSelector.MatchExpressions allows exclusive as well as inclusive - // matching, so you can use this // selector as a whitelist or a blacklist. - // For example, to apply the webhook to all namespaces except for those have - // labels with key "runlevel" and value equal to "0" or "1": - // metav1.LabelSelctor{MatchExpressions: []LabelSelectorRequirement{ - // { - // Key: "runlevel", - // Operator: metav1.LabelSelectorOpNotIn, - // Value: []string{"0", "1"}, - // }, - // }} - // As another example, to only apply the webhook to the namespaces that have - // labels with key “environment” and value equal to “prod” and “staging”: - // metav1.LabelSelctor{MatchExpressions: []LabelSelectorRequirement{ - // { - // Key: "environment", - // Operator: metav1.LabelSelectorOpIn, - // Value: []string{"prod", "staging"}, - // }, - // }} - // See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more examples of label selectors. - NamespaceSelector *metav1.LabelSelector -} -``` - -## Guidelines on namespace labeling -The mechanism depends on cluster admin properly labelling the namespaces. We -will provide guidelines on the labelling scheme. One suggestion is labelling -namespaces with runlevels. The design of runlevels is out of the scope of this -document (tracked in -[#54522](https://github.com/kubernetes/kubernetes/issues/54522)), a strawman -runlevel scheme is: - -- runlevel 0: namespaces that host core system components, like kube-apiserver - and kube-controller-manager. -- runlevel 1: namespaces that host add-ons that are part of the webhook serving - stack, e.g., kube-dns. -- runlevel 2: namespaces that host webhooks deployments and services. - -`ExternalAdmissionHook.NamespaceSelector` should be configured to skip all the -above namespaces. In the case where some webhooks depend on features offered by -other webhooks, the system administrator could extend this concept further (run -level 3, 4, 5, …) to accommodate them. - -## Security implication -The mechanism depends on namespaces being properly labelled. We assume only -highly privileged users can modify namespace labels. Note that the system -already relies on correct namespace annotations, examples include the -podNodeSelector admission plugin, and the podTolerationRestriction admission -plugin etc. - -# Considered Alternatives -- Allow each webhook to exempt one namespace - - Doesn’t work: if there are two webhooks in two namespaces both blocking pods - startup, they will block each other. -- Put all webhooks in a single namespace and let webhooks exempt that namespace, - e.g., deploy webhooks in the “kube-system” namespace and exempt the namespace. - - It doesn’t provide sufficient isolation. Not all objects in the - “kube-system” namespace should bypass webhooks. -- Add namespace selector to webhook configuration, but use the selector to match - the name of namespaces - ([#1191](https://github.com/kubernetes/community/pull/1191)). - - Violates k8s convention. The matching label (key=name, value=<namespace’s - name>) is imaginary. - - Hard to manage. Namespace’s name is arbitrary. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/admission_control.md b/contributors/design-proposals/api-machinery/admission_control.md index dec92334..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/admission_control.md +++ b/contributors/design-proposals/api-machinery/admission_control.md @@ -1,101 +1,6 @@ -# Kubernetes Proposal - Admission Control +Design proposals have been archived. -**Related PR:** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -| Topic | Link | -| ----- | ---- | -| Separate validation from RESTStorage | http://issue.k8s.io/2977 | -## Background - -High level goals: -* Enable an easy-to-use mechanism to provide admission control to cluster. -* Enable a provider to support multiple admission control strategies or author -their own. -* Ensure any rejected request can propagate errors back to the caller with why -the request failed. - -Authorization via policy is focused on answering if a user is authorized to -perform an action. - -Admission Control is focused on if the system will accept an authorized action. - -Kubernetes may choose to dismiss an authorized action based on any number of -admission control strategies. - -This proposal documents the basic design, and describes how any number of -admission control plug-ins could be injected. - -Implementation of specific admission control strategies are handled in separate -documents. - -## kube-apiserver - -The kube-apiserver takes the following OPTIONAL arguments to enable admission -control: - -| Option | Behavior | -| ------ | -------- | -| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | -| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. | - -An **AdmissionControl** plug-in is an implementation of the following interface: - -```go -package admission - -// Attributes is an interface used by a plug-in to make an admission decision -// on a individual request. -type Attributes interface { - GetNamespace() string - GetKind() string - GetOperation() string - GetObject() runtime.Object -} - -// Interface is an abstract, pluggable interface for Admission Control decisions. -type Interface interface { - // Admit makes an admission decision based on the request attributes - // An error is returned if it denies the request. - Admit(a Attributes) (err error) -} -``` - -A **plug-in** must be compiled with the binary, and is registered as an -available option by providing a name, and implementation of admission.Interface. - -```go -func init() { - admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) -} -``` - -A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go) - -```go - // Admission policies - _ "k8s.io/kubernetes/plugin/pkg/admission/admit" - _ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages" - _ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity" - ... - _ "<YOUR NEW PLUGIN>" -``` - -Invocation of admission control is handled by the **APIServer** and not -individual **RESTStorage** implementations. - -This design assumes that **Issue 297** is adopted, and as a consequence, the -general framework of the APIServer request/response flow will ensure the -following: - -1. Incoming request -2. Authenticate user -3. Authorize user -4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes) - - invoke each admission.Interface object in sequence -5. Case on the operation: - - If operation=create|update, then validate(object) and persist - - If operation=delete, delete the object - - If operation=connect, exec - -If at any step, there is an error, the request is canceled. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/admission_control_event_rate_limit.md b/contributors/design-proposals/api-machinery/admission_control_event_rate_limit.md index af25e4bf..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/admission_control_event_rate_limit.md +++ b/contributors/design-proposals/api-machinery/admission_control_event_rate_limit.md @@ -1,176 +1,6 @@ -# Admission control plugin: EventRateLimit +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document proposes a system for using an admission control to enforce a limit -on the number of event requests that the API Server will accept in a given time -slice. In a large cluster with many namespaces managed by disparate administrators, -there may be a small percentage of namespaces that have pods that are always in -some type of error state, for which the kubelets and controllers in the cluster -are producing a steady stream of error event requests. Each individual namespace -may not be causing a large amount of event requests on its own, but taken -collectively the errors from this small percentage of namespaces can have a -significant impact on the performance of the cluster overall. -## Use cases - -1. Ability to protect the API Server from being flooded by event requests. -2. Ability to protect the API Server from being flooded by event requests for - a particular namespace. -3. Ability to protect the API Server from being flooded by event requests for - a particular user. -4. Ability to protect the API Server from being flooded by event requests from - a particular source+object. - -## Data Model - -### Configuration - -```go -// LimitType is the type of the limit (e.g., per-namespace) -type LimitType string - -const ( - // ServerLimitType is a type of limit where there is one bucket shared by - // all of the event queries received by the API Server. - ServerLimitType LimitType = "server" - // NamespaceLimitType is a type of limit where there is one bucket used by - // each namespace - NamespaceLimitType LimitType = "namespace" - // UserLimitType is a type of limit where there is one bucket used by each - // user - UserLimitType LimitType = "user" - // SourceAndObjectLimitType is a type of limit where there is one bucket used - // by each combination of source and involved object of the event. - SourceAndObjectLimitType LimitType = "sourceAndObject" -) - -// Configuration provides configuration for the EventRateLimit admission -// controller. -type Configuration struct { - metav1.TypeMeta `json:",inline"` - - // limits are the limits to place on event queries received. - // Limits can be placed on events received server-wide, per namespace, - // per user, and per source+object. - // At least one limit is required. - Limits []Limit `json:"limits"` -} - -// Limit is the configuration for a particular limit type -type Limit struct { - // type is the type of limit to which this configuration applies - Type LimitType `json:"type"` - - // qps is the number of event queries per second that are allowed for this - // type of limit. The qps and burst fields are used together to determine if - // a particular event query is accepted. The qps determines how many queries - // are accepted once the burst amount of queries has been exhausted. - QPS int32 `json:"qps"` - - // burst is the burst number of event queries that are allowed for this type - // of limit. The qps and burst fields are used together to determine if a - // particular event query is accepted. The burst determines the maximum size - // of the allowance granted for a particular bucket. For example, if the burst - // is 10 and the qps is 3, then the admission control will accept 10 queries - // before blocking any queries. Every second, 3 more queries will be allowed. - // If some of that allowance is not used, then it will roll over to the next - // second, until the maximum allowance of 10 is reached. - Burst int32 `json:"burst"` - - // cacheSize is the size of the LRU cache for this type of limit. If a bucket - // is evicted from the cache, then the allowance for that bucket is reset. If - // more queries are later received for an evicted bucket, then that bucket - // will re-enter the cache with a clean slate, giving that bucket a full - // allowance of burst queries. - // - // The default cache size is 4096. - // - // If limitType is 'server', then cacheSize is ignored. - // +optional - CacheSize int32 `json:"cacheSize,omitempty"` -} -``` - -### Validation - -Validation of a **Configuration** enforces that the following rules apply: - -* There is at least one item in **Limits**. -* Each item in **Limits** has a unique **Type**. - -Validation of a **Limit** enforces that the following rules apply: - -* **Type** is one of "server", "namespace", "user", and "source+object". -* **QPS** is positive. -* **Burst** is positive. -* **CacheSize** is non-negative. - -### Default Value Behavior - -If there is no item in **Limits** for a particular limit type, then no limits -will be enforced for that type of limit. - -## AdmissionControl plugin: EventRateLimit - -The **EventRateLimit** plug-in introspects all incoming event requests and -determines whether the event fits within the rate limits configured. - -To enable the plug-in and support for EventRateLimit, the kube-apiserver must -be configured as follows: - -```console -$ kube-apiserver --admission-control=EventRateLimit --admission-control-config-file=$ADMISSION_CONTROL_CONFIG_FILE -``` - -## Example - -An example EventRateLimit configuration: - -| Type | RequestBurst | RequestRefillRate | CacheSize | -| ---- | ------------ | ----------------- | --------- | -| Server | 1000 | 100 | | -| Namespace | 100 | 10 | 50 | - -The API Server starts with an allowance to accept 1000 event requests. Each -event request received counts against that allowance. The API Server refills -the allowance at a rate of 100 per second, up to a maximum allowance of 1000. -If the allowance is exhausted, then the API Server will respond to subsequent -event requests with 429 Too Many Requests, until the API Server adds more to -its allowance. - -For example, let us say that at time t the API Server has a full allowance to -accept 1000 event requests. At time t, the API Server receives 1500 event -requests. The first 1000 to be handled are accepted. The last 500 are rejected -with a 429 response. At time t + 1 second, the API Server has refilled its -allowance with 100 tokens. At time t + 1 second, the API Server receives -another 500 event requests. The first 100 to be handled are accepted. The last -400 are rejected. - -The API Server also starts with an allowance to accept 100 event requests from -each namespace. This allowance works in parallel with the server-wide -allowance. An accepted event request will count against both the server-side -allowance and the per-namespace allowance. An event request rejected by the -server-side allowance will still count against the per-namespace allowance, -and vice versa. The API Server tracks the allowances for at most 50 namespaces. -The API Server will stop tracking the allowance for the least-recently-used -namespace if event requests from more than 50 namespaces are received. If an -event request for namespace N is received after the API Server has stop -tracking the allowance for namespace N, then a new, full allowance will be -created for namespace N. - -In this example, the API Server will track any allowances for neither the user -nor the source+object in an event request because both the user and the -source+object details have been omitted from the configuration. The allowance -mechanisms for per-user and per-source+object rate limiting works identically -to the per-namespace rate limiting, with the exception that the former consider -the user of the event request or source+object of the event and the latter -considers the namespace of the event request. - -## Client Behavior - -Currently, the Client event recorder treats a 429 response as an http transport -type of error, which warrants retrying the event request. Instead, the event -recorder should abandon the event. Additionally, the event recorder should -abandon all future events for the period of time specified in the -Retry-After header of the 429 response. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/admission_control_extension.md b/contributors/design-proposals/api-machinery/admission_control_extension.md index c7e7f1b7..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/admission_control_extension.md +++ b/contributors/design-proposals/api-machinery/admission_control_extension.md @@ -1,635 +1,6 @@ -# Extension of Admission Control via Initializers and External Admission Enforcement +Design proposals have been archived. -Admission control is the primary business-logic policy and enforcement subsystem in Kubernetes. It provides synchronous -hooks for all API operations and allows an integrator to impose additional controls on the system - rejecting, altering, -or reacting to changes to core objects. Today each of these plugins must be compiled into Kubernetes. As Kubernetes grows, -the requirement that all policy enforcement beyond coarse grained access control be done through in-tree compilation and -distribution becomes unwieldy and limits administrators and the growth of the ecosystem. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This proposal covers changes to the admission control subsystem that allow extension of admission without recompilation -and dynamic admission control configuration in ways that resemble existing controller behavior. - -## Background - -The four core systems in Kubernetes are: - -1. API servers with persistent storage, providing basic object validation, defaulting, and CRUD operations -2. Authentication and authorization layers that identify an actor and constrain the coarse actions that actor can take on API objects -3. [Admission controller layers](admission_control.md) that can control and limit the CRUD operations clients perform synchronously. -4. Controllers which watch the API and react to changes made by other users asynchronously (scheduler, replication controller, kubelet, kube-proxy, and ingress are all examples of controllers). - -Admission control supports a wide range of policy and behavior enforcement for cluster administrators and integrators. - - -### Types of Admission Control - -In Kubernetes 1.5 and OpenShift 1.4, the following types of functionality have been implemented through admission -(all file references are relative to `plugin/pkg/admission`, or simply identified by name for OpenShift). Many of the -Kubernetes admission controllers originated in OpenShift and are listed in both for history. - -#### Resource Control - -These admission controllers take resource usage for pods into account to ensure namespaces cannot abuse the cluster -by consuming more than their fair share of resources. These perform security or defaulting type roles. - -##### Kubernetes - -Name | Code | Description ----- | ---- | ----------- -InitialResources | initialresources/admission.go | Default the resources for a container based on past usage -LimitRanger | limitranger/admission.go | Set defaults for container requests and limits, or enforce upper bounds on certain resources (no more than 2GB of memory, default to 512MB). Implements the behavior of a v1 API (LimitRange). -ResourceQuota | resourcequota/admission.go | Calculate and deny number of objects (pods, rc, service load balancers) or total consumed resources (cpu, memory, disk) in a namespace. Implements the behavior of a v1 API (ResourceQuota). - -##### OpenShift - -Name | Code | Description ----- | ---- | ----------- -ClusterResourceOverride | clusterresourceoverride/admission.go | Allows administrators to override the user's container request for CPU or memory as a percentage of their request (the administrator's target overcommit number), or to default a limit based on a request. Allows cluster administrators to control overcommit on a cluster. -ClusterResourceQuota | clusterresourcequota/admission.go | Performs quota calculations over a set of namespaces with a shared quota. Can be used in conjunction with resource quota for hard and soft limits. -ExternalIPRanger | externalip_admission.go | Prevents users from creating services with externalIPs inside of fixed CIDR ranges, including the pod network, service network, or node network CIDRs to prevent hijacking of connections. -ImageLimitRange | admission.go | Performs LimitRanging on images that are pushed into the integrated image registry -OriginResourceQuota | resourcequota/admission.go | Performs quota calculations for API resources exposed by OpenShift. Demonstrates how quota would be implemented for API extensions. -ProjectRequestLimit | requestlimit/admission.go | A quota on how many namespaces may be created by any individual user. Has a global default and also a per user override. -RunOnceDuration | runonceduration/admission.go | Enforces a maximum ActiveDeadlineSeconds value on all RestartNever pods in a namespace. This ensures that users are defaulted to have a deadline if they did not request it (which prevents pathological resource consumption) - -Quota is typically last in the admission chain, to give all other components a chance to reject or modify the resource. - - -#### Security - -These controllers defend against specific actions within a resource that might be dangerous that the authorization -system cannot enforce. - -##### Kubernetes - -Name | Code | Description ----- | ---- | ----------- -AlwaysPullImages | alwayspullimages/admission.go | Forces the Kubelet to pull images to prevent pods from accessing private images that another user with credentials has already pulled to the node. -LimitPodHardAntiAffinityTopology | antiaffinity/admission.go | Defended the cluster against abusive anti-affinity topology rules that might hang the scheduler. -DenyEscalatingExec | exec/admission.go | Prevent users from executing into pods that have higher privileges via their service account than allowed by their policy (regular users can't exec into admin pods). -DenyExecOnPrivileged | exec/admission.go | Blanket ban exec access to pods with host level security. Superseded by DenyEscalatingExec -OwnerReferencesPermissionEnforcement | gc/gc_admission.go | Require that a user who sets a owner reference (which could result in garbage collection) has permission to delete the object, to prevent abuse. -ImagePolicyWebhook | imagepolicy/admission.go | Invoke a remote API to determine whether an image is allowed to run on the cluster. -PodNodeSelector | podnodeselector/admission.go | Default and limit what node selectors may be used within a namespace by reading a namespace annotation and a global configuration. -PodSecurityPolicy | security/podsecuritypolicy/admission.go | Control what security features pods are allowed to run as based on the end user launching the pod or the service account. Sophisticated policy rules. -SecurityContextDeny | securitycontext/scdeny/admission.go | Blanket deny setting any security context settings on a pod. - -##### OpenShift - -Name | Code | Description ----- | ---- | ----------- -BuildByStrategy | strategyrestrictions/admission.go | Control which types of image builds a user can create by checking for a specific virtual authorization rule (field level authorization), since some build types have security implications. -OriginPodNodeEnvironment | nodeenv/admission.go | Predecessor to PodNodeSelector. -PodNodeConstraints | podnodeconstraints/admission.go | Prevent users from setting nodeName directly unless they can invoke the `bind` resource on pods (same as a scheduler). This prevents users from attacking nodes by repeatedly creating pods that target a specific node and forcing it to reject those pods. (field level authorization) -RestrictedEndpointsAdmission | endpoint_admission.go | In a multitenant network setup where namespaces are isolated like OpenShift SDN, service endpoints must not allow a user to probe other namespaces. If a user edits the endpoints object and sets IPs that fall within the pod network CIDR, the user must have `create` permission on a virtual resource `endpoints/restricted`. The service controller is granted this permission by default. -SecurityContextConstraint | admission.go | Predecessor to PodSecurityPolicy. -SCCExecRestrictions | scc_exec.go | Predecessor to DenyEscalatingExec. - -Many other controllers have been proposed, including but not limited to: - -* Control over what taints and tolerations a user can set on a pod -* Control over which labels and annotations can be set or changed -* Generic control over which fields certain users may set (field level access control) - - -#### Defaulting / Injection - -These controllers inject namespace or cluster context into pods and other resources at runtime to decouple -application config from runtime config (separate the user's pod settings from environmental controls) - -##### Kubernetes - -Name | Code | Description ----- | ---- | ----------- -ServiceAccount | serviceaccount/admission.go | Bind mount the service account token for a pod into the pod at a specific location. -PersistentVolumeLabel | persistentvolume/label/admission.go | Lazily bind persistent volume claims to a given zone when a pod is scheduled. -DefaultStorageClass | storageclass/default/admission.go | Set a default storage class on any PVC created without a storage class. - -Many other controllers have been proposed, including but not limited to: - -* ServiceInjectionPolicy to inject environment, configmaps, and secrets into pods that reference those services -* Namespace level environment injection (all pods in this namespace should have env var `ENV=PROD`) -* Label selector based resource defaults (all pods with these labels get these default resources) - - -#### Referential Consistency - -These controllers enforce that certain guarantees of the system related to integrity. - -##### Kubernetes - -Name | Code | Description ----- | ---- | ----------- -NamespaceAutoProvision | namespace/autoprovision/admission.go | When users create resources in a namespace that does not exist, ensure the namespace is created so it can be seen with `kubectl get namespaces` -NamespaceExists | namespace/exists/admission.go | Require that a namespace object exist prior to a resource being created. -NamespaceLifecycle | namespace/lifecycle/admission.go | More powerful and flexible version of NamespaceExists. - -##### OpenShift - -Name | Code | Description ----- | ---- | ----------- -JenkinsBootstrapper | jenkinsbootstrapper/admission.go | Spawn a Jenkins instance in any project where a Build is defined that references a Jenkins pipeline. Checks that the creating user has permission to act-as an editor in the project to prevent escalation within a namespace. -ImagePolicy | imagepolicy/imagepolicy.go | Performs policy functions like ImagePolicyWebhook, but also is able to mutate the image reference from a tag to a digest (fully qualified spec), look up additional information about the image from the OpenShift Image API and potentially enforce resource consumption or placement decisions based on the image. May also be used to deny images from being used that don't resolve to image metadata that OpenShift tracks. -OriginNamespaceLifecycle | lifecycle/admission.go | Controls accepting resources for namespaces. - - -### Patterns - -In a study of all known admission controllers, the following patterns were seen most often: - -1. Defaulting on creation -2. Synchronous validation on creation -3. Synchronous validation on update - side-effect free - -Other patterns seen less frequently include: - -1. Defaulting on update -2. Resolving / further specifying values on update (ImagePolicy) -3. Creating resources in response to user action with the correct permission check (JenkinsBootstrapper) -4. Policy decisions based on *who* is doing the action (OwnerReferencesPermissionEnforcement, PodSecurityPolicy, JenkinsBootstrapper) -5. Synchronous validation on update - with side effects (quota) - -While admission controllers can operate on all verbs, resources, and sub resource types, in practice they -mostly deal with create and update on primary resources. Most sub resources are highly privileged operations -and so are typically covered by authorization policy. Other controllers like quota tend to be per apiserver -and therefore are not required to be extensible. - - -### Building enforcement - -In order to implement custom admission, an admin, integrator, or distribution of Kubernetes must compile their -admission controller(s) into the Kubernetes `kube-apiserver` binary. As Kubernetes is intended to be a -modular layered system this means core components must be upgraded to effect policy changes and only a fixed -list of plugins can be used. It also prevents experimentation and prototyping of policy, or "quick fix" -solutions applied on site. As we add additional APIs that are not hosted in the main binary (either as third -party resources or API extension servers), these APIs have many of the same security and policy needs that -the core resources do, but must compile in their own subsets of admission. - -Further, distributions of Kubernetes like OpenShift that wish to offer complete solutions (such as OpenShift's -multi-tenancy model) have no mechanism for running on top of Kubernetes without recompilation of the core or -for extending the core with additional policy. This prevents the formation of an open ecosystem for tools -*around* Kubernetes, forcing all changes to policy to go through the Kubernetes codebase review gate (when -such review is unnecessary or disruptive to Kubernetes itself). - -### Ordering of admission - -Previous work has described a logical ordering for admission: - -1. defaulting (PodPreset) -2. mutation (ClusterResourceOverride) -3. validation (PodSecurityPolicy) -4. transactional (ResourceQuota) - -Most controllers fit cleanly into one of these buckets. Controllers that need to act in multiple phases -are often best split into separate admission controllers, although today we offer no code mechanism to -share a request local cache. Extension may need to occur at each of these phases. - - -## Design - -It should be possible to perform holistic policy enforcement in Kubernetes without the recompilation of the -core project as plugins that can be added and removed to a stock Kubernetes release. That extension -of admission control should leverage similar our existing controller patterns and codebase where possible. -Extension must be as performant and reliable as other core mechanisms. - - -### Requirements - -1. Easy Initialization - - Privileged components should be able to easily participate in the **initialization** of a new object. - -2. Synchronous Validation - - Synchronous rejection of initialized objects or mutations must be possible outside of the kube-apiserver binary - -3. Backwards Compatible - - Existing API clients must see no change in behavior to external admission other than increased latency - -4. Easy Installation - - Administrators should be able to easily write a new admission plugin and deploy it in the cluster - -5. Performant - - External admission must not significantly regress performance in large and dense clusters - -6. Reliable - - External admission should be capable of being "production-grade" for deployment in an extremely large and dense cluster - -7. Internally Consistent - - Developing an admission controller should reuse as much infrastructure and tools as possible from building custom controllers so as to reduce the cost of extension. - - -### Specification - -Based on observation of the actual admission control implementations the majority of mutation -occurs as part of creation, and a large chunk of the remaining controllers are for side-effect free -validation of creation and updates. Therefore we propose the following changes to Kubernetes: - -1. Allow some controllers to act as "initializers" - watching the API and mutating the object before it is visible to normal clients. - - This would reuse the majority of the infrastructure in place for controllers. Because creation is - one-way, the object can be "revealed" to regular clients once a set list of initializers is consumed. These - controllers could run on the cluster as pods. Because initialization is a non-backwards compatible API change, - some care must be taken to shield old clients from observing the scenario. - -2. Add a generic **external admission webhook** controller that is non-mutating (thus parallelizable) - - This generic webhook API would resemble `admission.Interface` and be given the input object (for create) and the - previous object (for update/patch). After initialization or on any update, these hooks would be invoked in parallel - against the remote servers and any rejection would reject the mutation. - -3. Make the registration of both initializers and admission webhooks dynamic via the API (a configmap or cluster scoped resource) - - Administrators should be able to dynamically add or remove hooks and initializers on demand to the cluster. - Configuration would be similar to registering new API group versions and include config like "fail open" or - "fail closed". - -Some admission controller types would not be possible for these extensions: - -* Mutating admission webhooks are not part of the initial implementation but are desired in the future. -* Admission controllers that need access to the acting user can receive that via the external webhook. -* Admission controllers that "react" to the acting user can couple the information received via a webhook and then act if they observe mutation succeed (tuple combining resource UID and resource generation). -* Quota will continue to be a core plugin per API server, so extension is not critical. - - -#### Implications: - -1. Initializers and generic admission controllers are highly privileged, so while some separation is valuable they are effectively cluster scoped -2. This mechanism would allow dedicated infrastructure to host admission for multiple clusters, and allow some expensive admission to be centralized (like quota which is hard to performantly distribute) -3. There is no way to build initializers for updates without a much more complicated model, but we anticipate initializers to work best on creation. -4. Ordering will probably be necessary on initializers because defaulting in the wild requires ordering. Non-mutating validation on the other hand can be fully parallel. -5. Some admission depends on knowing the identity of the actor - we will likely need to include the **creator** as information to initializers. -6. Quota must still run after all validators are invoked. We may need to make quota extensible in the future. - - -#### Initializers - -An initializer must allow some API clients to perceive creations prior to other apps. For backwards compatibility, -uninitialized objects must be invisible to legacy clients. In addition, initializers must be recognized as -participating in the initialization process and therefore a list of initializers must be populated onto each -new object. Like finalizers, each initializer should perform an update and remove itself from the object. - -Every API object will have the following field added: - -``` -type ObjectMeta struct { - ... - // Initializers is a list of initializers that must run prior to this object being visible to - // normal clients. Only highly privileged users may modify this field. If this field is set, - // then normal clients will receive a 202 Accepted and a Status object if directly retrieved - // by name, and it will not be visible via listing or watching. - Initializers *Initializers `json:"initializers"` - ... -} - -// Initializers tracks the progress of initialization. -type Initializers struct { - // Pending is a list of initializers that must execute in order before this object is visible. - // When the last pending initializer is removed, and no failing result is set, the initializers - // struct will be set to nil and the object is considered as initialized and visible to all - // clients. - Pending []Initializer `json:"pending"` - // If result is set with the Failure field, the object will be persisted to etcd and then deleted, - // ensuring that other clients can observe the deletion. - Result *metav1.Status `json:"status"` -} - -// Initializer records a single pending initializer. It is a struct for future extension. -type Initializer struct { - // Name is the name of the process that owns this initializer - Name string `json:"name"` -} -``` - -On creation, a compiled in admission controller defaults the initializers field (if nil) to a value from -the system configuration that may vary by resource type. If initializers is set the admission controller -will check whether the user has the ability to run the `initialize` verb on the current resource type, and -reject the entry with 403 if not. This allows a privileged user to bypass initialization by setting -`initializers` to the empty struct. - -Once created, an object is not visible to clients unless the following conditions are satisfied: - -1. The initializers field is null. -2. The client provides a special option to GET, LIST, or WATCH indicating that the user wishes to see uninitialized objects. - -The apiserver that accepts the incoming creation should hold the response until the object is -initialized or the timeout is exceeded. This increases latency, but allows clients to avoid breaking -semantics. If the apiserver reaches the timeout it must return an appropriate error that includes the -resource version of the object and UID so that clients can perform a watch. If an initializer reports -a `Result` with a failure, it must return that to the user (all failures result in deletion). - -Each initializer is a controller or other client agent that watches for new objects with an initializer -whose first position matches their assigned name (e.g. `PodAutoSizer`) and then operate on them. These -clients would use the `?includeUninitialized=true` query param (working name) and observe all -objects. - -The initializer would perform a normal update on the object to perform their function, and then -remove their entry from `initializers` (or adding more entries). If an error occurs during initialization -that must terminate initialization, the `Status` field on the initializer should be set instead of removing -the initializer entry and then the initializer should delete the object. The client would receive this -status as the response to their creation (as described below). - -During initialization, resources may have relaxed validation requirements, which means initializers must -handle incomplete objects. The create call will perform normal defaulting so that initializers are not providing -their own defaulting, including UID and creationTimestamp. At all phases the object must be valid, so -resources that wish to use initializers should consider how defaulting would complicate initializers. - -To allow naive clients to avoid having to deal with uninitialized objects, the API will automatically -filter uninitialized objects out LIST and WATCH. Explicit GETs to that object should return the -appropriate status code `202 Accepted` indicating that the resource is reserved and including a `Status` -response with the correct resource version, but not the object. Clients specifying -`includeUninitialized` will see all updates, but shared code like caches and informers may need -to implement layered filters to handle multiple clients requesting both variants. A CREATE to an -uninitialized object should report the same status as before, and DELETE is always allowed. - -There is no current error case for a timeout that exactly matches existing behavior except a 5xx -timeout if etcd does not respond quickly. We should return that error if CREATE exceeds the timeout, -but return an appropriate status cause that lets a client determine what the outcome was. - -Initializers are allowed to set other initializers or finalizers. - - -##### Example flow: - -This flow shows the information moving across the system during a successful creation - -``` -Client APIServer (req) APIServer (other) Initializer 1 Initializer 2 ------------------ ---------------- ----------------- -------------------- ------------------- - - listen <--------- WATCH /pods?init=0 - listen <-------------|---------------- WATCH /pods?init=0 - | | -POST /pods -----> validate | | - admission(init): | | - default v | - save etcd -----------------------> observe | - WATCH /pods/1 v | - | change resources | - | clear initializer | - | validate <------- PUT /pods/1 | - | admission(init): | - | check authz v - | save etcd ---------------------------> observe - | change env vars - | clear initializer - | validate <---------------------------- PUT /pods - | admission(init): - v check authz - observe <------- save etcd -response <------- handle -``` - -An example flow where initialization fails: - -``` -Client APIServer (req) APIServer (other) Initializer 1 Initializer 2 ------------------ ---------------- ----------------- -------------------- ------------------- - - listen <--------- WATCH /pods?init=0 - listen <-------------|---------------- WATCH /pods?init=0 - | | -POST /pods -----> validate | | - admission(init): | | - default v | - save etcd -----------------------> observe | - WATCH /pods/1 v | - | change resources | - | clear initializer | - | validate <------- PUT /pods/1 | - | admission(init): | - | check authz v - | save etcd ---------------------------> observe - | failed to retrieve object - | set result to failure - | validate <---------------------------- PUT /pods - | admission(init): - | check authz - v signal failure - observe <------- save etcd -response <------- handle | - v - delete object -``` - -##### Failure - -If the apiserver crashes before the initialization is complete, it may be necessary for the -apiserver or another controller to complete deletion. - -1. Alternatively, upon receiving a failed status update during initialization, the apiserver could delete the object at that time. -2. The garbage collector controller could be responsible for cleaning up resources that failed initialization. -3. Read repair could be performed when listing resources from the apiserver. - - -##### Quota - -Quota consists of two distinct parts - object count, which prevents abuse for some limited -resource, and sub-object consumption which may not be known until after all initializers -are executed. - -Object count quota should be applied prior to the object being persisted to etcd (prior -to initialization). All other quota should be applied when the object completes initialization. -Compensation should run at the normal spot. - - -##### Bypassing initializers - -Initializers, like external admission hooks, raise the potential for a cluster that cannot -make progress or heal itself. An initializer on pods could block an emergency fallback -scheduler from launching a new scheduler pod. An initializer on the endpoints resource could -prevent masters from registering themselves, blocking an extension API server from observing -endpoints changes to allow it to watch endpoints. In general, new initializers must be -careful to not create circular dependencies with the masters. - -The ability to set an empty initializer list allows cluster level components to make progress -in the face of extension. Additional information may need to be returned with an object -creation error to indicate which component failed to initialize. - - -#### Generic external admission webhook - -Existing webhooks demonstrate specific admission callouts for image policy. In general, the `admission.Interface` -already defines a reasonable pattern for mutations. Admission is much less common for retrieval operations. - -Add a new `GenericAdmissionWebhook` that is a list of endpoints that will receive POST operations. The schema -of the object is modelled after `SubjectAccessReview` and `admission.Interface`. The caller posts the object -to the server and expects a 200, 204 or 400 response. If the response is 200 or 204, the caller proceeds. If -the response is 400, the client should interpret the response body to determine the "status" sub field and -then use that as the response. - -``` -POST [arbitrary_url] -{ - "kind": "AdmissionReview", - "apiVersion": "admission.k8s.io/v1", - "spec": { - "resource": "pods", - "subresource": "", - "operation": "Create", - "object": {...}, - "oldObject": nil, - "userInfo": { - "name": "user1", - "uid": "cn=user,dn=ou", - "groups": ["g=1"] - } - }, - "status": { - "status": "Failure", - "message": "...", - "reason": "Forbidden", - "code": 403, - ... - } -} - -400 Bad Request -{ - "kind": "AdmissionReview", - "apiVersion": "admission.k8s.io/v1", - "spec": { - ... - }, - "status": { - "status": "Failure", - "message": "...", - "reason": "Forbidden", - "code": 403, - ... - } -} -``` - -Clients may return 204 as an optimization. `oldObject` may be sent if this is an update operation. This -API explicitly does not allow mutations, but leaves open the possibility for that to be added in the future. - -Each webhook must be non-mutating and will be invoked in parallel. The first failure will terminate processing, -and the caller may choose to retry calls and so submission must be idempotent. - -Because admission is performance critical, the following considerations are taken: - -1. Protobuf will be the expected serialization format - JSON is shown above for readability, but the content-type wrapper will be used to encode bytes. -2. The admission object will be serialized once and sent to all callers -3. To minimize fan-out variance, future implementations may set strict timeouts and dispatch multiple requests. -4. Admission controllers MAY omit the spec field in returning a response to the caller, specifically the `object` and `oldObject` fields. - -Future work: - -1. It should be possible to bypass expensive parts of the serialization action by potentially passing the input bytes directly to the webhook, but mutation and defaulting may complicate this goal. - -Admission is a high security operation, so end-to-end TLS encryption is expected and the remote endpoint -should be authorized via strong signing, mutual-auth, or high security. - - -##### Upgrade of a cluster with external admission - -The current order of cluster upgrade is `apiserver` -> `controller` -> `nodes`. External admission controllers -would typically need to be upgraded first in order to ensure new semantic changes in objects are not ignored. -This would include fields like PodSecurityContext - adding that *prior* to admission is necessary because it -allows escalation that was previously impossible. - - -#### Dynamic configuration - -Trusted clients should be able to modify the initialization and external admission hooks on the fly and expect -that configuration is updated quickly. Extension API servers should also be able to leverage central -configuration, but may opt for alternate mechanisms. - -The initializer admission controller and the generic webhook admission controller should dynamically load config -from a `ConfigMap` or a net new API object holding the following configuration schema: - -``` -type AdmissionControlConfiguration struct { - TypeMeta // although this object could simply be serialized like ComponentConfig - - // ResourceInitializers is a list of resources and their default initializers - ResourceInitializers []ResourceDefaultInitializer - - ExternalAdmissionHooks []ExternalAdmissionHook -} - -type ResourceDefaultInitializer struct { - // Resource identifies the type of resource to be initialized that should be initialized - Resource GroupResource - // Initializers are the default names that will be registered to this resource - Initializers []string -} - -type ExternalAdmissionHook struct { - // Operations is the list of operations this hook will be invoked on - Create, Update, or * - // for all operations. Defaults to '*'. - Operations []string - // Resources are the resources this hook should be invoked on. '*' is all resources. - Resources []string - // Subresources are the list of subresources this hook should be invoked on. '*' is all resources. - Subresources []string - - // TODO define client configuration - - // FailurePolicy defines how unrecognized errors from the admission endpoint are handled - - // allowed values are Ignore, Retry, Fail. Default value is Fail - FailurePolicy FailurePolicyType -} -``` - -All changes to this config must be done in an safe matter - when adding a new hook or initializer, -first verify the new agent is online before allowing it to come into rotation. Removing a hook or -initializer must occur before disabling the remote endpoint, and all queued items must complete. - - -### Alternatives considered - -The following are all viable alternatives to this specification, but have some downsides against the requirements above. -There should be no reason these could not be implemented for specific use cases. - -1. Admission controller that can run shell commands inside its context to mutate objects. - * Limits on performance and reliability - * Requires the masters be updated (can't be done dynamically) -2. Admission controller that can run a scripting language like Lua or JavaScript in process. - * Limits on performance and reliability - * Not consistent with existing tools and infrastructure - * Requires that masters be updated and has limits on dynamic behavior -3. Direct external call outs for object mutation (RPC to initialize objects) - * Requires a new programming model - * Duplicates our create - watch - update logic from controllers -4. Make it easy to recompile Kubernetes to have new admission controllers - * Limits administrators to using Go - * Prevents easy installation and dynamic reconfiguration - - -## Future Work - -### Mutating admission controllers - -Allow webhook admission controllers to return a mutated object that is then sent to others. Requires some -ordering / dependency tree in order to control the set of changes. Is necessary for some forms of controller, -such as those that resolve fields into more specific values (converting an update of a pod image from a -tag reference to a digest reference, or pointing to a proxy server). - - -### Bypassing external admission hooks - -There may be scenarios where an external admission hook blocks a system critical loop in a non-obvious way - -by preventing node updates that prevents a new admission pod from being created, for instance. One option -is to allow administrators to request fail open on specific calls, or to require that certain special resource -paths (initializers dynamic config path) are always fail open. Alternatively, it may be desirable for -administrative users to bypass admission completely. - -Some options: - -1. Some namespaces are opted out of external admission (kube-system) -2. Certain service accounts can bypass external admission checks - - -### An easily accessible policy engine for moderately complex scenarios - -It should be easy for a novice Kubernetes administrator to apply simple policy rules to the cluster. In -the future it is desirable to have many such policy engines enabled via extension to enable quick policy -customization to meet specific needs.
\ No newline at end of file +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/aggregated-api-servers.md b/contributors/design-proposals/api-machinery/aggregated-api-servers.md index fdddddee..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/aggregated-api-servers.md +++ b/contributors/design-proposals/api-machinery/aggregated-api-servers.md @@ -1,270 +1,6 @@ -# Aggregated API Servers +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -We want to divide the single monolithic API server into multiple aggregated -servers. Anyone should be able to write their own aggregated API server to expose APIs they want. -Cluster admins should be able to expose new APIs at runtime by bringing up new -aggregated servers. - -## Motivation - -* Extensibility: We want to allow community members to write their own API - servers to expose APIs they want. Cluster admins should be able to use these - servers without having to require any change in the core kubernetes - repository. -* Unblock new APIs from core kubernetes team review: A lot of new API proposals - are currently blocked on review from the core kubernetes team. By allowing - developers to expose their APIs as a separate server and enabling the cluster - admin to use it without any change to the core kubernetes repository, we - unblock these APIs. -* Place for staging experimental APIs: New APIs can be developed in separate - aggregated servers, and installed only by those willing to take the risk of - installing an experimental API. Once they are stable, it is then easy to - package them up for installation in other clusters. -* Ensure that new APIs follow kubernetes conventions: Without the mechanism - proposed here, community members might be forced to roll their own thing which - may or may not follow kubernetes conventions. - -## Goal - -* Developers should be able to write their own API server and cluster admins - should be able to add them to their cluster, exposing new APIs at runtime. All - of this should not require any change to the core kubernetes API server. -* These new APIs should be seamless extensions of the core kubernetes APIs (ex: - they should be operated upon via kubectl). - -## Non Goals - -The following are related but are not the goals of this specific proposal: -* Make it easy to write a kubernetes API server. - -## High Level Architecture - -There will be a new component, `kube-aggregator`, which has these responsibilities: -* Provide an API for registering API servers. -* Summarize discovery information from all the servers. -* Proxy client requests to individual servers. - -The reverse proxy is provided for convenience. Clients can discover server URLs -using the summarized discovery information and contact them directly. Simple -clients can always use the proxy and don't need to know that under the hood -multiple apiservers are running. - -Wording note: When we say "API servers" we really mean groups of apiservers, -since any individual apiserver is horizontally replicable. Similarly, -kube-aggregator itself is horizontally replicable. - -## Operational configurations - -There are two configurations in which it makes sense to run `kube-aggregator`. - 1. In **test mode**/**single-user mode**. An individual developer who wants to test - their own apiserver could run their own private copy of `kube-aggregator`, - configured such that only they can interact with it. This allows for testing - both `kube-aggregator` and any custom apiservers without the potential for - causing any collateral damage in the rest of the cluster. Unfortunately, in - this configuration, `kube-aggregator`'s built in proxy will lack the client - cert that allows it to perform authentication that the rest of the cluster - will trust, so its functionality will be somewhat limited. - 2. In **gateway mode**. The `kube-apiserver` will embed the `kube-aggregator` component - and it will function as the official gateway to the cluster, where it aggregates - all of the apiservers the cluster administer wishes to provide. - -### Constraints - -* Unique API group versions across servers: Each API server (and groups of servers, in HA) - should expose unique API group versions. So, for example, you can serve - `api.mycompany.com/v1` from one apiserver and the replacement - `api.mycompany.com/v2` from another apiserver while you update clients. But - you can't serve `api.mycompany.com/v1/frobbers` and - `api.mycompany.com/v1/grobinators` from different apiservers. This restriction - allows us to limit the scope of `kube-aggregator` to a manageable level. -* Follow API conventions: APIs exposed by every API server should adhere to [kubernetes API - conventions](/contributors/devel/sig-architecture/api-conventions.md). -* Support discovery API: Each API server should support the kubernetes discovery API - (list the supported groupVersions at `/apis` and list the supported resources - at `/apis/<groupVersion>/`) -* No bootstrap problem: The core kubernetes apiserver must not depend on any - other aggregated server to come up. Non-core apiservers may use other non-core - apiservers, but must not fail in their absence. - -## Component Dependency Order - -`kube-aggregator` is not part of the core `kube-apiserver`. -The dependency order (for the cluster gateway configuration) looks like this: - 1. `etcd` - 2. `kube-apiserver` - 3. core scheduler, kubelet, service proxy (enough stuff to create a pod, run it on a node, and find it via service) - 4. `kubernetes-aggregator` as a pod/service - default summarizer and proxy - 5. controllers - 6. other API servers and their controllers - 7. clients, web consoles, etc - -Nothing below the `kubernetes-aggregator` can rely on the aggregator or proxy -being present. `kubernetes-aggregator` should be runnable as a pod backing a -service in a well-known location. Something like `api.kube-public.svc` or -similar seems appropriate since we'll want to allow network traffic to it from -every other namespace in the cluster. We recommend using a dedicated namespace, -since compromise of that namespace will expose the entire cluster: the -proxy has the power to act as any user against any API server. - -## Implementation Details - -### Summarizing discovery information - -We can have a very simple Go program to summarize discovery information from all -servers. Cluster admins will register each aggregated API server (its baseURL and swagger -spec path) with the proxy. The proxy will summarize the list of all group versions -exposed by all registered API servers with their individual URLs at `/apis`. - -### Reverse proxy - -We can use any standard reverse proxy server like nginx or extend the same Go program that -summarizes discovery information to act as reverse proxy for all aggregated servers. - -Cluster admins are also free to use any of the multiple open source API management tools -(for example, there is [Kong](https://getkong.org/), which is written in lua and there is -[Tyk](https://tyk.io/), which is written in Go). These API management tools -provide a lot more functionality like: rate-limiting, caching, logging, -transformations and authentication. -In future, we can also use ingress. That will give cluster admins the flexibility to -easily swap out the ingress controller by a Go reverse proxy, nginx, haproxy -or any other solution they might want. - -`kubernetes-aggregator` uses a simple proxy implementation alongside its discovery information -which supports connection upgrade (for `exec`, `attach`, etc) and runs with delegated -authentication and authorization against the core `kube-apiserver`. As a proxy, it adds -complete user information, including user, groups, and "extra" for backing API servers. - -### Storage - -Each API server is responsible for storing their resources. They can have their -own etcd or can use kubernetes server's etcd using [third party -resources](extending-api.md#adding-custom-resources-to-the-kubernetes-api-server). - -### Health check - -Kubernetes server's `/api/v1/componentstatuses` will continue to report status -of master components that it depends on (scheduler and various controllers). -Since clients have access to server URLs, they can use that to do -health check of individual servers. -In future, if a global health check is required, we can expose a health check -endpoint in the proxy that will report the status of all aggregated api servers -in the cluster. - -### Auth - -Since the actual server which serves client's request can be opaque to the client, -all API servers need to have homogeneous authentication and authorisation mechanisms. -All API servers will handle authn and authz for their resources themselves. -The current authentication infrastructure allows token authentication delegation to the -core `kube-apiserver` and trust of an authentication proxy, which can be fulfilled by -`kubernetes-aggregator`. - -#### Server Role Bootstrapping - -External API servers will often have to provide roles for the resources they -provide to other API servers in the cluster. This will usually be RBAC -clusterroles, RBAC clusterrolebindings, and apiaggregation types to describe -their API server. The external API server should *never* attempt to -self-register these since the power to mutate those resources provides the -power to destroy the cluster. Instead, there are two paths: - 1. the easy path - In this flow, the API server supports a `/bootstrap/<group>` endpoint - which provides the resources that can be piped to a `kubectl create -f` command a cluster-admin - can use those endpoints to prime other servers. - 2. the reliable path - In a production cluster, you generally want to know, audit, and - track the resources required to make your cluster work. In these scenarios, you want - to have the API resource list ahead of time. API server authors can provide a template. - -Nothing stops an external API server from supporting both. - -### kubectl - -kubectl will talk to `kube-aggregator`'s discovery endpoint and use the discovery API to -figure out the operations and resources supported in the cluster. -We will need to make kubectl truly generic. Right now, a lot of operations -(like get, describe) are hardcoded in the binary for all resources. A future -proposal will provide details on moving those operations to server. - -Note that it is possible for kubectl to talk to individual servers directly in -which case proxy will not be required at all, but this requires a bit more logic -in kubectl. We can do this in future, if desired. - -### Handling global policies - -Now that we have resources spread across multiple API servers, we need to -be careful to ensure that global policies (limit ranges, resource quotas, etc) are enforced. -Future proposals will improve how this is done across the cluster. - -#### Namespaces - -When a namespaced resource is created in any of the aggregated server, that -server first needs to check with the kubernetes server that: - -* The namespace exists. -* User has authorization to create resources in that namespace. -* Resource quota for the namespace is not exceeded. - -To prevent race conditions, the kubernetes server might need to expose an atomic -API for all these operations. - -While deleting a namespace, kubernetes server needs to ensure that resources in -that namespace maintained by other servers are deleted as well. We can do this -using resource [finalizers](/contributors/design-proposals/architecture/namespaces.md#finalizers). Each server -will add themselves in the set of finalizers before they create a resource in -the corresponding namespace and delete all their resources in that namespace, -whenever it is to be deleted (kubernetes API server already has this code, we -will refactor it into a library to enable reuse). - -Future proposal will talk about this in more detail and provide a better -mechanism. - -#### Limit ranges and resource quotas - -kubernetes server maintains [resource quotas](/contributors/design-proposals/resource-management/admission_control_resource_quota.md) and -[limit ranges](/contributors/design-proposals/resource-management/admission_control_limit_range.md) for all resources. -Aggregated servers will need to check with the kubernetes server before creating any -resource. - -## Methods for running on hosted kubernetes clusters - -Where "hosted" means the cluster users have very limited or no permissions to -change the control plane installation, for example on GKE, where it is managed -by Google. There are three ways of running on such a cluster:. - - 1. `kube-aggregator` will run in the single-user / test configuration on any - installation of Kubernetes, even if the user starting it only has permissions - in one namespace. - 2. Just like 1 above, if all of the users can agree on a location, then they - can make a public namespace and run a copy of `kube-aggregator` in that - namespace for everyone. The downside of running like this is that none of the - cluster components (controllers, nodes, etc) would be going through this - kube-aggregator. - 3. The hosted cluster provider can integrate `kube-aggregator` into the - cluster. This is the best configuration, but it may take a quarter or two after - `kube-aggregator` is ready to go for providers to complete this integration. - -## Alternatives - -There were other alternatives that we had discussed. - -* Instead of adding a proxy in front, let the core kubernetes server provide an - API for other servers to register themselves. It can also provide a discovery - API which the clients can use to discover other servers and then talk to them - directly. But this would have required another server API a lot of client logic as well. -* Validating aggregated servers: We can validate new servers when they are registered - with the proxy, or keep validating them at regular intervals, or validate - them only when explicitly requested, or not validate at all. - We decided that the proxy will just assume that all the servers are valid - (conform to our api conventions). In future, we can provide conformance tests. - -## Future Work - -* Validate servers: We should have some conformance tests that validate that the - servers follow kubernetes api-conventions. -* Provide centralised auth service: It is very hard to ensure homogeneous auth - across multiple aggregated servers, especially in case of hosted clusters - (where different people control the different servers). We can fix it by - providing a centralised authentication and authorization service which all of - the servers can use. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/alternate-api-representations.md b/contributors/design-proposals/api-machinery/alternate-api-representations.md index d54709a5..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/alternate-api-representations.md +++ b/contributors/design-proposals/api-machinery/alternate-api-representations.md @@ -1,433 +1,6 @@ -# Alternate representations of API resources +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Naive clients benefit from allowing the server to returning resource information in a form -that is easy to represent or is more efficient when dealing with resources in bulk. It -should be possible to ask an API server to return a representation of one or more resources -of the same type in a way useful for: -* Retrieving a subset of object metadata in a list or watch of a resource, such as the - metadata needed by the generic Garbage Collector or the Namespace Lifecycle Controller -* Dealing with generic operations like `Scale` correctly from a client across multiple API - groups, versions, or servers -* Return a simple tabular representation of an object or list of objects for naive - web or command-line clients to display (for `kubectl get`) -* Return a simple description of an object that can be displayed in a wide range of clients - (for `kubectl describe`) -* Return the object with fields set by the server cleared (as `kubectl export`) which - is dependent on the schema, not on user input. - -The server should allow a common mechanism for a client to request a resource be returned -in one of a number of possible forms. In general, many of these forms are simply alternate -versions of the existing content and are not intended to support arbitrary parameterization. - -Also, the server today contains a number of objects which are common across multiple groups, -but which clients must be able to deal with in a generic fashion. These objects - Status, -ListMeta, ObjectMeta, List, ListOptions, ExportOptions, and Scale - are embedded into each -group version but are actually part of a shared API group. It must be possible for a naive -client to translate the Scale response returned by two different API group versions. - - -## Motivation - -Currently it is difficult for a naive client (dealing only with the list of resources -presented by API discovery) to properly handle new and extended API groups, especially -as versions of those groups begin to evolve. It must be possible for a naive client to -perform a set of common operations across a wide range of groups and versions and leverage -a predictable schema. - -We also foresee increasing difficulty in building clients that must deal with extensions - -there are at least 6 known web-ui or CLI implementations that need to display some -information about third party resources or additional API groups registered with a server -without requiring each of them to change. Providing a server side implementation will -allow clients to retrieve meaningful information for the `get` and `describe` style -operations even for new API groups. - - -## Implementation - -The HTTP spec and the common REST paradigm provide mechanisms for clients to [negotiate -alternative representations of objects (RFC2616 14.1)](http://www.w3.org/Protocols/rfc2616/rfc2616.txt) -and for the server to correctly indicate a requested mechanism was chosen via the `Accept` -and `Content-Type` headers. This is a standard request response protocol intended to allow -clients to request the server choose a representation to return to the client based on the -server's capabilities. In RESTful terminology, a representation is simply a known schema that -the client is capable of handling - common schemas are HTML, JSON, XML, or protobuf, with the -possibility of the client and server further refining the requested output via either query -parameters or media type parameters. - -In order to ensure that generic clients can properly deal with many different group versions, -we introduce the `meta.k8s.io` group with version `v1` that grandfathers all existing resources -currently described as "unversioned". A generic client may request that responses be applied -in this version. The contents of a particular API group version would continue to be bound into -other group versions (`status.v1.meta.k8s.io` would be bound as `Status` into all existing -API groups). We would remove the `unversioned` package and properly home these resources in -a real API group. - - -### Considerations around choosing an implementation - -* We wish to avoid creating new resource *locations* (URLs) for existing resources - * New resource locations complicate access control, caching, and proxying - * We are still retrieving the same resource, just in an alternate representation, - which matches our current use of the protobuf, JSON, and YAML serializations - * We do not wish to alter the mechanism for authorization - a user with access - to a particular resource in a given namespace should be limited regardless of - the representation in use. - * Allowing "all namespaces" to be listed would require us to create "fake" resources - which would complicate authorization -* We wish to support retrieving object representations in multiple schemas - JSON for - simple clients and Protobuf for clients concerned with efficiency. -* Most clients will wish to retrieve a newer format, but for older servers will desire - to fall back to the implicit resource represented by the endpoint. - * Over time, clients may need to request results in multiple API group versions - because of breaking changes (when we introduce v2, clients that know v2 will want - to ask for v2, then v1) - * The Scale resource is an example - a generic client may know v1 Scale, but when - v2 Scale is introduced the generic client will still only request v1 Scale from - any given resource, and the server that no longer recognizes v1 Scale must - indicate that to the client. -* We wish to preserve the greatest possible query parameter space for sub resources - and special cases, which encourages us to avoid polluting the API with query - parameters that can be otherwise represented as alternate forms. -* We do not wish to allow deep orthogonal parameterization - a list of pods is a list - of pods regardless of the form, and the parameters passed to the JSON representation - should not vary significantly to the tabular representation. -* Because we expect not all extensions will implement protobuf, an efficient client - must continue to be able to "fall-back" to JSON, such as for third party - resources. -* We do not wish to create fake content-types like `application/json+kubernetes+v1+meta.k8s.io` - because the list of combinations is unbounded and our ability to encode specific values - (like slashes) into the value is limited. - -### Client negotiation of response representation - -When a client wishes to request an alternate representation of an object, it should form -a valid `Accept` header containing one or more accepted representations, where each -representation is represented by a media-type and [media-type parameters](https://tools.ietf.org/html/rfc6838#section-4.3). -The server should omit representations that are unrecognized or in error - if no representations -are left after omission the server should return a `406 Not Acceptable` HTTP response. - -The supported parameters are: - -| Name | Value | Default | Description | -| ---- | ----- | ------- | ----------- | -| g | The group name of the desired response | Current group | The group the response is expected in. | -| v | The version of the desired response | Current version | The version the response is expected in. Note that this is separate from Group because `/` is not a valid character in Accept headers. | -| as | Kind name | None | If specified, transform the resource into the following kind (including the group and version parameters). | -| sv | The server group (`meta.k8s.io`) version that should be applied to generic resources returned by this endpoint | Matching server version for the current group and version | If specified, the server should transform generic responses into this version of the server API group. | -| export | `1` | None | If specified, transform the resource prior to returning to omit defaulted fields. Additional arguments allowed in the query parameter. For legacy reasons, `?export=1` will continue to be supported on the request | -| pretty | `0`/`1` | `1` | If specified, apply formatting to the returned response that makes the serialization readable (for JSON, use indentation) | - -Examples: - -``` -# Request a PodList in an alternate form -GET /v1/pods -Accept: application/json;as=Table;g=meta.k8s.io;v=v1 - -# Request a PodList in an alternate form, with pretty JSON formatting -GET /v1/pods -Accept: application/json;as=Table;g=meta.k8s.io;v=v1;pretty=1 - -# Request that status messages be of the form meta.k8s.io/v2 on the response -GET /v1/pods -Accept: application/json;sv=v2 -{ - "kind": "Status", - "apiVersion": "meta.k8s.io/v2", - ... -} -``` - -For both export and the more complicated server side `kubectl get` cases, it's likely that -more parameters are required and should be specified as query parameters. However, the core -behavior is best represented as a variation on content-type. Supporting both is not limiting -in the short term as long as we can validate correctly. - -As a simplification for common use, we should create **media-type aliases** which may show up in lists of mime-types supported -and simplify use for clients. For example, the following aliases would be reasonable: - -* `application/json+vnd.kubernetes.export` would return the requested object in export form -* `application/json+vnd.kubernetes.as+meta.k8s.io+v1+TabularOutput` would return the requested object in a tabular form -* `text/csv` would return the requested object in a tabular form in the comma-separated-value (CSV) format - -### Example: Partial metadata retrieval - -The client may request to the server to return the list of namespaces as a -`PartialObjectMetadata` kind, which is an object containing only `ObjectMeta` and -can be serialized as protobuf or JSON. This is expected to be significantly more -performant when controllers like the Garbage collector retrieve multiple objects. - - GET /api/v1/namespaces - Accept: application/json;g=meta.k8s.io,v=v1,as=PartialObjectMetadata, application/json - -The server would respond with - - 200 OK - Content-Type: application/json;g=meta.k8s.io,v=v1,as=PartialObjectMetadata - { - "apiVersion": "meta.k8s.io/v1", - "kind": "PartialObjectMetadataList", - "items": [ - { - "apiVersion": "meta.k8s.io/v1", - "kind": "PartialObjectMetadata", - "metadata": { - "name": "foo", - "resourceVersion": "10", - ... - } - }, - ... - ] - } - -In this example PartialObjectMetadata is a real registered type, and each API group -provides an efficient transformation from their schema to the partial schema directly. -The client upon retrieving this type can act as a generic resource. - -Note that the `as` parameter indicates to the server the Kind of the resource, but -the Kubernetes API convention of returning a List with a known schema continues. An older -server could ignore the presence of the `as` parameter on the media type and merely return -a `NamespaceList` and the client would either use the content-type or the object Kind -to distinguish. Because all responses are expected to be self-describing, an existing -Kubernetes client would be expected to differentiate on Kind. - -An old server, not recognizing these parameters, would respond with: - - 200 OK - Content-Type: application/json - { - "apiVersion": "v1", - "kind": "NamespaceList", - "items": [ - { - "apiVersion": "v1", - "kind": "Namespace", - "metadata": { - "name": "foo", - "resourceVersion": "10", - ... - } - }, - ... - ] - } - - -### Example: Retrieving a known version of the Scale resource - -Each API group that supports resources that can be scaled must expose a subresource on -their object that accepts GET or PUT with a `Scale` kind resource. This subresource acts -as a generic interface that a client that knows nothing about the underlying object can -use to modify the scale value of that resource. However, clients *must* be able to understand -the response the server provides, and over time the response may change and should therefore -be versioned. Our current API provides no way for a client to discover whether a `Scale` -response returned by `batch/v2alpha1` is the same as the `Scale` resource returned by -`autoscaling/v1`. - -Under this proposal, to scale a generic resource a client would perform the following -operations: - - GET /api/v1/namespace/example/replicasets/test/scale - Accept: application/json;g=meta.k8s.io,v=v1,as=Scale, application/json - - 200 OK - Content-Type: application/json;g=meta.k8s.io,v=v1,as=Scale - { - "apiVersion": "meta.k8s.io/v1", - "kind": "Scale", - "spec": { - "replicas": 1 - } - ... - } - -The client, seeing that a generic response was returned (`meta.k8s.io/v1`), knows that -the server supports accepting that resource as well, and performs a PUT: - - PUT /apis/extensions/v1beta1/namespace/example/replicasets/test/scale - Accept: application/json;g=meta.k8s.io,v=v1,as=Scale, application/json - Content-Type: application/json - { - "apiVersion": "meta.k8s.io/v1", - "kind": "Scale", - "spec": { - "replicas": 2 - } - } - - 200 OK - Content-Type: application/json;g=meta.k8s.io,v=v1,as=Scale - { - "apiVersion": "meta.k8s.io/v1", - "kind": "Scale", - "spec": { - "replicas": 2 - } - ... - } - -Note that the client still asks for the common Scale as the response so that it -can access the value it wants. - - -### Example: Retrieving an alternative representation of the resource for use in `kubectl get` - -As new extension groups are added to the server, all clients must implement simple "view" logic -for each resource. However, these views are specific to the resource in question, which only -the server is aware of. To make clients more tolerant of extension and third party resources, -it should be possible for clients to ask the server to present a resource or list of resources -in a tabular / descriptive format rather than raw JSON. - -While the design of serverside tabular support is outside the scope of this proposal, a few -knows apply. The server must return a structured resource usable by both command line and -rich clients (web or IDE), which implies a schema, which implies JSON, and which means the -server should return a known Kind. For this example we will call that kind `TabularOutput` -to demonstrate the concept. - -A server side resource would implement a transformation from their resource to `TabularOutput` -and the API machinery would translate a single item or a list of items (or a watch) into -the tabular resource. - -A generic client wishing to display a tabular list for resources of type `v1.ReplicaSets` would -make the following call: - - GET /api/v1/namespaces/example/replicasets - Accept: application/json;g=meta.k8s.io,v=v1,as=TabularOutput, application/json - - 200 OK - Content-Type: application/json;g=meta.k8s.io,v=v1,as=TabularOutput - { - "apiVersion": "meta.k8s.io/v1", - "kind": "TabularOutput", - "columns": [ - {"name": "Name", "description": "The name of the resource"}, - {"name": "Resource Version", "description": "The version of the resource"}, - ... - ], - "items": [ - {"columns": ["name", "10", ...]}, - ... - ] - } - -The client can then present that information as necessary. If the server returns the -resource list `v1.ReplicaSetList` the client knows that the server does not support tabular -output and so must fall back to a generic output form (perhaps using the existing -compiled in listers). - -Note that `kubectl get` supports a number of parameters for modifying the response, -including whether to filter resources, whether to show a "wide" list, or whether to -turn certain labels into columns. Those options are best represented as query parameters -and transformed into a known type. - - -### Example: Versioning a ListOptions call to a generic API server - -When retrieving lists of resources, the server transforms input query parameters like -`labels` and `fields` into a `ListOptions` type. It should be possible for a generic -client dealing with the server to be able to specify the version of ListOptions it -is sending to detect version skew. - -Since this is an input and list is implemented with GET, it is not possible to send -a body and no Content-Type is possible. For this approach, we recommend that the kind -and API version be specifiable via the GET call for further clarification: - -New query parameters: - -| Name | Value | Default | Description | -| ---- | ----- | ------- | ----------- | -| kind | The kind of parameters being sent | `ListOptions` (GET), `DeleteOptions` (DELETE) | The kind of the serialized struct, defaults to ListOptions on GET and DeleteOptions on DELETE. | -| queryVersion / apiVersion | The API version of the parameter struct | `meta.k8s.io/v1` | May be altered to match the expected version. Because we have not yet versioned ListOptions, this is safe to alter. | - -To send ListOptions in the v2 future format, where the serialization of `resourceVersion` -is changed to `rv`, clients would provide: - - GET /api/v1/namespaces/example/replicasets?apiVersion=meta.k8s.io/v2&rv=10 - -Before we introduce a second API group version, we would have to ensure old servers -properly reject apiVersions they do not understand. - - -### Impact on web infrastructure - -In the past, web infrastructure and old browsers have coped poorly with the `Accept` -header. However, most modern caching infrastructure properly supports `Vary: Accept` -and caching of responses has not been a significant requirement for Kubernetes APIs -to this point. - - -### Considerations for discoverability - -To ensure clients can discover these endpoints, the Swagger and OpenAPI documents -should also include a set of example mime-types for each endpoint that are supported. -Specifically, the `produces` field on an individual operation can be used to list a -set of well known types. The description of the operation can include a stanza about -retrieving alternate representations. - - -## Alternatives considered - -* Implement only with query parameters - - To properly implement alternative resource versions must support multiple version - support (ask for v2, then v1). The Accept mechanism already handles this sort of - multi-version negotiation, while any approach based on query parameters would - have to implement this option as well. In addition, some serializations may not - be valid in all content types, so the client asking for TabularOutput in protobuf - may also ask for TabularOutput in JSON - if TabularOutput is not valid in protobuf - the server call fall back to JSON. - -* Use new resource paths - `/apis/autoscaling/v1/namespaces/example/horizontalpodautoscalermetadata` - - This leads to a proliferation of paths which will confuse automated tools and end - users. Authorization, logging, audit may all need a way to map the two resources - as equivalent, while clients would need a discovery mechanism that identifies a - "same underlying object" relationship that is different from subresources. - -* Use a special HTTP header to denote the alternative representation - - Given the need to support multiple versions, this would be reimplementing Accept - in a slightly different way, so we prefer to reuse Accept. - -* For partial object retrieval, support complex field selectors - - From an efficiency perspective, calculating subpaths and filtering out sub fields - from the underlying object is complex. In practice, almost all filtering falls into - a few limited subsets, and thus retrieving an object into a few known schemas can be made - much more efficient. In addition, arbitrary transformation of the object provides - opportunities for supporting forward "partial" migration - for instance, returning a - ReplicationController as a ReplicaSet to simplify a transition across resource types. - While this is not under explicit consideration, allowing a caller to move objects across - schemas will eventually be a required behavior when dramatic changes occur in an API - schema. - -## Backwards Compatibility - -### Old clients - -Old clients would not be affected by the new Accept path. - -If servers begin returning Status in version `meta.k8s.io/v1`, old clients would likely error -as that group has never been used. We would continue to return the group version of the calling -API group on server responses unless the `sv` mime-type parameter is set. - - -### Old servers - -Because old Kubernetes servers are not selective about the content type parameters they -accept, we may wish to patch server versions to explicitly bypass content -types they do not recognize the parameters to. As a special consideration, this would allow -new clients to more strictly handle Accept (so that the server returns errors if the content -type is not recognized). - -As part of introducing the new API group `meta.k8s.io`, some opaque calls where we assume the -empty API group-version for the resource (GET parameters) could be defaulted to this group. - - -## Future items - -* ??? +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/api-chunking.md b/contributors/design-proposals/api-machinery/api-chunking.md index a04c9ba4..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/api-chunking.md +++ b/contributors/design-proposals/api-machinery/api-chunking.md @@ -1,177 +1,6 @@ -# Allow clients to retrieve consistent API lists in chunks +Design proposals have been archived. -On large clusters, performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) can lead to significant variations in peak memory use on the server and contribute substantially to long tail request latency. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -When loading very large sets of objects -- some clusters are now reaching 100k pods or equivalent numbers of supporting resources -- the system must: -* Construct the full range description in etcd in memory and serialize it as protobuf in the client - * Some clusters have reported over 500MB being stored in a single object type - * This data is read from the underlying datastore and converted to a protobuf response - * Large reads to etcd can block writes to the same range (https://github.com/coreos/etcd/issues/7719) -* The data from etcd has to be transferred to the apiserver in one large chunk -* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client - * Much of the decoded etcd memory is copied into the struct used to serialize to the client -* An API client like `kubectl get` will then decode the response from JSON or protobuf - * An API client with a slow connection may not be able to receive the entire response body within the default 60s timeout - * This may cause other failures downstream of that API client with their own timeouts - * The recently introduced client compression feature can assist - * The large response will also be loaded entirely into memory - -The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a technique commonly referred to as paging or chunking. By efficiently splitting large list ranges from etcd to clients into many smaller list ranges, we can reduce the peak memory allocation on etcd and the apiserver, without losing the consistent read invariant our clients depend on. - -This proposal does not cover general purpose ranging or paging for arbitrary clients, such as allowing web user interfaces to offer paged output, but does define some parameters for future extension. To that end, this proposal uses the phrase "chunking" to describe retrieving a consistent snapshot range read from the API server in distinct pieces. - -Our primary consistent store etcd3 offers support for efficient chunking with minimal overhead, and mechanisms exist for other potential future stores such as SQL databases or Consul to also implement a simple form of consistent chunking. - -Relevant issues: - -* https://github.com/kubernetes/kubernetes/issues/2349 - -## Terminology - -**Consistent list** - A snapshot of all resources at a particular moment in time that has a single `resourceVersion` that clients can begin watching from to receive updates. All Kubernetes controllers depend on this semantic. Allows a controller to refresh its internal state, and then receive a stream of changes from the initial state. - -**API paging** - API parameters designed to allow a human to view results in a series of "pages". - -**API chunking** - API parameters designed to allow a client to break one large request into multiple smaller requests without changing the semantics of the original request. - - -## Proposed change: - -Expose a simple chunking mechanism to allow large API responses to be broken into consistent partial responses. Clients would indicate a tolerance for chunking (opt-in) by specifying a desired maximum number of results to return in a `LIST` call. The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the client could pass to receive the next set of results. The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` without `continue`). - -``` -GET /api/v1/pods?limit=500 -{ - "metadata": {"continue": "ABC...", "resourceVersion": "147"}, - "items": [ - // no more than 500 items - ] -} -GET /api/v1/pods?limit=500&continue=ABC... -{ - "metadata": {"continue": "DEF...", "resourceVersion": "147"}, - "items": [ - // no more than 500 items - ] -} -GET /api/v1/pods?limit=500&continue=DEF... -{ - "metadata": {"resourceVersion": "147"}, - "items": [ - // no more than 500 items - ] -} -``` - -The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. - -The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads. - -If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error. - -The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades. - -Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call. - -The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes. - -The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted. - - -### Proposed Implementations - -etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read. - -etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity. - -Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern. - -#### etcd3 - -For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list. - -The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token. - -#### Possible SQL database implementation - -A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to: - - SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC - -where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned. - - -### Security implications of returning last or next key in the continue token - -If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact: - -1. Disable chunking on specific resources -2. Disable chunking when the user does not have permission to view all resources within a range -3. Encrypt the next key or the continue token using a shared secret across all API servers -4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are always keys the user has access to. - -In the short term we have no supported subset filtering (i.e. a user who can LIST can also LIST ?fields= and vice versa), so 1 is sufficient to address the sensitive key name issue. Because clients are required to proceed as if limiting is not possible, the server is always free to ignore a chunked request for other reasons. In the future, 4 may be the best option because we assume that most users starting a consistent read intend to finish it, unlike more general user interface paging where only a small fraction of requests continue to the next page. - - -### Handling expired resource versions - -If the required data to perform a consistent list is no longer available in the storage backend (by default, old versions of objects in etcd3 are removed after 5 minutes), the server **must** return a `410 Gone ResourceExpired` status response (the same as for watch), which means clients must start from the beginning. - -``` -# resourceVersion is expired -GET /api/v1/pods?limit=500&continue=DEF... -{ - "kind": "Status", - "code": 410, - "reason": "ResourceExpired" -} -``` - -Some clients may wish to follow a failed paged list with a full list attempt. - -The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accommodate larger clusters. - - -#### Types of clients and impact - -Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking. - -* Controllers with full caches - * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results -* `kubectl get` - * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed" -* Migration style commands - * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window. - -For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`. - - -### Rate limiting - -Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly. - - -### Chunk by default? - -On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list. - -For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set. It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system. - -It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking. - - -### Other solutions - -Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes. - -Various optimizations exist that can and should be applied to minimizing the amount of data that is transferred from etcd to the client or number of allocations made in each location, but do not how response size scales with number of entries. - - -## Plan - -The initial chunking implementation would focus on consistent listing on server and client as well as measuring the impact of chunking on total system load, since chunking will slightly increase the cost to view large data sets because of the additional per page processing. The initial implementation should make the fewest assumptions possible in constraining future backend storage. - -For the initial alpha release, chunking would be behind a feature flag and attempts to provide the `continue` or `limit` flags should be ignored. While disabled, a `continue` token should never be returned by the server as part of a list. - -Future work might offer more options for clients to page in an inconsistent fashion, or allow clients to directly specify the parts of the namespace / name keyspace they wish to range over (paging). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/api-group.md b/contributors/design-proposals/api-machinery/api-group.md index 442c9ca8..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/api-group.md +++ b/contributors/design-proposals/api-machinery/api-group.md @@ -1,115 +1,6 @@ -# Supporting multiple API groups +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -1. Breaking the monolithic v1 API into modular groups and allowing groups to be enabled/disabled individually. This allows us to break the monolithic API server to smaller components in the future. -2. Supporting different versions in different groups. This allows different groups to evolve at different speed. - -3. Supporting identically named kinds to exist in different groups. This is useful when we experiment new features of an API in the experimental group while supporting the stable API in the original group at the same time. - -4. Exposing the API groups and versions supported by the server. This is required to develop a dynamic client. - -5. Laying the basis for [API Plugin](./extending-api.md). - -6. Keeping the user interaction easy. For example, we should allow users to omit group name when using kubectl if there is no ambiguity. - - -## Bookkeeping for groups - -1. No changes to TypeMeta: - - Currently many internal structures, such as RESTMapper and Scheme, are indexed and retrieved by APIVersion. For a fast implementation targeting the v1.1 deadline, we will concatenate group with version, in the form of "group/version", and use it where a version string is expected, so that many code can be reused. This implies we will not add a new field to TypeMeta, we will use TypeMeta.APIVersion to hold "group/version". - - For backward compatibility, v1 objects belong to the group with an empty name, so existing v1 config files will remain valid. - -2. /pkg/conversion#Scheme: - - The key of /pkg/conversion#Scheme.versionMap for versioned types will be "group/version". For now, the internal version types of all groups will be registered to versionMap[""], as we don't have any identically named kinds in different groups yet. In the near future, internal version types will be registered to versionMap["group/"], and pkg/conversion#Scheme.InternalVersion will have type []string. - - We will need a mechanism to express if two kinds in different groups (e.g., compute/pods and experimental/pods) are convertible, and auto-generate the conversions if they are. - -3. meta.RESTMapper: - - Each group will have its own RESTMapper (of type DefaultRESTMapper), and these mappers will be registered to pkg/api#RESTMapper (of type MultiRESTMapper). - - To support identically named kinds in different groups, We need to expand the input of RESTMapper.VersionAndKindForResource from (resource string) to (group, resource string). If group is not specified and there is ambiguity (i.e., the resource exists in multiple groups), an error should be returned to force the user to specify the group. - -## Server-side implementation - -1. resource handlers' URL: - - We will force the URL to be in the form of prefix/group/version/... - - Prefix is used to differentiate API paths from other paths like /healthz. All groups will use the same prefix="apis", except when backward compatibility requires otherwise. No "/" is allowed in prefix, group, or version. Specifically, - - * for /api/v1, we set the prefix="api" (which is populated from cmd/kube-apiserver/app#APIServer.APIPrefix), group="", version="v1", so the URL remains to be /api/v1. - - * for new kube API groups, we will set the prefix="apis" (we will add a field in type APIServer to hold this prefix), group=GROUP_NAME, version=VERSION. For example, the URL of the experimental resources will be /apis/experimental/v1alpha1. - - * for OpenShift v1 API, because it's currently registered at /oapi/v1, to be backward compatible, OpenShift may set prefix="oapi", group="". - - * for other new third-party API, they should also use the prefix="apis" and choose the group and version. This can be done through the third-party API plugin mechanism in [13000](http://pr.k8s.io/13000). - -2. supporting API discovery: - - * At /prefix (e.g., /apis), API server will return the supported groups and their versions using pkg/api/unversioned#APIVersions type, setting the Versions field to "group/version". This is backward compatible, because currently API server does return "v1" encoded in pkg/api/unversioned#APIVersions at /api. (We will also rename the JSON field name from `versions` to `apiVersions`, to be consistent with pkg/api#TypeMeta.APIVersion field) - - * At /prefix/group, API server will return all supported versions of the group. We will create a new type VersionList (name is open to discussion) in pkg/api/unversioned as the API. - - * At /prefix/group/version, API server will return all supported resources in this group, and whether each resource is namespaced. We will create a new type APIResourceList (name is open to discussion) in pkg/api/unversioned as the API. - - We will design how to handle deeper path in other proposals. - - * At /swaggerapi/swagger-version/prefix/group/version, API server will return the Swagger spec of that group/version in `swagger-version` (e.g. we may support both Swagger v1.2 and v2.0). - -3. handling common API objects: - - * top-level common API objects: - - To handle the top-level API objects that are used by all groups, we either have to register them to all schemes, or we can choose not to encode them to a version. We plan to take the latter approach and place such types in a new package called `unversioned`, because many of the common top-level objects, such as APIVersions, VersionList, and APIResourceList, which are used in the API discovery, and pkg/api#Status, are part of the protocol between client and server, and do not belong to the domain-specific parts of the API, which will evolve independently over time. - - Types in the unversioned package will not have the APIVersion field, but may retain the Kind field. - - For backward compatibility, when handling the Status, the server will encode it to v1 if the client expects the Status to be encoded in v1, otherwise the server will send the unversioned#Status. If an error occurs before the version can be determined, the server will send the unversioned#Status. - - * non-top-level common API objects: - - Assuming object o belonging to group X is used as a field in an object belonging to group Y, currently genconversion will generate the conversion functions for o in package Y. Hence, we don't need any special treatment for non-top-level common API objects. - - TypeMeta is an exception, because it is a common object that is used by objects in all groups but does not logically belong to any group. We plan to move it to the package `unversioned`. - -## Client-side implementation - -1. clients: - - Currently we have structured (pkg/client/unversioned#ExperimentalClient, pkg/client/unversioned#Client) and unstructured (pkg/kubectl/resource#Helper) clients. The structured clients are not scalable because each of them implements specific interface, e.g., `[here]../../pkg/client/unversioned/client.go#L32`--fixed. Only the unstructured clients are scalable. We should either auto-generate the code for structured clients or migrate to use the unstructured clients as much as possible. - - We should also move the unstructured client to pkg/client/. - -2. Spelling the URL: - - The URL is in the form of prefix/group/version/. The prefix is hard-coded in the client/unversioned.Config. The client should be able to figure out `group` and `version` using the RESTMapper. For a third-party client which does not have access to the RESTMapper, it should discover the mapping of `group`, `version` and `kind` by querying the server as described in point 2 of #server-side-implementation. - -3. kubectl: - - kubectl should accept arguments like `group/resource`, `group/resource/name`. Nevertheless, the user can omit the `group`, then kubectl shall rely on RESTMapper.VersionAndKindForResource() to figure out the default group/version of the resource. For example, for resources (like `node`) that exist in both k8s v1 API and k8s modularized API (like `infra/v2`), we should set kubectl default to use one of them. If there is no default group, kubectl should return an error for the ambiguity. - - When kubectl is used with a single resource type, the --api-version and --output-version flag of kubectl should accept values in the form of `group/version`, and they should work as they do today. For multi-resource operations, we will disable these two flags initially. - - Currently, by setting pkg/client/unversioned/clientcmd/api/v1#Config.NamedCluster[x].Cluster.APIVersion ([here](../../pkg/client/unversioned/clientcmd/api/v1/types.go#L58)), user can configure the default apiVersion used by kubectl to talk to server. It does not make sense to set a global version used by kubectl when there are multiple groups, so we plan to deprecate this field. We may extend the version negotiation function to negotiate the preferred version of each group. Details will be in another proposal. - -## OpenShift integration - -OpenShift can take a similar approach to break monolithic v1 API: keeping the v1 where they are, and gradually adding groups. - -For the v1 objects in OpenShift, they should keep doing what they do now: they should remain registered to Scheme.versionMap["v1"] scheme, they should keep being added to originMapper. - -For new OpenShift groups, they should do the same as native Kubernetes groups would do: each group should register to Scheme.versionMap["group/version"], each should has separate RESTMapper and the register the MultiRESTMapper. - -To expose a list of the supported Openshift groups to clients, OpenShift just has to call to pkg/cmd/server/origin#call initAPIVersionRoute() as it does now, passing in the supported "group/versions" instead of "versions". - - -## Future work - -1. Dependencies between groups: we need an interface to register the dependencies between groups. It is not our priority now as the use cases are not clear yet. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/apiserver-build-in-admission-plugins.md b/contributors/design-proposals/api-machinery/apiserver-build-in-admission-plugins.md index cefaf8fd..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/apiserver-build-in-admission-plugins.md +++ b/contributors/design-proposals/api-machinery/apiserver-build-in-admission-plugins.md @@ -1,80 +1,6 @@ -# Build some Admission Controllers into the Generic API server library +Design proposals have been archived. -**Related PR:** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -| Topic | Link | -| ----- | ---- | -| Admission Control | https://git.k8s.io/community/contributors/design-proposals/api-machinery/admission_control.md | -## Introduction - -An admission controller is a piece of code that intercepts requests to the Kubernetes API - think a middleware. -The API server lets you have a whole chain of them. Each is run in sequence before a request is accepted -into the cluster. If any of the plugins in the sequence rejects the request, the entire request is rejected -immediately and an error is returned to the user. - -Many features in Kubernetes require an admission control plugin to be enabled in order to properly support the feature. -In fact in the [documentation](https://kubernetes.io/docs/admin/admission-controllers/#is-there-a-recommended-set-of-plug-ins-to-use) you will find -a recommended set of them to use. - -At the moment admission controllers are implemented as plugins and they have to be compiled into the -final binary in order to be used at a later time. Some even require an access to cache, an authorizer etc. -This is where an admission plugin initializer kicks in. An admission plugin initializer is used to pass additional -configuration and runtime references to a cache, a client and an authorizer. - -To streamline the process of adding new plugins especially for aggregated API servers we would like to build some plugins -into the generic API server library and provide a plugin initializer. While anyone can author and register one, having a known set of -provided references let's people focus on what they need their admission plugin to do instead of paying attention to wiring. - -## Implementation - -The first step would involve creating a "standard" plugin initializer that would be part of the -generic API server. It would use kubeconfig to populate -[external clients](https://git.k8s.io/kubernetes/pkg/kubeapiserver/admission/initializer.go#L29) -and [external informers](https://git.k8s.io/kubernetes/pkg/kubeapiserver/admission/initializer.go#L35). -By default for servers that would be run on the kubernetes cluster in-cluster config would be used. -The standard initializer would also provide a client config for connecting to the core kube-apiserver. -Some API servers might be started as static pods, which don't have in-cluster configs. -In that case the config could be easily populated form the file. - -The second step would be to move some plugins from [admission pkg](https://git.k8s.io/kubernetes/plugin/pkg/admission) -to the generic API server library. Some admission plugins are used to ensure consistent user expectations. -These plugins should be moved. One example is the Namespace Lifecycle plugin which prevents users -from creating resources in non-existent namespaces. - -*Note*: -For loading in-cluster configuration [visit](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/examples/in-cluster-client-configuration/main.go) - For loading the configuration directly from a file [visit](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/examples/out-of-cluster-client-configuration/main.go) - -## How to add an admission plugin ? - At this point adding an admission plugin is very simple and boils down to performing the -following series of steps: - 1. Write an admission plugin - 2. Register the plugin - 3. Reference the plugin in the admission chain - -## An example -The sample apiserver provides an example admission plugin that makes meaningful use of the "standard" plugin initializer. -The admission plugin ensures that a resource name is not on the list of banned names. -The source code of the plugin can be found [here](https://github.com/kubernetes/kubernetes/blob/2f00e6d72c9d58fe3edc3488a91948cf4bfcc6d9/staging/src/k8s.io/sample-apiserver/pkg/admission/plugin/banflunder/admission.go). - -Having the plugin, the next step is the registration. [AdmissionOptions](https://github.com/kubernetes/kubernetes/blob/2f00e6d72c9d58fe3edc3488a91948cf4bfcc6d9/staging/src/k8s.io/apiserver/pkg/server/options/admission.go) -provides two important things. Firstly it exposes [a register](https://github.com/kubernetes/kubernetes/blob/2f00e6d72c9d58fe3edc3488a91948cf4bfcc6d9/staging/src/k8s.io/apiserver/pkg/server/options/admission.go#L43) -under which all admission plugins are registered. In fact, that's exactly what the [Register](https://github.com/kubernetes/kubernetes/blob/2f00e6d72c9d58fe3edc3488a91948cf4bfcc6d9/staging/src/k8s.io/sample-apiserver/pkg/admission/plugin/banflunder/admission.go#L33) -method does from our example admission plugin. It accepts a global registry as a parameter and then simply registers itself in that registry. -Secondly, it adds an admission chain to the server configuration via [ApplyTo](https://github.com/kubernetes/kubernetes/blob/2f00e6d72c9d58fe3edc3488a91948cf4bfcc6d9/staging/src/k8s.io/apiserver/pkg/server/options/admission.go#L66) method. -The method accepts optional parameters in the form of `pluginInitalizers`. This is useful when admission plugins need custom configuration that is not provided by the generic initializer. - -The following code has been extracted from the sample server and illustrates how to register and wire an admission plugin: - -```go - // register admission plugins - banflunder.Register(o.Admission.Plugins) - - // create custom plugin initializer - informerFactory := informers.NewSharedInformerFactory(client, serverConfig.LoopbackClientConfig.Timeout) - admissionInitializer, _ := wardleinitializer.New(informerFactory) - - // add admission chain to the server configuration - o.Admission.ApplyTo(serverConfig, admissionInitializer) -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/apiserver-count-fix.md b/contributors/design-proposals/api-machinery/apiserver-count-fix.md index a2f312a6..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/apiserver-count-fix.md +++ b/contributors/design-proposals/api-machinery/apiserver-count-fix.md @@ -1,86 +1,6 @@ -# apiserver-count fix proposal +Design proposals have been archived. -Authors: @rphillips +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Table of Contents -1. [Overview](#overview) -2. [Known Issues](#known-issues) -3. [Proposal](#proposal) -4. [Alternate Proposals](#alternate-proposals) - 1. [Custom Resource Definitions](#custom-resource-definitions) - 2. [Refactor Old Reconciler](#refactor-old-reconciler) - -## Overview - -Proposal to fix Issue [#22609](https://github.com/kubernetes/kubernetes/issues/22609) - -`kube-apiserver` currently has a command-line argument `--apiserver-count` -specifying the number of api servers. This masterCount is used in the -MasterCountEndpointReconciler on a 10 second interval to potentially cleanup -stale API Endpoints. The issue is when the number of kube-apiserver instances -gets below or above the masterCount. If the below case happens, the stale -instances within the Endpoints does not get cleaned up, or in the latter case -the endpoints start to flap. - -## Known Issues - -Each apiserver’s reconciler only cleans up for it's own IP. If a new -server is spun up at a new IP, then the old IP in the Endpoints list is -only reclaimed if the number of apiservers becomes greater-than or equal -to the masterCount. For example: - -* If the masterCount = 3, and there are 3 API servers running (named: A, B, and C) -* ‘B’ API server is terminated for any reason -* The IP for endpoint ‘B’ is not -removed from the Endpoints list - -There is logic within the -[MasterCountEndpointReconciler](https://github.com/kubernetes/kubernetes/blob/68814c0203c4b8abe59812b1093844a1f9bdac05/pkg/master/controller.go#L293) -to attempt to make the Endpoints eventually consistent, but the code relies on -the Endpoints count becoming equal to or greater than masterCount. When the -apiservers become greater than the masterCount the Endpoints tend to flap. - -If the number endpoints were scaled down from automation, then the -Endpoints would never become consistent. - -## Proposal - -### Create New Reconciler - -| Kubernetes Release | Quality | Description | -| ------------- | ------------- | ----------- | -| 1.9 | alpha | <ul><li>Add a new reconciler</li><li>Add a command-line type `--alpha-apiserver-endpoint-reconciler-type`<ul><li>storage</li><li>default</li></ul></li></ul> -| 1.10 | beta | <ul><li>Turn on the `storage` type by default</li></ul> -| 1.11 | stable | <ul><li>Remove code for old reconciler</li><li>Remove --apiserver-count</li></ul> - -The MasterCountEndpointReconciler does not meet the current needs for durability -of API Endpoint creation, deletion, or failure cases. - -Custom Resource Definitions were proposed, but they do not have clean layering. -Additionally, liveness and locking would be a nice to have feature for a long -term solution. - -ConfigMaps were proposed, but since they are watched globally, liveliness -updates could be overly chatty. - -By porting OpenShift's -[LeaseEndpointReconciler](https://github.com/openshift/origin/blob/master/pkg/cmd/server/election/lease_endpoint_reconciler.go) -to Kubernetes we can use the Storage API directly to store Endpoints -dynamically within the system. - -### Alternate Proposals - -#### Custom Resource Definitions and ConfigMaps - -CRD's and ConfigMaps were considered for this proposal. They were not adopted -for this proposal by the community due to technical issues explained earlier. - -#### Refactor Old Reconciler - -| Release | Quality | Description | -| ------- | ------- | ------------------------------------------------------------ | -| 1.9 | stable | Change the logic in the current reconciler - -We could potentially reuse the old reconciler by changing the reconciler to count -the endpoints and set the `masterCount` (with a RWLock) to the count. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/apiserver-watch.md b/contributors/design-proposals/api-machinery/apiserver-watch.md index 7d509e4d..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/apiserver-watch.md +++ b/contributors/design-proposals/api-machinery/apiserver-watch.md @@ -1,139 +1,6 @@ -## Abstract +Design proposals have been archived. -In the current system, most watch requests sent to apiserver are redirected to -etcd. This means that for every watch request the apiserver opens a watch on -etcd. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The purpose of the proposal is to improve the overall performance of the system -by solving the following problems: -- having too many open watches on etcd -- avoiding deserializing/converting the same objects multiple times in different -watch results - -In the future, we would also like to add an indexing mechanism to the watch. -Although Indexer is not part of this proposal, it is supposed to be compatible -with it - in the future Indexer should be incorporated into the proposed new -watch solution in apiserver without requiring any redesign. - - -## High level design - -We are going to solve those problems by allowing many clients to watch the same -storage in the apiserver, without being redirected to etcd. - -At the high level, apiserver will have a single watch open to etcd, watching all -the objects (of a given type) without any filtering. The changes delivered from -etcd will then be stored in a cache in apiserver. This cache is in fact a -"rolling history window" that will support clients having some amount of latency -between their list and watch calls. Thus it will have a limited capacity and -whenever a new change comes from etcd when a cache is full, the oldest change -will be remove to make place for the new one. - -When a client sends a watch request to apiserver, instead of redirecting it to -etcd, it will cause: - - - registering a handler to receive all new changes coming from etcd - - iterating though a watch window, starting at the requested resourceVersion - to the head and sending filtered changes directory to the client, blocking - the above until this iteration has caught up - -This will be done be creating a go-routine per watcher that will be responsible -for performing the above. - -The following section describes the proposal in more details, analyzes some -corner cases and divides the whole design in more fine-grained steps. - - -## Proposal details - -We would like the cache to be __per-resource-type__ and __optional__. Thanks to -it we will be able to: - - have different cache sizes for different resources (e.g. bigger cache - [= longer history] for pods, which can significantly affect performance) - - avoid any overhead for objects that are watched very rarely (e.g. events - are almost not watched at all, but there are a lot of them) - - filter the cache for each watcher more effectively - -If we decide to support watches spanning different resources in the future and -we have an efficient indexing mechanisms, it should be relatively simple to unify -the cache to be common for all the resources. - -The rest of this section describes the concrete steps that need to be done -to implement the proposal. - -1. Since we want the watch in apiserver to be optional for different resource -types, this needs to be self-contained and hidden behind a well-defined API. -This should be a layer very close to etcd - in particular all registries: -"pkg/registry/generic/registry" should be built on top of it. -We will solve it by turning tools.EtcdHelper by extracting its interface -and treating this interface as this API - the whole watch mechanisms in -apiserver will be hidden behind that interface. -Thanks to it we will get an initial implementation for free and we will just -need to reimplement few relevant functions (probably just Watch and List). -Moreover, this will not require any changes in other parts of the code. -This step is about extracting the interface of tools.EtcdHelper. - -2. Create a FIFO cache with a given capacity. In its "rolling history window" -we will store two things: - - - the resourceVersion of the object (being an etcdIndex) - - the object watched from etcd itself (in a deserialized form) - - This should be as simple as having an array and treating it as a cyclic buffer. - Obviously resourceVersion of objects watched from etcd will be increasing, but - they are necessary for registering a new watcher that is interested in all the - changes since a given etcdIndex. - - Additionally, we should support LIST operation, otherwise clients can never - start watching at now. We may consider passing lists through etcd, however - this will not work once we have Indexer, so we will need that information - in memory anyway. - Thus, we should support LIST operation from the "end of the history" - i.e. - from the moment just after the newest cached watched event. It should be - pretty simple to do, because we can incrementally update this list whenever - the new watch event is watched from etcd. - We may consider reusing existing structures cache.Store or cache.Indexer - ("pkg/client/cache") but this is not a hard requirement. - -3. Create the new implementation of the API, that will internally have a -single watch open to etcd and will store the data received from etcd in -the FIFO cache - this includes implementing registration of a new watcher -which will start a new go-routine responsible for iterating over the cache -and sending all the objects watcher is interested in (by applying filtering -function) to the watcher. - -4. Add a support for processing "error too old" from etcd, which will require: - - disconnect all the watchers - - clear the internal cache and relist all objects from etcd - - start accepting watchers again - -5. Enable watch in apiserver for some of the existing resource types - this -should require only changes at the initialization level. - -6. The next step will be to incorporate some indexing mechanism, but details -of it are TBD. - - - -### Future optimizations: - -1. The implementation of watch in apiserver internally will open a single -watch to etcd, responsible for watching all the changes of objects of a given -resource type. However, this watch can potentially expire at any time and -reconnecting can return "too old resource version". In that case relisting is -necessary. In such case, to avoid LIST requests coming from all watchers at -the same time, we can introduce an additional etcd event type: EtcdResync - - Whenever relisting will be done to refresh the internal watch to etcd, - EtcdResync event will be send to all the watchers. It will contain the - full list of all the objects the watcher is interested in (appropriately - filtered) as the parameter of this watch event. - Thus, we need to create the EtcdResync event, extend watch.Interface and - its implementations to support it and handle those events appropriately - in places like - [Reflector](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/tools/cache/reflector.go) - - However, this might turn out to be unnecessary optimization if apiserver - will always keep up (which is possible in the new design). We will work - out all necessary details at that point. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/auditing.md b/contributors/design-proposals/api-machinery/auditing.md index 2770f56d..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/auditing.md +++ b/contributors/design-proposals/api-machinery/auditing.md @@ -1,376 +1,6 @@ -# Auditing +Design proposals have been archived. -Maciej Szulik (@soltysh) -Dr. Stefan Schimanski (@sttts) -Tim St. Clair (@timstclair) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Abstract - -This proposal aims at extending the auditing log capabilities of the apiserver. - -## Motivation and Goals - -With [#27087](https://github.com/kubernetes/kubernetes/pull/27087) basic audit logging was added to Kubernetes. It basically implements `access.log` like http handler based logging of all requests in the apiserver API. It does not do deeper inspection of the API calls or of their payloads. Moreover, it has no specific knowledge of the API objects which are modified. Hence, the log output does not answer the question how API objects actually change. - -The log output format in [#27087](https://github.com/kubernetes/kubernetes/pull/27087) is fixed. It is text based, unstructured (e.g. non-JSON) data which must be parsed to be usable in any advanced external system used to analyze audit logs. - -The log output format does not follow any public standard like e.g. https://www.dmtf.org/standards/cadf. - -With this proposal we describe how the auditing functionality can be extended in order: - -- to allow multiple output formats, e.g. access.log style or structured JSON output -- to allow deep payload inspection to allow - - either real differential JSON output (which field of an object have changed from which value to which value) - - or full object output of the new state (and optionally the old state) -- to be extensible in the future to fully comply with the Cloud Auditing Data Federation standard (https://www.dmtf.org/standards/cadf) -- to allow filtering of the output - - by kind, e.g. don't log endpoint objects - - by object path (JSON path), e.g. to ignore all `*.status` changes - - by user, e.g. to only log end user action, not those of the controller-manager and scheduler - - by level (request headers, request object, storage object) - -while - -- not degrading apiserver performance when auditing is enabled. - -## Constraints and Assumptions - -* it is not the goal to implement all output formats one can imagine. The main goal is to be extensible with a clear golang interface. Implementations of e.g. CADF must be possible, but won't be discussed here. -* dynamic loading of backends for new output formats are out of scope. - -## Use Cases - -1. As a cluster operator I want to enable audit logging of requests to the apiserver in order **to comply with given business regulations** regarding a subset of the 7 Ws of auditing: - - - **what** happened? - - **when** did it happen? - - **who** initiated it? - - **on what** did it happen (e.g. pod foo/bar)? - - **where** was it observed (e.g. apiserver hostname)? - - from **where** was it initiated? (e.g. kubectl IP) - - to **where** was it going? (e.g. node 1.2.3.4 for kubectl proxy, apiserver when logged at aggregator). - -1. Depending on the environment, as a cluster operator I want to **define the amount of audit logging**, balancing computational overhead for the apiserver with the detail and completeness of the log. - -1. As a cluster operator I want to **integrate with external systems**, which will have different requirements for the log format, network protocols and communication modes (e.g. pull vs. push). - -1. As a cluster operator I must be able to provide a **complete trace of changes to an object** to API objects. - -1. As a cluster operator I must be able to create a trace for **all accesses to a secret**. - -1. As a cluster operator I must be able to log non-CRUD access like **kubectl exec**, when it started, when it finished and with which initial parameters. - -### Out of scope use-cases - -1. As a cluster operator I must be able to get a trace of non-REST calls executed against components other than kube-apiserver, kube-aggregator and their counterparts in federation. This includes operations requiring HTTP upgrade requests to support multiplexed bidirectional streams (HTTP/2, SPDY), direct calls to kubelet endpoints, port forwarding, etc. - -## Community Work - -- Kubernetes basic audit log PR: https://github.com/kubernetes/kubernetes/pull/27087/ -- OpenStack's implementation of the CADF standard: https://www.dmtf.org/sites/default/files/standards/documents/DSP2038_1.1.0.pdf -- Cloud Auditing Data Federation standard: https://www.dmtf.org/standards/cadf -- Ceilometer audit blueprint: https://wiki.openstack.org/wiki/Ceilometer/blueprints/support-standard-audit-formats -- Talk from IBM: An Introduction to DMTF Cloud Auditing using -the CADF Event Model and Taxonomies https://wiki.openstack.org/w/images/e/e1/Introduction_to_Cloud_Auditing_using_CADF_Event_Model_and_Taxonomy_2013-10-22.pdf - -## Architecture - -When implementing audit logging there are basically two options: - -1. put a logging proxy in front of the apiserver -2. integrate audit logging into the apiserver itself - -Both approaches have advantages and disadvantages: -- **pro proxy**: - + keeps complexity out of the apiserver - + reuses existing solutions -- **contra proxy**: - + has no deeper insight into the Kubernetes api - + has no knowledge of authn, authz, admission - + has no access to the storage level for differential output - + has to terminate SSL and complicates client certificates based auth - -In the following, the second approach is described without a proxy. At which point there are a few possible places, inside the apiserver, where auditing could happen, namely: -1. as one of the REST handlers (as in [#27087](https://github.com/kubernetes/kubernetes/pull/27087)), -2. as an admission controller. - -The former approach (currently implemented) was picked over the other one, due to the need to be able to get information about both the user submitting the request and the impersonated user (and group), which is being overridden inside the [impersonation filter](https://git.k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go). Additionally admission controller does not have access to the response and runs after authorization which will prevent logging failed authorization. All of that resulted in continuing the solution started in [#27087](https://github.com/kubernetes/kubernetes/pull/27087), which implements auditing as one of the REST handlers -after authentication, but before impersonation and authorization. - -## Proposed Design - -The main concepts are those of - -- an audit *event*, -- an audit *level*, -- an audit *filters*, -- an audit *output backend*. - -An audit event holds all the data necessary for an *output backend* to produce an audit log entry. The *event* is independent of the *output backend*. - -The audit event struct is passed through the apiserver layers as an `*audit.Event` pointer inside the apiserver's `Context` object. It is `nil` when auditing is disabled. - -If auditing is enabled, the http handler will attach an `audit.Event` to the context: - -```go -func WithAuditEvent(parent Context, e *audit.Event) Context -func AuditEventFrom(ctx Context) (*audit.Event, bool) -``` - -Depending on the audit level (see [below](#levels)), different layers of the apiserver (e.g. http handler, storage) will fill the `audit.Event` struct. Certain fields might stay empty or `nil` if given level does not require that field. E.g. in the case when only http headers are supposed to be audit logged, no `OldObject` or `NewObject` is to be retrieved on the storage layer. - -### Levels - -Proposed audit levels are: - -- `None` - don't audit the request. -- `Metadata` - reflects the current level of auditing, iow. provides following information about each request: timestamp, source IP, HTTP method, user info (including group as user and as group), namespace, URI and response code. -- `RequestBody` - additionally provides the unstructured request body. -- `ResponseBody` - additionally provides the unstructured response body. Equivalent to `RequestBody` for streaming requests. -- `StorageObject` - provides the object before and after modification. - -```go -package audit - -// AuditLevel defines the amount of information logged during auditing -type AuditLevel string - -// Valid audit levels -const ( - // AuditNone disables auditing - AuditNone AuditLevel = "None" - // AuditMetadata provides basic level of auditing, logging data at HTTP level - AuditMetadata AuditLevel = "Metadata" - // AuditRequestBody provides Header level of auditing, and additionally - // logs unstructured request body - AuditRequestBody AuditLevel = "RequestBody" - // AuditResponseBody provides Request level of auditing, and additionally - // logs unstructured response body - AuditResponseBody AuditLevel = "ResponseBody" - // AuditStorageobject provides Response level, and additionally - // logs object before and after saving in storage - AuditStorageObject AuditLevel = "StorageObject" -) -``` - -The audit level is determined by the policy, which maps a combination of requesting user, namespace, verb, API group, and resource. The policy is described in detail [below](#policy). - -In an [aggregated](aggregated-api-servers.md) deployment, the `kube-aggregator` is able to fill in -`Metadata` level audit events, but not above. For the higher audit levels, an audit event is -generated _both_ in the `kube-aggregator` and in the end-user apiserver. The events can be -de-duplicated in the audit backend based on the audit ID, which is generated from the `Audit-ID` -header. The event generated by the end-user apiserver may not have the full authentication information. - -**Note:** for service creation and deletion there is special REST code in the apiserver which takes care of service/node port (de)allocation and removal of endpoints on service deletion. Hence, these operations are not visible on the API layer and cannot be audit logged therefore. **No other resources** (with the exception of componentstatus which is not of interest here) **implement this kind of custom CRUD operations.** - -### Events - -The `Event` object contains the following data: - -```go -package audit - -type Event struct { - // AuditLevel at which event was generated - Level AuditLevel - - // below fields are filled at Metadata level and higher: - - // Unique ID of the request being audited, and able to de-dupe audit events. - // Set from the `Audit-ID` header. - ID string - // Time the event reached the apiserver - Timestamp Timestamp - // Source IPs, from where the request originates, with intermediate proxy IPs. - SourceIPs []string - // HTTP method sent by the client - HttpMethod string - // Verb is the kube verb associated with the request for API requests. - // For non-resource requests, this is identical to HttpMethod. - Verb string - // Authentication method used to allow users access the cluster - AuthMethod string - // RequestURI is the Request-Line as sent by the client to a server - RequestURI string - // User information - User UserInfo - // Impersonation information - Impersonate UserInfo - // Object reference this request is targeted at - Object ObjectReference - // Response status code are returned by the server - ResponseStatusCode int - // Error response, if ResponseStatusCode >= 400 - ResponseErrorMessage string - - // below fields are filled at RequestObject level and higher: - - // RequestObject logged before admission (json format) - RequestObject runtime.Unstructured - // Response object in json format - ResponseObject runtime.Unstructured - - // below fields are filled at StorageObject level and higher: - - // Object value before modification (will be empty when creating new object) - OldObject runtime.Object - // Object value after modification (will be empty when removing object) - NewObject runtime.Object -} -``` - -### Policy - -The audit policy determines what audit event is generated for a given request. The policy is configured by the cluster administrator. Here is a sketch of the policy API: - -```go -type Policy struct { - // Rules specify the audit Level a request should be recorded at. - // A request may match multiple rules, in which case the FIRST matching rule is used. - // The default audit level is None, but can be overridden by a catch-all rule at the end of the list. - Rules []PolicyRule - - // Discussed under Filters section. - Filters []Filter -} - -// Based off the RBAC PolicyRule -type PolicyRule struct { - // Required. The Level that requests matching this rule are recorded at. - Level Level - - // The users (by authenticated user name) this rule applies to. - // An empty list implies every user. - Users []string - // The user groups this rule applies to. If a user is considered matching - // if they are a member of any of these groups - // An empty list implies every user group. - UserGroups []string - - // The verbs that match this rule. - // An empty list implies every verb. - Verbs []string - - // Rules can apply to API resources (such as "pods" or "secrets"), - // non-resource URL paths (such as "/api"), or neither, but not both. - // If neither is specified, the rule is treated as a default for all URLs. - - // APIGroups is the name of the APIGroup that contains the resources ("" for core). - // If multiple API groups are specified, any action requested against one of the - // enumerated resources in any API group will be allowed. - // Any empty list implies every group. - APIGroups []string - // Namespaces that this rule matches. - // This field should be left empty if specifying non-namespaced resources. - // Any empty list implies every namespace. - Namespaces []string - // GroupResources is a list of GroupResource types this rule applies to. - // Any empty list implies every resource type. - GroupResources []GroupResource - // ResourceNames is an optional white list of names that the rule applies to. - // Any empty list implies everything. - ResourceNames []string - - // NonResourceURLs is a set of partial urls that should be audited. - // *s are allowed, but only as the full, final step in the path. - // If an action is not a resource API request, then the URL is split on '/' and - // is checked against the NonResourceURLs to look for a match. - NonResourceURLs []string -} -``` - -As an example, the administrator may decide that by default requests should be audited with the -response, except for get and list requests. On top of that, they wish to completely ignore the noisy -`kube-proxy` endpoint requests. This looks like: - -```yaml -rules: - - level: None - users: ["system:kube-proxy"] - apiGroups: [""] # The core API group - resources: ["endpoints"] - - level: RequestBody - verbs: ["get", "list"] - # The default for non-resource URLs - - level: Metadata - nonResourceURLs: ["*"] - # The default for everything else - - level: ResponseBody -``` - -The policy is checked immediately after authentication in the request handling, and determines how -the `audit.Event` is formed. - -In an [aggregated](aggregated-api-servers.md) deployment, each apiserver must be independently -configured for audit logging (including the aggregator). - -### Filters - -In addition to the high-level policy rules, auditing can be controlled at a more fine-grained level -with `Filters`. Unlike the policy, filters are applied _after_ the `audit.Event` is constructed, but -before it's passed to the output backend. - -TODO: Define how filters work. They should enable dropping sensitive fields from the -request/response/storage objects. - -### Output Backend Interface - -```go -package audit - -type OutputBackend interface { - // Thread-safe, blocking. - Log(e *Event) error -} -``` - -It is the responsibility of the OutputBackend to manage concurrency, but it is acceptable for the -method to block. Errors will be handled by recording a count in prometheus, and attempting to write -to the standard debug log. - -### Apiserver Command Line Flags - -Deprecate flags currently used for configuring audit: -* `--audit-log-path` - specifies the file where requests are logged, -* `--audit-log-maxage` - specifies the maximum number of days to retain old log files, -* `--audit-log-maxbackup` - specifies maximum number of old files to retain, -* `--audit-log-maxsize` - specifies maximum size in megabytes of the log file. - -Following new flags should be introduced in the apiserver: -* `--audit-output` - which specifies the backend and its configuration, example: - `--audit-output file:path=/var/log/apiserver-audit.log,rotate=1d,max=1024MB,format=json` - which will log to `/var/log/apiserver-audit.log`, and additionally defines rotation arguments (analogically to the deprecated ones) and output format. -* `--audit-policy` - which specifies a file with policy configuration, see [policy](#policy) for a sample file contents. - -### Audit Security - -Several parts of the audit system could be exposed to spoofing or tampering threats. This section -lists the threats, and how we will mitigate them. - -**Audit ID.** The audit ID is set from the "front door" server, which could be a federation apiserver, -a kube-aggregator, or end-user apiserver. Since the server can't currently know where in the serving -chain it falls, it is possible for the client to set a `Audit-ID` header that is -non-unique. For this reason, any aggregation that happens based on the audit ID must also sanity -check the known fields (e.g. URL, source IP, time window, etc.). With this additional check, an -attacker could generate a bit more noise in the logs, but no information would be lost. - -**Source IP.** Kubernetes requests may go through multiple hops before a response is generated -(e.g. federation apiserver -> kube-aggregator -> end-user apiserver). Each hop must append the -previous sender's IP address to the `X-Forwarded-For` header IP chain. If we simply audited the -original sender's IP, an attacker could send there request with a bogus IP at the front of the -`X-Forwarded-For` chain. To mitigate this, we will log the entire IP chain. This has the additional -benefit of supporting external proxies. - -## Sensible (not necessarily sequential) Milestones of Implementation - -1. Add `audit.Event` and `audit.OutputBackend` and implement [#27087](https://github.com/kubernetes/kubernetes/pull/27087)'s basic auditing using them, using a single global audit Level, up to `ResponseBody`. -1. Implement the full `audit.Policy` rule specification. -1. Add deep inspection on the storage level to the old and the new object. -1. Add filter support (after finishing the Filters section of this proposal). - -## Future evolution - -Below are the possible future extensions to the auditing mechanism: -* Define how filters work. They should enable dropping sensitive fields from the request/response/storage objects. -* Allow setting a unique identifier which allows matching audit events across apiserver and federated servers. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/bulk_watch.md b/contributors/design-proposals/api-machinery/bulk_watch.md index 1b50f685..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/bulk_watch.md +++ b/contributors/design-proposals/api-machinery/bulk_watch.md @@ -1,476 +1,6 @@ -# Background +Design proposals have been archived. - As part of increasing security of a cluster, we are planning to limit the -ability of a given Kubelet (in general: node), to be able to read only -resources associated with it. Those resources, in particular means: secrets, -configmaps & persistentvolumeclaims. This is needed to avoid situation when -compromising node de facto means compromising a cluster. For more details & -discussions see https://github.com/kubernetes/kubernetes/issues/40476. - - However, by some extension to this effort, we would like to improve scalability -of the system, by significantly reducing amount of api calls coming from -kubelets. As of now, to avoid situation that kubelet is watching all secrets/ -configmaps/... in the system, it is not using watch for this purpose. Instead of -that, it is retrieving individual objects, by sending individual GET requests. -However, to enable automatic updates of mounted secrets/configmaps/..., Kubelet -is sending those GET requests periodically. In large clusters, this is -generating huge unnecessary load, as this load in principle should be -watch-based. We would like to address this together with solving the -authorization issue. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Proposal - - In this proposal, we're not focusing on how exactly security should be done. -We're just sketching very high level approach and exact authorization mechanism -should be discussed separately. - - At the high level, what we would like to achieve is to enable LIST and WATCH -requests to support more sophisticated filtering (e.g. we would like to be able -to somehow ask for all Secrets attached to all pods bound to a given node in a -single LIST or WATCH request). However, the design has to be consistent with -authorization of other types of requests (in particular GETs). - - To solve this problem, we propose to introduce the idea of `bulk watch ` and -`bulk get ` (and in the future other bulk operations). This idea was already -appearing in the community and now we have good usecase to proceed with it. - - Once a bulk watch is set up, we also need to periodically verify its ACLs. -Whenever a user (in our case Kubelet) loses access to a given resource, the -watch should be closed within some bounded time. The rationale behind this -requirement is that by using (bulk) watch we still want to enforce ACLs -similarly to how we enforce them with get operations (and that Kubelet would -eventually be unable to access a secret no matter if it is watching or -polling it). - - That said, periodic verification of ACLs isn't specific to bulk watch and -needs to be solved also in `regular` watch (e.g. user watching just a single -secret may also lose access to it and such watch should also be closed in this -case). So this requirement is common for both regular and bulk watch. We -just need to solve this problem on low enough level that would allow us to -reuse the same mechanism in both cases - we will solve it by sending an error -event to the watcher and then just closing this particular watch. - - At the high level, we would like the API to be generic enough so that it will -be useful for many different usecases, not just this particular one (another -example may be a controller that needs to deal with just subset of namespaces). -As a result, below we are describing requirements that the end solution has to -meet to satisfy our needs -- a single bulk requests has to support multiple resource types (e.g. get a -node and all pods associated with it) -- the wrappers for aggregating multiple objects (in case of list we can return -a number of objects of different kinds) should be `similar` to lists in core -API (by lists I mean e.g. `PodList` object) -- the API has to be implemented also in aggregator so that bulk operations -are supported also if different resource types are served by different -apiservers -- clients has to be able to alter their watch subscriptions incrementally (it -may not be implemented in the initial version though, but has to be designed) - - -# Detailed design - - As stated in above requirements, we need to make bulk operations work across -different resource types (e.g. watch pod P and secret S within a single watch -call). Spanning multiple resources, resource types or conditions will be more -and more important for large number of watches. As an example, federation will -be adding watches for every type it federates. With that in mind, bypassing -aggregation at the resource type level and going to aggregation over objects -with different resource types will allow us to more aggressively optimize in the -future (it doesn't mean you have to watch resources of different types in a -single watch, but we would like to make it possible). - - That means, that we need to implement the bulk operation at the aggregation -level. The implications of it are discussed below. - - Moreover, our current REST API doesn't even offer an easy way to handle -"multiple watches of a given type" within a single request. As a result, instead -of inventing new top level pattern per type, we can introduce a new resource -type that follows normal RESTful rules and solves even more generic problem -of spanning multiple different resource types. - - We will start with introducing a new dedicated API group: - ``` - /apis/bulk.k8s.io/ - ``` - that underneath will have a completely separate implementation. - - In all text below, we are assuming v1 version of the API, but it will obviously -go through alpha and beta stages before (it will start as v1alpha1). - - In this design, we will focus only on bulk get (list) and watch operations. -Later, we would like to introduce new resources to support bulk create, update -and delete operations, but that's not part of this design. - - We will start with introducing `bulkgetoperations` resource and supporting the -following operation: -``` -POST /apis/bulk.k8s.io/v1/bulkgetoperations <body defines filtering> -``` - We can't simply make this an http GET request, due to limitations of GET for -the size (length) of the url (in which we would have to pass filter options). - - We could consider adding `watch` operation using the same pattern with just -`?watch=1` parameter. However, the main drawback of this approach is that it -won't allow for dynamic altering of watch subscriptions (which we definitely -also need to support). -As a result, we need another API for watch that will also support incremental -subscriptions - it will look as following: -``` -websocket /apis/bulk.k8s.io/v1/bulkgetoperations?watch=1 -``` - -*Note: For consistency, we also considered introducing websocket API for -handling LIST requests, where first client sends a filter definition over the -channel and then server sends back the response, but we dropped this for now.* - -*Note: We also considered implementing the POST-based watch handler that doesn't -allow for altering subscriptions, which should be very simple once we have list -implemented. But since websocket API is needed anyway, we also dropped it.* - - -### Filtering definition - - We need to make our filtering mechanism to support different resource types at -the same time. On the other hand, we would like it to be as consistent with all -other Kubernetes APIs as possible. So we define the selector for bulk operations -as following: - -``` -type BulkGetOperation struct { - Operations []GetOperation -} - -type GetOperation struct { - Resource GroupVersionResource - - // We would like to reuse the same ListOptions definition as we are using - // in regular APIs. - // TODO: We may consider supporting multiple ListOptions for a single - // GetOperation. - Options ListOptions -} -``` - - We need to be able to detect whether a given user is allowed to get/list -objects requested by a given "GetOperation". For that purpose, we will create -some dedicated admission plugin (or potentially reuse already existing one). -That one will not support partial rejections and will simply allow or reject -the whole request. - -For watch operations, as described in requirements, we also need to periodically -(or maybe lazily) verify whether their ACLs didn't change (and user is still -allowed to watch requested objects). However, as also mentioned above, this -periodic checking isn't specific to bulk operations and we need to support the -same mechanism for regular watch too. We will just ensure that this mechanism -tracking and periodically verifying ACLs is implemented low enough in apiserver -machinery so that we will be able to reuse exactly the same one for the purpose -of bulk watch operations. -For watch request, we will support partial rejections. The exact details of it -will be described together with dynamic watch description below. - - -### Dynamic watch - - As mentioned in the Proposal section, we will implement bulk watch that will -allow for dynamic subscription/unsubscription for (sets of) objects on top of -websockets protocol. - - Note that we already support websockets in the regular Kubernetes API for -watch requests (in addition to regular http requests), so for the purpose of -bulk watch we will be extending websocket support. - - The high level, the protocol will look: -1. client opens a new websocket connection to a bulk watch endpoint to the -server via ghttp GET -1. this results in creating a single channel that is used only to handle -communication for subscribing/unsubscribing for watches; no watch events are -delivered via this particular channel. -*TODO: Consider switching to two channels, one for incoming and one for -outgoing communication* -1. to subscribe for a watch of a given (set of) objects, user sends `Watch` -object over the channel; in response a new channel is created and the message -with the channel identifier is send back to the user (we will be using integers -as channel identifiers). -*TODO: Check if integers are mandatory or if we can switch to something like -ch1, ch2 ... .* -1. once subscribed, all objects matching a given selector will be send over -the newly created channel -1. to stop watching for a given (set of) objects, user sends `CloseWatch` -object over the channel; in response the corresponding watch is broken and -corresponding channel within websocket is closed -1. once done, user should close the whole websocket connection (this results in -breaking all still opened channels and corresponding watches). - - With that high level protocol, there are still multiple details that needs -to be figured out. First, we need to define `Watch` and `CloseWatch` -message. We will solve it with the single `Request` object: -``` -type Request struct { - // Only one of those is set. - Watch *Watch - CloseWatch *CloseWatch -} - -type Identifier int64 - -// Watch request for objects matching a given selector. -type Watch struct { - Selector GetOperation -} - -// Request to stop watching objects from a watch identified by the channel. -type CloseWatch struct { - Channel Identifier -} - -// Depending on the request, channel that was created or deleted. -type Response struct { - Channel Identifier -} -``` -With the above structure we can guarantee that we only send and receive -objects of a single type over the channel. - -We should also introduce some way of correlating responses with requests -when a client is sending multiple of them at the same time. To achieve this -we will add a `request identified` field to the `Request` that user can set -and that will then be returned as part of `Response`. With this mechanism -user can set the identifier to increasing integers and then will be able -to correlate responses with requests he sent before. So the final structure -will be as following: -``` -type Request struct { - ID Identifier - // Only one of those is set. - Watch *Watch - CloseWatch *CloseWatch -} - -// Depending on the request, channel that was created or deleted. -type Response struct { - // Propagated from the Request. - RequestID Identifier - Channel Identifier -} -``` - -Another detail we need to point at is about semantic. If there are multiple -selectors selecting exactly the same object, the object will be send multiple -times (once for every channel interested in that object). -If we decide to also have http-POST-based watch, since there will be basically -a single channel there (as it is in watch in regular API), such an object will -be send once. This semantic difference needs to be explicitly described. - -*TODO: Since those channels for individual watches are kind of independent, -we need to decide whether they are supposed to be synchronized with each other. -In particular, if we have two channels watching for objects of the same type -are events guaranteed to be send in the increasing order of resource versions. -I think this isn't necessary, but it needs to be explicit.* - -Yet another thing to consider is what if the server wants to close the watch -even though user didn't request it (some potential reasons of it may be failed -periodic ACL check or some kind of timeout). -We will solve it by saying that in such situation we will send an error via -the corresponding channel and every error automatically closes the channel. -It is responsibility of a user to re-subscribe if that's possible. - -We will also reuse this mechanism for partial rejections. We will be able to -reject (or close failed periodic ACL check) any given channel separately from -all other existing channels. - - -### Watch semantics - - There are a lot of places in the code (including all our list/watch-related -frameworks like reflector) that rely on two crucial watch invariants: -1. watch events are delivered in increasing order of resource version -1. there is at most one watch event delivered for any resource version - - However, we have no guarantee that resource version series is shared between -different resource types (in fact in default GCE setup events are not sharing -the same series as they are stored in a separate etcd instance). That said, -to avoid introducing too many assumptions (that already aren't really met) -we can't guarantee exactly the same. - - With the above description of "dynamic watch", within a single channel you -are allowed to only watch for objects of a single type. So it is enough to -introduce only the following assumption: -1. within a single resource type, all objects are sharing the same resource -version series. - - This means, we can still shard etcd by resource types, but we can't really -shard by e.g. by namespaces. Note that this doesn't introduce significant -limitations compared to what we already have, because even now you can watch -all objects of a single type and there is no mechanism to inject multiple -resource versions into it. So this assumption is not making things worse. - - To support multi-resource-type watches, we can build another framework on -top of frameworks we already have as following: -- we will have a single reflector/informer/... per resource type -- we will create an aggregator/dispatcher in front of them that will be -responsible for aggregating requests from underlying frameworks into a single -one and then dispatching incoming watch events to correct reflect/informer. -This will obviously require changes to existing frameworks, but those should -be local changes. - - One more thing to mention is detecting resource type of object being send via -watch. With "dynamic watch" proposal, we already know it based on the channel -from which it came (only objects of a single type can be send over the single -channel). - - Note that this won't be true if we would decide for regular http-based watch -and as a result we would have to introduce a dedicated type for bulk watch -event containing object type. This is yet another argument to avoid implementing -http-based bulk watch at all. - - -### Implementation details - - As already mentioned above, we need to support API aggregation. The API -(endpoint) has to be implemented in kube-aggregator. For the implementation -we have two alternatives: -1. aggregator forwards the request to all apiservers and aggregates results -2. based on discovery information, aggregator knows which type is supported -by which apiserver so it is forwarding requests with just appropriate -resource types to corresponding apiservers and then aggregates results. - -Neither of those is difficult, so we should proceed with the second, which -has an advantage for watch, because for a given subrequest only a single -apiserver will be returning some events. However, no matter which one we -choose, client will not see any difference between contacting apiserver or -aggregator, which is crucial requirement here. - -NOTE: For watch requests, as an initial step we can consider implementing -this API only in aggregator and simply start an individual watch for any -subrequest. With http2 we shouldn't get rid of descriptors and it can be -enough as a proof of concept. However, with such approach there will be -difference between sending a given request to aggregator and apiserver -so we need to implement it properly in apiserver before entering alpha -anyway. This would just give us early results faster. - - The implementation of bulk get and bulk watch in a single apiserver will -also work as kind of aggregator. Whenever a request is coming, it will: -- check what resource type(s) are requested in this request -- for every resource type, combine only parts of the filter that are about -this particular resource type and send the request down the stack -- gather responses for all those resource types and combine them into -single response to the user. - - The only non-trivial operation above is sending the request for a single -resource type down the stack. In order to implement it, we will need to -slightly modify the interface of "Registry" in apiserver. The modification -will have to allow passing both what we are passing now and BulkListOptions -(in some format) (this may e.g. changing signature to accept BulkListOptions -and translating ListOptions to BulkListOptions in the current code). - - With those changes in place, there is a question of how to call this -code. There are two main options: -1. Make each registered resource type, register also in BulkAggregator -(or whatever we call it) and call those methods directly -1. Expose also per-resource bulk operation in the main API, e.g.: -``` -POST /api/v1/namespace/default/pods/_bulkwatch <body defines filtering> -``` -and use the official apiserver API for delegating requests. However, this -may collide with resource named `_bulkwatch ` and detecting whether -this is bulk operation or regular api operation doesn't seem to be worth -pursuing. - -As a result, we will proceed with option 1. - - -## Considered alternatives - - We considered introducing a different way of filtering that would basically be -"filter only objects that the node is allowed to see". However, to make this -kind of watch work correctly with our list/watch-related frameworks, it would -need to preserve the crucial invariants of watch. This in particular means: - -1. There is at most one watch event for a given resource version. -2. Watch events are delivered in the increasing order of resource versions. - - Ideally, whenever a new pod referencing object X is bound to a node, we send -"add" event for object X to the watcher. However, that would break the above -assumptions because: - -1. There can be more objects referenced by a given pod (so we can't send all -of them with the rv corresponding to that pod add/update/delete) -2. If we decide for sending those events with their original resource version, -then we could potentially go back in time. - - As a result, we considered the following alternatives to solve this problems: - -1. Don't set the event being result of pod creation/update/deletion. -It would be responsibility of a user to grab current version of all object that -are being referenced by this new pod. And only from that point, events for all -objects being referenced would be delivered to the watcher as long as the pod -existing on the node. - -This approach in a way it leaking the watch logic to the watcher, that needs -to duplicate the logic of tracking what objects are referenced. - -2. Whenever a new pod is bound to a node (or existing is deleted) we send -all add/delete events of attached object to the watcher with the same resource -version being the one of the modified pod. - -In this approach, we violate the first assumption (this is a problem in case -of breaking watch, as we don't really know where to resume) as well as we -send events with fake resource versions, which might be misleading to watchers. - -3. Another potential option is to change the watch api so that, instead of -sending a single object in a watch event, we would be sending a list of objects -as part of single watch event. - -This would solve all the problems from previous two solutions, but this is -change in the api (we would need to introduce a new API for it), and would also -require changes in our list/watch-related frameworks. - - In all above proposals, the tricky part is determining whether an object X -is referenced by any pods bound to a given node is to avoid race conditions and -do it in deterministic way. The crucial requirements are: - -1. Whenever "list" request returns a list of objects and a resource version "rv", -starting a watch from the returned "rv" will never drop any events. -2. For a given watch request (with resource version "rv"), the returned stream -of events is always the same (e.g. very slow lagging watch may not cause dropped -events). - -We can't really satisfy these conditions using the existing machinery. To solve -this problem reliably we need to be able to serialize events between different -object types in a deterministic way. -This could be done via resource version, but would require assumption that: - -1. all object types necessary to determine the in-memory mapping share the same -resource version series. - -With that assumption, we can have a "multi-object-type" watch that will serialize -events for different object types for us. Having exactly one watch responsible -for delivering all objects (pods, secrets, ...) will guarantee that if we are -currently at resource version "rv", we processed objects of all types up to rv -and nothing with resource version greater than rv. Which is exactly what we need. - - -## Other Notes - - We were seriously considering implementing http-POST-based approach as an -additional, simpler watch to implement watch (not supporting altering -subscriptions). In this approach whenever a user want to start watching for -another object (or set of objects) or drop one (or a set), he needs to break the -watch and initiated a new one with a different filter. This approach isn't -perfect, but can solve different usecase of naive clients and is much simpler -to implement. - - However, this has multiple drawbacks, including: -- this is a second mechanism for doing the same thing (we need "dynamic watch" -no matter if we implement it or not) -- there would be some semantic differences between those approaches (e.g. if -there are few selectors selecting the same object, in dynamic approach it will -be send multiple times, once over each channel, here it would be send once) -- we would have to introduce a dedicate "BulkWatchEvent" type to incorporate -resource type. This would make those two incompatible even at the output format. - - With all of those in mind, even though the implementation would be much -simpler (and could potentially be a first step and would probably solve the -original "kubelet watching secrets" problem good enough), we decided not to -proceed with it at all. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/client-package-structure.md b/contributors/design-proposals/api-machinery/client-package-structure.md index 1aa47b18..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/client-package-structure.md +++ b/contributors/design-proposals/api-machinery/client-package-structure.md @@ -1,309 +1,6 @@ -- [Client: layering and package structure](#client-layering-and-package-structure) - - [Desired layers](#desired-layers) - - [Transport](#transport) - - [RESTClient/request.go](#restclientrequestgo) - - [Mux layer](#mux-layer) - - [High-level: Individual typed](#high-level-individual-typed) - - [High-level, typed: Discovery](#high-level-typed-discovery) - - [High-level: Dynamic](#high-level-dynamic) - - [High-level: Client Sets](#high-level-client-sets) - - [Package Structure](#package-structure) - - [Client Guarantees (and testing)](#client-guarantees-and-testing) +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Client: layering and package structure - -## Desired layers - -### Transport - -The transport layer is concerned with round-tripping requests to an apiserver -somewhere. It consumes a Config object with options appropriate for this. -(That's most of the current client.Config structure.) - -Transport delivers an object that implements http's RoundTripper interface -and/or can be used in place of http.DefaultTransport to route requests. - -Transport objects are safe for concurrent use, and are cached and reused by -subsequent layers. - -Tentative name: "Transport". - -It's expected that the transport config will be general enough that third -parties (e.g., OpenShift) will not need their own implementation, rather they -can change the certs, token, etc., to be appropriate for their own servers, -etc.. - -Action items: -* Split out of current client package into a new package. (@krousey) - -### RESTClient/request.go - -RESTClient consumes a Transport and a Codec (and optionally a group/version), -and produces something that implements the interface currently in request.go. -That is, with a RESTClient, you can write chains of calls like: - -`c.Get().Path(p).Param("name", "value").Do()` - -RESTClient is generically usable by any client for servers exposing REST-like -semantics. It provides helpers that benefit those following api-conventions.md, -but does not mandate them. It provides a higher level http interface that -abstracts transport, wire serialization, retry logic, and error handling. -Kubernetes-like constructs that deviate from standard HTTP should be bypassable. -Every non-trivial call made to a remote restful API from Kubernetes code should -go through a rest client. - -The group and version may be empty when constructing a RESTClient. This is valid -for executing discovery commands. The group and version may be overridable with -a chained function call. - -Ideally, no semantic behavior is built into RESTClient, and RESTClient will use -the Codec it was constructed with for all semantic operations, including turning -options objects into URL query parameters. Unfortunately, that is not true of -today's RESTClient, which may have some semantic information built in. We will -remove this. - -RESTClient should not make assumptions about the format of data produced or -consumed by the Codec. Currently, it is JSON, but we want to support binary -protocols in the future. - -The Codec would look something like this: - -```go -type Codec interface { - Encode(runtime.Object) ([]byte, error) - Decode([]byte]) (runtime.Object, error) - - // Used to version-control query parameters - EncodeParameters(optionsObject runtime.Object) (url.Values, error) - - // Not included here since the client doesn't need it, but a corresponding - // DecodeParametersInto method would be available on the server. -} -``` - -There should be one codec per version. RESTClient is *not* responsible for -converting between versions; if a client wishes, they can supply a Codec that -does that. But RESTClient will make the assumption that it's talking to a single -group/version, and will not contain any conversion logic. (This is a slight -change from the current state.) - -As with Transport, it is expected that 3rd party providers following the api -conventions should be able to use RESTClient, and will not need to implement -their own. - -Action items: -* Split out of the current client package. (@krousey) -* Possibly, convert to an interface (currently, it's a struct). This will allow - extending the error-checking monad that's currently in request.go up an - additional layer. -* Switch from ParamX("x") functions to using types representing the collection - of parameters and the Codec for query parameter serialization. -* Any other Kubernetes group specific behavior should also be removed from - RESTClient. - -### Mux layer - -(See TODO at end; this can probably be merged with the "client set" concept.) - -The client muxer layer has a map of group/version to cached RESTClient, and -knows how to construct a new RESTClient in case of a cache miss (using the -discovery client mentioned below). The ClientMux may need to deal with multiple -transports pointing at differing destinations (e.g. OpenShift or other 3rd party -provider API may be at a different location). - -When constructing a RESTClient generically, the muxer will just use the Codec -the high-level dynamic client would use. Alternatively, the user should be able -to pass in a Codec-- for the case where the correct types are compiled in. - -Tentative name: ClientMux - -Action items: -* Move client cache out of kubectl libraries into a more general home. -* TODO: a mux layer may not be necessary, depending on what needs to be cached. - If transports are cached already, and RESTClients are extremely light-weight, - there may not need to be much code at all in this layer. - -### High-level: Individual typed - -Our current high-level client allows you to write things like -`c.Pods("namespace").Create(p)`; we will insert a level for the group. - -That is, the system will be: - -`clientset.GroupName().NamespaceSpecifier().Action()` - -Where: -* `clientset` is a thing that holds multiple individually typed clients (see - below). -* `GroupName()` returns the generated client that this section is about. -* `NamespaceSpecifier()` may take a namespace parameter or nothing. -* `Action` is one of Create/Get/Update/Delete/Watch, or appropriate actions - from the type's subresources. -* It is TBD how we'll represent subresources and their actions. This is - inconsistent in the current clients, so we'll need to define a consistent - format. Possible choices: - * Insert a `.Subresource()` before the `.Action()` - * Flatten subresources, such that they become special Actions on the parent - resource. - -The types returned/consumed by such functions will be e.g. api/v1, NOT the -current version in specific types. The current internal-versioned client is -inconvenient for users, as it does not protect them from having to recompile -their code with every minor update. (We may continue to generate an -internal-versioned client for our own use for a while, but even for our own -components it probably makes sense to switch to specifically versioned clients.) - -We will provide this structure for each version of each group. It is infeasible -to do this manually, so we will generate this. The generator will accept both -swagger and the ordinary go types. The generator should operate on out-of-tree -sources AND out-of-tree destinations, so it will be useful for consuming -out-of-tree APIs and for others to build custom clients into their own -repositories. - -Typed clients will be constructable given a ClientMux; the typed constructor will use -the ClientMux to find or construct an appropriate RESTClient. Alternatively, a -typed client should be constructable individually given a config, from which it -will be able to construct the appropriate RESTClient. - -Typed clients do not require any version negotiation. The server either supports -the client's group/version, or it does not. However, there are ways around this: -* If you want to use a typed client against a server's API endpoint and the - server's API version doesn't match the client's API version, you can construct - the client with a RESTClient using a Codec that does the conversion (this is - basically what our client does now). -* Alternatively, you could use the dynamic client. - -Action items: -* Move current typed clients into new directory structure (described below) -* Finish client generation logic. (@caesarxuchao, @lavalamp) - -#### High-level, typed: Discovery - -A `DiscoveryClient` is necessary to discover the api groups, versions, and -resources a server supports. It's constructable given a RESTClient. It is -consumed by both the ClientMux and users who want to iterate over groups, -versions, or resources. (Example: namespace controller.) - -The DiscoveryClient is *not* required if you already know the group/version of -the resource you want to use: you can simply try the operation without checking -first, which is lower-latency anyway as it avoids an extra round-trip. - -Action items: -* Refactor existing functions to present a sane interface, as close to that - offered by the other typed clients as possible. (@caeserxuchao) -* Use a RESTClient to make the necessary API calls. -* Make sure that no discovery happens unless it is explicitly requested. (Make - sure SetKubeDefaults doesn't call it, for example.) - -### High-level: Dynamic - -The dynamic client lets users consume apis which are not compiled into their -binary. It will provide the same interface as the typed client, but will take -and return `runtime.Object`s instead of typed objects. There is only one dynamic -client, so it's not necessary to generate it, although optionally we may do so -depending on whether the typed client generator makes it easy. - -A dynamic client is constructable given a config, group, and version. It will -use this to construct a RESTClient with a Codec which encodes/decodes to -'Unstructured' `runtime.Object`s. The group and version may be from a previous -invocation of a DiscoveryClient, or they may be known by other means. - -For now, the dynamic client will assume that a JSON encoding is allowed. In the -future, if we have binary-only APIs (unlikely?), we can add that to the -discovery information and construct an appropriate dynamic Codec. - -Action items: -* A rudimentary version of this exists in kubectl's builder. It needs to be - moved to a more general place. -* Produce a useful 'Unstructured' runtime.Object, which allows for easy - Object/ListMeta introspection. - -### High-level: Client Sets - -Because there will be multiple groups with multiple versions, we will provide an -aggregation layer that combines multiple typed clients in a single object. - -We do this to: -* Deliver a concrete thing for users to consume, construct, and pass around. We - don't want people making 10 typed clients and making a random system to keep - track of them. -* Constrain the testing matrix. Users can generate a client set at their whim - against their cluster, but we need to make guarantees that the clients we - shipped with v1.X.0 will work with v1.X+1.0, and vice versa. That's not - practical unless we "bless" a particular version of each API group and ship an - official client set with each release. (If the server supports 15 groups with - 2 versions each, that's 2^15 different possible client sets. We don't want to - test all of them.) - -A client set is generated into its own package. The generator will take the list -of group/versions to be included. Only one version from each group will be in -the client set. - -A client set is constructable at runtime from either a ClientMux or a transport -config (for easy one-stop-shopping). - -An example: - -```go -import ( - api_v1 "k8s.io/kubernetes/pkg/client/typed/generated/v1" - ext_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/extensions/v1beta1" - net_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/net/v1beta1" - "k8s.io/kubernetes/pkg/client/typed/dynamic" -) - -type Client interface { - API() api_v1.Client - Extensions() ext_v1beta1.Client - Net() net_v1beta1.Client - // ... other typed clients here. - - // Included in every set - Discovery() discovery.Client - GroupVersion(group, version string) dynamic.Client -} -``` - -Note that a particular version is chosen for each group. It is a general rule -for our API structure that no client need care about more than one version of -each group at a time. - -This is the primary deliverable that people would consume. It is also generated. - -Action items: -* This needs to be built. It will replace the ClientInterface that everyone - passes around right now. - -## Package Structure - -``` -pkg/client/ -----------/transport/ # transport & associated config -----------/restclient/ -----------/clientmux/ -----------/typed/ -----------------/discovery/ -----------------/generated/ ---------------------------/<group>/ -----------------------------------/<version>/ ---------------------------------------------/<resource>.go -----------------/dynamic/ -----------/clientsets/ ----------------------/release-1.1/ ----------------------/release-1.2/ ----------------------/the-test-set-you-just-generated/ -``` - -`/clientsets/` will retain their contents until they reach their expire date. -e.g., when we release v1.N, we'll remove clientset v1.(N-3). Clients from old -releases live on and continue to work (i.e., are tested) without any interface -changes for multiple releases, to give users time to transition. - -## Client Guarantees (and testing) - -Once we release a clientset, we will not make interface changes to it. Users of -that client will not have to change their code until they are deliberately -upgrading their import. We probably will want to generate some sort of stub test -with a clientset, to ensure that we don't change the interface. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/controller-ref.md b/contributors/design-proposals/api-machinery/controller-ref.md index eee9629a..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/controller-ref.md +++ b/contributors/design-proposals/api-machinery/controller-ref.md @@ -1,417 +1,6 @@ -# ControllerRef proposal +Design proposals have been archived. -* Authors: gmarek, enisoc -* Last edit: [2017-02-06](#history) -* Status: partially implemented +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Approvers: -* [ ] briangrant -* [ ] dbsmith -**Table of Contents** - -* [Goals](#goals) -* [Non-goals](#non-goals) -* [API](#api) -* [Behavior](#behavior) -* [Upgrading](#upgrading) -* [Implementation](#implementation) -* [Alternatives](#alternatives) -* [History](#history) - -# Goals - -* The main goal of ControllerRef (controller reference) is to solve the problem - of controllers that fight over controlled objects due to overlapping selectors - (e.g. a ReplicaSet fighting with a ReplicationController over Pods because - both controllers have label selectors that match those Pods). - Fighting controllers can [destabilize the apiserver](https://github.com/kubernetes/kubernetes/issues/24433), - [thrash objects back-and-forth](https://github.com/kubernetes/kubernetes/issues/24152), - or [cause controller operations to hang](https://github.com/kubernetes/kubernetes/issues/8598). - - We don't want to have just an in-memory solution because we don't want a - Controller Manager crash to cause a massive reshuffling of controlled objects. - We also want to expose the mapping so that controllers can be in multiple - processes (e.g. for HA of kube-controller-manager) and separate binaries - (e.g. for controllers that are API extensions). - Therefore, we will persist the mapping from each object to its controller in - the API object itself. - -* A secondary goal of ControllerRef is to provide back-links from a given object - to the controller that manages it, which can be used for: - * Efficient object->controller lookup, without having to list all controllers. - * Generic object grouping (e.g. in a UI), without having to know about all - third-party controller types in advance. - * Replacing certain uses of the `kubernetes.io/created-by` annotation, - and potentially enabling eventual deprecation of that annotation. - However, deprecation is not being proposed at this time, so any uses that - remain will be unaffected. - -# Non-goals - -* Overlapping selectors will continue to be considered user error. - - ControllerRef will prevent this user error from destabilizing the cluster or - causing endless back-and-forth fighting between controllers, but it will not - make it completely safe to create controllers with overlapping selectors. - - In particular, this proposal does not address cases such as Deployment or - StatefulSet, in which "families" of orphans may exist that ought to be adopted - as indivisible units. - Since multiple controllers may race to adopt orphans, the user must ensure - selectors do not overlap to avoid breaking up families. - Breaking up families of orphans could result in corruption or loss of - Deployment rollout state and history, and possibly also corruption or loss of - StatefulSet application data. - -* ControllerRef is not intended to replace [selector generation](selector-generation.md), - used by some controllers like Job to ensure all selectors are unique - and prevent overlapping selectors from occurring in the first place. - - However, ControllerRef will still provide extra protection and consistent - cross-controller semantics for controllers that already use selector - generation. For example, selector generation can be manually overridden, - which leaves open the possibility of overlapping selectors due to user error. - -* This proposal does not change how cascading deletion works. - - Although ControllerRef will extend OwnerReference and rely on its machinery, - the [Garbage Collector](garbage-collection.md) will continue to implement - cascading deletion as before. - That is, the GC will look at all OwnerReferences without caring whether a - given OwnerReference happens to be a ControllerRef or not. - -# API - -The `Controller` API field in OwnerReference marks whether a given owner is a -managing controller: - -```go -type OwnerReference struct { - … - // If true, this reference points to the managing controller. - // +optional - Controller *bool -} -``` - -A ControllerRef is thus defined as an OwnerReference with `Controller=true`. -Each object may have at most one ControllerRef in its list of OwnerReferences. -The validator for OwnerReferences lists will fail any update that would violate -this invariant. - -# Behavior - -This section summarizes the intended behavior for existing controllers. -It can also serve as a guide for respecting ControllerRef when writing new -controllers. - -## The Three Laws of Controllers - -All controllers that manage collections of objects should obey the following -rules. - -1. **Take ownership** - - A controller should claim *ownership* of any objects it creates by adding a - ControllerRef, and may also claim ownership of an object it didn't create, - as long as the object has no existing ControllerRef (i.e. it is an *orphan*). - -1. **Don't interfere** - - A controller should not take any action (e.g. edit/scale/delete) on an object - it does not own, except to [*adopt*](#adoption) the object if allowed by the - First Law. - -1. **Don't share** - - A controller should not count an object it does not own toward satisfying its - desired state (e.g. a certain number of replicas), although it may include - the object in plans to achieve its desired state (e.g. through adoption) - as long as such plans do not conflict with the First or Second Laws. - -## Adoption - -If a controller finds an orphaned object (an object with no ControllerRef) that -matches its selector, it may try to adopt the object by adding a ControllerRef. -Note that whether or not the controller *should* try to adopt the object depends -on the particular controller and object. - -Multiple controllers can race to adopt a given object, but only one can win -by being the first to add a ControllerRef to the object's OwnerReferences list. -The losers will see their adoptions fail due to a validation error as explained -[above](#api). - -If a controller has a non-nil `DeletionTimestamp`, it must not attempt adoption -or take any other actions except updating its `Status`. -This prevents readoption of objects orphaned by the [orphan finalizer](garbage-collection.md#part-ii-the-orphan-finalizer) -during deletion of the controller. - -## Orphaning - -When a controller is deleted, the objects it owns will either be orphaned or -deleted according to the normal [Garbage Collection](garbage-collection.md) -behavior, based on OwnerReferences. - -In addition, if a controller finds that it owns an object that no longer matches -its selector, it should orphan the object by removing itself from the object's -OwnerReferences list. Since ControllerRef is just a special type of -OwnerReference, this also means the ControllerRef is removed. - -## Watches - -Many controllers use watches to *sync* each controller instance (prompting it to -reconcile desired and actual state) as soon as a relevant event occurs for one -of its controlled objects, as well as to let controllers wait for asynchronous -operations to complete on those objects. -The controller subscribes to a stream of events about controlled objects -and routes each event to a particular controller instance. - -Previously, the controller used only label selectors to decide which -controller to route an event to. If multiple controllers had overlapping -selectors, events might be misrouted, causing the wrong controllers to sync. -Controllers could also freeze because they keep waiting for an event that -already came but was misrouted, manifesting as `kubectl` commands that hang. - -Some controllers introduced a workaround to break ties. For example, they would -sort all controller instances with matching selectors, first by creation -timestamp and then by name, and always route the event to the first controller -in this list. However, that did not prevent misrouting if the overlapping -controllers were of different types. It also only worked while controllers -themselves assigned ownership over objects using the same tie-break rules. - -Now that controller ownership is defined in terms of ControllerRef, -controllers should use the following guidelines for responding to watch events: - -* If the object has a ControllerRef: - * Sync only the referenced controller. - * Update `expectations` counters for the referenced controller. - * If an *Update* event removes the ControllerRef, sync any controllers whose - selectors match to give each one a chance to adopt the object. -* If the object is an orphan: - * *Add* event - * Sync any controllers whose selectors match to give each one a chance to - adopt the object. - * Do *not* update counters on `expectations`. - Controllers should never be waiting for creation of an orphan because - anything they create should have a ControllerRef. - * *Delete* event - * Do *not* sync any controllers. - Controllers should never care about orphans disappearing. - * Do *not* update counters on `expectations`. - Controllers should never be waiting for deletion of an orphan because they - are not allowed to delete objects they don't own. - * *Update* event - * If labels changed, sync any controllers whose selectors match to give each - one a chance to adopt the object. - -## Default garbage collection policy - -Controllers that used to rely on client-side cascading deletion should set a -[`DefaultGarbageCollectionPolicy`](https://github.com/kubernetes/kubernetes/blob/dd22743b54f280f41e68f206449a13ca949aca4e/pkg/genericapiserver/registry/rest/delete.go#L43) -of `rest.OrphanDependents` when they are updated to implement ControllerRef. - -This ensures that deleting only the controller, without specifying the optional -`DeleteOptions.OrphanDependents` flag, remains a non-cascading delete. -Otherwise, the behavior would change to server-side cascading deletion by -default as soon as the controller manager is upgraded to a version that performs -adoption by setting ControllerRefs. - -Example from [ReplicationController](https://github.com/kubernetes/kubernetes/blob/9ae2dfacf196ca7dbee798ee9c3e1663a5f39473/pkg/registry/core/replicationcontroller/strategy.go#L49): - -```go -// DefaultGarbageCollectionPolicy returns Orphan because that was the default -// behavior before the server-side garbage collection was implemented. -func (rcStrategy) DefaultGarbageCollectionPolicy() rest.GarbageCollectionPolicy { - return rest.OrphanDependents -} -``` - -New controllers that don't have legacy behavior to preserve can omit this -controller-specific default to use the [global default](https://github.com/kubernetes/kubernetes/blob/2bb1e7581544b9bd059eafe6ac29775332e5a1d6/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L543), -which is to enable server-side cascading deletion. - -## Controller-specific behavior - -This section lists considerations specific to a given controller. - -* **ReplicaSet/ReplicationController** - - * These controllers currently only enable ControllerRef behavior when the - Garbage Collector is enabled. When ControllerRef was first added to these - controllers, the main purpose was to enable server-side cascading deletion - via the Garbage Collector, so it made sense to gate it behind the same flag. - - However, in order to achieve the [goals](#goals) of this proposal, it is - necessary to set ControllerRefs and perform adoption/orphaning regardless of - whether server-side cascading deletion (the Garbage Collector) is enabled. - For example, turning off the GC should not cause controllers to start - fighting again. Therefore, these controllers will be updated to always - enable ControllerRef. - -* **StatefulSet** - - * A StatefulSet will not adopt any Pod whose name does not match the template - it uses to create new Pods: `{statefulset name}-{ordinal}`. - This is because Pods in a given StatefulSet form a "family" that may use pod - names (via their generated DNS entries) to coordinate among themselves. - Adopting Pods with the wrong names would violate StatefulSet's semantics. - - Adoption is allowed when Pod names match, so it remains possible to orphan a - family of Pods (by deleting their StatefulSet without cascading) and then - create a new StatefulSet with the same name and selector to adopt them. - -* **CronJob** - - * CronJob [does not use watches](https://github.com/kubernetes/kubernetes/blob/9ae2dfacf196ca7dbee798ee9c3e1663a5f39473/pkg/controller/cronjob/cronjob_controller.go#L20), - so [that section](#watches) doesn't apply. - Instead, all CronJobs are processed together upon every "sync". - * CronJob applies a `created-by` annotation to link Jobs to the CronJob that - created them. - If a ControllerRef is found, it should be used instead to determine this - link. - -## Created-by annotation - -Aside from the change to CronJob mentioned above, several other uses of the -`kubernetes.io/created-by` annotation have been identified that would be better -served by ControllerRef because it tracks who *currently* controls an object, -not just who originally created it. - -As a first step, the specific uses identified in the [Implementation](#implementation) -section will be augmented to prefer ControllerRef if one is found. -If no ControllerRef is found, they will fall back to looking at `created-by`. - -# Upgrading - -In the absence of controllers with overlapping selectors, upgrading or -downgrading the master to or from a version that introduces ControllerRef -should have no user-visible effects. -If no one is fighting, adoption should always succeed eventually, so ultimately -only the selectors matter on either side of the transition. - -If there are controllers with overlapping selectors at the time of an *upgrade*: - -* Back-and-forth thrashing should stop after the upgrade. -* The ownership of existing objects might change due to races during - [adoption](#adoption). As mentioned in the [non-goals](#non-goals) section, - this can include breaking up families of objects that should have stayed - together. -* Controllers might create additional objects because they start to respect the - ["Don't share"](#behavior) rule. - -If there are controllers with overlapping selectors at the time of a -*downgrade*: - -* Controllers may begin to fight and thrash objects. -* The ownership of existing objects might change due to ignoring ControllerRef. -* Controllers might delete objects because they stop respecting the - ["Don't share"](#behavior) rule. - -# Implementation - -Checked items had been completed at the time of the [last edit](#history) of -this proposal. - -* [x] Add API field for `Controller` to the `OwnerReference` type. -* [x] Add validator that prevents an object from having multiple ControllerRefs. -* [x] Add `ControllerRefManager` types to encapsulate ControllerRef manipulation - logic. -* [ ] Update all affected controllers to respect ControllerRef. - * [ ] ReplicationController - * [ ] Don't touch controlled objects if DeletionTimestamp is set. - * [x] Don't adopt/manage objects. - * [ ] Don't orphan objects. - * [x] Include ControllerRef on all created objects. - * [x] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [x] Use ControllerRefManager to adopt and orphan. - * [ ] Enable ControllerRef regardless of `--enable-garbage-collector` flag. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] ReplicaSet - * [ ] Don't touch controlled objects if DeletionTimestamp is set. - * [x] Don't adopt/manage objects. - * [ ] Don't orphan objects. - * [x] Include ControllerRef on all created objects. - * [x] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [x] Use ControllerRefManager to adopt and orphan. - * [ ] Enable ControllerRef regardless of `--enable-garbage-collector` flag. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] StatefulSet - * [ ] Don't touch controlled objects if DeletionTimestamp is set. - * [ ] Include ControllerRef on all created objects. - * [ ] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [ ] Use ControllerRefManager to adopt and orphan. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] DaemonSet - * [x] Don't touch controlled objects if DeletionTimestamp is set. - * [ ] Include ControllerRef on all created objects. - * [ ] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [ ] Use ControllerRefManager to adopt and orphan. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] Deployment - * [x] Don't touch controlled objects if DeletionTimestamp is set. - * [x] Include ControllerRef on all created objects. - * [x] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [x] Use ControllerRefManager to adopt and orphan. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] Job - * [x] Don't touch controlled objects if DeletionTimestamp is set. - * [ ] Include ControllerRef on all created objects. - * [ ] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [ ] Use ControllerRefManager to adopt and orphan. - * [ ] Use ControllerRef to map watch events to controllers. - * [ ] CronJob - * [ ] Don't touch controlled objects if DeletionTimestamp is set. - * [ ] Include ControllerRef on all created objects. - * [ ] Set DefaultGarbageCollectionPolicy to OrphanDependents. - * [ ] Use ControllerRefManager to adopt and orphan. - * [ ] Use ControllerRef to map Jobs to their parent CronJobs. -* [ ] Tests - * [ ] Update existing controller tests to use ControllerRef. - * [ ] Add test for overlapping controllers of different types. -* [ ] Replace or augment uses of `CreatedByAnnotation` with ControllerRef. - * [ ] `kubectl describe` list of controllers for an object. - * [ ] `kubectl drain` Pod filtering. - * [ ] Classifying failed Pods in e2e test framework. - -# Alternatives - -The following alternatives were considered: - -* Centralized "ReferenceController" component that manages adoption/orphaning. - - Not chosen because: - * Hard to make it work for all imaginable 3rd party objects. - * Adding hooks to framework makes it possible for users to write their own - logic. - -* Separate API field for `ControllerRef` in the ObjectMeta. - - Not chosen because: - * Complicated relationship between `ControllerRef` and `OwnerReference` - when it comes to deletion/adoption. - -# History - -Summary of significant revisions to this document: - -* 2017-02-06 (enisoc) - * [Controller-specific behavior](#controller-specific-behavior) - * Enable ControllerRef regardless of whether GC is enabled. - * [Implementation](#implementation) - * Audit whether existing controllers respect DeletionTimestamp. -* 2017-02-01 (enisoc) - * Clarify existing specifications and add details not previously specified. - * [Non-goals](#non-goals) - * Make explicit that overlapping selectors are still user error. - * [Behavior](#behavior) - * Summarize fundamental rules that all new controllers should follow. - * Explain how the validator prevents multiple ControllerRefs on an object. - * Specify how ControllerRef should affect the use of watches/expectations. - * Specify important controller-specific behavior for existing controllers. - * Specify necessary changes to default GC policy when adding ControllerRef. - * Propose changing certain uses of `created-by` annotation to ControllerRef. - * [Upgrading](#upgrading) - * Specify ControllerRef-related behavior changes upon upgrade/downgrade. - * [Implementation](#implementation) - * List all work to be done and mark items already completed as of this edit. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/csi-client-structure-proposal.md b/contributors/design-proposals/api-machinery/csi-client-structure-proposal.md index 4f46d32e..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/csi-client-structure-proposal.md +++ b/contributors/design-proposals/api-machinery/csi-client-structure-proposal.md @@ -1,192 +1,6 @@ -# Overall Kubernetes Client Structure +Design proposals have been archived. -**Status:** Approved by SIG API Machinery on March 29th, 2017 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Authors:** @lavalamp, @mbohlool - -**last edit:** 2017-3-22 - -## Goals - -* Users can build production-grade programmatic use of Kubernetes-style APIs in their language of choice. - -## New Concept - -Today, Kubernetes has the concept of an API Group. Sometimes it makes sense to package multiple groups together in a client, for example, the core APIs we publish today. I’ll call this a "group collection" as it sounds a bit better than “group group.” Group collections will have names. In particular, this document uses “core” as the name for the current set of APIs. - -## Repositories - -We’ve decomposed the problem into several components. We’d like to make the following repositories under a new `kubernetes-client` github org. - -* github.com/kubernetes-client/gen - - * Contents: - - * OpenAPI preprocessing (shared among multiple languages) and client generator(s) scripts - - * The Kubernetes go language client generator, which currently takes as input a tree of types.go files. - - * Future work: Convert OpenAPI into a types.go tree, or modify the input half of the go language generator. - - * Make the client generation completely automated so it can be part of a build process. Less reason for people to create custom repos as many clients for single language (and different api extensions) could be confusing. - - * gen is intended to be used as part of build tool chains. - -* github.com/kubernetes-client/go-base - - * Contents: - - * All reusable components of the existing client-go library, including at least: - - * Transport - - * RESTClient - - * Workqueue - - * Informer (not the typed informers) - - * Dynamic and Discovery clients - - * Utility functions (if still necessary) - - * But omitting: - - * API types - - * current generated code. - - * go-base is usable as a client on its own (dynamic client, discovery client) - -* github.com/kubernetes-client/core-go - - * Contents: - - * The existing examples. - - * The output (including API types) resulting from running client-gen on the source api types of the core API Group collection. - - * Any hand-written *_expansion.go files. - -* github.com/kubernetes-client/python-base - - * Hand-tuned pieces (auth, watch support etc) for python language clients. - -* github.com/kubernetes-client/core-python - - * The output of kubernetes-client/gen for the python language. - - * Note: We should provide a packaging script to make sure this is backward compatible with current `pip` package. - -* github.com/kubernetes-client/core-{lang} and github.com/kubernetes-client/{lang}-base - - * The output of kubernetes-client/gen for language {lang}, and hand-tuned pieces for that language. [See here](https://docs.google.com/document/d/1hsJNlowIg-u_rz3JBw9hXh6rj2LgeAdMY7OleOL1srw/edit). - -Note that the word "core" in the above package names represents the API groups that will be included in the repository. “core” would indicate inclusion of the groups published in the client today. (One can imagine replacing it with “service-catalog” etc.) - -## Why this split? - -The division of each language into two repositories is intended to allow for composition: generated clients for [multiple different API sources](https://docs.google.com/document/d/1UZyb5sQc-G2Ix4YL6dA9f4xJWtz3VKCiFPNHVHDUq9Y/edit#) can be used together without any code duplication. To clarify further: - -* Some who run the generator don't need go-base. - - * I want to publish an API extension client. - -* Some who use the kubernetes-client/core-go package don't need the generator. - - * I don’t use any extensions, just vanilla k8s. - -* Some who use go-base need neither the generator nor the core-go package. - - * I want to use the dynamic client since I only care about metadata. - -* Those who write automation only for their extension need the generator and go-base but possibly not core-go. - -That is, there should only be one kubernetes-client/{lang}-base for any given language, but many different API providers may provide a kubernetes-client/{collection}-{lang} for the language (e.g., core Kubernetes, the API registration API, the cluster federation effort, service catalog, heapster/metrics API, OpenShift). - -It is preferred to use the generation as part of the user’s build process so we can have fewer kubernetes-client/{collection}-{lang} for custom APIs out in the wild. Users should only expect super popular extensions to host their own client, as there’s otherwise a combinatorial explosion of API Group collections x languages. - -Users may want to run the client generator themselves with the particular collection of APIs enabled in their particular cluster, or at a different version of the generator. - -## Versioning - -`kubernetes-client/gen` must be versioned, so that users can get deterministic, repeatable client interfaces. (Use case: adding a reference to a new API resource to existing slightly out-of-date code.) We will use semver. - -The versions of kubernetes-client/gen must correspond to the versions of all kubernetes-client/{lang}-base repositories. - -kubernetes-client/{collection}-{lang} repos have their own version. Each release of such repos must clearly state both: - -* the version of the OpenAPI source (or go types for the go client) - -* the version of kubernetes-client/gen used to generate the client - -This will allow users to regenerate any given kubernetes-client/{collection}-{lang} repo, adding custom API resources etc. - -## Clients for API Extensions - -Providers of API extensions (e.g., service catalog or cluster federation) may choose to publish a generated client for their API types, to make users’ lives easier (it’s not strictly necessary, since end users could run the generator themselves). Publishing a client like this should be as easy as importing the generator execution script from e.g. the main kubernetes-client/core-go repo, and providing a different input source. - -## Language-specific considerations - -### Go - -* The typed informer generator should be included with the client generator. - -* We need a "clientset" adaptor concept to make it easy to compose clients from disparate client repos. - -* Prior strategy doc [is here](https://docs.google.com/document/d/1h_IBGYPMa8FS0oih4NbVkAMAzM7YTHr76VBcKy1qFbg/edit#heading=h.ve95t2prztno). - -### {My Favorite Language} - -Read about the process for producing an official client library [here](https://docs.google.com/document/d/1hsJNlowIg-u_rz3JBw9hXh6rj2LgeAdMY7OleOL1srw/edit). - -## Remaining Design Work - -* Client Release Process - - * Needs definition, i.e., who does what to which repo when. - -* Client release note collection mechanism - - * For go, we’ve talked about amending the Kubernetes merge bot to require a client-relnote: `Blah` in PRs that touch a few pre-identified directories. - - * Once we are working on the generator in the generator repo, it becomes easier to assemble release notes: we can grab all changes to the interface by looking at the kubernetes-client/gen and kubernetes-client/{lang}-base repos, and we can switch the release note rule to start tracking client-visible API changes in the main repository. - -* Client library documentation - - * Ideally, generated - - * Ideally, both: - - * In the native form for the language, and - - * In a way that can easily be aggregated with the official Kubernetes API docs. - -## Timeline Guesstimate - -Rough order of changes that need to be made. - -1. Begin working towards collecting client release notes. - -2. Split client-go into kubernetes-client/core-go and kubernetes-client/go-base - -3. Move go client generator into kubernetes-client/gen - - 1. kubernetes-client/gen becomes the canonical location for this. It is vendored into the main repository (downloaded at a specific version & invoked directly would be even better). - - 2. The client generator is modified to make a copy of the go types specifically for the client. (Either from the source go types, or generated from an OpenAPI spec.) - -4. Split client-python into kubernetes-client/core-python and kubernetes-client/python-base - -5. Move OpenAPI generator into kubernetes-client/gen - -6. Declare 1.0.0 on kubernetes-client/gen, and all kubernetes-client/{lang}-base repositories. (This doesn’t mean they’re stable, but we need a functioning versioning system since the deliverable here is the entire process, not any one particular client.) - -7. Instead of publishing kubernetes-client/core-go from its location in the staging directory: - - 3. Add a script to it that downloads kubernetes-client/gen and the main repo and generates the client. - - 4. Switch the import direction, so that we really just vendor kubernetes-client/core-go in the main repo. (Alternative: main repo can just run the generator itself to avoid having to make multiple PRs) - -8. At this point we should have finished the bootstrapping process and we’ll be ready to automate and execute on whatever release process we’ve defined. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md b/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md index 7d209042..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md +++ b/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md @@ -1,98 +1,6 @@ -# Kubernetes: New Client Library Procedure +Design proposals have been archived. -**Status:** Approved by SIG API Machinery on March 29th, 2017 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Authors:** @mbohlool, @lavalamp - -**Last Updated:** 2017-03-06 - -# Background - -Kubernetes currently officially supports both Go and [Python client](https://github.com/kubernetes-incubator/client-python) libraries. The go client is developed and extracted from main kubernetes repositories in a complex process. On the other hand, the python client is based on OpenAPI, and is mostly generated code (via [swagger-codegen](https://github.com/swagger-api/swagger-codegen)). By generating the API Operations and Data Models, updating the client and tracking changes from main repositories becomes much more sustainable. - -The python client development process can be repeated for other languages. Supporting a basic set of languages would help the community to build more tools and applications based on kubernetes. We may consider adjusting the go client library generation to match, but that is not the goal of this doc. - -More background information can be found [here](https://github.com/kubernetes/kubernetes/issues/22405). - -# Languages - -The proposal is to support *Java*, *PHP*, *Ruby*, *C#*, and *Javascript* in addition to the already supported libraries, Go and Python. There are good clients for each of these languages, but having a basic supported client would even help those client libraries to focus on their interface and delegate transport and config layer to this basic client. For community members willing to do some work producing a client for their favorite language, this doc establishes a procedure for going about this. - -# Development process - -Development would be based on a generated client using OpenAPI and [swagger-codegen](https://github.com/swagger-api/swagger-codegen). Some basic functionality such as loading config, watch, etc. would be added (i.e., hand-written) on top of this generated client. The idea is to develop transportation and configuration layer, and modify as few generated files (such as API and models) as possible. The clients would be in alpha, beta or stable stages, and may have either bronze, silver, or gold support according to these requirements: - -### Client Capabilities - -* Bronze Requirements [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-capabilities) - - * Support loading config from kube config file - - * Basic Auth (username/password) (Add documentation to discourage this and only use for testing.) - - * X509 Client certificate (inline and referenced by file) - - * Bearer tokens (inline or referenced by a file that is reloaded at least once per minute) - - * encryption/TLS (inline, referenced by file, insecure) - - * Basic API calls such as list pods should work - - * Works from within the cluster environment. - -* Silver Requirements [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-capabilities) - - * Support watch calls - -* Gold Requirements [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-capabilities) - - * Support exec, attach, port-forward calls (these are not normally supported out of the box from [swagger-codegen](https://github.com/swagger-api/swagger-codegen)) - - * Proto encoding - -### Client Support Level - -* Alpha [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-support-level) - - * Clients don’t even have to meet bronze requirements - -* Beta [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-support-level) - - * Client at least meets bronze standards - - * Reasonably stable releases - - * Installation instructions - - * 2+ individual maintainers/owners of the repository - -* Stable [](/contributors/design-proposals/api-machinery/csi-new-client-library-procedure.md#client-support-level) - - * Support level documented per-platform - - * Library documentation - - * Deprecation policy (backwards compatibility guarantees documented) - - * How fast may the interface change? - - * Versioning procedure well documented - - * Release process well documented - - * N documented users of the library - -The API machinery SIG will somewhere (community repo?) host a page listing clients, including their stability and capability level from the above lists. - -# Kubernetes client repo - -New clients will start as repositories in the [kubernetes client](https://github.com/kubernetes-client/) organization. - -We propose to make a `gen` repository to house common functionality such as preprocessing the OpenAPI spec and running the generator, etc. - -For each client language, we’ll make a client-[lang]-base and client-[lang] repository (where lang is one of java, csharp, js, php, ruby). The base repo would have all utility and add-ons for the specified language and the main repo will have generated client and reference to base repo. - -# Support - -These clients will be supported by the Kubernetes [API Machinery special interest group](/sig-api-machinery); however, individual owner(s) will be needed for each client language for them to be considered stable; the SIG won’t be able to handle the support load otherwise. If the generated clients prove as easy to maintain as we hope, then a few individuals may be able to own multiple clients. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/customresource-conversion-webhook.md b/contributors/design-proposals/api-machinery/customresource-conversion-webhook.md index 54991fd6..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/customresource-conversion-webhook.md +++ b/contributors/design-proposals/api-machinery/customresource-conversion-webhook.md @@ -1,889 +1,6 @@ -# CRD Conversion Webhook +Design proposals have been archived. -Status: Approved +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Alpha -Implementation Owner: @mbohlool - -Authors: @mbohlool, @erictune - -Thanks: @dbsmith, @deads2k, @sttts, @liggit, @enisoc - -### Summary - -This document proposes a detailed plan for adding support for version-conversion of Kubernetes resources defined via Custom Resource Definitions (CRD). The API Server is extended to call out to a webhook at appropriate parts of the handler stack for CRDs. - -No new resources are added; the [CRD resource](https://github.com/kubernetes/kubernetes/blob/34383aa0a49ab916d74ea897cebc79ce0acfc9dd/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/types.go#L187) is extended to include conversion information as well as multiple schema definitions, one for each apiVersion that is to be served. - - -## Definitions - -**Webhook Resource**: a Kubernetes resource (or portion of a resource) that informs the API Server that it should call out to a Webhook Host for certain operations. - -**Webhook Host**: a process / binary which accepts HTTP connections, intended to be called by the Kubernetes API Server as part of a Webhook. - -**Webhook**: In Kubernetes, refers to the idea of having the API server make an HTTP request to another service at a point in its request processing stack. Examples are [Authentication webhooks](https://kubernetes.io/docs/reference/access-authn-authz/webhook/) and [Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/). Usually refers to the system of Webhook Host and Webhook Resource together, but occasionally used to mean just Host or just Resource. - -**Conversion Webhook**: Webhook that can convert an object from one version to another. - -**Custom Resource**: In the context of this document, it refers to resources defined as Custom Resource Definition (in contrast with extension API server’s resources). - -**CRD Package**: CRD definition, plus associated controller deployment, RBAC roles, etc, which is released by a developer who uses CRDs to create new APIs. - - -## Motivation - -Version conversion is, in our experience, the most requested improvement to CRDs. Prospective CRD users want to be certain they can evolve their API before they start down the path of developing a CRD + controller. - - -## Requirements - -* As an existing author of a CRD, I can update my API's schema, without breaking existing clients. To that end, I can write a CRD(s) that supports one kind with two (or more) versions. Users of this API can access an object via either version (v1 or v2), and are accessing the same underlying storage (assuming that I have properly defined how to convert between v1 and v2.) - -* As a prospective user of CRDs, I don't know what schema changes I may need in the future, but I want to know that they will be possible before I chose CRDs (over EAS, or over a non-Kubernetes API). - -* As an author of a CRD Package, my users can upgrade to a new version of my package, and can downgrade to a prior version of my package (assuming that they follow proper upgrade and downgrade procedures; these should not require direct etcd access.) - -* As a user, I should be able to request CR in any supported version defined by CRD and get an object has been properly converted to the requested version (assuming the CRD Package Author has properly defined how to convert). - -* As an author of a CRD that does not use validation, I can still have different versions which undergo conversion. - -* As a user, when I request an object, and webhook-conversion fails, I get an error message that helps me understand the problem. - -* As an API machinery code maintainer, this change should not make the API machinery code harder to maintain - -* As a cluster owner, when I upgrade to the version of Kubernetes that supports CRD multiple versions, but I don't use the new feature, my existing CRDs work fine. I can roll back to the previous version without any special action. - - -## Summary of Changes - -1. A CRD object now represents a group/kind with one or more versions. - -2. The CRD API (CustomResourceDefinitionSpec) is extended as follows: - - 1. It has a place to register 1 webhook. - - 2. it holds multiple "versions". - - 3. Some fields which were part of the .spec are now per-version; namely Schema, Subresources, and AdditionalPrinterColumns. - -3. A Webhook Host is used to do conversion for a CRD. - - 4. CRD authors will need to write a Webhook Host that accepts any version and returns any version. - - 5. Toolkits like kube-builder and operator-sdk are expected to provide flows to assist users to generate Webhook Hosts. - - -## Detailed Design - - -### CRD API Changes - -The CustomResourceDefinitionSpec is extended to have a new section where webhooks are defined: - -```golang -// CustomResourceDefinitionSpec describes how a user wants their resource to appear -type CustomResourceDefinitionSpec struct { - Group string - Version string - Names CustomResourceDefinitionNames - Scope ResourceScope - // Optional, can only be provided if per-version schema is not provided. - Validation *CustomResourceValidation - // Optional, can only be provided if per-version subresource is not provided. - Subresources *CustomResourceSubresources - Versions []CustomResourceDefinitionVersion - // Optional, can only be provided if per-version additionalPrinterColumns is not provided. - AdditionalPrinterColumns []CustomResourceColumnDefinition - - Conversion *CustomResourceConversion -} - -type CustomResourceDefinitionVersion struct { - Name string - Served Boolean - Storage Boolean - // Optional, can only be provided if top level validation is not provided. - Schema *JSONSchemaProp - // Optional, can only be provided if top level subresource is not provided. - Subresources *CustomResourceSubresources - // Optional, can only be provided if top level additionalPrinterColumns is not provided. - AdditionalPrinterColumns []CustomResourceColumnDefinition -} - -Type CustomResourceConversion struct { - // Conversion strategy, either "nop” or "webhook”. If webhook is set, Webhook field is required. - Strategy string - - // Additional information for external conversion if strategy is set to external - // +optional - Webhook *CustomResourceConversionWebhook -} - -type CustomResourceConversionWebhook { - // ClientConfig defines how to communicate with the webhook. This is the same config used for validating/mutating webhooks. - ClientConfig WebhookClientConfig -} -``` - -### Top level fields to Per-Version fields - -In *CRD v1beta1* (apiextensions.k8s.io/v1beta1) there are per-version schema, additionalPrinterColumns or subresources (called X in this section) defined and these validation rules will be applied to them: - -* Either top level X or per-version X can be set, but not both. This rule applies to individual X’s not the whole set. E.g. top level schema can be set while per-version subresources are set. -* per-version X cannot be the same. E.g. if all per-version schema are the same, the CRD object will be rejected with an error message asking the user to use the top level schema. - -in *CRD v1* (apiextensions.k8s.io/v1), there will be only version list with no top level X. The second validation guarantees a clean moving to v1. These are conversion rules: - -*v1beta1->v1:* - -* If top level X is set in v1beta1, then it will be copied to all versions in v1. -* If per-version X are set in v1beta1, then they will be used for per-version X in v1. - -*v1->v1beta1:* - -* If all per-version X are the same in v1, they will be copied to top level X in v1beta1 -* Otherwise, they will be used as per-version X in v1beta1 - -#### Alternative approaches considered - -First a defaulting approach is considered which per-version fields would be defaulted to top level fields. but that breaks backward incompatible change; Quoting from API [guidelines](/contributors/devel/sig-architecture/api_changes.md#backward-compatibility-gotchas): - -> A single feature/property cannot be represented using multiple spec fields in the same API version simultaneously - -Hence the defaulting either implicit or explicit has the potential to break backward compatibility as we have two sets of fields representing the same feature. - -There are other solution considered that does not involved defaulting: - -* Field Discriminator: Use `Spec.Conversion.Strategy` as discriminator to decide which set of fields to use. This approach would work but the proposed solution is keeping the mutual excusivity in a broader sense and is preferred. -* Per-version override: If a per-version X is specified, use it otherwise use the top level X if provided. While with careful validation and feature gating, this solution is also backward compatible, the overriding behaviour need to be kept in CRD v1 and that looks too complicated and not clean to keep for a v1 API. - -Refer to [this document](http://bit.ly/k8s-crd-per-version-defaulting) for more details and discussions on those solutions. - -### Support Level - -The feature will be alpha in the first implementation and will have a feature gate that is defaulted to false. The roll-back story with a feature gate is much more clear. if we have the features as alpha in kubernetes release Y (>X where the feature is missing) and we make it beta in kubernetes release Z, it is not safe to use the feature and downgrade from Y to X but the feature is alpha in Y which is fine. It is safe to downgrade from Z to Y (given that we enable the feature gate in Y) and that is desirable as the feature is beta in Z. -On downgrading from a Z to Y, stored CRDs can have per-version fields set. While the feature gate can be off on Y (alpha cluster), it is dangerous to disable per-version Schema Validation or Status subresources as it makes the status field mutable and validation on CRs will be disabled. Thus the feature gate in Y only protects adding per-version fields not the actual behaviour. Thus if the feature gate is off in Y: - -* Per-version X cannot be set on CRD create (per-version fields are auto-cleared). -* Per-version X can only be set/changed on CRD update *if* the existing CRD object already has per-version X set. - -This way even if we downgrade from Z to Y, per-version validations and subresources will be honored. This will not be the case for webhook conversion itself. The feature gate will also protect the implementation of webhook conversion and alpha cluster with disabled feature gate will return error for CRDs with webhook conversion (that are created with a future version of the cluster). - -### Rollback - -Users that need to rollback to version X (but may currently be running version Y > X) of apiserver should not use CRD Webhook Conversion if X is not a version that supports these features. If a user were to create a CRD that uses CRD Webhook Conversion and then rolls back to version X that does not support conversion then the following would happen: - -1. The stored custom resources in etcd will not be deleted. - -2. Any clients that try to get the custom resources will get a 500 (internal server error). this is distinguishable from a deleted object for get and the list operation will also fail. That means the CRD is not served at all and Clients that try to garbage collect related resources to missing CRs should be aware of this. - -3. Any client (e.g. controller) that tries to list the resource (in preparation for watching it) will get a 500 (this is distinguishable from an empty list or a 404). - -4. If the user rolls forward again, then custom resources will be served again. - -If a user does not use the webhook feature but uses the versioned schema, additionalPrinterColumns, and/or subresources and rollback to a version that does not support them per-version, any value set per-version will be ignored and only values in top level spec.* will be honor. - -Please note that any of the fields added in this design that is not supported in previous kubernetes releases can be removed on an update operation (e.g. status update). The kubernetes release where defined the types but gate them with an alpha feature gate, however, can keep these fields but ignore there value. - -### Webhook Request/Response - -The Conversion request and response would be similar to [Admission webhooks](https://github.com/kubernetes/kubernetes/blob/951962512b9cfe15b25e9c715a5f33f088854f97/staging/src/k8s.io/api/admission/v1beta1/types.go#L29). The AdmissionReview seems to be redundant but used by other Webhook APIs and added here for consistency. - -```golang -// ConversionReview describes a conversion request/response. -type ConversionReview struct { - metav1.TypeMeta - // Request describes the attributes for the conversion request. - // +optional - Request *ConversionRequest - // Response describes the attributes for the conversion response. - // +optional - Response *ConversionResponse -} - -type ConversionRequest struct { - // UID is an identifier for the individual request/response. Useful for logging. - UID types.UID - // The version to convert given object to. E.g. "stable.example.com/v1" - APIVersion string - // Object is the CRD object to be converted. - Object runtime.RawExtension -} - -type ConversionResponse struct { - // UID is an identifier for the individual request/response. - // This should be copied over from the corresponding ConversionRequest. - UID types.UID - // ConvertedObject is the converted version of request.Object. - ConvertedObject runtime.RawExtension -} -``` - -If the conversion is failed, the webhook should fail the HTTP request with a proper error code and message that will be used to create a status error for the original API caller. - - -### Monitorability - -There should be prometheus variables to show: - -* CRD conversion latency - * Overall - * By webhook name - * By request (sum of all conversions in a request) - * By CRD -* Conversion Failures count - * Overall - * By webhook name - * By CRD -* Timeout failures count - * Overall - * By webhook name - * By CRD - -Adding a webhook dynamically adds a key to a map-valued prometheus metric. Webhook host process authors should consider how to make their webhook host monitorable: while eventually we hope to offer a set of best practices around this, for the initial release we won’t have requirements here. - - -### Error Messages - -When a conversion webhook fails, e.g. for the GET operation, then the error message from the apiserver to its client should reflect that conversion failed and include additional information to help debug the problem. The error message and HTTP error code returned by the webhook should be included in the error message API server returns to the user. For example: - -```bash -$ kubectl get mykind somename -error on server: conversion from stored version v1 to requested version v2 for somename: "408 request timeout" while calling service "mywebhookhost.somens.cluster.local:443" -``` - - -For operations that need more than one conversion (e.g. LIST), no partial result will be returned. Instead the whole operation will fail the same way with detailed error messages. To help debugging these kind of operations, the UID of the first failing conversion will also be included in the error message. - - -### Caching - -No new caching is planned as part of this work, but the API Server may in the future cache webhook POST responses. - -Most API operations are reads. The most common kind of read is a watch. All watched objects are cached in memory. For CRDs, the cache -is per-version. That is the result of having one [REST store object](https://github.com/kubernetes/kubernetes/blob/3cb771a8662ae7d1f79580e0ea9861fd6ab4ecc0/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/etcd.go#L72) per-version which -was an arbitrary design choice but would be required for better caching with webhook conversion. In this model, each GVK is cached, regardless of whether some GVKs share storage. Thus, watches do not cause conversion. So, conversion webhooks will not add overhead to the watch path. Watch cache is per api server and eventually consistent. - -Non-watch reads are also cached (if requested resourceVersion is 0 which is true for generated informers by default, but not for calls like `kubectl get ...`, namespace cleanup, etc). The cached objects are converted and per-version (TODO: fact check). So, conversion webhooks will not add overhead here too. - -If in the future this proves to be a performance problem, we might need to add caching later. The Authorization and Authentication webhooks already use a simple scheme with APIserver-side caching and a single TTL for expiration. This has worked fine, so we can repeat this process. It does not require Webhook hosts to be aware of the caching. - - -## Examples - - -### Example of Writing Conversion Webhook - -Data model for v1: - -|data model for v1| -|-----------------| -```yaml -properties: - spec: - properties: - cronSpec: - type: string - image: - type: string -``` - -|data model for v2| -|-----------------| -```yaml -properties: - spec: - properties: - min: - type: string - hour: - type: string - dayOfMonth: - type: string - month: - type: string - dayOfWeek: - type: string - image: - type: string -``` - - -Both schemas can hold the same data (assuming the string format for V1 was a valid format). - -|crontab_conversion.go| -|---------------------| - -```golang -import .../types/v1 -import .../types/v2 - -// Actual conversion methods - -func convertCronV1toV2(cronV1 *v1.Crontab) (*v2.Crontab, error) { - items := strings.Split(cronV1.spec.cronSpec, " ") - if len(items) != 5 { - return nil, fmt.Errorf("invalid spec string, needs five parts: %s", cronV1.spec.cronSpec) - } - return &v2.Crontab{ - ObjectMeta: cronV1.ObjectMeta, - TypeMeta: metav1.TypeMeta{ - APIVersion: "stable.example.com/v2", - Kind: cronV1.Kind, - }, - spec: v2.CrontabSpec{ - image: cronV1.spec.image, - min: items[0], - hour: items[1], - dayOfMonth: items[2], - month: items[3], - dayOfWeek: items[4], - }, - }, nil - -} - -func convertCronV2toV1(cronV2 *v2.Crontab) (*v1.Crontab, error) { - cronspec := cronV2.spec.min + " " - cronspec += cronV2.spec.hour + " " - cronspec += cronV2.spec.dayOfMonth + " " - cronspec += cronV2.spec.month + " " - cronspec += cronV2.spec.dayOfWeek - return &v1.Crontab{ - ObjectMeta: cronV2.ObjectMeta, - TypeMeta: metav1.TypeMeta{ - APIVersion: "stable.example.com/v1", - Kind: cronV2.Kind, - }, - spec: v1.CrontabSpec{ - image: cronV2.spec.image, - cronSpec: cronspec, - }, - }, nil -} - -// The rest of the file can go into an auto generated framework - -func serveCronTabConversion(w http.ResponseWriter, r *http.Request) { - request, err := readConversionRequest(r) - if err != nil { - reportError(w, err) - } - response := ConversionResponse{} - response.UID = request.UID - converted, err := convert(request.Object, request.APIVersion) - if err != nil { - reportError(w, err) - } - response.ConvertedObject = *converted - writeConversionResponse(w, response) -} - -func convert(in runtime.RawExtension, version string) (*runtime.RawExtension, error) { - inApiVersion, err := extractAPIVersion(in) - if err != nil { - return nil, err - } - switch inApiVersion { - case "stable.example.com/v1": - var cronV1 v1Crontab - if err := json.Unmarshal(in.Raw, &cronV1); err != nil { - return nil, err - } - switch version { - case "stable.example.com/v1": - // This should not happened as API server will not call the webhook in this case - return &in, nil - case "stable.example.com/v2": - cronV2, err := convertCronV1toV2(&cronV1) - if err != nil { - return nil, err - } - raw, err := json.Marshal(cronV2) - if err != nil { - return nil, err - } - return &runtime.RawExtension{Raw: raw}, nil - } - case "stable.example.com/v2": - var cronV2 v2Crontab - if err := json.Unmarshal(in.Raw, &cronV2); err != nil { - return nil, err - } - switch version { - case "stable.example.com/v2": - // This should not happened as API server will not call the webhook in this case - return &in, nil - case "stable.example.com/v1": - cronV1, err := convertCronV2toV1(&cronV2) - if err != nil { - return nil, err - } - raw, err := json.Marshal(cronV1) - if err != nil { - return nil, err - } - return &runtime.RawExtension{Raw: raw}, nil - } - default: - return nil, fmt.Errorf("invalid conversion fromVersion requested: %s", inApiVersion) - } - return nil, fmt.Errorf("invalid conversion toVersion requested: %s", version) -} - -func extractAPIVersion(in runtime.RawExtension) (string, error) { - object := unstructured.Unstructured{} - if err := object.UnmarshalJSON(in.Raw); err != nil { - return "", err - } - return object.GetAPIVersion(), nil -} -``` - -Note: not all code is shown for running a web server. - -Note: some of this is boilerplate that we expect tools like Kubebuilder will handle for the user. - -Also some appropriate tests, most importantly round trip test: - -|crontab_conversion_test.go| -|-| - -```golang -func TestRoundTripFromV1ToV2(t *testing.T) { - testObj := v1.Crontab{ - ObjectMeta: metav1.ObjectMeta{ - Name: "my-new-cron-object", - }, - TypeMeta: metav1.TypeMeta{ - APIVersion: "stable.example.com/v1", - Kind: "CronTab", - }, - spec: v1.CrontabSpec{ - image: "my-awesome-cron-image", - cronSpec: "* * * * */5", - }, - } - testRoundTripFromV1(t, testObj) -} - -func testRoundTripFromV1(t *testing.T, v1Object v1.CronTab) { - v2Object, err := convertCronV1toV2(v1Object) - if err != nil { - t.Fatalf("failed to convert v1 crontab to v2: %v", err) - } - v1Object2, err := convertCronV2toV1(v2Object) - if err != nil { - t.Fatalf("failed to convert v2 crontab to v1: %v", err) - } - if !reflect.DeepEqual(v1Object, v1Object2) { - t.Errorf("round tripping failed for v1 crontab. v1Object: %v, v2Object: %v, v1ObjectConverted: %v", - v1Object, v2Object, v1Object2) - } -} -``` - -## Example of Updating CRD from one to two versions - -This example uses some files from previous section. - -**Step 1**: Start from a CRD with only one version - -|crd1.yaml| -|-| - -```yaml -apiVersion: apiextensions.k8s.io/v1beta1 -kind: CustomResourceDefinition -metadata: - name: crontabs.stable.example.com -spec: - group: stable.example.com - versions: - - name: v1 - served: true - storage: true - schema: - properties: - spec: - properties: - cronSpec: - type: string - image: - type: string - scope: Namespaced - names: - plural: crontabs - singular: crontab - kind: CronTab - shortNames: - - ct -``` - -And create it: - -```bash -Kubectl create -f crd1.yaml -``` - -(If you have an existing CRD installed prior to the version of Kubernetes that supports the "versions" field, then you may need to move version field to a single item in the list of versions or just try to touch the CRD after upgrading to the new Kubernetes version which will result in the versions list being defaulted to a single item equal to the top level spec values) - -**Step 2**: Create a CR within that one version: - -|cr1.yaml| -|-| -```yaml - -apiVersion: "stable.example.com/v1" -kind: CronTab -metadata: - name: my-new-cron-object -spec: - cronSpec: "* * * * */5" - image: my-awesome-cron-image -``` - -And create it: - -```bash -Kubectl create -f cr1.yaml -``` - -**Step 3**: Decide to introduce a new version of the API. - -**Step 3a**: Write a new OpenAPI data model for the new version (see previous section). Use of a data model is not required, but it is recommended. - -**Step 3b**: Write conversion webhook and deploy it as a service named `crontab_conversion` - -See the "crontab_conversion.go" file in the previous section. - -**Step 3c**: Update the CRD to add the second version. - -Do this by adding a new item to the "versions" list, containing the new data model: - -|crd2.yaml| -|-| -```yaml - -apiVersion: apiextensions.k8s.io/v1beta1 -kind: CustomResourceDefinition -metadata: - name: crontabs.stable.example.com -spec: - group: stable.example.com - versions: - - name: v1 - served: true - storage: false - schema: - properties: - spec: - properties: - cronSpec: - type: string - image: - type: string - - name: v2 - served: true - storage: true - schema: - properties: - spec: - properties: - min: - type: string - hour: - type: string - dayOfMonth: - type: string - month: - type: string - dayOfWeek: - type: string - image: - type: string - scope: Namespaced - names: - plural: crontabs - singular: crontab - kind: CronTab - shortNames: - - ct - conversion: - strategy: external - webhook: - client_config: - namespace: crontab - service: crontab_conversion - Path: /crontab_convert -``` - -And apply it: - -```bash -Kubectl apply -f crd2.yaml -``` - -**Step 4**: add a new CR in v2: - -|cr2.yaml| -|-| -```yaml - -apiVersion: "stable.example.com/v2" -kind: CronTab -metadata: - name: my-second-cron-object -spec: - min: "*" - hour: "*" - day_of_month: "*" - dayOfWeek: "*/5" - month: "*" - image: my-awesome-cron-image -``` - -And create it: - -```bash -Kubectl create -f cr2.yaml -``` - -**Step 5**: storage now has two custom resources in two different versions. To downgrade to previous CRD, one can apply crd1.yaml but that will fail as the status.storedVersions has both v1 and v2 and those cannot be removed from the spec.versions list. To downgrade, first create a crd2-b.yaml file that sets v1 as storage version and apply it, then follow "*Upgrade existing objects to a new stored version*“ in [this document](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/). After all CRs in the storage has v1 version, you can apply crd1.yaml. - -**Step 5 alternative**: create a crd1-b.yaml that has v2 but not served. - -|crd1-b.yaml| -|-| -```yaml - -apiVersion: apiextensions.k8s.io/v1beta1 -kind: CustomResourceDefinition -metadata: - name: crontabs.stable.example.com -spec: - group: stable.example.com - versions: - - name: v1 - served: true - storage: true - schema: - properties: - spec: - properties: - cronSpec: - type: string - image: - type: string - - name: v2 - served: false - storage: false - scope: Namespaced - names: - plural: crontabs - singular: crontab - kind: CronTab - shortNames: - - ct - conversion: - strategy: external - webhook: - client_config: - namespace: crontab - service: crontab_conversion - Path: /crontab_convert -``` - -## Alternatives Considered - -Other than webhook conversion, a declarative conversion also considered and discussed. The main operator that being discussed was Rename/Move. This section explains why Webhooks are chosen over declarative conversion. This does not mean the declarative approach will not be supported by the webhook would be first conversion method kubernetes supports. - -### Webhooks vs Declarative - -The table below compares webhook vs declarative in details. - -<table> - <tr> - <td></td> - <td>Webhook</td> - <td>Declarative</td> - </tr> - <tr> - <td>1. Limitatisons</td> - <td>There is no limitation on the type of conversion CRD author can do.</td> - <td>Very limited set of conversions will be provided.</td> - </tr> - <tr> - <td>2. User Complexity</td> - <td>Harder to implement and the author needs to run an http server. This can be made simpler using tools such as kube-builder.</td> - <td>Easy to use as they are in yaml configuration file.</td> - </tr> - <tr> - <td>3. Design Complexity</td> - <td>Because the API server calls into an external webhook, there is no need to design a specific conversions.</td> - <td>Designing of declarative conversions can be tricky, especially if they are changing the value of fields. Challenges are: Meeting the round-trip-ability requirement, arguing the usefulness of the operator and keeping it simple enough for a declarative system.</td> - </tr> - <tr> - <td>4. Performance</td> - <td>Several calls to webhook for one operation (e.g. Apply) might hit performance issues. A monitoring metric helps measure this for later improvements that can be done through batch conversion.</td> - <td>Implemented in API Server directly thus there is no performance concerns.</td> - </tr> - <tr> - <td>5. User mistakes</td> - <td>Users have freedom to implement any kind of conversion which may not conform with our API convention (e.g. round-tripability. If the conversion is not revertible, old clients may fail and downgrade will also be at risk).</td> - <td>Keeping the conversion operators sane and sound would not be user’s problem. For things like rename/move there is already a design that keeps round-tripp-ability but that could be tricky for other operations.</td> - </tr> - <tr> - <td>6. Popularity</td> - <td>Because of the freedom in conversion of webhooks, they probably would be more popular</td> - <td>Limited set of declarative operators make it a safer but less popular choice at least in the early stages of CRD development</td> - </tr> - <tr> - <td>7. CRD Development Cycles</td> - <td>Fit well into the story of CRD development of starting with blob store CRDs, then add Schema, then Add webhook conversions for the freedom of conversion the move as much possible to declarative for safer production.</td> - <td>Comes after Webhooks in the development cycles of CRDs</td> - </tr> -</table> - - -Webhook conversion has less limitation for the authors of APIs using CRD which is desirable especially in the early stages of development. Although there is a chance of user mistakes and also it may look more complex to implement a webhook, those can be relieved using sets of good tools/libraries such as kube-builder. Overall, Webhook conversion is the clear winner here. Declarative approach may be considered at a later stage as an alternative but need to be carefully designed. - - -### Caching - -* use HTTP caching conventions with Cache-Control, Etags, and a unique URL for each different request). This requires more complexity for the webhook author. This change could be considered as part of an update to all 5 or so kinds of webhooks, but not justified for just this one kind of webhook. - -* The CRD object could have a "conversionWebhookVersion" field which the user can increment/change when upgrading/downgrading the webhook to force invalidation of cached objects. - - -## Advice to Users - -* A proper webhook host implementation should accept every supported version as input and as output version. - -* It should also be able to round trip between versions. E.g. converting an object from v1 to v2 and back to v1 should yield the same object. - -* Consider testing your conversion webhook with a fuzz tester that generates random valid objects. - -* The webhook should always give the same response with the same request that allows API server to potentially cache the responses in future (modulo bug fixes; when an update is pushed that fixes a bug in the conversion operation it might not take effect for a few minutes. - -* If you need to add a new field, just add it. You don't need new schema to add a field. - -* Webhook Hosts should be side-effect free. - -* Webhook Hosts should not expect to see every conversion operation. Some may be cached in the future. - -* Toolkits like KubeBuilder and OperatorKit may assist users in using this new feature by: - - * having a place in their file hierarchy to define multiple schemas for the same kind. - - * having a place in their code templates to define a conversion function. - - * generating a full Webhook Host from a conversion function. - - * helping users create tests by writing directories containing sample yamls of an object in various versions. - - * using fuzzing to generate random valid objects and checking if they convert. - -## Test and Documentation Plan - -* Test the upgrade/rollback scenario below. - -* Test conversion, refer to the test case section. - -* Document CRD conversion and best practices for webhook conversion - -* Document to CRD users how to upgrade and downgrade (changing storage version dance, and changes to CRD stored tags). - -### Upgrade/Rollback Scenarios - -Scenario 1: Upgrading an Operator to have more versions. - -* Detect if the cluster version supports webhook conversion - - * Helm chart can require e.g. v1.12 of a Kubernetes API Server. - -Scenario 2: Rolling back to a previous version of API Server that does not support CRD Conversions - -* I have a cluster - - * I use apiserver v1.11.x, which supports multiple no-conversion-versions of a CRD - -* I start to use CRDs - - * I install helm chart "Foo-Operator", which installs a CRD for resource Foo, with 1 version called v1beta1. - - * This uses the old "version" and " - - * I create some Foo resources. - -* I upgrade apiserver to v1.12.x - - * version-conversion now supported. - -* I upgrade the Foo-Operator chart. - - * This changes the CRD to have two versions, v1beta1 and v1beta2. - - * It installs a Webhook Host to convert them. - - * Assume: v1beta1 is still the storage version. - -* I start using multiple versions, so that the CRs are now stored in a mix of versions. - -* I downgrade kube-apiserver - - * Emergency happens, I need to downgrade to v1.11.x. Conversion won't be possible anymore. - - * Downgrade - - * Any call needs conversion should fail at this stage (we need to patch 1.11 for this, see issue [#65790](https://github.com/kubernetes/kubernetes/issues/65790) - -### Test Cases - -* Updating existing CRD to use multiple versions with conversion - - * Define a CRD with one version. - - * Create stored CRs. - - * Update the CRD object to add another (non-storage) version with a conversion webhook - - * Existing CRs are not harmed - - * Can get existing CRs via new api, conversion webhook should be called - - * Can create new CRs with new api, conversion webhook should be called - - * Access new CRs with new api, conversion webhook should not be called - - * Access new CRs with old api, conversion webhook should be called - -## Development Plan - -Google able to staff development, test, review, and documentation. Help welcome, too, esp. Reviewing. - -Not in scope for this work: - -* Including CRDs to aggregated OpenAPI spec (fka swagger.json). - -* Apply for CRDs - -* Make CRDs powerful enough to convert any or all core types to CRDs (in line with that goal, but this is just a step towards it). - -### Work items - -* Add APIs for conversion webhooks in CustomResourceDefinition type. - -* Support multi-version (used to be called validation) Schema - -* Support multi-version subresources and AdditionalPrintColumns - -* Add a Webhook converter call as a CRD converter (refactor conversion code as needed) - -* Ensure able to monitor latency from webhooks. See Monitorability section - -* Add Upgrade/Downgrade tests - -* Add public documentation +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/customresources-subresources.md b/contributors/design-proposals/api-machinery/customresources-subresources.md index 54c5f5bb..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/customresources-subresources.md +++ b/contributors/design-proposals/api-machinery/customresources-subresources.md @@ -1,203 +1,6 @@ -# Subresources for CustomResources +Design proposals have been archived. -Authors: @nikhita, @sttts +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Table of Contents -1. [Abstract](#abstract) -2. [Goals](#goals) -3. [Non-Goals](#non-goals) -4. [Proposed Extension of CustomResourceDefinition](#proposed-extension-of-customresourcedefinition) - 1. [API Types](#api-types) - 2. [Feature Gate](#feature-gate) -5. [Semantics](#semantics) - 1. [Validation Behavior](#validation-behavior) - 1. [Status](#status) - 2. [Scale](#scale) - 2. [Status Behavior](#status-behavior) - 3. [Scale Behavior](#scale-behavior) - 1. [Status Replicas Behavior](#status-replicas-behavior) - 2. [Selector Behavior](#selector-behavior) -4. [Implementation Plan](#implementation-plan) -5. [Alternatives](#alternatives) - 1. [Scope](#scope) - -## Abstract - -[CustomResourceDefinitions](https://github.com/kubernetes/community/pull/524) (CRDs) were introduced in 1.7. The objects defined by CRDs are called CustomResources (CRs). Currently, we do not provide subresources for CRs. - -However, it is one of the [most requested features](https://github.com/kubernetes/kubernetes/issues/38113) and this proposal seeks to add `/status` and `/scale` subresources for CustomResources. - -## Goals - -1. Support status/spec split for CustomResources: - 1. Status changes are ignored on the main resource endpoint. - 2. Support a `/status` subresource HTTP path for status changes. - 3. `metadata.Generation` is increased only on spec changes. -2. Support a `/scale` subresource for CustomResources. -3. Maintain backward compatibility by allowing CRDs to opt-in to enable subresources. -4. If a CustomResource is already structured using spec/status, allow it to easily transition to use the `/status` and `/scale` endpoint. -5. Work seamlessly with [JSON Schema validation](https://github.com/kubernetes/community/pull/708). - -## Non-Goals - -1. Allow defining arbitrary subresources i.e. subresources except `/status` and `/scale`. - -## Proposed Extension of CustomResourceDefinition - -### API Types - -The addition of the following external types in `apiextensions.k8s.io/v1beta1` is proposed: - -```go -type CustomResourceDefinitionSpec struct { - ... - // SubResources describes the subresources for CustomResources - // This field is alpha-level and should only be sent to servers that enable - // subresources via the CurstomResourceSubResources feature gate. - // +optional - SubResources *CustomResourceSubResources `json:"subResources,omitempty"` -} - -// CustomResourceSubResources defines the status and scale subresources for CustomResources. -type CustomResourceSubResources struct { - // Status denotes the status subresource for CustomResources - Status *CustomResourceSubResourceStatus `json:"status,omitempty"` - // Scale denotes the scale subresource for CustomResources - Scale *CustomResourceSubResourceScale `json:"scale,omitempty"` -} - -// CustomResourceSubResourceStatus defines how to serve the HTTP path <CR Name>/status. -type CustomResourceSubResourceStatus struct { -} - -// CustomResourceSubResourceScale defines how to serve the HTTP path <CR name>/scale. -type CustomResourceSubResourceScale struct { - // required, e.g. “.spec.replicas”. Must be under `.spec`. - // Only JSON paths without the array notation are allowed. - SpecReplicasPath string `json:"specReplicasPath"` - // optional, e.g. “.status.replicas”. Must be under `.status`. - // Only JSON paths without the array notation are allowed. - StatusReplicasPath string `json:"statusReplicasPath,omitempty"` - // optional, e.g. “.spec.labelSelector”. Must be under `.spec`. - // Only JSON paths without the array notation are allowed. - LabelSelectorPath string `json:"labelSelectorPath,omitempty"` - // ScaleGroupVersion denotes the GroupVersion of the Scale - // object sent as the payload for /scale. It allows transition - // to future versions easily. - // Today only autoscaling/v1 is allowed. - ScaleGroupVersion schema.GroupVersion `json:"groupVersion"` -} -``` - -### Feature Gate - -The `SubResources` field in `CustomResourceDefinitionSpec` will be gated under the `CustomResourceSubResources` alpha feature gate. -If the gate is not open, the value of the new field within `CustomResourceDefinitionSpec` is dropped on creation and updates of CRDs. - -### Scale type - -The `Scale` object is the payload sent over the wire for `/scale`. The [polymorphic `Scale` type](https://github.com/kubernetes/kubernetes/pull/53743) i.e. `autoscaling/v1.Scale` is used for the `Scale` object. - -Since the GroupVersion of the `Scale` object is specified in `CustomResourceSubResourceScale`, transition to future versions (eg `autoscaling/v2.Scale`) can be done easily. - -Note: If `autoscaling/v1.Scale` is deprecated, then it would be deprecated here as well. - -## Semantics - -### Validation Behavior - -#### Status - -The status endpoint of a CustomResource receives a full CR object. Changes outside of the `.status` subpath are ignored. -For validation, the JSON Schema present in the CRD is validated only against the `.status` subpath. - -To validate only against the schema for the `.status` subpath, `oneOf` and `anyOf` constructs are not allowed within the root of the schema, but only under a properties sub-schema (with this restriction, we can project a schema to a sub-path). The following is forbidden in the CRD spec: - -```yaml -validation: - openAPIV3Schema: - oneOf: - ... -``` - -**Note**: The restriction for `oneOf` and `anyOf` allows us to write a projection function `ProjectJSONSchema(schema *JSONSchemaProps, path []string) (*JSONSchemaProps, error)` that can be used to apply a given schema for the whole object to only the sub-path `.status` or `.spec`. - -#### Scale - -Moreover, if the scale subresource is enabled: - -On update, we copy the values from the `Scale` object into the specified paths in the CustomResource, if the path is set (`StatusReplicasPath` and `LabelSelectorPath` are optional). -If `StatusReplicasPath` or `LabelSelectorPath` is not set, we validate that the value in `Scale` is also not specified and return an error otherwise. - -On `get` and on `update` (after copying the values into the CustomResource as described above), we verify that: - -- The value at the specified JSON Path `SpecReplicasPath` (e.g. `.spec.replicas`) is a non-negative integer value and is not empty. - -- The value at the optional JSON Path `StatusReplicasPath` (e.g. `.status.replicas`) is an integer value if it exists (i.e. this can be empty). - -- The value at the optional JSON Path `LabelSelectorPath` (e.g. `.spec.labelSelector`) is a valid label selector if it exists (i.e. this can be empty). - -**Note**: The values at the JSON Paths specified by `SpecReplicasPath`, `LabelSelectorPath` and `StatusReplicasPath` are also validated with the same rules when the whole object or, in case the `/status` subresource is enabled, the `.status` sub-object is updated. - -### Status Behavior - -If the `/status` subresource is enabled, the following behaviors change: - -- The main resource endpoint will ignore all changes in the status subpath. -(note: it will **not** reject requests which try to change the status, following the existing semantics of other resources). - -- The `.metadata.generation` field is updated if and only if the value at the `.spec` subpath changes. -Additionally, if the spec does not change, `.metadata.generation` is not updated. - -- The `/status` subresource receives a full resource object, but only considers the value at the `.status` subpath for the update. -The value at the `.metadata` subpath is **not** considered for update as decided in https://github.com/kubernetes/kubernetes/issues/45539. - -Both the status and the spec (and everything else if there is anything) of the object share the same key in the storage layer, i.e. the value at `.metadata.resourceVersion` is increased for any kind of change. There is no split of status and spec in the storage layer. - -The `/status` endpoint supports both `get` and `update` verbs. - -### Scale Behavior - -The number of CustomResources can be easily scaled up or down depending on the replicas field present in the `.spec` subpath. - -Only `ScaleSpec.Replicas` can be written. All other values are read-only and changes will be ignored. i.e. upon updating the scale subresource, two fields are modified: - -1. The replicas field is copied back from the `Scale` object to the main resource as specified by `SpecReplicasPath` in the CRD, e.g. `.spec.replicas = scale.Spec.Replicas`. - -2. The resource version is copied back from the `Scale` object to the main resource before writing to the storage: `.metadata.resourceVersion = scale.ResourceVersion`. -In other words, the scale and the CustomResource share the resource version used for optimistic concurrency. -Updates with outdated resource versions are rejected with a conflict error, read requests will return the resource version of the CustomResource. - -The `/scale` endpoint supports both `get` and `update` verbs. - -#### Status Replicas Behavior - -As only the `scale.Spec.Replicas` field is to be written to by the CR user, the user-provided controller (not any generic CRD controller) counts its children and then updates the controlled object by writing to the `/status` subresource, i.e. the `scale.Status.Replicas` field is read-only. - -#### Selector Behavior - -`CustomResourceSubResourceScale.LabelSelectorPath` is the label selector over CustomResources that should match the replicas count. -The value in the `Scale` object is one-to-one the value from the CustomResource if the label selector is non-empty. -Intentionally we do not default it to another value from the CustomResource (e.g. `.spec.template.metadata.labels`) as this turned out to cause trouble (e.g. in `kubectl apply`) and it is generally seen as a wrong approach with existing resources. - -## Implementation Plan - -The `/scale` and `/status` subresources are mostly distinct. It is proposed to do the implementation in two phases (the order does not matter much): - -1. `/status` subresource -2. `/scale` subresource - -## Alternatives - -### Scope - -In this proposal we opted for an opinionated concept of subresources i.e. we restrict the subresource spec to the two very specific subresources: `/status` and `/scale`. -We do not aim for a more generic subresource concept. In Kubernetes there are a number of other subresources like `/log`, `/exec`, `/bind`. But their semantics is much more special than `/status` and `/scale`. -Hence, we decided to leave those other subresources to the domain of User provided API Server (UAS) instead of inventing a more complex subresource concept for CustomResourceDefinitions. - -**Note**: The types do not make the addition of other subresources impossible in the future. - -We also restrict the JSON path for the status and the spec within the CustomResource. -We could make them definable by the user and the proposed types actually allow us to open this up in the future. -For the time being we decided to be opinionated as all status and spec subobjects in existing types live under `.status` and `.spec`. Keeping this pattern imposes consistency on user provided CustomResources as well. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/customresources-validation.md b/contributors/design-proposals/api-machinery/customresources-validation.md index 64448ee3..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/customresources-validation.md +++ b/contributors/design-proposals/api-machinery/customresources-validation.md @@ -1,525 +1,6 @@ -# Validation for CustomResources +Design proposals have been archived. -Authors: @nikhita, @sttts, some ideas integrated from @xiao-zhou’s proposal<sup id="f1">[1](#footnote1)</sup> +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Table of Contents - -1. [Overview](#overview) -2. [Background](#background) - 1. [Goals](#goals) - 2. [Non-Goals](#non-goals) -3. [Proposed Extension of CustomResourceDefinition](#proposed-extension-of-customresourcedefinition) - 1. [API Types](#api-types) - 2. [Examples](#examples) - 1. [JSON-Schema](#json-schema) - 2. [Error messages](#error-messages) -4. [Validation Behavior](#validation-behavior) - 1. [Metadata](#metadata) - 2. [Server-Side Validation](#server-side-validation) - 3. [Client-Side Validation](#client-side-validation) - 4. [Comparison between server-side and client-side Validation](#comparison-between-server-side-and-client-side-Validation) - 5. [Existing Instances and changing the Schema](#existing-instances-and-changing-the-schema) - 6. [Outlook to Status Sub-Resources](#outlook-to-status-sub-resources) - 7. [Outlook Admission Webhook](#outlook-admission-webhook) -5. [Implementation Plan](#implementation-plan) -6. [Appendix](#appendix) - 1. [Expressiveness of JSON-Schema](#expressiveness-of-json-schema) - 2. [JSON-Schema Validation Runtime Complexity](#json-schema-validation-runtime-complexity) - 3. [Alternatives](#alternatives) - 1. [Direct Embedding of the Schema into the Spec](#direct-embedding-of-the-schema-into-the-spec) - 2. [External CustomResourceSchema Type](#external-customresourceschema-type) - -## Overview - -This document proposes the design and describes a way to add JSON-Schema based validation for Custom Resources. - -## Background - -ThirdPartyResource (TPR) is deprecated and CustomResourceDefinition (CRD) is the successor which solves the fundamental [issues](https://github.com/kubernetes/features/issues/95) of TPRs to form a stable base for further features. - -Currently we do not provide validation for CustomResources (CR), i.e. the CR payload is free-form JSON. However, one of the most requested [[1](https://github.com/kubernetes/features/issues/95#issuecomment-296416969)][[2](https://github.com/kubernetes/features/issues/95#issuecomment-298791881)] features is validation and this proposal seeks to add it. - -## Goals - -1. To provide validation for CustomResources using a declarative specification language for JSON data. -2. To keep open the door to add other validation mechanisms later.<sup id="f2">[2](#footnote2)</sup> -3. To allow server-side validation. -4. To be able to integrate into the existing client-side validation of kubectl. -5. To be able to define defaults in the specification (at least in a follow-up after basic validation support). - -## Non-Goals - -1. The JSON-Schema specs can be used for creating OpenAPI documentation for CRs. The format is compatible but we won’t propose an implementation for that. -2. A turing-complete specification language is not proposed. Instead a declarative way is proposed to express the vast majority of validations. -3. For now, CRD only allows 1 version at a time. Supporting multiple versions of CRD and/or conversion of CRD is not within the scope of this proposal. - -## Proposed Extension of CustomResourceDefinition - -We propose to add a field `validation` to the spec of a CustomResourceDefinition. As a first validation format we propose to use [JSON-Schema](http://json-schema.org/) under `CRD.Spec.Validation.JSONSchema`. - -JSON-Schema is a [standardized](https://tools.ietf.org/html/draft-zyp-json-schema-04) declarative specification language. Different keywords may be utilized to put constraints on the data. Thus it provides ways to make assertions about what a valid document must look like. - -It is already used in Swagger/OpenAPI specs in Kubernetes and hence such a CRD specification integrates cleanly into the existing infrastructure of the API server which serves these specifications, -* into kubectl which is able to verify YAML and JSON objects against the returned specification. -* With the https://github.com/go-openapi/validate library, we have a powerful JSON-Schema validator which can be used client and server-side. - -## API Types - -The schema is referenced in [`CustomResourceDefinitionSpec`](https://github.com/kubernetes/kubernetes/commit/0304ef60a210758ab4ac43a468f8a5e19f39ff5a#diff-0e64a9ef2cf809a2a611b16fd44d22f8). `Validation` is of the type `CustomResourceValidation`. The JSON-Schema is stored in a field of `Validation`. This way we can make the validation generic and add other validations in the future as well. - -The schema types follow those of the OpenAPI library, but we decided to define them independently for the API to have full control over the serialization and versioning. Hence, it is easy to convert our types into those used for validation or to integrate them into an OpenAPI spec later. - -Reference http://json-schema.org is also used by OpenAPI. We propose this as there are implementations available in Go and with OpenAPI, we will also be able to serve OpenAPI specs for CustomResourceDefinitions. - -```go -// CustomResourceSpec describes how a user wants their resource to appear -type CustomResourceDefinitionSpec struct { - Group string `json:"group" protobuf:"bytes,1,opt,name=group"` - Version string `json:"version" protobuf:"bytes,2,opt,name=version"` - Names CustomResourceDefinitionNames `json:"names" protobuf:"bytes,3,opt,name=names"` - Scope ResourceScope `json:"scope" protobuf:"bytes,8,opt,name=scope,casttype=ResourceScope"` - // Validation describes the validation methods for CustomResources - Validation CustomResourceValidation `json:"validation,omitempty"` -} - -// CustomResourceValidation is a list of validation methods for CustomResources -type CustomResourceValidation struct { - // JSONSchema is the JSON Schema to be validated against. - // Can add other validation methods later if needed. - JSONSchema *JSONSchemaProps `json:"jsonSchema,omitempty"` -} - -// JSONSchemaProps is a JSON-Schema following Specification Draft 4 (http://json-schema.org/). -type JSONSchemaProps struct { - ID string `json:"id,omitempty"` - Schema JSONSchemaURL `json:"-,omitempty"` - Ref JSONSchemaRef `json:"-,omitempty"` - Description string `json:"description,omitempty"` - Type StringOrArray `json:"type,omitempty"` - Format string `json:"format,omitempty"` - Title string `json:"title,omitempty"` - Default interface{} `json:"default,omitempty"` - Maximum *float64 `json:"maximum,omitempty"` - ExclusiveMaximum bool `json:"exclusiveMaximum,omitempty"` - Minimum *float64 `json:"minimum,omitempty"` - ExclusiveMinimum bool `json:"exclusiveMinimum,omitempty"` - MaxLength *int64 `json:"maxLength,omitempty"` - MinLength *int64 `json:"minLength,omitempty"` - Pattern string `json:"pattern,omitempty"` - MaxItems *int64 `json:"maxItems,omitempty"` - MinItems *int64 `json:"minItems,omitempty"` - // disable uniqueItems for now because it can cause the validation runtime - // complexity to become quadratic. - UniqueItems bool `json:"uniqueItems,omitempty"` - MultipleOf *float64 `json:"multipleOf,omitempty"` - Enum []interface{} `json:"enum,omitempty"` - MaxProperties *int64 `json:"maxProperties,omitempty"` - MinProperties *int64 `json:"minProperties,omitempty"` - Required []string `json:"required,omitempty"` - Items *JSONSchemaPropsOrArray `json:"items,omitempty"` - AllOf []JSONSchemaProps `json:"allOf,omitempty"` - OneOf []JSONSchemaProps `json:"oneOf,omitempty"` - AnyOf []JSONSchemaProps `json:"anyOf,omitempty"` - Not *JSONSchemaProps `json:"not,omitempty"` - Properties map[string]JSONSchemaProps `json:"properties,omitempty"` - AdditionalProperties *JSONSchemaPropsOrBool `json:"additionalProperties,omitempty"` - PatternProperties map[string]JSONSchemaProps `json:"patternProperties,omitempty"` - Dependencies JSONSchemaDependencies `json:"dependencies,omitempty"` - AdditionalItems *JSONSchemaPropsOrBool `json:"additionalItems,omitempty"` - Definitions JSONSchemaDefinitions `json:"definitions,omitempty"` -} - -// JSONSchemaRef represents a JSON reference that is potentially resolved. -// It is marshaled into a string using a custom JSON marshaller. -type JSONSchemaRef struct { - ReferencePointer JSONSchemaPointer - HasFullURL bool - HasURLPathOnly bool - HasFragmentOnly bool - HasFileScheme bool - HasFullFilePath bool -} - -// JSONSchemaPointer is the JSON pointer representation. -type JSONSchemaPointer struct { - ReferenceTokens []string -} - -// JSONSchemaURL represents a schema url. Defaults to JSON Schema Specification Draft 4. -type JSONSchemaURL string - -const ( - // JSONSchemaDraft4URL is the url for JSON Schema Specification Draft 4. - JSONSchemaDraft4URL SchemaURL = "http://json-schema.org/draft-04/schema#" -) - -// StringOrArray represents a value that can either be a string or an array of strings. -// Mainly here for serialization purposes. -type StringOrArray []string - -// JSONSchemaPropsOrArray represents a value that can either be a JSONSchemaProps -// or an array of JSONSchemaProps. Mainly here for serialization purposes. -type JSONSchemaPropsOrArray struct { - Schema *JSONSchemaProps - JSONSchemas []JSONSchemaProps -} - -// JSONSchemaPropsOrBool represents JSONSchemaProps or a boolean value. -// Defaults to true for the boolean property. -type JSONSchemaPropsOrBool struct { - Allows bool - Schema *JSONSchemaProps -} - -// JSONSchemaDependencies represent a dependencies property. -type JSONSchemaDependencies map[string]JSONSchemaPropsOrStringArray - -// JSONSchemaPropsOrStringArray represents a JSONSchemaProps or a string array. -type JSONSchemaPropsOrStringArray struct { - Schema *JSONSchemaProps - Property []string -} - -// JSONSchemaDefinitions contains the models explicitly defined in this spec. -type JSONSchemaDefinitions map[string]JSONSchemaProps -``` - -Note: A reflective test to check for drift between the types here and the OpenAPI types for runtime usage will be added. - -## Examples - -### JSON-Schema - -The following example illustrates how a schema can be used in `CustomResourceDefinition`. It shows various restrictions that can be achieved for validation using JSON-Schema. - -```json -{ - "apiVersion": "apiextensions.k8s.io/v1beta1", - "kind": "CustomResourceDefinition", - "metadata": { - "name": "noxus.mygroup.example.com" - }, - "spec": { - "group": "mygroup.example.com", - "version": "v1alpha1", - "scope": "Namespaced", - "names": { - "plural": "noxus", - "singular": "noxu", - "kind": "Noxu", - "listKind": "NoxuList" - }, - "validation": { - "jsonSchema": { - "$schema": "http://json-schema.org/draft-04/schema#", - "type": "object", - "description": "Noxu is a kind of Custom Resource which has only fields that are specified", - "required": [ - "alpha", - "beta", - "gamma", - "delta", - "epsilon", - "zeta" - ], - "properties": { - "alpha": { - "description": "Alpha is an alphanumeric string with underscores which defaults to foo_123", - "type": "string", - "pattern": "^[a-zA-Z0-9_]*$", - "default": "foo_123" - }, - "beta": { - "description": "We need at least 10 betas. If not specified, it defaults to 10.", - "type": "number", - "minimum": 10, - "default": 10 - }, - "gamma": { - "description": "Gamma is restricted to foo, bar and baz", - "type": "string", - "enum": [ - "foo", - "bar", - "baz" - ] - }, - "delta": { - "description": "Delta is a string with a maximum length of 5 or a number with a minimum value of 0", - "anyOf": [ - { - "type": "string", - "maxLength": 5 - }, - { - "type": "number", - "minimum": 0 - } - ] - }, - "epsilon": { - "description": "Epsilon is either of type one zeta or two zeta", - "allOf": [ - { - "$ref": "#/definitions/zeta" - }, - { - "properties": { - "type": { - "enum": [ - "one", - "two" - ] - } - }, - "required": [ - "type" - ], - "additionalProperties": false - } - ] - }, - "additionalProperties": false, - "definitions": { - "zeta": { - "description": "Every zeta needs to have foo, bar and baz", - "type": "object", - "properties": { - "foo": { - "type": "string" - }, - "bar": { - "type": "number" - }, - "baz": { - "type": "boolean" - } - }, - "required": [ - "foo", - "bar", - "baz" - ], - "additionalProperties": false - } - } - } - } - } - } -} -``` - -### Error messages - -The following examples illustrate the type of validation errors generated by using the go-openapi validate library. - -The description is not taken into account, but a better error output can be easily [added](https://github.com/go-openapi/errors/blob/master/headers.go#L23) to go-openapi. - -1. `data.foo in body should be at least 4 chars long` -2. `data.foo in body should be greater than or equal to 10` -3. `data.foo in body should be one of [bar baz]` -4. `data.foo in body must be of type integer: "string"` -5. `data.foo in body should match '^[a-zA-Z0-9_]*$'` -6. `data.foo in body is required` -7. When foo validates if it is a multiple of 3 and 5: -``` -data.foo in body should be a multiple of 5 -data.foo in body should be a multiple of 3 -must validate all the schemas (allOf) -``` - -## Validation Behavior - -The schema will be described in the `CustomResourceDefinitionSpec`. The validation will be carried out using the [go-openapi validation library](https://github.com/go-openapi/validate). - -While creating/updating the CR, the metadata is first validated. To validate the CR against the spec in the CRD, we _must_ have server-side validation and we _can_ have client-side validation. - -### Metadata - -ObjectMeta and TypeMeta are implicitly specified. They do not have to be added to the JSON-Schema of a CRD. The validation already happens today as part of the apiextensions-apiserver REST handlers. - -### Server-Side Validation - -The server-side validation is carried out after sending the request to the apiextensions-apiserver, i.e. inside the CREATE and UPDATE handlers for CRs. - -We do a schema pass there using the https://github.com/go-openapi/validate validator with the provided schema in the corresponding CRD. Validation errors are returned to the caller as for native resources. - -JSON-Schema also allows us to reject additional fields that are not defined in the schema and only allow the fields that are specified. This can be achieved by using `"additionalProperties": false` in the schema. However, there is danger in allowing CRD authors to set `"additionalProperties": false` because it breaks version skew (new client can send new optional fields to the old server). So we should not allow CRD authors to set `"additionalProperties": false`. - -### Client-Side Validation - -The client-side validation is carried out before sending the request to the api-server, or even completely offline. This can be achieved while creating resources through the client i.e. kubectl using the --validate option. - -If the API type serves the JSON-Schema in the swagger spec, the existing kubectl code will already be able to also validate CRs. This will be achieved as a follow-up. - -### Comparison between server-side and client-side Validation - -The table below shows the cases when server-side and client-side validation methods are applicable. - -| Case | Server-Side | Client-Side | -|:--------------------------------------------------:|:------------:|:-------------:| -| Kubectl create/edit/replace with validity feedback | ✓ | ✓ | -| Custom controller creates/updates CRs | ✓ | ✗ | -| CRs are created by an untrusted party | ✓ | ✗ | -| Not making validation for CRs a special case | ✓ | ✗ | - -The above table is an evidence that we need server-side validation as well, next to the client-side validation we easily get, nearly for free, by serving Swagger/OpenAPI specs in apiextension-apiserver. - -This is especially true in situations when CRs are used by components that are out of the control of the admin. Example: A user can create a database CR for a Database-As-A-Service. In this case, only server-side validation can give confidence that the CRs are well formed. - -### Existing Instances and changing the Schema - -If the schema is made stricter later, the existing CustomResources might no longer comply with the spec. This will make them unchangeable and essentially read-only. - -To avoid this, it is the responsibility of the user to make sure that any changes made to the schema are such that the existing CustomResources remain validated. - -Note: - -1. This is the same behavior that we require for native resources. Validation cannot be made stricter in later Kubernetes versions without breaking compatibility. - -2. For migration of CRDs with no validation to CRDs with validation, we can create a controller that will validate and annotate invalid CRs once the spec changes, so that the custom controller can choose to delete them (this is also essentially the status condition of the CRD). This can be achieved, but it is not part of the proposal. - -### Outlook to Status Sub-Resources - -As another most-wanted feature, a Status sub-resource might be proposed and implemented for CRDs. The JSON-Schema proposed here might as well cover the Status field of a CR. For now this is not handled or validated in a particular way. - -When the Status sub-resource exists some day, the /status endpoint will receive a full CR object, but only the status field is to be validated. We propose to enforce the JSON-Schema structure to be of the shape: - -```json -{"type":"object", "properties":{"status": ..., "a": ..., "b": ...}} -``` -Then we can validate the status against the sub-schema easily. Hence, this proposal will be compatible with a later sub-resource extension. - -### Outlook Admission Webhook - -Apiextensions-apiserver uses the normal REST endpoint implementation and only customizes the registry and the codecs. The admission plugins are inherited from the kube-apiserver (when running inside of it via apiserver delegation) and therefore they are supposed to apply to CRs as well. - -It is [verified](https://github.com/kubernetes/kubernetes/pull/47252) that CRDs work well with initializers. It is also expected that webhook admission prototyped at https://github.com/kubernetes/kubernetes/pull/46316 will work with CRs out of the box. Hence, for more advanced validation webhook admission is an option as well (when it is merged). - -JSON-Schema based validation does not preclude implementation of other validation methods. Hence, advanced webhook-based validation can also be implemented in the future. - -## Implementation Plan - -The implementation is planned in the following steps: - -1. Add the proposed types to the v1beta1<sup id="f3">[3](#footnote3)</sup> version of the CRD type. -2. Add a validation step to the CREATE and UPDATE REST handlers of the apiextensions-apiserver. - -Independently, from 1. and 2. add defaulting support: - -3. [Add defaulting support to go-openapi](https://github.com/go-openapi/validate/pull/27). Before this PR, we will reject JSON-Schemas which define defaults. - -As an optional follow-up, we can implement the OpenAPI part and with that enable client-side validation: - -4. Export the JSON-Schema via a dynamically served OpenAPI spec. - -## Appendix - -### Expressiveness of JSON-Schema - -The following example properties cannot be expressed using JSON-Schema: -1. “In a PodSpec, for each `spec.Containers[*].volumeMounts[*].Name` there must be a `spec.Volumes[*].Name`” -2. “The volume names in `PodSpec.Volumes` are unique” (`uniqueItems` only compares the complete objects, it cannot compare by key) - -Different versions within one CRD with a custom version field (i.e. not the one in apiVersion) **can** be expressed: - -```json -{ - "$schema": "http://json-schema.org/draft-04/schema#", - "title": "child_schema", - "type": "object", - "anyOf": [ - { - "properties": { - "version": { - "type": "string", - "pattern": "^a$" - }, - "spec": { - "type": "object", - "properties": { - "foo": {} - }, - "additionalProperties": false, - }, - } - }, - { - "properties": { - "version": { - "type": "string", - "pattern": "^b$" - }, - "spec": { - "type": "object", - "properties": { - "bar": {} - }, - "additionalProperties": false, - }, - } - } - ], -} -``` - -This validates: -* `{"version": "a", "spec": {"foo": 42}}` -* `{"version": "b", "spec": {"bar": 42}}` - -but not: -* `{"version": "a", "spec": {"bar": 42}}`. - -Note: this is a workaround while we do not support multiple versions and conversion for custom resources. - -### JSON-Schema Validation Runtime Complexity - -Following “JSON: data model, query languages and schema specification<sup id="f4">[4](#footnote4)</sup>” and “Formal Specification, Expressiveness and Complexity analysis for JSON Schema<sup id="f5">[5](#footnote5)</sup>”, JSON-Schema validation -* without the uniqueItems operator and -* without recursion for the $ref operator -has linear runtime in the size of the JSON input and the size of the schema (Th. 1 plus Prop. 7). - -If we allow uniqueItems, the runtime complexity becomes quadratic in the size of the JSON input. Hence, we might want to consider forbidding the uniqueItems operator in order to avoid DDoS attacks, at least if the schema definitions of CRDs cannot be trusted. - -The CRD JSON-Schema will be validated to have neither recursion, nor `uniqueItems=true` being set. - -### Alternatives - -#### Direct Embedding of the Schema into the Spec - -An alternative approach to describe the schema in the spec can be as shown below. We directly specify the schema in the spec without the using a Validation field. While simpler, this will limit later extensions, e.g. with non-declarative validation. - -```go -// CustomResourceSpec describes how a user wants their resource to appear -type CustomResourceDefinitionSpec struct { - Group string `json:"group" protobuf:"bytes,1,opt,name=group"` - Version string `json:"version" protobuf:"bytes,2,opt,name=version"` - Names CustomResourceDefinitionNames `json:"names" protobuf:"bytes,3,opt,name=names"` - Scope ResourceScope `json:"scope" protobuf:"bytes,8,opt,name=scope,casttype=ResourceScope"` - // Schema is the JSON-Schema to be validated against. - Schema JSONSchema -} -``` - -#### External CustomResourceSchema Type - -In this proposal the JSON-Schema is directly stored in the CRD. Alternatively, one could create a separate top-level API type CustomResourceValidator and reference this from a CRD. Compare @xiao-zhou’s [proposal](https://docs.google.com/document/d/1lKJf9pYBNRcbM7il1VjSJNMDLaf3cFPnquIPPGbEjr4/) for a more detailed sketch of this idea. - -We do not follow the idea of separate API types in this proposal because CustomResourceDefinitions are highly coupled in practice with the validation of the instances. It doesn’t look like a common use-case to reference a schema from different CRDs and to modify the schema for all of them concurrently. - -Hence, the additional complexity for an extra type doesn’t look to be justified. - - -#### Footnotes - -<a name="footnote1">1</a>: https://docs.google.com/document/d/1lKJf9pYBNRcbM7il1VjSJNMDLaf3cFPnquIPPGbEjr4 [↩](#f1) - -<a name="footnote2">2</a>: Admission webhooks and embedded programming languages like JavaScript or LUA have been discussed. [↩](#f2) - -<a name="footnote3">3</a>: It is common to have alpha fields in beta objects in Kubernetes, compare: FlexVolume, component configs. [↩](#f3) - -<a name="footnote4">4</a>: https://arxiv.org/pdf/1701.02221.pdf [↩](#f4) - -<a name="footnote5">5</a>: https://repositorio.uc.cl/bitstream/handle/11534/16908/000676530.pdf [↩](#f5)
\ No newline at end of file +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/customresources-versioning.md b/contributors/design-proposals/api-machinery/customresources-versioning.md index 6c6d4391..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/customresources-versioning.md +++ b/contributors/design-proposals/api-machinery/customresources-versioning.md @@ -1,100 +1,6 @@ -CRD Versioning -============= +Design proposals have been archived. -The objective of this design document is to provide a machinery for Custom Resource Definition authors to define different resource version and a conversion mechanism between them. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# **Background** - -Custom Resource Definitions ([CRDs](https://kubernetes.io/docs/concepts/api-extension/custom-resources/)) are a popular mechanism for extending Kubernetes, due to their ease of use compared with the main alternative of building an Aggregated API Server. They are, however, lacking a very important feature that all other kubernetes objects support: Versioning. Today, each CR can only have one version and there is no clear way for authors to advance their resources to a newer version other than creating a completely new CRD and converting everything manually in sync with their client software. - -This document proposes a mechanism to support multiple CRD versions. A few alternatives are also explored in [this document](https://docs.google.com/document/d/1Ucf7JwyHpy7QlgHIN2Rst_q6yT0eeN9euzUV6kte6aY). - -**Goals:** - -* Support versioning on API level - -* Support conversion mechanism between versions - -* Support ability to change storage version - -* Support Validation/OpenAPI schema for all versions: All versions should have a schema. This schema can be provided by user or derived from a single schema. - -**Non-Goals:** - -* Support cohabitation (i.e. no group/kind move) - -# **Proposed Design** - -The basis of the design is a system that supports versioning and no conversion. The APIs here, is designed in a way that can be extended with conversions later. - -The summary is to support a list of versions that will include current version. One of these versions can be flagged as the storage version and all versions ever marked as storage version will be listed in a stored_version field in the Status object to enable authors to plan a migration for their stored objects. - -The current `Version` field is planned to be deprecated in a later release and will be used to pre-populate the `Versions` field (The `Versions` field will be defaulted to a single version, constructed from top level `Version` field). The `Version` field will be also mutable to give a way to the authors to remove it from the list. - -```golang -// CustomResourceDefinitionSpec describes how a user wants their resource to appear -type CustomResourceDefinitionSpec struct { - // Group is the group this resource belongs in - Group string - // Version is the version this resource belongs in - // must be always the first item in Versions field if provided. - Version string - // Names are the names used to describe this custom resource - Names CustomResourceDefinitionNames - // Scope indicates whether this resource is cluster or namespace scoped. Default is namespaced - Scope ResourceScope - // Validation describes the validation methods for CustomResources - Validation *CustomResourceValidation - - // *************** - // ** New Field ** - // *************** - // Versions is the list of all supported versions for this resource. - // Validation: All versions must use the same validation schema for now. i.e., top - // level Validation field is applied to all of these versions. - // Order: The order of these versions is used to determine the order in discovery API - // (preferred version first). - // The versions in this list may not be removed if they are in - // CustomResourceDefinitionStatus.StoredVersions list. - Versions []CustomResourceDefinitionVersion -} - -// *************** -// ** New Type ** -// *************** -type CustomResourceDefinitionVersion { - // Name is the version name, e.g. "v1", “v2beta1”, etc. - Name string - // Served is a flag enabling/disabling this version from being served via REST APIs - Served Boolean - // Storage flags the release as a storage version. There must be exactly one version - // flagged as Storage. - Storage Boolean -} -``` - -The Status object will have a list of potential stored versions. This data is necessary to do a storage migration in future (the author can choose to do the migration themselves but there is [a plan](https://docs.google.com/document/d/1eoS1K40HLMl4zUyw5pnC05dEF3mzFLp5TPEEt4PFvsM/edit) to solve the problem of migration, potentially for both standard and custom types). - -```golang -// CustomResourceDefinitionStatus indicates the state of the CustomResourceDefinition -type CustomResourceDefinitionStatus struct { - ... - - // StoredVersions are all versions ever marked as storage in spec. Tracking these - // versions allow a migration path for stored version in etcd. The field is mutable - // so the migration controller can first make sure a version is certified (i.e. all - // stored objects is that version) then remove the rest of the versions from this list. - // None of the versions in this list can be removed from the spec.Versions field. - StoredVersions []string -} -``` - -# **Validation** - -Basic validations needed for the `version` field are: - -* `Spec.Version` field exists in `Spec.Versions` field. -* The version defined in `Spec.Version` field should point to a `Served` Version in `Spec.Versions` list except when we do not serve any version (i.e. all versions in `Spec.Versions` field are disabled by `Served` set to `False`). This is for backward compatibility. An old controller expect that version to be served but only the whole CRD is served. CRD Registration controller should unregister a CRD with no serving version. -* None of the `Status.StoredVersion` can be removed from `Spec.Versions` list. -* Only one of the versions in `spec.Versions` can flag as `Storage` version. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/dynamic-admission-control-configuration.md b/contributors/design-proposals/api-machinery/dynamic-admission-control-configuration.md index b1efb0db..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/dynamic-admission-control-configuration.md +++ b/contributors/design-proposals/api-machinery/dynamic-admission-control-configuration.md @@ -1,371 +1,6 @@ +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The extensible admission control -[proposal](admission_control_extension.md) -proposed making admission control extensible. In the proposal, the `initializer -admission controller` and the `generic webhook admission controller` are the two -controllers that set default initializers and external admission hooks for -resources newly created. These two admission controllers are in the same binary -as the apiserver. This -[section](admission_control_extension.md#dynamic-configuration) -gave a preliminary design of the dynamic configuration of the list of the -default admission controls. This document hashes out the implementation details. -## Goals - -* Admin is able to predict what initializers/webhooks will be applied to newly - created objects. - -* Admin needs to be able to ensure initializers/webhooks config will be applied within some bound - -* As a fallback, admin can always restart an apiserver and guarantee it sees the latest config - -* Do not block the entire cluster if the initializers/webhooks are not ready - after registration. - -## Specification - -We assume initializers could be "fail open". We need to update the extensible -admission control -[proposal](admission_control_extension.md) -if this is accepted. - -The schema is evolved from the prototype in -[#132](https://github.com/kubernetes/community/pull/132). - -```golang -// InitializerConfiguration describes the configuration of initializers. -type InitializerConfiguration struct { - metav1.TypeMeta - - v1.ObjectMeta - - // Initializers is a list of resources and their default initializers - // Order-sensitive. - // When merging multiple InitializerConfigurations, we sort the initializers - // from different InitializerConfigurations by the name of the - // InitializerConfigurations; the order of the initializers from the same - // InitializerConfiguration is preserved. - // +optional - Initializers []Initializer `json:"initializers,omitempty" patchStrategy:"merge" patchMergeKey:"name"` -} - -// Initializer describes the name and the failure policy of an initializer, and -// what resources it applies to. -type Initializer struct { - // Name is the identifier of the initializer. It will be added to the - // object that needs to be initialized. - // Name should be fully qualified, e.g., alwayspullimages.kubernetes.io, where - // "alwayspullimages" is the name of the webhook, and kubernetes.io is the name - // of the organization. - // Required - Name string `json:"name"` - - // Rules describes what resources/subresources the initializer cares about. - // The initializer cares about an operation if it matches _any_ Rule. - Rules []Rule `json:"rules,omitempty"` - - // FailurePolicy defines what happens if the responsible initializer controller - // fails to takes action. Allowed values are Ignore, or Fail. If "Ignore" is - // set, initializer is removed from the initializers list of an object if - // the timeout is reached; If "Fail" is set, apiserver returns timeout error - // if the timeout is reached. The default timeout for each initializer is - // 5s. - FailurePolicy *FailurePolicyType `json:"failurePolicy,omitempty"` -} - -// Rule is a tuple of APIGroups, APIVersion, and Resources.It is recommended -// to make sure that all the tuple expansions are valid. -type Rule struct { - // APIGroups is the API groups the resources belong to. '*' is all groups. - // If '*' is present, the length of the slice must be one. - // Required. - APIGroups []string `json:"apiGroups,omitempty"` - - // APIVersions is the API versions the resources belong to. '*' is all versions. - // If '*' is present, the length of the slice must be one. - // Required. - APIVersions []string `json:"apiVersions,omitempty"` - - // Resources is a list of resources this rule applies to. - // - // For example: - // 'pods' means pods. - // 'pods/log' means the log subresource of pods. - // '*' means all resources, but not subresources. - // 'pods/*' means all subresources of pods. - // '*/scale' means all scale subresources. - // '*/*' means all resources and their subresources. - // - // If '*' or '*/*' is present, the length of the slice must be one. - // Required. - Resources []string `json:"resources,omitempty"` -} - -type FailurePolicyType string - -const ( - // Ignore means the initializer is removed from the initializers list of an - // object if the initializer is timed out. - Ignore FailurePolicyType = "Ignore" - // For 1.7, only "Ignore" is allowed. "Fail" will be allowed when the - // extensible admission feature is beta. - Fail FailurePolicyType = "Fail" -) - -// ExternalAdmissionHookConfiguration describes the configuration of initializers. -type ExternalAdmissionHookConfiguration struct { - metav1.TypeMeta - - v1.ObjectMeta - // ExternalAdmissionHooks is a list of external admission webhooks and the - // affected resources and operations. - // +optional - ExternalAdmissionHooks []ExternalAdmissionHook `json:"externalAdmissionHooks,omitempty" patchStrategy:"merge" patchMergeKey:"name"` -} - -// ExternalAdmissionHook describes an external admission webhook and the -// resources and operations it applies to. -type ExternalAdmissionHook struct { - // The name of the external admission webhook. - // Name should be fully qualified, e.g., imagepolicy.kubernetes.io, where - // "imagepolicy" is the name of the webhook, and kubernetes.io is the name - // of the organization. - // Required. - Name string `json:"name"` - - // ClientConfig defines how to communicate with the hook. - // Required - ClientConfig AdmissionHookClientConfig `json:"clientConfig"` - - // Rules describes what operations on what resources/subresources the webhook cares about. - // The webhook cares about an operation if it matches _any_ Rule. - Rules []RuleWithVerbs `json:"rules,omitempty"` - - // FailurePolicy defines how unrecognized errors from the admission endpoint are handled - - // allowed values are Ignore or Fail. Defaults to Ignore. - // +optional - FailurePolicy *FailurePolicyType -} - -// RuleWithVerbs is a tuple of Verbs and Resources. It is recommended to make -// sure that all the tuple expansions are valid. -type RuleWithVerbs struct { - // Verbs is the verbs the admission hook cares about - CREATE, UPDATE, or * - // for all verbs. - // If '*' is present, the length of the slice must be one. - // Required. - Verbs []OperationType `json:"verbs,omitempty"` - // Rule is embedded, it describes other criteria of the rule, like - // APIGroups, APIVersions, Resources, etc. - Rule `json:",inline"` -} - -type OperationType string - -const ( - VerbAll OperationType = "*" - Create OperationType = "CREATE" - Update OperationType = "UPDATE" - Delete OperationType = "DELETE" - Connect OperationType = "CONNECT" -) - -// AdmissionHookClientConfig contains the information to make a TLS -// connection with the webhook -type AdmissionHookClientConfig struct { - // Service is a reference to the service for this webhook. If there is only - // one port open for the service, that port will be used. If there are multiple - // ports open, port 443 will be used if it is open, otherwise it is an error. - // Required - Service ServiceReference `json:"service"` - // CABundle is a PEM encoded CA bundle which will be used to validate webhook's server certificate. - // Required - CABundle []byte `json:"caBundle"` -} - -// ServiceReference holds a reference to Service.legacy.k8s.io -type ServiceReference struct { - // Namespace is the namespace of the service - // Required - Namespace string `json:"namespace"` - // Name is the name of the service - // Required - Name string `json:"name"` -} -``` - -Notes: -* There could be multiple InitializerConfiguration and - ExternalAdmissionHookConfiguration. Every service provider can define their - own. - -* This schema asserts a global order of initializers, that is, initializers are - applied to different resources in the *same* order, if they opt-in for the - resources. - -* The API will be placed at k8s.io/apiserver for 1.7. - -* We will figure out a more flexible way to represent the order of initializers - in the beta version. - -* We excluded `Retry` as a FailurePolicy, because we want to expose the - flakiness of an admission controller; and admission controllers like the quota - controller are not idempotent. - -* There are multiple ways to compose `Rules []Rule` to achieve the same effect. - It is recommended to compact to as few Rules as possible, but make sure all - expansions of the `<Verbs, APIGroups, APIVersions, Resource>` tuple in each - Rule are valid. We need to document the best practice. - -## Synchronization of admission control configurations - -If the `initializer admission controller` and the `generic webhook admission -controller` watch the admission control configurations and act upon deltas, their -cached version of the configuration might be arbitrarily delayed. This makes it -impossible to predict what initializer/hooks will be applied to newly created -objects. - -To make the behavior of `initializer admission controller` and the `generic -webhook admission controller` predictable, we let them do a consistent read (a -"LIST") of the InitializerConfiguration and ExternalAdmissionHookConfiguration -every 1s. If there isn't any successful read in the last 5s, the two admission -controllers block all incoming request. One consistent read per second isn't -going to cause performance issues. - -In the HA setup, apiservers must be configured with --etcd-quorum-read=true. - -See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives. - -## Handling initializers/webhooks that are not ready but registered - -We only allow initializers/webhooks to be created as "fail open". This could be -enforced via validation. They can upgrade themselves to "fail closed" via the -normal Update operation. A human can also update them to "fail closed" later. - -See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives. - -## Handling fail-open initializers - -The original [proposal](admission_control_extension.md) assumed initializers always failed closed. It is dangerous since crashed -initializers can block the whole cluster. We propose to allow initializers to -fail open, and in 1.7, let all initializers fail open. - -#### Implementation of fail open initializers. - -In the initializer prototype -[PR](https://github.com/kubernetes/kubernetes/pull/36721), the apiserver that -handles the CREATE request -[watches](https://github.com/kubernetes/kubernetes/pull/36721/files#diff-2c081fad5c858e67c96f75adac185093R349) -the uninitialized object. We can add a timer there and let the apiserver remove -the timed out initializer. - -If the apiserver crashes, then we fall back to a `read repair` mechanism. When -handling a GET request, the apiserver checks the objectMeta.CreationTimestamp of -the object, if a global initializer timeout (e.g., 10 mins) has reached, the -apiserver removes the first initializer in the object. - -In the HA setup, apiserver needs to take the clock drift into account as well. - -Note that the fallback is only invoked when the initializer and the apiserver -crashes, so it is rare. - -See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives. - -## Future work - -1. Figuring out a better schema to represent the order among - initializers/webhooks, e.g., adding fields like lists of initializers that - must execute before/after the current one. - -2. #1 will allow parallel initializers as well. - -3. implement the fail closed initializers according to - [proposal](admission_control_extension.md#initializers). - -4. more efficient check of AdmissionControlConfiguration changes. Currently we - do periodic consistent read every second. - -5. block incoming requests if the `initializer admission controller` and the - `generic webhook admission controller` haven't acknowledged a recent change - to AdmissionControlConfiguration. Currently we only guarantee a change - becomes effective in 1s. - -## Considered but REJECTED alternatives: - -### synchronization mechanism - -#### Rejected 1. Always do consistent read - -Rejected because of inefficiency. - -The `initializer admission controller` and the `generic webhook admission -controller` always do consistent read of the `AdmissionControlConfiguration` -before applying the configuration to the incoming objects. This adds latency to -every CREATE request. Because the two admission controllers are in the same -process as the apiserver, the latency mainly consists of the consistent read -latency of the backend storage (etcd), and the proto unmarshalling. - - -#### Rejected 2. Don't synchronize, but report what is the cached version - -Rejected because it violates Goal 2 on the time bound. - -The main goal is *NOT* to always apply the latest -`AdmissionControlConfiguration`, but to make it predictable what -initializers/hooks will be applied. If we introduce the -`generation/observedGeneration` concept to the `AdmissionControlConfiguration`, -then a human (e.g., a cluster admin) can compare the generation with the -observedGeneration and predict if all the initializer/hooks listed in the -`AdmissionControlConfiguration` will be applied. - -In the HA setup, the `observedGeneration` reported by of every apiserver's -`initializer admission controller` and `generic webhook admission controller` -are different, so the API needs to record multiple `observedGeneration`. - -#### Rejected 3. Always do a consistent read of a smaller object - -Rejected because of the complexity. - -A consistent read of the AdmissionControlConfiguration object is expensive, we -cannot do it for every incoming request. - -Alternatively, we record the resource version of the AdmissionControlConfiguration -in a configmap. The apiserver that handles an update of the AdmissionControlConfiguration -updates the configmap with the updated resource version. In the HA setup, there -are multiple apiservers that update this configmap, they should only -update if the recorded resource version is lower than the local one. - -The `initializer admission controller` and the `generic webhook admission -controller` do a consistent read of the configmap *everytime* before applying -the configuration to an incoming request. If the configmap has changed, then -they do a consistent read of the `AdmissionControlConfiguration`. - -### Handling not ready initializers/webhook - -#### Rejected 1. - -add readiness check to initializer and webhooks, `initializer admission -controller` and `generic webhook admission controller` only apply those have -passed readiness check. Specifically, we add `readiness` fields to -`AdmissionControllerConfiguration`; then we either create yet another controller -to probe for the readiness and update the `AdmissionControllerConfiguration`, or -ask each initializer/webhook to update their readiness in the -`AdmissionControllerConfigure`. The former is complex. The latter is -essentially the same as the first approach, except that we need to introduce the -additional concept of "readiness". - -### Handling fail-open initializers - -#### Rejected 1. use a controller - -A `fail-open initializers controller` will remove the timed out fail-open -initializers from objects' initializers list. The controller uses shared -informers to track uninitialized objects. Every 30s, the controller - -* makes a snapshot of the uninitialized objects in the informers. -* indexes the objects by the name of the first initializer in the objectMeta.Initializers -* compares with the snapshot 30s ago, finds objects whose first initializers haven't changed -* does a consistent read of AdmissionControllerConfiguration, finds which initializers are fail-open -* spawns goroutines to send patches to remove fail-open initializers +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/event_compression.md b/contributors/design-proposals/api-machinery/event_compression.md index 258adbb3..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/event_compression.md +++ b/contributors/design-proposals/api-machinery/event_compression.md @@ -1,164 +1,6 @@ -# Kubernetes Event Compression +Design proposals have been archived. -This document captures the design of event compression. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Background -Kubernetes components can get into a state where they generate tons of events. - -The events can be categorized in one of two ways: - -1. same - The event is identical to previous events except it varies only on -timestamp. -2. similar - The event is identical to previous events except it varies on -timestamp and message. - -For example, when pulling a non-existing image, Kubelet will repeatedly generate -`image_not_existing` and `container_is_waiting` events until upstream components -correct the image. When this happens, the spam from the repeated events makes -the entire event mechanism useless. It also appears to cause memory pressure in -etcd (see [#3853](http://issue.k8s.io/3853)). - -The goal is introduce event counting to increment same events, and event -aggregation to collapse similar events. - -## Proposal - -Each binary that generates events (for example, `kubelet`) should keep track of -previously generated events so that it can collapse recurring events into a -single event instead of creating a new instance for each new event. In addition, -if many similar events are created, events should be aggregated into a single -event to reduce spam. - -Event compression should be best effort (not guaranteed). Meaning, in the worst -case, `n` identical (minus timestamp) events may still result in `n` event -entries. - -## Design - -Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following -fields: - * `FirstTimestamp unversioned.Time` - * The date/time of the first occurrence of the event. - * `LastTimestamp unversioned.Time` - * The date/time of the most recent occurrence of the event. - * On first occurrence, this is equal to the FirstTimestamp. - * `Count int` - * The number of occurrences of this event between FirstTimestamp and -LastTimestamp. - * On first occurrence, this is 1. - -Each binary that generates events: - * Maintains a historical record of previously generated events: - * Implemented with -["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) -in [`pkg/client/record/events_cache.go`](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/tools/record/events_cache.go). - * Implemented behind an `EventCorrelator` that manages two subcomponents: -`EventAggregator` and `EventLogger`. - * The `EventCorrelator` observes all incoming events and lets each -subcomponent visit and modify the event in turn. - * The `EventAggregator` runs an aggregation function over each event. This -function buckets each event based on an `aggregateKey` and identifies the event -uniquely with a `localKey` in that bucket. - * The default aggregation function groups similar events that differ only by -`event.Message`. Its `localKey` is `event.Message` and its aggregate key is -produced by joining: - * `event.Source.Component` - * `event.Source.Host` - * `event.InvolvedObject.Kind` - * `event.InvolvedObject.Namespace` - * `event.InvolvedObject.Name` - * `event.InvolvedObject.UID` - * `event.InvolvedObject.APIVersion` - * `event.Reason` - * If the `EventAggregator` observes a similar event produced 10 times in a 10 -minute window, it drops the event that was provided as input and creates a new -event that differs only on the message. The message denotes that this event is -used to group similar events that matched on reason. This aggregated `Event` is -then used in the event processing sequence. - * The `EventLogger` observes the event out of `EventAggregation` and tracks -the number of times it has observed that event previously by incrementing a key -in a cache associated with that matching event. - * The key in the cache is generated from the event object minus -timestamps/count/transient fields, specifically the following events fields are -used to construct a unique key for an event: - * `event.Source.Component` - * `event.Source.Host` - * `event.InvolvedObject.Kind` - * `event.InvolvedObject.Namespace` - * `event.InvolvedObject.Name` - * `event.InvolvedObject.UID` - * `event.InvolvedObject.APIVersion` - * `event.Reason` - * `event.Message` - * The LRU cache is capped at 4096 events for both `EventAggregator` and -`EventLogger`. That means if a component (e.g. kubelet) runs for a long period -of time and generates tons of unique events, the previously generated events -cache will not grow unchecked in memory. Instead, after 4096 unique events are -generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/tools/record/event.go)). - * If the key for the new event matches the key for a previously generated -event (meaning all of the above fields match between the new event and some -previously generated event), then the event is considered to be a duplicate and -the existing event entry is updated in etcd: - * The new PUT (update) event API is called to update the existing event -entry in etcd with the new last seen timestamp and count. - * The event is also updated in the previously generated events cache with -an incremented count, updated last seen timestamp, name, and new resource -version (all required to issue a future event update). - * If the key for the new event does not match the key for any previously -generated event (meaning none of the above fields match between the new event -and any previously generated events), then the event is considered to be -new/unique and a new event entry is created in etcd: - * The usual POST/create event API is called to create a new event entry in -etcd. - * An entry for the event is also added to the previously generated events -cache. - -## Issues/Risks - - * Compression is not guaranteed, because each component keeps track of event - history in memory - * An application restart causes event history to be cleared, meaning event -history is not preserved across application restarts and compression will not -occur across component restarts. - * Because an LRU cache is used to keep track of previously generated events, -if too many unique events are generated, old events will be evicted from the -cache, so events will only be compressed until they age out of the events cache, -at which point any new instance of the event will cause a new entry to be -created in etcd. - -## Example - -Sample kubectl output: - -```console -FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE -Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal -``` - -This demonstrates what would have been 20 separate entries (indicating -scheduling failure) collapsed/compressed down to 5 entries. - -## Related Pull Requests/Issues - - * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events. - * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API. - * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow -compressing multiple recurring events in to a single event. - * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a -single event to optimize etcd storage. - * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache -instead of map. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/extending-api.md b/contributors/design-proposals/api-machinery/extending-api.md index 9a0c9263..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/extending-api.md +++ b/contributors/design-proposals/api-machinery/extending-api.md @@ -1,198 +1,6 @@ -# Adding custom resources to the Kubernetes API server +Design proposals have been archived. -This document describes the design for implementing the storage of custom API -types in the Kubernetes API Server. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Resource Model - -### The ThirdPartyResource - -The `ThirdPartyResource` resource describes the multiple versions of a custom -resource that the user wants to add to the Kubernetes API. `ThirdPartyResource` -is a non-namespaced resource; attempting to place it in a namespace will return -an error. - -Each `ThirdPartyResource` resource has the following: - * Standard Kubernetes object metadata. - * ResourceKind - The kind of the resources described by this third party -resource. - * Description - A free text description of the resource. - * APIGroup - An API group that this resource should be placed into. - * Versions - One or more `Version` objects. - -### The `Version` Object - -The `Version` object describes a single concrete version of a custom resource. -The `Version` object currently only specifies: - * The `Name` of the version. - * The `APIGroup` this version should belong to. - -## Expectations about third party objects - -Every object that is added to a third-party Kubernetes object store is expected -to contain Kubernetes compatible [object metadata](/contributors/devel/sig-architecture/api-conventions.md#metadata). -This requirement enables the Kubernetes API server to provide the following -features: - * Filtering lists of objects via label queries. - * `resourceVersion`-based optimistic concurrency via compare-and-swap. - * Versioned storage. - * Event recording. - * Integration with basic `kubectl` command line tooling. - * Watch for resource changes. - -The `Kind` for an instance of a third-party object (e.g. CronTab) below is -expected to be programmatically convertible to the name of the resource using -the following conversion. Kinds are expected to be of the form -`<CamelCaseKind>`, and the `APIVersion` for the object is expected to be -`<api-group>/<api-version>`. To prevent collisions, it's expected that you'll -use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`. - -For example `mygroup.example.com/v1` - -'CamelCaseKind' is the specific type name. - -To convert this into the `metadata.name` for the `ThirdPartyResource` resource -instance, the `<domain-name>` is copied verbatim, the `CamelCaseKind` is then -converted using '-' instead of capitalization ('camel-case'), with the first -character being assumed to be capitalized. In pseudo code: - -```go -var result string -for ix := range kindName { - if isCapital(kindName[ix]) { - result = append(result, '-') - } - result = append(result, toLowerCase(kindName[ix]) -} -``` - -As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines -resources of Kind `CamelCaseKind`, in the APIGroup with the prefix -`mygroup.example.com/...`. - -The reason for this is to enable rapid lookup of a `ThirdPartyResource` object -given the kind information. This is also the reason why `ThirdPartyResource` is -not namespaced. - -## Usage - -When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts -by creating a new, namespaced RESTful resource path. For now, non-namespaced -objects are not supported. As with existing built-in objects, deleting a -namespace deletes all third party resources in that namespace. - -For example, if a user creates: - -```yaml -metadata: - name: cron-tab.mygroup.example.com -apiVersion: extensions/v1beta1 -kind: ThirdPartyResource -description: "A specification of a Pod to run on a cron style schedule" -versions: -- name: v1 -- name: v2 -``` - -Then the API server will program in the new RESTful resource path: - * `/apis/mygroup.example.com/v1/namespaces/<namespace>/crontabs/...` - -**Note: This may take a while before RESTful resource path registration happen, please -always check this before you create resource instances.** - -Now that this schema has been created, a user can `POST`: - -```json -{ - "metadata": { - "name": "my-new-cron-object" - }, - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTab", - "cronSpec": "* * * * /5", - "image": "my-awesome-cron-image" -} -``` - -to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs` - -and the corresponding data will be stored into etcd by the APIServer, so that -when the user issues: - -``` -GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object -``` - -And when they do that, they will get back the same data, but with additional -Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in. - -Likewise, to list all resources, a user can issue: - -``` -GET /apis/mygroup.example.com/v1/namespaces/default/crontabs -``` - -and get back: - -```json -{ - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTabList", - "items": [ - { - "metadata": { - "name": "my-new-cron-object" - }, - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTab", - "cronSpec": "* * * * /5", - "image": "my-awesome-cron-image" - } - ] -} -``` - -Because all objects are expected to contain standard Kubernetes metadata fields, -these list operations can also use label queries to filter requests down to -specific subsets. - -Likewise, clients can use watch endpoints to watch for changes to stored -objects. - -## Storage - -In order to store custom user data in a versioned fashion inside of etcd, we -need to also introduce a `Codec`-compatible object for persistent storage in -etcd. This object is `ThirdPartyResourceData` and it contains: - * Standard API Metadata. - * `Data`: The raw JSON data for this custom object. - -### Storage key specification - -Each custom object stored by the API server needs a custom key in storage, this -is described below: - -#### Definitions - - * `resource-namespace`: the namespace of the particular resource that is -being stored - * `resource-name`: the name of the particular resource being stored - * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` -resource that represents the type for the specific instance being stored - * `third-party-resource-name`: the name of the `ThirdPartyResource` resource -that represents the type for the specific instance being stored - -#### Key - -Given the definitions above, the key for a specific third-party object is: - -``` -${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name} -``` - -Thus, listing a third-party resource can be achieved by listing the directory: - -``` -${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/ -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/garbage-collection.md b/contributors/design-proposals/api-machinery/garbage-collection.md index 4ee1cabc..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/garbage-collection.md +++ b/contributors/design-proposals/api-machinery/garbage-collection.md @@ -1,351 +1,6 @@ -**Table of Contents** +Design proposals have been archived. -- [Overview](#overview) -- [Cascading deletion with Garbage Collector](#cascading-deletion-with-garbage-collector) -- [Orphaning the descendants with "orphan" finalizer](#orphaning-the-descendants-with-orphan-finalizer) - - [Part I. The finalizer framework](#part-i-the-finalizer-framework) - - [Part II. The "orphan" finalizer](#part-ii-the-orphan-finalizer) -- [Related issues](#related-issues) - - [Orphan adoption](#orphan-adoption) - - [Upgrading a cluster to support cascading deletion](#upgrading-a-cluster-to-support-cascading-deletion) -- [End-to-End Examples](#end-to-end-examples) - - [Life of a Deployment and its descendants](#life-of-a-deployment-and-its-descendants) -- [Open Questions](#open-questions) -- [Considered and Rejected Designs](#considered-and-rejected-designs) -- [1. Tombstone + GC](#1-tombstone--gc) -- [2. Recovering from abnormal cascading deletion](#2-recovering-from-abnormal-cascading-deletion) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Overview - -Currently most cascading deletion logic is implemented at client-side. For example, when deleting a replica set, kubectl uses a reaper to delete the created pods and then delete the replica set. We plan to move the cascading deletion to the server to simplify the client-side logic. In this proposal, we present the garbage collector which implements cascading deletion for all API resources in a generic way; we also present the finalizer framework, particularly the "orphan" finalizer, to enable flexible alternation between cascading deletion and orphaning. - -Goals of the design include: -* Supporting cascading deletion at the server-side. -* Centralizing the cascading deletion logic, rather than spreading in controllers. -* Allowing optionally orphan the dependent objects. - -Non-goals include: -* Releasing the name of an object immediately, so it can be reused ASAP. -* Propagating the grace period in cascading deletion. - -# Cascading deletion with Garbage Collector - -## API Changes - -```go -type ObjectMeta struct { - ... - OwnerReferences []OwnerReference -} -``` - -**ObjectMeta.OwnerReferences**: -List of objects depended by this object. If ***all*** objects in the list have been deleted, this object will be garbage collected. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.OwnerReferences pointing to `D`, set by the deployment controller when `R` is created. This field can be updated by any client that has the privilege to both update ***and*** delete the object. For safety reasons, we can add validation rules to restrict what resources could be set as owners. For example, Events will likely be banned from being owners. - -```go -type OwnerReference struct { - // Version of the referent. - APIVersion string - // Kind of the referent. - Kind string - // Name of the referent. - Name string - // UID of the referent. - UID types.UID -} -``` - -**OwnerReference struct**: OwnerReference contains enough information to let you identify an owning object. Please refer to the inline comments for the meaning of each field. Currently, an owning object must be in the same namespace as the dependent object, so there is no namespace field. - -## New components: the Garbage Collector - -The Garbage Collector is responsible to delete an object if none of the owners listed in the object's OwnerReferences exist. -The Garbage Collector consists of a scanner, a garbage processor, and a propagator. -* Scanner: - * Uses the discovery API to detect all the resources supported by the system. - * Periodically scans all resources in the system and adds each object to the *Dirty Queue*. - -* Garbage Processor: - * Consists of the *Dirty Queue* and workers. - * Each worker: - * Dequeues an item from *Dirty Queue*. - * If the item's OwnerReferences is empty, continues to process the next item in the *Dirty Queue*. - * Otherwise checks each entry in the OwnerReferences: - * If at least one owner exists, do nothing. - * If none of the owners exist, requests the API server to delete the item. - -* Propagator: - * The Propagator is for optimization, not for correctness. - * Consists of an *Event Queue*, a single worker, and a DAG of owner-dependent relations. - * The DAG stores only name/uid/orphan triplets, not the entire body of every item. - * Watches for create/update/delete events for all resources, enqueues the events to the *Event Queue*. - * Worker: - * Dequeues an item from the *Event Queue*. - * If the item is an creation or update, then updates the DAG accordingly. - * If the object has an owner and the owner doesn't exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*. - * If the item is a deletion, then removes the object from the DAG, and enqueues all its dependent objects to the *Dirty Queue*. - * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier. - * With the Propagator, we *only* need to run the Scanner when starting the GC to populate the DAG and the *Dirty Queue*. - -# Orphaning the descendants with "orphan" finalizer - -Users may want to delete an owning object (e.g., a replicaset) while orphaning the dependent object (e.g., pods), that is, leaving the dependent objects untouched. We support such use cases by introducing the "orphan" finalizer. Finalizer is a generic API that has uses other than supporting orphaning, so we first describe the generic finalizer framework, then describe the specific design of the "orphan" finalizer. - -## Part I. The finalizer framework - -## API changes - -```go -type ObjectMeta struct { - … - Finalizers []string -} -``` - -**ObjectMeta.Finalizers**: List of finalizers that need to run before deleting the object. This list must be empty before the object is deleted from the registry. Each string in the list is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. For safety reasons, updating finalizers requires special privileges. To enforce the admission rules, we will expose finalizers as a subresource and disallow directly changing finalizers when updating the main resource. - -## New components - -* Finalizers: - * Like a controller, a finalizer is always running. - * A third party can develop and run their own finalizer in the cluster. A finalizer doesn't need to be registered with the API server. - * Watches for update events that meet two conditions: - 1. the updated object has the identifier of the finalizer in ObjectMeta.Finalizers; - 2. ObjectMeta.DeletionTimestamp is updated from nil to non-nil. - * Applies the finalizing logic to the object in the update event. - * After the finalizing logic is completed, removes itself from ObjectMeta.Finalizers. - * The API server deletes the object after the last finalizer removes itself from the ObjectMeta.Finalizers field. - * Because it's possible for the finalizing logic to be applied multiple times (e.g., the finalizer crashes after applying the finalizing logic but before being removed form ObjectMeta.Finalizers), the finalizing logic has to be idempotent. - * If a finalizer fails to act in a timely manner, users with proper privileges can manually remove the finalizer from ObjectMeta.Finalizers. We will provide a kubectl command to do this. - -## Changes to existing components - -* API server: - * Deletion handler: - * If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionTimestamp, but does not delete the object. - * If the `ObjectMeta.Finalizers` is empty and the options.GracePeriod is zero, then deletes the object. If the options.GracePeriod is non-zero, then just updates the DeletionTimestamp. - * Update handler: - * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, and the DeletionGracePeriodSeconds is zero, then deletes the object from the registry. - * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, but the DeletionGracePeriodSeconds is non-zero, then just updates the object. - -## Part II. The "orphan" finalizer - -## API changes - -```go -type DeleteOptions struct { - … - OrphanDependents bool -} -``` - -**DeleteOptions.OrphanDependents**: allows a user to express whether the dependent objects should be orphaned. It defaults to true, because controllers before release 1.2 expect dependent objects to be orphaned. - -## Changes to existing components - -* API server: -When handling a deletion request, depending on if DeleteOptions.OrphanDependents is true, the API server updates the object to add/remove the "orphan" finalizer to/from the ObjectMeta.Finalizers map. - - -## New components - -Adding a fourth component to the Garbage Collector, the"orphan" finalizer: -* Watches for update events as described in [Part I](#part-i-the-finalizer-framework). -* Removes the object in the event from the `OwnerReferences` of its dependents. - * dependent objects can be found via the DAG kept by the GC, or by relisting the dependent resource and checking the OwnerReferences field of each potential dependent object. -* Also removes any dangling owner references the dependent objects have. -* At last, removes the itself from the `ObjectMeta.Finalizers` of the object. - -# Related issues - -## Orphan adoption - -Controllers are responsible for adopting orphaned dependent resources. To do so, controllers -* Checks a potential dependent object's OwnerReferences to determine if it is orphaned. -* Fills the OwnerReferences if the object matches the controller's selector and is orphaned. - -There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion. - -## Upgrading a cluster to support cascading deletion - -For the master, after upgrading to a version that supports cascading deletion, the OwnerReferences of existing objects remain empty, so the controllers will regard them as orphaned and start the adoption procedures. After the adoptions are done, server-side cascading will be effective for these existing objects. - -For nodes, cascading deletion does not affect them. - -For kubectl, we will keep the kubectl's cascading deletion logic for one more release. - -# End-to-End Examples - -This section presents an example of all components working together to enforce the cascading deletion or orphaning. - -## Life of a Deployment and its descendants - -1. User creates a deployment `D1`. -2. The Propagator of the GC observes the creation. It creates an entry of `D1` in the DAG. -3. The deployment controller observes the creation of `D1`. It creates the replicaset `R1`, whose OwnerReferences field contains a reference to `D1`, and has the "orphan" finalizer in its ObjectMeta.Finalizers map. -4. The Propagator of the GC observes the creation of `R1`. It creates an entry of `R1` in the DAG, with `D1` as its owner. -5. The replicaset controller observes the creation of `R1` and creates Pods `P1`~`Pn`, all with `R1` in their OwnerReferences. -6. The Propagator of the GC observes the creation of `P1`~`Pn`. It creates entries for them in the DAG, with `R1` as their owner. - - ***In case the user wants to cascadingly delete `D1`'s descendants, then*** - -7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=false`. API server checks if `D1` has "orphan" finalizer in its Finalizers map, if so, it updates `D1` to remove the "orphan" finalizer. Then API server deletes `D1`. -8. The "orphan" finalizer does *not* take any action, because the observed deletion shows `D1` has an empty Finalizers map. -9. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent object, replicaset `R1`, to the *dirty queue*. -10. The Garbage Processor of the GC dequeues `R1` from the *dirty queue*. It finds `R1` has an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `R1`, with `DeleteOptions.OrphanDependents=false`. (The Garbage Processor should always set this field to false.) -11. The API server updates `R1` to remove the "orphan" finalizer if it's in the `R1`'s Finalizers map. Then the API server deletes `R1`, as `R1` has an empty Finalizers map. -12. The Propagator of the GC observes the deletion of `R1`. It deletes `R1` from the DAG. It adds its dependent objects, Pods `P1`~`Pn`, to the *Dirty Queue*. -13. The Garbage Processor of the GC dequeues `Px` (1 <= x <= n) from the *Dirty Queue*. It finds that `Px` have an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `Px`, with `DeleteOptions.OrphanDependents=false`. -14. API server deletes the Pods. - - ***In case the user wants to orphan `D1`'s descendants, then*** - -7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=true`. -8. The API server first updates `D1`, with DeletionTimestamp=now and DeletionGracePeriodSeconds=0, increments the Generation by 1, and add the "orphan" finalizer to ObjectMeta.Finalizers if it's not present yet. The API server does not delete `D1`, because its Finalizers map is not empty. -9. The deployment controller observes the update, and acknowledges by updating the `D1`'s ObservedGeneration. The deployment controller won't create more replicasets on `D1`'s behalf. -10. The "orphan" finalizer observes the update, and notes down the Generation. It waits until the ObservedGeneration becomes equal to or greater than the noted Generation. Then it updates `R1` to remove `D1` from its OwnerReferences. At last, it updates `D1`, removing itself from `D1`'s Finalizers map. -11. The API server handles the update of `D1`, because *i)* DeletionTimestamp is non-nil, *ii)* the DeletionGracePeriodSeconds is zero, and *iii)* the last finalizer is removed from the Finalizers map, API server deletes `D1`. -12. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent, replicaset `R1`, to the *Dirty Queue*. -13. The Garbage Processor of the GC dequeues `R1` from the *Dirty Queue* and skips it, because its OwnerReferences is empty. - -# Open Questions - -1. In case an object has multiple owners, some owners are deleted with DeleteOptions.OrphanDependents=true, and some are deleted with DeleteOptions.OrphanDependents=false, what should happen to the object? - - The presented design will respect the setting in the deletion request of last owner. - -2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet. - - Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state. - - [#25055](https://github.com/kubernetes/kubernetes/issues/25055) tracks this problem. - -3. How can a client know when the cascading deletion is completed? - - A tentative solution is introducing a "completing cascading deletion" finalizer, which will be added to the finalizers list of the owning object, and removed by the GC when all dependents are deleted. The user can watch for the deletion event of the owning object to ensure the cascading deletion process has completed. - - ---- -***THE REST IS FOR ARCHIVAL PURPOSES*** ---- - -# Considered and Rejected Designs - -# 1. Tombstone + GC - -## Reasons of rejection - -* It likely would conflict with our plan in the future to use all resources as their own tombstones, once the registry supports multi-object transaction. -* The TTL of the tombstone is hand-waving, there is no guarantee that the value of the TTL is long enough. -* This design is essentially the same as the selected design, with the tombstone as an extra element. The benefit the extra complexity buys is that a parent object can be deleted immediately even if the user wants to orphan the children. The benefit doesn't justify the complexity. - - -## API Changes - -```go -type DeleteOptions struct { - … - OrphanChildren bool -} -``` - -**DeleteOptions.OrphanChildren**: allows a user to express whether the child objects should be orphaned. - -```go -type ObjectMeta struct { - ... - ParentReferences []ObjectReference -} -``` - -**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation. - -```go -type Tombstone struct { - unversioned.TypeMeta - ObjectMeta - UID types.UID -} -``` - -**Tombstone**: a tombstone is created when an object is deleted and the user requires the children to be orphaned. -**Tombstone.UID**: the UID of the original object. - -## New components - -The only new component is the Garbage Collector, which consists of a scanner, a garbage processor, and a propagator. -* Scanner: - * Uses the discovery API to detect all the resources supported by the system. - * For performance reasons, resources can be marked as not participating cascading deletion in the discovery info, then the GC will not monitor them. - * Periodically scans all resources in the system and adds each object to the *Dirty Queue*. - -* Garbage Processor: - * Consists of the *Dirty Queue* and workers. - * Each worker: - * Dequeues an item from *Dirty Queue*. - * If the item's ParentReferences is empty, continues to process the next item in the *Dirty Queue*. - * Otherwise checks each entry in the ParentReferences: - * If a parent exists, continues to check the next parent. - * If a parent doesn't exist, checks if a tombstone standing for the parent exists. - * If the step above shows no parent nor tombstone exists, requests the API server to delete the item. That is, only if ***all*** parents are non-existent, and none of them have tombstones, the child object will be garbage collected. - * Otherwise removes the item's ParentReferences to non-existent parents. - -* Propagator: - * The Propagator is for optimization, not for correctness. - * Maintains a DAG of parent-child relations. This DAG stores only name/uid/orphan triplets, not the entire body of every item. - * Consists of an *Event Queue* and a single worker. - * Watches for create/update/delete events for all resources that participating cascading deletion, enqueues the events to the *Event Queue*. - * Worker: - * Dequeues an item from the *Event Queue*. - * If the item is an creation or update, then updates the DAG accordingly. - * If the object has a parent and the parent doesn't exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*. - * If the item is a deletion, then removes the object from the DAG, and enqueues all its children to the *Dirty Queue*. - * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier. - * With the Propagator, we *only* need to run the Scanner when starting the Propagator to populate the DAG and the *Dirty Queue*. - -## Changes to existing components - -* Storage: we should add a REST storage for Tombstones. The index should be UID rather than namespace/name. - -* API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created. - -* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don't have a parent should have the namespace object as the parent. - -## Comparison with the selected design - -The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry. - -* Advantages of "Tombstone + GC" design - * Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all children's ObjectMeta.ParentReferences. -* Advantages of "Finalizer Framework + GC" - * The finalizer framework is needed for other purposes as well. - - -# 2. Recovering from abnormal cascading deletion - -## Reasons of rejection - -* Not a goal -* Tons of work, not feasible in the near future - -In case the garbage collector is mistakenly deleting objects, we should provide mechanism to stop the garbage collector and restore the objects. - -* Stopping the garbage collector - - We will add a "--enable-garbage-collector" flag to the controller manager binary to indicate if the garbage collector should be enabled. Admin can stop the garbage collector in a running cluster by restarting the kube-controller-manager with --enable-garbage-collector=false. - -* Restoring mistakenly deleted objects - * Guidelines - * The restoration should be implemented as a roll-forward rather than a roll-back, because likely the state of the cluster (e.g., available resources on a node) has changed since the object was deleted. - * Need to archive the complete specs of the deleted objects. - * The content of the archive is sensitive, so the access to the archive subjects to the same authorization policy enforced on the original resource. - * States should be stored in etcd. All components should remain stateless. - - * A preliminary design - - This is a generic design for “undoing a deletion”, not specific to undoing cascading deletion. - * Add a `/archive` sub-resource to every resource, it's used to store the spec of the deleted objects. - * Before an object is deleted from the registry, the API server clears fields like DeletionTimestamp, then creates the object in /archive and sets a TTL. - * Add a `kubectl restore` command, which takes a resource/name pair as input, creates the object with the spec stored in the /archive, and deletes the archived object. - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/metadata-policy.md b/contributors/design-proposals/api-machinery/metadata-policy.md index b9a78e36..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/metadata-policy.md +++ b/contributors/design-proposals/api-machinery/metadata-policy.md @@ -1,134 +1,6 @@ -# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system +Design proposals have been archived. -Status: Not Implemented +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Introduction -This document describes a new API resource, `MetadataPolicy`, that configures an -admission controller to take one or more actions based on an object's metadata. -Initially the metadata fields that the predicates can examine are labels and -annotations, and the actions are to add one or more labels and/or annotations, -or to reject creation/update of the object. In the future other actions might be -supported, such as applying an initializer. - -The first use of `MetadataPolicy` will be to decide which scheduler should -schedule a pod in a [multi-scheduler](./multiple-schedulers.md) -Kubernetes system. In particular, the policy will add the scheduler name -annotation to a pod based on an annotation that is already on the pod that -indicates the QoS of the pod. (That annotation was presumably set by a simpler -admission controller that uses code, rather than configuration, to map the -resource requests and limits of a pod to QoS, and attaches the corresponding -annotation.) - -We anticipate a number of other uses for `MetadataPolicy`, such as defaulting -for labels and annotations, prohibiting/requiring particular labels or -annotations, or choosing a scheduling policy within a scheduler. We do not -discuss them in this doc. - - -## API - -```go -// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource. -// Every rule is applied, in an unspecified order, but if the action for any rule -// that matches is to reject the object, then the object is rejected without being mutated. -type MetadataPolicySpec struct { - Rules []MetadataPolicyRule `json:"rules,omitempty"` -} - -// If the PolicyPredicate is met, then the PolicyAction is applied. -// Example rules: -// reject object if label with key X is present (i.e. require X) -// reject object if label with key X is not present (i.e. forbid X) -// add label X=Y if label with key X is not present (i.e. default X) -// add annotation A=B if object has annotation C=D or E=F -type MetadataPolicyRule struct { - PolicyPredicate PolicyPredicate `json:"policyPredicate"` - PolicyAction PolicyAction `json:policyAction"` -} - -// All criteria must be met for the PolicyPredicate to be considered met. -type PolicyPredicate struct { - // Note that Namespace is not listed here because MetadataPolicy is per-Namespace. - LabelSelector *LabelSelector `json:"labelSelector,omitempty"` - AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"` -} - -// Apply the indicated Labels and/or Annotations (if present), unless Reject is set -// to true, in which case reject the object without mutating it. -type PolicyAction struct { - // If true, the object will be rejected and not mutated. - Reject bool `json:"reject"` - // The labels to add or update, if any. - UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"` - // The annotations to add or update, if any. - UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"` -} - -// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying -// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's -// are applied to all objects in the namespace; the order of evaluation is not guaranteed, -// but if any of the matching policies have an action of rejecting the object, then the object -// will be rejected without being mutated. -type MetadataPolicy struct { - unversioned.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata - ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status - Spec MetadataPolicySpec `json:"spec,omitempty"` -} - -// MetadataPolicyList is a list of MetadataPolicy items. -type MetadataPolicyList struct { - unversioned.TypeMeta `json:",inline"` - // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds - unversioned.ListMeta `json:"metadata,omitempty"` - - // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota - Items []MetadataPolicy `json:"items"` -} -``` - -## Implementation plan - -1. Create `MetadataPolicy` API resource -1. Create admission controller that implements policies defined in -`MetadataPolicy` -1. Create admission controller that sets annotation -`scheduler.alpha.kubernetes.io/qos: <QoS>` -(where `QOS` is one of `Guaranteed, Burstable, BestEffort`) -based on pod's resource request and limit. - -## Future work - -Longer-term we will have QoS be set on create and update by the registry, -similar to `Pending` phase today, instead of having an admission controller -(that runs before the one that takes `MetadataPolicy` as input) do it. - -We plan to eventually move from having an admission controller set the scheduler -name as a pod annotation, to using the initializer concept. In particular, the -scheduler will be an initializer, and the admission controller that decides -which scheduler to use will add the scheduler's name to the list of initializers -for the pod (presumably the scheduler will be the last initializer to run on -each pod). The admission controller would still be configured using the -`MetadataPolicy` described here, only the mechanism the admission controller -uses to record its decision of which scheduler to use would change. - -## Related issues - -The main issue for multiple schedulers is #11793. There was also a lot of -discussion in PRs #17197 and #17865. - -We could use the approach described here to choose a scheduling policy within a -single scheduler, as opposed to choosing a scheduler, a desire mentioned in - -# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where - -`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized -API for matching "claims" to "service classes"; matching a pod to a scheduler -would be one use for such an API. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/protobuf.md b/contributors/design-proposals/api-machinery/protobuf.md index 455cc955..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/protobuf.md +++ b/contributors/design-proposals/api-machinery/protobuf.md @@ -1,475 +1,6 @@ -# Protobuf serialization and internal storage +Design proposals have been archived. -@smarterclayton +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -March 2016 -## Proposal and Motivation - -The Kubernetes API server is a "dumb server" which offers storage, versioning, -validation, update, and watch semantics on API resources. In a large cluster -the API server must efficiently retrieve, store, and deliver large numbers -of coarse-grained objects to many clients. In addition, Kubernetes traffic is -heavily biased towards intra-cluster traffic - as much as 90% of the requests -served by the APIs are for internal cluster components like nodes, controllers, -and proxies. The primary format for intercluster API communication is JSON -today for ease of client construction. - -At the current time, the latency of reaction to change in the cluster is -dominated by the time required to load objects from persistent store (etcd), -convert them to an output version, serialize them JSON over the network, and -then perform the reverse operation in clients. The cost of -serialization/deserialization and the size of the bytes on the wire, as well -as the memory garbage created during those operations, dominate the CPU and -network usage of the API servers. - -In order to reach clusters of 10k nodes, we need roughly an order of magnitude -efficiency improvement in a number of areas of the cluster, starting with the -masters but also including API clients like controllers, kubelets, and node -proxies. - -We propose to introduce a Protobuf serialization for all common API objects -that can optionally be used by intra-cluster components. Experiments have -demonstrated a 10x reduction in CPU use during serialization and deserialization, -a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount -of objects created on the heap during serialization. The Protobuf schema -for each object will be automatically generated from the external API Go structs -we use to serialize to JSON. - -Benchmarking showed that the time spent on the server in a typical GET -resembles: - - etcd -> decode -> defaulting -> convert to internal -> - JSON 50us 5us 15us - Proto 5us - JSON 150allocs 80allocs - Proto 100allocs - - process -> convert to external -> encode -> client - JSON 15us 40us - Proto 5us - JSON 80allocs 100allocs - Proto 4allocs - - Protobuf has a huge benefit on encoding because it does not need to allocate - temporary objects, just one large buffer. Changing to protobuf moves our - hotspot back to conversion, not serialization. - - -## Design Points - -* Generate Protobuf schema from Go structs (like we do for JSON) to avoid - manual schema update and drift -* Generate Protobuf schema that is field equivalent to the JSON fields (no - special types or enumerations), reducing drift for clients across formats. -* Follow our existing API versioning rules (backwards compatible in major - API versions, breaking changes across major versions) by creating one - Protobuf schema per API type. -* Continue to use the existing REST API patterns but offer an alternative - serialization, which means existing client and server tooling can remain - the same while benefiting from faster decoding. -* Protobuf objects on disk or in etcd will need to be self identifying at - rest, like JSON, in order for backwards compatibility in storage to work, - so we must add an envelope with apiVersion and kind to wrap the nested - object, and make the data format recognizable to clients. -* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal - operations, allowing us to bypass the expensive reflection used by the - golang JSOn operation - - -## Alternatives - -* We considered JSON compression to reduce size on wire, but that does not - reduce the amount of memory garbage created during serialization and - deserialization. -* More efficient formats like Msgpack were considered, but they only offer - 2x speed up vs. the 10x observed for Protobuf -* gRPC was considered, but is a larger change that requires more core - refactoring. This approach does not eliminate the possibility of switching - to gRPC in the future. -* We considered attempting to improve JSON serialization, but the cost of - implementing a more efficient serializer library than ugorji is - significantly higher than creating a protobuf schema from our Go structs. - - -## Schema - -The Protobuf schema for each API group and version will be generated from -the objects in that API group and version. The schema will be named using -the package identifier of the Go package, i.e. - - k8s.io/kubernetes/pkg/api/v1 - -Each top level object will be generated as a Protobuf message, i.e.: - - type Pod struct { ... } - - message Pod {} - -Since the Go structs are designed to be serialized to JSON (with only the -int, string, bool, map, and array primitive types), we will use the -canonical JSON serialization as the protobuf field type wherever possible, -i.e.: - - JSON Protobuf - string -> string - int -> varint - bool -> bool - array -> repeating message|primitive - -We disallow the use of the Go `int` type in external fields because it is -ambiguous depending on compiler platform, and instead always use `int32` or -`int64`. - -We will use maps (a protobuf 3 extension that can serialize to protobuf 2) -to represent JSON maps: - - JSON Protobuf Wire (proto2) - map -> map<string, ...> -> repeated Message { key string; value bytes } - -We will not convert known string constants to enumerations, since that -would require extra logic we do not already have in JSOn. - -To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and -in the future investigate a Protobuf 3 serialization. We will introduce -abstractions that let us have more than a single protobuf serialization if -necessary. Protobuf 3 would require us to support message types for -pointer primitive (nullable) fields, which is more complex than Protobuf 2's -support for pointers. - -### Example of generated proto IDL - -Without gogo extensions: - -``` -syntax = 'proto2'; - -package k8s.io.kubernetes.pkg.api.v1; - -import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; -import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; -import "k8s.io/kubernetes/pkg/runtime/generated.proto"; -import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; - -// Package-wide variables from generator "generated". -option go_package = "v1"; - -// Represents a Persistent Disk resource in AWS. -// -// An AWS EBS disk must exist before mounting to a container. The disk -// must also be in the same AWS zone as the kubelet. An AWS EBS disk -// can only be mounted as read/write once. AWS EBS volumes support -// ownership management and SELinux relabeling. -message AWSElasticBlockStoreVolumeSource { - // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional string volumeID = 1; - - // Filesystem type of the volume that you want to mount. - // Tip: Ensure that the filesystem type is supported by the host operating system. - // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - // TODO: how do we prevent errors in the filesystem from compromising the machine - optional string fsType = 2; - - // The partition in the volume that you want to mount. - // If omitted, the default is to mount by volume name. - // Examples: For volume /dev/sda1, you specify the partition as "1". - // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). - optional int32 partition = 3; - - // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". - // If omitted, the default is "false". - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional bool readOnly = 4; -} - -// Affinity is a group of affinity scheduling rules, currently -// only node affinity, but in the future also inter-pod affinity. -message Affinity { - // Describes node affinity scheduling rules for the pod. - optional NodeAffinity nodeAffinity = 1; -} -``` - -With extensions: - -``` -syntax = 'proto2'; - -package k8s.io.kubernetes.pkg.api.v1; - -import "github.com/gogo/protobuf/gogoproto/gogo.proto"; -import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; -import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; -import "k8s.io/kubernetes/pkg/runtime/generated.proto"; -import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; - -// Package-wide variables from generator "generated". -option (gogoproto.marshaler_all) = true; -option (gogoproto.sizer_all) = true; -option (gogoproto.unmarshaler_all) = true; -option (gogoproto.goproto_unrecognized_all) = false; -option (gogoproto.goproto_enum_prefix_all) = false; -option (gogoproto.goproto_getters_all) = false; -option go_package = "v1"; - -// Represents a Persistent Disk resource in AWS. -// -// An AWS EBS disk must exist before mounting to a container. The disk -// must also be in the same AWS zone as the kubelet. An AWS EBS disk -// can only be mounted as read/write once. AWS EBS volumes support -// ownership management and SELinux relabeling. -message AWSElasticBlockStoreVolumeSource { - // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false]; - - // Filesystem type of the volume that you want to mount. - // Tip: Ensure that the filesystem type is supported by the host operating system. - // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - // TODO: how do we prevent errors in the filesystem from compromising the machine - optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false]; - - // The partition in the volume that you want to mount. - // If omitted, the default is to mount by volume name. - // Examples: For volume /dev/sda1, you specify the partition as "1". - // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). - optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false]; - - // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". - // If omitted, the default is "false". - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false]; -} - -// Affinity is a group of affinity scheduling rules, currently -// only node affinity, but in the future also inter-pod affinity. -message Affinity { - // Describes node affinity scheduling rules for the pod. - optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"]; -} -``` - -## Wire format - -In order to make Protobuf serialized objects recognizable in a binary form, -the encoded object must be prefixed by a magic number, and then wrap the -non-self-describing Protobuf object in a Protobuf object that contains -schema information. The protobuf object is referred to as the `raw` object -and the encapsulation is referred to as `wrapper` object. - -The simplest serialization is the raw Protobuf object with no identifying -information. In some use cases, we may wish to have the server identify the -raw object type on the wire using a protocol dependent format (gRPC uses -a type HTTP header). This works when all objects are of the same type, but -we occasionally have reasons to encode different object types in the same -context (watches, lists of objects on disk, and API calls that may return -errors). - -To identify the type of a wrapped Protobuf object, we wrap it in a message -in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown` -having the following schema: - - message Unknown { - optional TypeMeta typeMeta = 1; - optional bytes value = 2; - optional string contentEncoding = 3; - optional string contentType = 4; - } - - message TypeMeta { - optional string apiVersion = 1; - optional string kind = 2; - } - -The `value` field is an encoded protobuf object that matches the schema -defined in `typeMeta` and has optional `contentType` and `contentEncoding` -fields. `contentType` and `contentEncoding` have the same meaning as in -HTTP, if unspecified `contentType` means "raw protobuf object", and -`contentEncoding` defaults to no encoding. If `contentEncoding` is -specified, the defined transformation should be applied to `value` before -attempting to decode the value. - -The `contentType` field is required to support objects without a defined -protobuf schema, like the ThirdPartyResource or templates. Those objects -would have to be encoded as JSON or another structure compatible form -when used with Protobuf. Generic clients must deal with the possibility -that the returned value is not in the known type. - -We add the `contentEncoding` field here to preserve room for future -optimizations like encryption-at-rest or compression of the nested content. -Clients should error when receiving an encoding they do not support. -Negotiating encoding is not defined here, but introducing new encodings -is similar to introducing a schema change or new API version. - -A client should use the `kind` and `apiVersion` fields to identify the -correct protobuf IDL for that message and version, and then decode the -`bytes` field into that Protobuf message. - -Any Unknown value written to stable storage will be given a 4 byte prefix -`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte. -The content-type `application/vnd.kubernetes.protobuf` is defined as -representing the following schema: - - MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN - UNKNOWN = <protobuf serialization of k8s.io/kubernetes/pkg/runtime#Unknown> - -A client should check for the first four bytes, then perform a protobuf -deserialization of the remaining bytes into the `runtime.Unknown` type. - -## Streaming wire format - -While the majority of Kubernetes APIs return single objects that can vary -in type (Pod vs. Status, PodList vs. Status), the watch APIs return a stream -of identical objects (Events). At the time of this writing, this is the only -current or anticipated streaming RESTful protocol (logging, port-forwarding, -and exec protocols use a binary protocol over Websockets or SPDY). - -In JSON, this API is implemented as a stream of JSON objects that are -separated by their syntax (the closing `}` brace is followed by whitespace -and the opening `{` brace starts the next object). There is no formal -specification covering this pattern, nor a unique content-type. Each object -is expected to be of type `watch.Event`, and is currently not self describing. - -For expediency and consistency, we define a format for Protobuf watch Events -that is similar. Since protobuf messages are not self describing, we must -identify the boundaries between Events (a `frame`). We do that by prefixing -each frame of N bytes with a 4-byte, big-endian, unsigned integer with the -value N. - - frame = length body - length = 32-bit unsigned integer in big-endian order, denoting length of - bytes of body - body = <bytes> - - # frame containing a single byte 0a - frame = 01 00 00 00 0a - - # equivalent JSON - frame = {"type": "added", ...} - -The body of each frame is a serialized Protobuf message `Event` in package -`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this -format is `application/vnd.kubernetes.protobuf;type=watch`. - -## Negotiation - -To allow clients to request protobuf serialization optionally, the `Accept` -HTTP header is used by callers to indicate which serialization they wish -returned in the response, and the `Content-Type` header is used to tell the -server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH -requests). The server will return 406 if the `Accept` header is not -recognized or 415 if the `Content-Type` is not recognized (as defined in -RFC2616). - -To be backwards compatible, clients must consider that the server does not -support protobuf serialization. A number of options are possible: - -### Preconfigured - -Clients can have a configuration setting that instructs them which version -to use. This is the simplest option, but requires intervention when the -component upgrades to protobuf. - -### Include serialization information in api-discovery - -Servers can define the list of content types they accept and return in -their API discovery docs, and clients can use protobuf if they support it. -Allows dynamic configuration during upgrade if the client is already using -API-discovery. - -### Optimistically attempt to send and receive requests using protobuf - -Using multiple `Accept` values: - - Accept: application/vnd.kubernetes.protobuf, application/json - -clients can indicate their preferences and handle the returned -`Content-Type` using whatever the server responds. On update operations, -clients can try protobuf and if they receive a 415 error, record that and -fall back to JSON. Allows the client to be backwards compatible with -any server, but comes at the cost of some implementation complexity. - - -## Generation process - -Generation proceeds in five phases: - -1. Generate a gogo-protobuf annotated IDL from the source Go struct. -2. Generate temporary Go structs from the IDL using gogo-protobuf. -3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf. -4. Take all tag numbers generated for the IDL and apply them as struct tags - to the original Go types. -5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL. - -The output is a `generated.proto` file in each package containing a standard -proto2 IDL, and a `generated.pb.go` file in each package that contains the -generated marshal/unmarshallers. - -The Go struct generated by gogo-protobuf from the first IDL must be identical -to the origin struct - a number of changes have been made to gogo-protobuf -to ensure exact 1-1 conversion. A small number of additions may be necessary -in the future if we introduce more exotic field types (Go type aliases, maps -with aliased Go types, and embedded fields were fixed). If they are identical, -the output marshallers/unmarshallers can then work on the origin struct. - -Whenever a new field is added, generation will assign that field a unique tag -and the 4th phase will write that tag back to the origin Go struct as a `protobuf` -struct tag. This ensures subsequent generation passes are stable, even in the -face of internal refactors. The first time a field is added, the author will -need to check in both the new IDL AND the protobuf struct tag changes. - -The second IDL is generated without gogo-protobuf annotations to allow clients -in other languages to generate easily. - -Any errors in the generation process are considered fatal and must be resolved -early (being unable to identify a field type for conversion, duplicate fields, -duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure -that a Go struct can be round-tripped to protobuf and back, as we do for JSON -and conversion testing. - - -## Changes to development process - -All existing API change rules would still apply. New fields added would be -automatically assigned a tag by the generation process. New API versions will -have a new proto IDL, and field name and changes across API versions would be -handled using our existing API change rules. Tags cannot change within an -API version. - -Generation would be done by developers and then checked into source control, -like conversions and ugorji JSON codecs. - -Because protoc is not packaged well across all platforms, we will add it to -the `kube-cross` Docker image and developers can use that to generate -updated protobufs. Protobuf 3 beta is required. - -The generated protobuf will be checked with a verify script before merging. - - -## Implications - -* The generated marshal code is large and will increase build times and binary - size. We may be able to remove ugorji after protobuf is added, since the - bulk of our decoding would switch to protobuf. -* The protobuf schema is naive, which means it may not be as a minimal as - possible. -* Debugging of protobuf related errors is harder due to the binary nature of - the format. -* Migrating API object storage from JSON to protobuf will require that all - API servers are upgraded before beginning to write protobuf to disk, since - old servers won't recognize protobuf. -* Transport of protobuf between etcd and the api server will be less efficient - in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON). - Should still be smaller than current JSON request. -* Third-party API objects must be stored as JSON inside of a protobuf wrapper - in etcd, and the API endpoints will not benefit from clients that speak - protobuf. Clients will have to deal with some API objects not supporting - protobuf. - - -## Open Questions - -* Is supporting stored protobuf files on disk in the kubectl client worth it? +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/server-get.md b/contributors/design-proposals/api-machinery/server-get.md index 576a1916..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/server-get.md +++ b/contributors/design-proposals/api-machinery/server-get.md @@ -1,179 +1,6 @@ -# Expose `get` output from the server +Design proposals have been archived. -Today, all clients must reproduce the tabular and describe output implemented in `kubectl` to perform simple lists -of objects. This logic in many cases is non-trivial and condenses multiple fields into succinct output. It also requires -that every client provide rendering logic for every possible type, including those provided by API aggregation or third -party resources which may not be known at compile time. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This proposal covers moving `get` and `describe` to the server to reduce total work for all clients and centralize these -core display options for better reuse and extension. - - -## Background - -`kubectl get` is a simple tabular representation of one or more instances of a particular resource type. It is the primary -listing mechanism and so must be implemented for each type. Today, we have a single generic implementation for unrecognized -types that attempts to load information from the `metadata` field (assuming the object follows the metav1 Kubernetes API -schema). `get` supports a `wide` mode that includes additional columns. Users can add additional columns for labels via a -flag. Headers corresponding to the columns are optionally displayed. - -`kubectl describe` shows a textual representation of individual objects that describes individual fields as subsequent -lines and uses indentation and nested tables to convey deeper structure on the resource (such as events for a pod or -each container). It sometimes retrieves related objects like events, pods for a replication controller, or autoscalers -for a deployment. It supports no significant flags. - -The implementation of both is modeled as a registered function that takes an object or list of objects and outputs -semi-structured text. - -## Goals - -* Make it easy for a simple client to get a list of resources for a web UI or CLI output -* Support all existing options, leave open the door for future extension and experimentation -* Allow new API extensions and third party resources to be implemented server side, removing the need to version - schemes for retrieving data from the server -* Keep implementation of `get` and `describe` simple -* Ease internationalization of `get` and `describe` output for all clients - -## Non-Goals - -* Deep customization of the returned output by the client - - -## Specification of server-side `get` - -The server would return a `Table` object (working-name) that contains metadata for columns and one or more -rows composed of cells for each column. Some additional data may be relevant for each row and returned by the -server. Since every object should have a `Table` representation, treat this as part of content negotiation -as described in [the alternative representations of objects proposal](alternate-api-representations.md). - -Example request: - -``` -$ curl https://localhost:8443/api/v1/pods -H "Accept: application/json+vnd.kubernetes.as+meta.k8s.io+v1alpha1+Table" -{ - "kind": "Table", - "apiVersion": "meta.k8s.io/v1alpha1", - "headers": [ - {"name": "Name", "type": "string", "description": "The name of the pod, must be unique ..."}, - {"name": "Status", "type": "string", "description": "Describes the current state of the pod"}, - ... - ], - "items": [ - {"cells": ["pod1", "Failed - unable to start", ...]}, - {"cells": ["pod2", "Init 0/2", ...]}, - ... - ] -} -``` - -This representation is also possible to return from a watch. The watch can omit headers on subsequent queries. - -``` -$ curl https://localhost:8443/api/v1/pods?watch=1 -H "Accept: application/json+vnd.kubernetes.as+meta.k8s.io+v1alpha1+Table" -{ - "kind": "Table", - "apiVersion": "meta.k8s.io/v1alpha1", - // headers are printed first, in case the watch holds - "headers": [ - {"name": "Name", "type": "string", "description": "The name of the pod, must be unique ..."}, - {"name": "Status", "type": "string", "description": "Describes the current state of the pod"}, - ... - ] -} -{ - "kind": "Table", - "apiVersion": "meta.k8s.io/v1alpha1", - // headers are not returned here - "items": [ - {"cells": ["pod1", "Failed - unable to start", ...]}, - ... - ] -} -``` - -It can also be returned in CSV form: - -``` -$ curl https://localhost:8443/api/v1/pods -H "Accept: text/csv+vnd.kubernetes.as+meta.k8s.io+v1alpha1+Table" -Name,Status,... -pod1,"Failed - unable to start",... -pod2,"Init 0/2",... -... -``` - -To support "wide" format, columns may be marked with an optional priority field of increasing integers (default -priority 0): - -``` -{ - "kind": "Table", - "apiVersion": "meta.k8s.io/v1alpha1", - "headers": [ - ... - {"name": "Node Name", "type": "string", "description": "The node the pod is scheduled on, empty if the pod is not yet scheduled", "priority": 1}, - ... - ], - ... -} -``` - -To allow label columns, and to enable integrators to build effective UIs, each row may contain an `object` field that -is either `PartialObjectMetadata` (a standard object containing only ObjectMeta) or the object itself. Clients may request -this field be set by specifying `?includeObject=None|Metadata|Self` on the query parameter. - -``` -GET ...?includeObject=Metadata -{ - "kind": "Table", - "apiVersion": "meta.k8s.io/v1alpha1", - "items": [ - ... - {"cells": [...], "object": {"kind": "PartialObjectMetadata", "apiVersion":"meta.k8s.io/v1alpha1", "metadata": {"name": "pod1", "namespace": "pod2", "labels": {"a": "1"}, ...}}, - ... - ] -} -``` - -The `Metadata` value would be the default. Clients that wish to print in an advanced manner may use `Self` to get the full -object and perform arbitrary transformations. - -All fields on the server side are candidates for translation and localization changes can be delivered more -quickly and to all clients. - -Third-party resources can more easily implement `get` in this fashion - instead of web dashboards and -`kubectl` both implementing their own logic to parse a particular version of Swagger or OpenAPI, the server -component performs the transformation. The server encapsulates the details of printing. Aggregated resources -automatically provide this behavior when possible. - - -### Specific features in `kubectl get` - -Feature | Implementation ---- | --- -sort-by | Continue to implement client-side (no server side sort planned) -custom-column (jsonpath) | Implement client-side by requesting object `?includeObject=Self` and parsing -custom-column (label) | Implement client-side by getting labels from metadata returned with each row -show-kind | Implement client-side by using the discovery info associated with the object (rather than being returned by server) -template | Implement client-side, bypass receiving table output and get raw objects -watch | Request Table output via the watch endpoint -export | Implement client-side, bypass receiving table output and get exported object -wide | Server should indicate which columns are "additional" via a field on the header column - client then shows those columns if it wants to -color (proposed) | Rows which should be highlighted should have a semantic field on the row - e.g. `alert: [{type: Warning, message: "This pod has been deleted"}]`. Cells could be selected by adding an additional field `alert: [{type: Warning, ..., cells: [0, 1]}]`. - - -## Future considerations - -* When we introduce server side paging, Table would be paged similar to how PodList or other types are paged. https://issues.k8s.io/2349 -* More advanced output could in the future be provided by an external call-out or an aggregation API on the server side. -* `describe` could be managed on the server as well, with a similar generic format, and external outbound links used to reference other objects. - - -## Migration - -Old clients will continue retrieving the primary representation. Clients can begin using the optional `Accept` -header to indicate they want the simpler version, and if they receive a Table perform the new path, otherwise -fall back to client side functions. - -Server side code would reuse the existing display functions but replace TabWriter with either a structured writer -or the tabular form. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/synchronous-garbage-collection.md b/contributors/design-proposals/api-machinery/synchronous-garbage-collection.md index c7ab2249..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/synchronous-garbage-collection.md +++ b/contributors/design-proposals/api-machinery/synchronous-garbage-collection.md @@ -1,169 +1,6 @@ -**Table of Contents** +Design proposals have been archived. -- [Overview](#overview) -- [API Design](#api-design) - - [Standard Finalizers](#standard-finalizers) - - [OwnerReference](#ownerreference) - - [DeleteOptions](#deleteoptions) -- [Components changes](#components-changes) - - [API Server](#api-server) - - [Garbage Collector](#garbage-collector) - - [Controllers](#controllers) -- [Handling circular dependencies](#handling-circular-dependencies) -- [Unhandled cases](#unhandled-cases) -- [Implications to existing clients](#implications-to-existing-clients) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Overview - -Users of the server-side garbage collection need to determine if the garbage collection is done. For example: -* Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done. -* [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies. - -We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released resources they occupy; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.) - -Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. - -Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 - -# API Design - -## Standard Finalizers - -We will introduce a new standard finalizer: - -```go -const GCFinalizer string = “DeletingDependents” -``` - -This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted. - -## OwnerReference - -```go -OwnerReference { - ... - // If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. - // Defaults to false. - // To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. - BlockOwnerDeletion *bool -} -``` - -The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly set `BlockOwnerDeletion` allows the master to properly authorize the request. - -## DeleteOptions - -```go -DeleteOptions { - … - // Whether and how garbage collection will be performed. - // Defaults to DeletePropagationDefault - // Either this field or OrphanDependents may be set, but not both. - PropagationPolicy *DeletePropagationPolicy -} - -type DeletePropagationPolicy string - -const ( - // The default depends on the existing finalizers on the object and the type of the object. - DeletePropagationDefault DeletePropagationPolicy = "DeletePropagationDefault" - // Orphans the dependents - DeletePropagationOrphan DeletePropagationPolicy = "DeletePropagationOrphan" - // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. - DeletePropagationBackground DeletePropagationPolicy = "DeletePropagationBackground" - // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. - // API server will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp. - // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. - DeletePropagationForeground DeletePropagationPolicy = "DeletePropagationForeground" -) -``` - -The `DeletePropagationForeground` policy represents the synchronous GC mode. - -`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `PropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. - -The conversion rules are described in the following table: - -| 1.5 | pre 1.4/1.4 | -|------------------------------------------|--------------------------| -| DeletePropagationDefault | OrphanDependents==nil | -| DeletePropagationOrphan | *OrphanDependents==true | -| DeletePropagationBackground | *OrphanDependents==false | -| DeletePropagationForeground | N/A | - -# Components changes - -## API Server - -`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeletePropagationForeground`, the API server will update the object instead of deleting it, add the "DeletingDependents" finalizer, remove the "OrphanDependents" finalizer if it's present, and set the `ObjectMeta.DeletionTimestamp`. - -When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. - -## Garbage Collector - -**Modifications to processEvent()** - -Currently `processEvent()` manages GC's internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to: - -* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`. -* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted. - -**Modifications to processItem()** - -Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: - -* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise synchronous GC will not progress because the owner keeps existing in the key-value store. -* when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy. -* if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. - -In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: - -* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). -* Checks if the object has dependents - * If not, send a PUT request to remove the `GCFinalizer`; - * If so, then add all dependents to the `dirtyQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. - -## Controllers - -To utilize the synchronous garbage collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). - -# Handling circular dependencies - -SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object. - -Note that the approach is not rigorous and thus having false positives. For example, if a user first sends a SynchronousGC delete request for an object, then sends the delete request for its owner, then `processItem()` will be fooled to believe there is a circle. We expect user not to do this. We can make the circle detection more rigorous if needed. - -Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later. - -# Unhandled cases - -* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, synchronous GC is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). - -# Implications to existing clients - -Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case. - -**Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. - -**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated phase, `kubelet` will not care to manage it. - -**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed. - -**Podgc** deletes terminated pods if there are too many of them in the cluster. We need to make sure finalizers on Pods are taken off quickly enough so that the progress of `Podgc` is not affected. - -**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment should adopt it, take its pods into account, but shouldn't try to mutate it, because the RS controller will ignore a RS that's being deleted. Hence, `deployment controller` should wait for the RS to be deleted, and then create a new one. - -**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers. - -**StatefulSet controller** will be blocked by a pod with pending finalizers, so synchronous GC might slow down its progress. - -**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. - -Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `PropagationPolicy=DeletePropagationBackground` and return immediately. - -To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master. - -1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeletePropagationBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. - -Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DeletePropagationDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/api-machinery/thirdpartyresources.md b/contributors/design-proposals/api-machinery/thirdpartyresources.md index 05dfff76..f0fbec72 100644 --- a/contributors/design-proposals/api-machinery/thirdpartyresources.md +++ b/contributors/design-proposals/api-machinery/thirdpartyresources.md @@ -1,253 +1,6 @@ -# Moving ThirdPartyResources to beta +Design proposals have been archived. -## Background -There are a number of important issues with the alpha version of -ThirdPartyResources that we wish to address to move TPR to beta. The list is -tracked [here](https://github.com/kubernetes/features/issues/95), and also -includes feedback from existing Kubernetes ThirdPartyResource users. This -proposal covers the steps we believe are necessary to move TPR to beta and to -prevent future challenges in upgrading. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goals -1. Ensure ThirdPartyResource APIs operate consistently with first party -Kubernetes APIs. -2. Enable ThirdPartyResources to specify how they will appear in API -discovery to be consistent with other resources and avoid naming conflicts -3. Move TPR into their own API group to allow the extensions group to be -[removed](https://github.com/kubernetes/kubernetes/issues/43214) -4. Support cluster scoped TPR resources -5. Identify other features required for TPR to become beta -6. Minimize the impact to alpha ThirdPartyResources consumers and define a -process for how TPR migrations / breaking changes can be accomplished (for -both the cluster and for end users) - -Non-goals -1. Solve automatic conversion of TPR between versions or automatic migration of -existing TPR - -### Desired API Semantics -TPRs are intended to look like normal kube-like resources to external clients. -In order to do that effectively, they should respect the normal get, list, -watch, create, patch, update, and delete semantics. - -In "normal" Kubernetes APIs, if I have a persisted resource in the same group -with the same name in v1 and v2, they are backed by the same underlying object. -A change made to one is reflected in the other. API clients, garbage collection, -namespace cleanup, version negotiation, and controllers all build on this. - -The convertibility of Kubernetes APIs provides a seamless interaction between -versions. A TPR does not have the ability to convert between versions, which -focuses on the primary role of TPR as an easily extensible and simple mechanism -for adding new APIs. Conversion primarily allows structural, but not backwards -incompatible, changes. By not supporting conversion, all TPR use cases are -preserved, but a large amount of complexity is avoided for consumers of TPR. - -Allowing a single, user specified version for a given TPR will provide this -semantic by preventing server-side versioning altogether. All instances of a -single TPR must have the same version or the Kubernetes API semantic of always -returning a resource encoded to the matching version will not be maintained. -Since conversions (even native Kubernetes conversions) cannot be used to handle -behavioral changes, the same effect can be achieved for TPRs client-side with -overlapping serialization changes. - - -### Avoiding Naming Problems -There are several identifiers that a Kubernetes API resource has which share -value-spaces within an API group and must not conflict. They are: -1. Resource-type value space - 1. plural resource-type name - like "configmaps" - 2. singular resource-type name - like "configmap" - 3. short names - like "cm" -2. Kind-type value space - for group "example.com" - 1. Kind name - like "ConfigMap" - 2. ListKind name - like "ConfigMapList" -If these values conflict within their value-spaces then no client will be able -to properly distinguish intent. - -The actual name of the TPR-registration (resource that describes the TPR to -create) resource can only protect one of these values from conflict. Since -Kubernetes API types are accessed via a URL that looks like `/apis/<group>/<version>/namespaces/<namespace-name>/<plural-resource-type>`, -the name of the TPR-registration object will be `<plural-resource-type>.<group>`. - -Conflicts with other parts of the value-space can not be detected with static -validation, so there will be a spec/status split with `status.conditions` that -reflect the acceptance status of a TPR-registration. For instance, you cannot -determine whether two TPRs in the same group have the same short name without -inspecting the current state of existing TPRs. - -Parts of the value-space will be "claimed" by making an entry in TPR.status to -include the accepted names which will be served. This prevents a new TPR from -disabling an existing TPR's name. - - -## New API -In order to: -1. eliminate opaquely derived information - deriving camel-cased kind names -from lower-case dash-delimited values as for instance. -1. allow the expression of complex transformations - not all plurals are easily -determined (ox and oxen) and not all are English. Fields for complete -specification eliminates ambiguity. -1. handle TPR-registration value-space conflicts -1. [stop using the extensions API group](https://github.com/kubernetes/kubernetes/issues/43214) - -We can create a type `ThirdPartyResource.apiextension.k8s.io`. -```go -// ThirdPartyResourceSpec describe how a user wants their resource to appear -type ThirdPartyResourceSpec struct { - // Group is the group this resource belongs in - Group string `json:"group" protobuf:"bytes,1,opt,name=group"` - // Version is the version this resource belongs in - Version string `json:"version" protobuf:"bytes,2,opt,name=version"` - // Names holds the information about the resource and kind you have chosen which is - // surfaced through discovery. - Names ThirdPartyResourceNames - - // Scope indicates whether this resource is cluster or namespace scoped. Default is namespaced - Scope ResourceScope `json:"scope" protobuf:"bytes,8,opt,name=scope,casttype=ResourceScope"` -} - -type ThirdPartyResourceNames struct { - // Plural is the plural name of the resource to serve. It must match the name of the TPR-registration - // too: plural.group - Plural string `json:"plural" protobuf:"bytes,3,opt,name=plural"` - // Singular is the singular name of the resource. Defaults to lowercased <kind> - Singular string `json:"singular,omitempty" protobuf:"bytes,4,opt,name=singular"` - // ShortNames are short names for the resource. - ShortNames []string `json:"shortNames,omitempty" protobuf:"bytes,5,opt,name=shortNames"` - // Kind is the serialized kind of the resource - Kind string `json:"kind" protobuf:"bytes,6,opt,name=kind"` - // ListKind is the serialized kind of the list for this resource. Defaults to <kind>List - ListKind string `json:"listKind,omitempty" protobuf:"bytes,7,opt,name=listKind"` -} - -type ResourceScope string - -const ( - ClusterScoped ResourceScope = "Cluster" - NamespaceScoped ResourceScope = "Namespaced" -) - -type ConditionStatus string - -// These are valid condition statuses. "ConditionTrue" means a resource is in the condition. -// "ConditionFalse" means a resource is not in the condition. "ConditionUnknown" means kubernetes -// can't decide if a resource is in the condition or not. In the future, we could add other -// intermediate conditions, e.g. ConditionDegraded. -const ( - ConditionTrue ConditionStatus = "True" - ConditionFalse ConditionStatus = "False" - ConditionUnknown ConditionStatus = "Unknown" -) - -// ThirdPartyResourceConditionType is a valid value for ThirdPartyResourceCondition.Type -type ThirdPartyResourceConditionType string - -const ( - // NameConflict means the resource or kind names chosen for this ThirdPartyResource conflict with others in the group. - // The first TPR in the group to have the name reflected in status "wins" the name. - NameConflict ThirdPartyResourceConditionType = "NameConflict" - // Terminating means that the ThirdPartyResource has been deleted and is cleaning up. - Terminating ThirdPartyResourceConditionType = "Terminating" -) - -// ThirdPartyResourceCondition contains details for the current condition of this ThirdPartyResource. -type ThirdPartyResourceCondition struct { - // Type is the type of the condition. - Type ThirdPartyResourceConditionType `json:"type" protobuf:"bytes,1,opt,name=type,casttype=ThirdPartyResourceConditionType"` - // Status is the status of the condition. - // Can be True, False, Unknown. - Status ConditionStatus `json:"status" protobuf:"bytes,2,opt,name=status,casttype=ConditionStatus"` - // Last time the condition transitioned from one status to another. - // +optional - LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"` - // Unique, one-word, CamelCase reason for the condition's last transition. - // +optional - Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"` - // Human-readable message indicating details about last transition. - // +optional - Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"` -} - -// ThirdPartyResourceStatus indicates the state of the ThirdPartyResource -type ThirdPartyResourceStatus struct { - // Conditions indicate state for particular aspects of a ThirdPartyResource - Conditions []ThirdPartyResourceCondition `json:"conditions" protobuf:"bytes,1,opt,name=conditions"` - - // AcceptedNames are the names that are actually being used to serve discovery - // They may not be the same as names in spec. - AcceptedNames ThirdPartyResourceNames -} - -// +genclient=true - -// ThirdPartyResource represents a resource that should be exposed on the API server. Its name MUST be in the format -// <.spec.plural>.<.spec.group>. -type ThirdPartyResource struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Spec describes how the user wants the resources to appear - Spec ThirdPartyResourceSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` - // Status indicates the actual state of the ThirdPartyResource - Status ThirdPartyResourceStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` -} - -// ThirdPartyResourceList is a list of ThirdPartyResource objects. -type ThirdPartyResourceList struct { - metav1.TypeMeta `json:",inline"` - metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Items individual ThirdParties - Items []ThirdPartyResource `json:"items" protobuf:"bytes,2,rep,name=items"` -} -``` - - -## Behavior -### Create -When a new TPR is created, no synchronous action is taken. -A controller will run to confirm that value-space of the reserved names doesn't -collide and sets the "KindNameConflict" condition to `false`. - -A custom `http.Handler` will look at request and use the parsed out -GroupVersionResource information to match it to a ThirdPartyResource. The ThirdPartyResource -will be checked to make sure its valid enough in .Status to serve and will -response appropriated. If there is no ThirdPartyResource defined, it will delegate -to the next handler in the chain. - -### Delete -When a TPR-registration is deleted, it will be handled as a finalizer like a -namespace is done today. The `Terminating` condition will be updated (like -namespaces) and that will cause mutating requests to be rejected by the REST -handler (see above). The finalizer will remove all the associated storage. -Once the finalizer is done, it will delete the TPR-registration itself. - - -## Migration from existing TPR -Because of the changes required to meet the goals, there is not a silent -auto-migration from the existing TPR to the new TPR. It will be possible, but -it will be manual. At a high level, you simply: - 1. Stop all clients from writing to TPR (revoke edit rights for all users) and - stop controllers. - 2. Get all your TPR-data. - `$ kubectl get TPR --all-namespaces -o yaml > data.yaml` - 3. Delete the old TPR-data. Be sure you orphan! - `$ kubectl delete TPR --all --all-namespaces --cascade=false` - 4. Delete the old TPR-registration. - `$ kubectl delete TPR/name` - 5. Create a new TPR-registration with the same GroupVersionKind as before. - `$ kubectl create -f new_tpr.name` - 6. Recreate your new TPR-data. - `$ kubectl create -f data.yaml` - 7. Restart controllers. - -There are a couple things that you'll need to consider: - 1. Garbage collection. You may have created links that weren't respected by - the GC collector in 1.6. Since you orphaned your dependents, you'll probably - want to re-adopt them like the Kubernetes controllers do with their resources. - 2. Controllers will observe deletes. Part of this migration actually deletes - the resource. Your controller will see the delete. You ought to shut down - your TPR controller while you migrate your data. If you do this, your - controller will never see a delete. - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/OBSOLETE_templates.md b/contributors/design-proposals/apps/OBSOLETE_templates.md index a1213830..f0fbec72 100644 --- a/contributors/design-proposals/apps/OBSOLETE_templates.md +++ b/contributors/design-proposals/apps/OBSOLETE_templates.md @@ -1,564 +1,6 @@ -# Templates+Parameterization: Repeatedly instantiating user-customized application topologies. +Design proposals have been archived. -## Motivation +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Addresses https://github.com/kubernetes/kubernetes/issues/11492 -There are two main motivators for Template functionality in Kubernetes: Controller Instantiation and Application Definition - -### Controller Instantiation - -Today the replication controller defines a PodTemplate which allows it to instantiate multiple pods with identical characteristics. -This is useful but limited. Stateful applications have a need to instantiate multiple instances of a more sophisticated topology -than just a single pod (e.g. they also need Volume definitions). A Template concept would allow a Controller to stamp out multiple -instances of a given Template definition. This capability would be immediately useful to the [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) proposal. - -Similarly the [Service Catalog proposal](https://github.com/kubernetes/kubernetes/pull/17543) could leverage template instantiation as a mechanism for claiming service instances. - - -### Application Definition - -Kubernetes gives developers a platform on which to run images and many configuration objects to control those images, but -constructing a cohesive application made up of images and configuration objects is currently difficult. Applications -require: - -* Information sharing between images (e.g. one image provides a DB service, another consumes it) -* Configuration/tuning settings (memory sizes, queue limits) -* Unique/customizable identifiers (service names, routes) - -Application authors know which values should be tunable and what information must be shared, but there is currently no -consistent way for an application author to define that set of information so that application consumers can easily deploy -an application and make appropriate decisions about the tunable parameters the author intended to expose. - -Furthermore, even if an application author provides consumers with a set of API object definitions (e.g. a set of yaml files) -it is difficult to build a UI around those objects that would allow the deployer to modify names in one place without -potentially breaking assumed linkages to other pieces. There is also no prescriptive way to define which configuration -values are appropriate for a deployer to tune or what the parameters control. - -## Use Cases - -### Use cases for templates in general - -* Providing a full baked application experience in a single portable object that can be repeatably deployed in different environments. - * e.g. Wordpress deployment with separate database pod/replica controller - * Complex service/replication controller/volume topologies -* Bulk object creation -* Provide a management mechanism for deleting/uninstalling an entire set of components related to a single deployed application -* Providing a library of predefined application definitions that users can select from -* Enabling the creation of user interfaces that can guide an application deployer through the deployment process with descriptive help about the configuration value decisions they are making, and useful default values where appropriate -* Exporting a set of objects in a namespace as a template so the topology can be inspected/visualized or recreated in another environment -* Controllers that need to instantiate multiple instances of identical objects (e.g. StatefulSets). - - -### Use cases for parameters within templates - -* Share passwords between components (parameter value is provided to each component as an environment variable or as a Secret reference, with the Secret value being parameterized or produced by an [initializer](https://github.com/kubernetes/kubernetes/issues/3585)) -* Allow for simple deployment-time customization of “app” configuration via environment values or api objects, e.g. memory - tuning parameters to a MySQL image, Docker image registry prefix for image strings, pod resource requests and limits, default - scale size. -* Allow simple, declarative defaulting of parameter values and expose them to end users in an approachable way - a parameter - like “MySQL table space” can be parameterized in images as an env var - the template parameters declare the parameter, give - it a friendly name, give it a reasonable default, and informs the user what tuning options are available. -* Customization of component names to avoid collisions and ensure matched labeling (e.g. replica selector value and pod label are - user provided and in sync). -* Customize cross-component references (e.g. user provides the name of a secret that already exists in their namespace, to use in - a pod as a TLS cert). -* Provide guidance to users for parameters such as default values, descriptions, and whether or not a particular parameter value - is required or can be left blank. -* Parameterize the replica count of a deployment or [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) -* Parameterize part of the labels and selector for a DaemonSet -* Parameterize quota/limit values for a pod -* Parameterize a secret value so a user can provide a custom password or other secret at deployment time - - -## Design Assumptions - -The goal for this proposal is a simple schema which addresses a few basic challenges: - -* Allow application authors to expose configuration knobs for application deployers, with suggested defaults and -descriptions of the purpose of each knob -* Allow application deployers to easily customize exposed values like object names while maintaining referential integrity - between dependent pieces (for example ensuring a pod's labels always match the corresponding selector definition of the service) -* Support maintaining a library of templates within Kubernetes that can be accessed and instantiated by end users -* Allow users to quickly and repeatedly deploy instances of well-defined application patterns produced by the community -* Follow established Kubernetes API patterns by defining new template related APIs which consume+return first class Kubernetes - API (and therefore json conformant) objects. - -We do not wish to invent a new Turing-complete templating language. There are good options available -(e.g. https://github.com/mustache/mustache) for developers who want a completely flexible and powerful solution for creating -arbitrarily complex templates with parameters, and tooling can be built around such schemes. - -This desire for simplicity also intentionally excludes template composability/embedding as a supported use case. - -Allowing templates to reference other templates presents versioning+consistency challenges along with making the template -no longer a self-contained portable object. Scenarios necessitating multiple templates can be handled in one of several -alternate ways: - -* Explicitly constructing a new template that merges the existing templates (tooling can easily be constructed to perform this - operation since the templates are first class api objects). -* Manually instantiating each template and utilizing [service linking](https://github.com/kubernetes/kubernetes/pull/17543) to share - any necessary configuration data. - -This document will also refrain from proposing server APIs or client implementations. This has been a point of debate, and it makes -more sense to focus on the template/parameter specification/syntax than to worry about the tooling that will process or manage the -template objects. However since there is a desire to at least be able to support a server side implementation, this proposal -does assume the specification will be k8s API friendly. - -## Desired characteristics - -* Fully k8s object json-compliant syntax. This allows server side apis that align with existing k8s apis to be constructed - which consume templates and existing k8s tooling to work with them. It also allows for api versioning/migration to be managed by - the existing k8s codec scheme rather than having to define/introduce a new syntax evolution mechanism. - * (Even if they are not part of the k8s core, it would still be good if a server side template processing+managing api supplied - as an ApiGroup consumed the same k8s object schema as the peer k8s apis rather than introducing a new one) -* Self-contained parameter definitions. This allows a template to be a portable object which includes metadata that describe - the inputs it expects, making it easy to wrapper a user interface around the parameterization flow. -* Object field primitive types include string, int, boolean, byte[]. The substitution scheme should support all of those types. - * complex types (struct/map/list) can be defined in terms of the available primitives, so it's preferred to avoid the complexity - of allowing for full complex-type substitution. -* Parameter metadata. Parameters should include at a minimum, information describing the purpose of the parameter, whether it is - required/optional, and a default/suggested value. Type information could also be required to enable more intelligent client interfaces. -* Template metadata. Templates should be able to include metadata describing their purpose or links to further documentation and - versioning information. Annotations on the Template's metadata field can fulfill this requirement. - - -## Proposed Implementation - -### Overview - -We began by looking at the List object which allows a user to easily group a set of objects together for easy creation via a -single CLI invocation. It also provides a portable format which requires only a single file to represent an application. - -From that starting point, we propose a Template API object which can encapsulate the definition of all components of an -application to be created. The application definition is encapsulated in the form of an array of API objects (identical to -List), plus a parameterization section. Components reference the parameter by name and the value of the parameter is -substituted during a processing step, prior to submitting each component to the appropriate API endpoint for creation. - -The primary capability provided is that parameter values can easily be shared between components, such as a database password -that is provided by the user once, but then attached as an environment variable to both a database pod and a web frontend pod. - -In addition, the template can be repeatedly instantiated for a consistent application deployment experience in different -namespaces or Kubernetes clusters. - -Lastly, we propose the Template API object include a “Labels” section in which the template author can define a set of labels -to be applied to all objects created from the template. This will give the template deployer an easy way to manage all the -components created from a given template. These labels will also be applied to selectors defined by Objects within the template, -allowing a combination of templates and labels to be used to scope resources within a namespace. That is, a given template -can be instantiated multiple times within the same namespace, as long as a different label value is used each for each -instantiation. The resulting objects will be independent from a replica/load-balancing perspective. - -Generation of parameter values for fields such as Secrets will be delegated to an [admission controller/initializer/finalizer](https://github.com/kubernetes/kubernetes/issues/3585) rather than being solved by the template processor. Some discussion about a generation -service is occurring [here](https://github.com/kubernetes/kubernetes/issues/12732) - -Labels to be assigned to all objects could also be generated in addition to, or instead of, allowing labels to be supplied in the -Template definition. - -### API Objects - -**Template Object** - -```go -// Template contains the inputs needed to produce a Config. -type Template struct { - unversioned.TypeMeta - kapi.ObjectMeta - - // Optional: Parameters is an array of Parameters used during the - // Template to Config transformation. - Parameters []Parameter - - // Required: A list of resources to create - Objects []runtime.Object - - // Optional: ObjectLabels is a set of labels that are applied to every - // object during the Template to Config transformation - // These labels are also be applied to selectors defined by objects in the template - ObjectLabels map[string]string -} -``` - -**Parameter Object** - -```go -// Parameter defines a name/value variable that is to be processed during -// the Template to Config transformation. -type Parameter struct { - // Required: Parameter name must be set and it can be referenced in Template - // Items using $(PARAMETER_NAME) - Name string - - // Optional: The name that will show in UI instead of parameter 'Name' - DisplayName string - - // Optional: Parameter can have description - Description string - - // Optional: Value holds the Parameter data. - // The value replaces all occurrences of the Parameter $(Name) or - // $((Name)) expression during the Template to Config transformation. - Value string - - // Optional: Indicates the parameter must have a non-empty value either provided by the user or provided by a default. Defaults to false. - Required bool - - // Optional: Type-value of the parameter (one of string, int, bool, or base64) - // Used by clients to provide validation of user input and guide users. - Type ParameterType -} -``` - -As seen above, parameters allow for metadata which can be fed into client implementations to display information about the -parameter's purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)` -and `$((PARAM))`. When the single parens option is used, the result of the substitution will remain quoted. When the double -parens option is used, the result of the substitution will not be quoted. For example, given a parameter defined with a value -of "BAR", the following behavior will be observed: - -```go -somefield: "$(FOO)" -> somefield: "BAR" -somefield: "$((FOO))" -> somefield: BAR -``` - -for concatenation, the result value reflects the type of substitution (quoted or unquoted): - -```go -somefield: "prefix_$(FOO)_suffix" -> somefield: "prefix_BAR_suffix" -somefield: "prefix_$((FOO))_suffix" -> somefield: prefix_BAR_suffix -``` - -if both types of substitution exist, quoting is performed: - -```go -somefield: "prefix_$((FOO))_$(FOO)_suffix" -> somefield: "prefix_BAR_BAR_suffix" -``` - -This mechanism allows for integer/boolean values to be substituted properly. - -The value of the parameter can be explicitly defined in template. This should be considered a default value for the parameter, clients -which process templates are free to override this value based on user input. - - -**Example Template** - -Illustration of a template which defines a service and replication controller with parameters to specialized -the name of the top level objects, the number of replicas, and several environment variables defined on the -pod template. - -```json -{ - "kind": "Template", - "apiVersion": "v1", - "metadata": { - "name": "mongodb-ephemeral", - "annotations": { - "description": "Provides a MongoDB database service" - } - }, - "labels": { - "template": "mongodb-ephemeral-template" - }, - "objects": [ - { - "kind": "Service", - "apiVersion": "v1", - "metadata": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "spec": { - "ports": [ - { - "name": "mongo", - "protocol": "TCP", - "targetPort": 27017 - } - ], - "selector": { - "name": "$(DATABASE_SERVICE_NAME)" - } - } - }, - { - "kind": "ReplicationController", - "apiVersion": "v1", - "metadata": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "spec": { - "replicas": "$((REPLICA_COUNT))", - "selector": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "template": { - "metadata": { - "creationTimestamp": null, - "labels": { - "name": "$(DATABASE_SERVICE_NAME)" - } - }, - "spec": { - "containers": [ - { - "name": "mongodb", - "image": "docker.io/centos/mongodb-26-centos7", - "ports": [ - { - "containerPort": 27017, - "protocol": "TCP" - } - ], - "env": [ - { - "name": "MONGODB_USER", - "value": "$(MONGODB_USER)" - }, - { - "name": "MONGODB_PASSWORD", - "value": "$(MONGODB_PASSWORD)" - }, - { - "name": "MONGODB_DATABASE", - "value": "$(MONGODB_DATABASE)" - } - ] - } - ] - } - } - } - } - ], - "parameters": [ - { - "name": "DATABASE_SERVICE_NAME", - "description": "Database service name", - "value": "mongodb", - "required": true - }, - { - "name": "MONGODB_USER", - "description": "Username for MongoDB user that will be used for accessing the database", - "value": "username", - "required": true - }, - { - "name": "MONGODB_PASSWORD", - "description": "Password for the MongoDB user", - "required": true - }, - { - "name": "MONGODB_DATABASE", - "description": "Database name", - "value": "sampledb", - "required": true - }, - { - "name": "REPLICA_COUNT", - "description": "Number of mongo replicas to run", - "value": "1", - "required": true - } - ] -} -``` - -### API Endpoints - -* **/processedtemplates** - when a template is POSTed to this endpoint, all parameters in the template are processed and -substituted into appropriate locations in the object definitions. Validation is performed to ensure required parameters have -a value supplied. In addition labels defined in the template are applied to the object definitions. Finally the customized -template (still a `Template` object) is returned to the caller. (The possibility of returning a List instead has -also been discussed and will be considered for implementation). - -The client is then responsible for iterating the objects returned and POSTing them to the appropriate resource api endpoint to -create each object, if that is the desired end goal for the client. - -Performing parameter substitution on the server side has the benefit of centralizing the processing so that new clients of -k8s, such as IDEs, CI systems, Web consoles, etc, do not need to reimplement template processing or embed the k8s binary. -Instead they can invoke the k8s api directly. - -* **/templates** - the REST storage resource for storing and retrieving template objects, scoped within a namespace. - -Storing templates within k8s has the benefit of enabling template sharing and securing via the same roles/resources -that are used to provide access control to other cluster resources. It also enables sophisticated service catalog -flows in which selecting a service from a catalog results in a new instantiation of that service. (This is not the -only way to implement such a flow, but it does provide a useful level of integration). - -Creating a new template (POST to the /templates api endpoint) simply stores the template definition, it has no side -effects(no other objects are created). - -This resource can also support a subresource "/templates/templatename/processed". This resource would accept just a -Parameters object and would process the template stored in the cluster as "templatename". The processed result would be -returned in the same form as `/processedtemplates` - -### Workflow - -#### Template Instantiation - -Given a well-formed template, a client will - -1. Optionally set an explicit `value` for any parameter values the user wishes to explicitly set -2. Submit the new template object to the `/processedtemplates` api endpoint - -The api endpoint will then: - -1. Validate the template including confirming “required” parameters have an explicit value. -2. Walk each api object in the template. -3. Adding all labels defined in the template's ObjectLabels field. -4. For each field, check if the value matches a parameter name and if so, set the value of the field to the value of the parameter. - * Partial substitutions are accepted, such as `SOME_$(PARAM)` which would be transformed into `SOME_XXXX` where `XXXX` is the value - of the `$(PARAM)` parameter. - * If a given $(VAL) could be resolved to either a parameter or an environment variable/downward api reference, an error will be - returned. -5. Return the processed template object. (or List, depending on the choice made when this is implemented) - -The client can now either return the processed template to the user in a desired form (e.g. json or yaml), or directly iterate the -api objects within the template, invoking the appropriate object creation api endpoint for each element. (If the api returns -a List, the client would simply iterate the list to create the objects). - -The result is a consistently recreatable application configuration, including well-defined labels for grouping objects created by -the template, with end-user customizations as enabled by the template author. - -#### Template Authoring - -To aid application authors in the creation of new templates, it should be possible to export existing objects from a project -in template form. A user should be able to export all or a filtered subset of objects from a namespace, wrappered into a -Template API object. The user will still need to customize the resulting object to enable parameterization and labeling, -though sophisticated export logic could attempt to auto-parameterize well understood api fields. Such logic is not considered -in this proposal. - -#### Tooling - -As described above, templates can be instantiated by posting them to a template processing endpoint. CLI tools should -exist which can input parameter values from the user as part of the template instantiation flow. - -More sophisticated UI implementations should also guide the user through which parameters the template expects, the description -of those templates, and the collection of user provided values. - -In addition, as described above, existing objects in a namespace can be exported in template form, making it easy to recreate a -set of objects in a new namespace or a new cluster. - - -## Examples - -### Example Templates - -These examples reflect the current OpenShift template schema, not the exact schema proposed in this document, however this -proposal, if accepted, provides sufficient capability to support the examples defined here, with the exception of -automatic generation of passwords. - -* [Jenkins template](https://github.com/openshift/origin/blob/master/examples/jenkins/jenkins-persistent-template.json) -* [MySQL DB service template](https://github.com/openshift/origin/blob/master/examples/db-templates/mysql-persistent-template.json) - -### Examples of OpenShift Parameter Usage - -(mapped to use cases described above) - -* [Share passwords](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L146-L152) -* [Simple deployment-time customization of “app” configuration via environment values](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L108-L126) (e.g. memory tuning, resource limits, etc) -* [Customization of component names with referential integrity](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L199-L207) -* [Customize cross-component references](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L78-L83) (e.g. user provides the name of a secret that already exists in their namespace, to use in a pod as a TLS cert) - -## Requirements analysis - -There has been some discussion of desired goals for a templating/parameterization solution [here](https://github.com/kubernetes/kubernetes/issues/11492#issuecomment-160853594). This section will attempt to address each of those points. - -*The primary goal is that parameterization should facilitate reuse of declarative configuration templates in different environments in - a "significant number" of common cases without further expansion, substitution, or other static preprocessing.* - -* This solution provides for templates that can be reused as is (assuming parameters are not used or provide sane default values) across - different environments, they are a self-contained description of a topology. - -*Parameterization should not impede the ability to use kubectl commands with concrete resource specifications.* - -* The parameterization proposal here does not extend beyond Template objects. That is both a strength and limitation of this proposal. - Parameterizable objects must be wrapped into a Template object, rather than existing on their own. - -*Parameterization should work with all kubectl commands that accept --filename, and should work on templates comprised of multiple resources.* - -* Same as above. - -*The parameterization mechanism should not prevent the ability to wrap kubectl with workflow/orchestration tools, such as Deployment manager.* - -* Since this proposal uses standard API objects, a DM or Helm flow could still be constructed around a set of templates, just as those flows are - constructed around other API objects today. - -*Any parameterization mechanism we add should not preclude the use of a different parameterization mechanism, it should be possible -to use different mechanisms for different resources, and, ideally, the transformation should be composable with other -substitution/decoration passes.* - -* This templating scheme does not preclude layering an additional templating mechanism over top of it. For example, it would be - possible to write a Mustache template which, after Mustache processing, resulted in a Template which could then be instantiated - through the normal template instantiating process. - -*Parameterization should not compromise reproducibility. For instance, it should be possible to manage template arguments as well as -templates under version control.* - -* Templates are a single file, including default or chosen values for parameters. They can easily be managed under version control. - -*It should be possible to specify template arguments (i.e., parameter values) declaratively, in a way that is "self-describing" -(i.e., naming the parameters and the template to which they correspond). It should be possible to write generic commands to -process templates.* - -* Parameter definitions include metadata which describes the purpose of the parameter. Since parameter definitions are part of the template, - there is no need to indicate which template they correspond to. - -*It should be possible to validate templates and template parameters, both values and the schema.* - -* Template objects are subject to standard api validation. - -*It should also be possible to validate and view the output of the substitution process.* - -* The `/processedtemplates` api returns the result of the substitution process, which is itself a Template object that can be validated. - -*It should be possible to generate forms for parameterized templates, as discussed in #4210 and #6487.* - -* Parameter definitions provide metadata that allows for the construction of form-based UIs to gather parameter values from users. - -*It shouldn't be inordinately difficult to evolve templates. Thus, strategies such as versioning and encapsulation should be -encouraged, at least by convention.* - -* Templates can be versioned via annotations on the template object. - -## Key discussion points - -The preceding document is opinionated about each of these topics, however they have been popular topics of discussion so they are called out explicitly below. - -### Where to define parameters - -There has been some discussion around where to define parameters that are being injected into a Template - -1. In a separate standalone file -2. Within the Template itself - -This proposal suggests including the parameter definitions within the Template, which provides a self-contained structure that -can be easily versioned, transported, and instantiated without risk of mismatching content. In addition, a Template can easily -be validated to confirm that all parameter references are resolveable. - -Separating the parameter definitions makes for a more complex process with respect to -* Editing a template (if/when first class editing tools are created) -* Storing/retrieving template objects with a central store - -Note that the `/templates/sometemplate/processed` subresource would accept a standalone set of parameters to be applied to `sometemplate`. - -### How to define parameters - -There has also been debate about how a parameter should be referenced from within a template. This proposal suggests that -fields to be substituted by a parameter value use the "$(parameter)" syntax which is already used elsewhere within k8s. The -value of `parameter` should be matched to a parameter with that name, and the value of the matched parameter substituted into -the field value. - -Other suggestions include a path/map approach in which a list of field paths (e.g. json path expressions) and corresponding -parameter names are provided. The substitution process would walk the map, replacing fields with the appropriate -parameter value. This approach makes templates more fragile from the perspective of editing/refactoring as field paths -may change, thus breaking the map. There is of course also risk of breaking references with the previous scheme, but -renaming parameters seems less likely than changing field paths. - -### Storing templates in k8s - -Openshift defines templates as a first class resource so they can be created/retrieved/etc via standard tools. This allows client tools to list available templates (available in the openshift cluster), allows existing resource security controls to be applied to templates, and generally provides a more integrated feel to templates. However there is no explicit requirement that for k8s to adopt templates, it must also adopt storing them in the cluster. - -### Processing templates (server vs. client) - -Openshift handles template processing via a server endpoint which consumes a template object from the client and returns the list of objects -produced by processing the template. It is also possible to handle the entire template processing flow via the client, but this was deemed -undesirable as it would force each client tool to reimplement template processing (e.g. the standard CLI tool, an eclipse plugin, a plugin for a CI system like Jenkins, etc). The assumption in this proposal is that server side template processing is the preferred implementation approach for -this reason. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/OWNERS b/contributors/design-proposals/apps/OWNERS deleted file mode 100644 index f36b2fcd..00000000 --- a/contributors/design-proposals/apps/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-apps-leads -approvers: - - sig-apps-leads -labels: - - sig/apps diff --git a/contributors/design-proposals/apps/configmap.md b/contributors/design-proposals/apps/configmap.md index 55571448..f0fbec72 100644 --- a/contributors/design-proposals/apps/configmap.md +++ b/contributors/design-proposals/apps/configmap.md @@ -1,296 +1,6 @@ -# Generic Configuration Object +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The `ConfigMap` API resource stores data used for the configuration of -applications deployed on Kubernetes. -The main focus of this resource is to: - -* Provide dynamic distribution of configuration data to deployed applications. -* Encapsulate configuration information and simplify `Kubernetes` deployments. -* Create a flexible configuration model for `Kubernetes`. - -## Motivation - -A `Secret`-like API resource is needed to store configuration data that pods can -consume. - -Goals of this design: - -1. Describe a `ConfigMap` API resource. -2. Describe the semantics of consuming `ConfigMap` as environment variables. -3. Describe the semantics of consuming `ConfigMap` as files in a volume. - -## Use Cases - -1. As a user, I want to be able to consume configuration data as environment -variables. -2. As a user, I want to be able to consume configuration data as files in a -volume. -3. As a user, I want my view of configuration data in files to be eventually -consistent with changes to the data. - -### Consuming `ConfigMap` as Environment Variables - -A series of events for consuming `ConfigMap` as environment variables: - -1. Create a `ConfigMap` object. -2. Create a pod to consume the configuration data via environment variables. -3. The pod is scheduled onto a node. -4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and -starts the container processes with the appropriate configuration data from -environment variables. - -### Consuming `ConfigMap` in Volumes - -A series of events for consuming `ConfigMap` as configuration files in a volume: - -1. Create a `ConfigMap` object. -2. Create a new pod using the `ConfigMap` via a volume plugin. -3. The pod is scheduled onto a node. -4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` -method. -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod -and projects the appropriate configuration data into the volume. - -### Consuming `ConfigMap` Updates - -Any long-running system has configuration that is mutated over time. Changes -made to configuration data must be made visible to pods consuming data in -volumes so that they can respond to those changes. - -The `resourceVersion` of the `ConfigMap` object will be updated by the API -server every time the object is modified. After an update, modifications will be -made visible to the consumer container: - -1. Create a `ConfigMap` object. -2. Create a new pod using the `ConfigMap` via the volume plugin. -3. The pod is scheduled onto a node. -4. During the sync loop, the Kubelet creates an instance of the volume plugin -and calls its `Setup()` method. -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod -and projects the appropriate data into the volume. -6. The `ConfigMap` referenced by the pod is updated. -7. During the next iteration of the `syncLoop`, the Kubelet creates an instance -of the volume plugin and calls its `Setup()` method. -8. The volume plugin projects the updated data into the volume atomically. - -It is the consuming pod's responsibility to make use of the updated data once it -is made visible. - -Because environment variables cannot be updated without restarting a container, -configuration data consumed in environment variables will not be updated. - -### Advantages - -* Easy to consume in pods; consumer-agnostic -* Configuration data is persistent and versioned -* Consumers of configuration data in volumes can respond to changes in the data - -## Proposed Design - -### API Resource - -The `ConfigMap` resource will be added to the main API: - -```go -package api - -// ConfigMap holds configuration data for pods to consume. -type ConfigMap struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - // Data contains the configuration data. Each key must be a valid - // DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN. - Data map[string]string `json:"data,omitempty"` -} - -type ConfigMapList struct { - TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty"` - - Items []ConfigMap `json:"items"` -} -``` - -A `Registry` implementation for `ConfigMap` will be added to -`pkg/registry/configmap`. - -### Environment Variables - -The `EnvVarSource` will be extended with a new selector for `ConfigMap`: - -```go -package api - -// EnvVarSource represents a source for the value of an EnvVar. -type EnvVarSource struct { - // other fields omitted - - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` -} - -// Selects a key from a ConfigMap. -type ConfigMapKeySelector struct { - // The ConfigMap to select from. - LocalObjectReference `json:",inline"` - // The key to select. - Key string `json:"key"` -} -``` - -### Volume Source - -A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` -object will be added to the `VolumeSource` struct in the API: - -```go -package api - -type VolumeSource struct { - // other fields omitted - ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"` -} - -// Represents a volume that holds configuration data. -type ConfigMapVolumeSource struct { - LocalObjectReference `json:",inline"` - // A list of keys to project into the volume. - // If unspecified, each key-value pair in the Data field of the - // referenced ConfigMap will be projected into the volume as a file whose name - // is the key and content is the value. - // If specified, the listed keys will be project into the specified paths, and - // unlisted keys will not be present. - Items []KeyToPath `json:"items,omitempty"` -} - -// Represents a mapping of a key to a relative path. -type KeyToPath struct { - // The name of the key to select - Key string `json:"key"` - - // The relative path name of the file to be created. - // Must not be absolute or contain the '..' path. Must be utf-8 encoded. - // The first item of the relative path must not start with '..' - Path string `json:"path"` -} -``` - -**Note:** The update logic used in the downward API volume plug-in will be -extracted and re-used in the volume plug-in for `ConfigMap`. - -### Changes to Secret - -We will update the Secret volume plugin to have a similar API to the new -`ConfigMap` volume plugin. The secret volume plugin will also begin updating -secret content in the volume when secrets change. - -## Examples - -#### Consuming `ConfigMap` as Environment Variables - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: etcd-env-config -data: - number-of-members: "1" - initial-cluster-state: new - initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN - discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN - discovery-url: http://etcd-discovery:2379 - etcdctl-peers: http://etcd:2379 -``` - -This pod consumes the `ConfigMap` as environment variables: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - env: - - name: ETCD_NUM_MEMBERS - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: number-of-members - - name: ETCD_INITIAL_CLUSTER_STATE - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: initial-cluster-state - - name: ETCD_DISCOVERY_TOKEN - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: discovery-token - - name: ETCD_DISCOVERY_URL - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: discovery-url - - name: ETCDCTL_PEERS - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: etcdctl-peers -``` - -#### Consuming `ConfigMap` as Volumes - -`redis-volume-config` is intended to be used as a volume containing a config -file: - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: redis-volume-config -data: - redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n" -``` - -The following pod consumes the `redis-volume-config` in a volume: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-volume-example -spec: - containers: - - name: redis - image: kubernetes/redis - command: ["redis-server", "/mnt/config-map/etc/redis.conf"] - ports: - - containerPort: 6379 - volumeMounts: - - name: config-map-volume - mountPath: /mnt/config-map - volumes: - - name: config-map-volume - configMap: - name: redis-volume-config - items: - - path: "etc/redis.conf" - key: redis.conf -``` - -## Future Improvements - -In the future, we may add the ability to specify an init-container that can -watch the volume contents for updates and respond to changes when they occur. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/controller_history.md b/contributors/design-proposals/apps/controller_history.md index e9384379..f0fbec72 100644 --- a/contributors/design-proposals/apps/controller_history.md +++ b/contributors/design-proposals/apps/controller_history.md @@ -1,462 +1,6 @@ -# Controller History - -**Author**: kow3ns@ - -**Status**: Proposal - -## Abstract -In Kubernetes, in order to update and rollback the configuration and binary -images of controller managed Pods, users mutate DaemonSet, StatefulSet, -and Deployment Objects, and the corresponding controllers attempt to transition -the current state of the system to the new declared target state. - -To facilitate update and rollback for these controllers, and to provide a -primitive that third-party controllers can build on, we propose a mechanism -that allows controllers to manage a bounded history of revisions to the declared -target state of their generated Objects. - -## Affected Components - -1. API Machinery -1. API Server -1. Kubectl -1. Controllers that utilize the feature - -## Requirements - -1. History is a collection of points in time, and each point in time must be -represented by its own Object. While it is tempting to aggregate all of an -Object's history into a single container Object, experience with Borg and Mesos -has taught us that this inevitably leads to exhausting the single Object size -limit of the system's storage backend. -1. We must be able to select the Objects that contain point in time snapshots -of versions of an Object to reconstruct the Object's history. -1. History respects causality. The Object type used to store point in time -snapshots must be strictly ordered with respect to creation. CreationTimestamp -should not be used, as this is susceptible to clock skew. -1. History must not be revisionist. Once an Object corresponding to a version -of a controllers target state is created, it can not be mutated. -1. Controller history requires only current events. Storing an exhaustive -history of all revisions to all controllers is out of scope for our purposes, -and it can be solved by applying a version control system to manifests. Internal -revision history must only store revisions to the controller's target state that -correspond to live Objects and (potentially) a small, configurable number of -prior revisions. -1. History is scale invariant. A revision to a controller is a modification -that changes the specification of the Objects it generates. Changing the -cardinality of those Objects is a scaling operation and does not constitute a -revision. - -## Terminology -The following terminology is used throughout the rest of this proposal. We -make its meaning explicit here. -- The specification type of a controller is the type that contains the -specification for the Objects generated by the controller. - - For example, the specification types for the ReplicaSet, DaemonSet, - and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec, - and StatefulSetSpec respectively. -- The generated type(s) for a controller is/are the type of the Object(s) -generated by the controller. - - Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet - controllers. - - PersistentVolumeClaim is also a generated type for the StatefulSet - controller. -- The current state of a controller is the union of the states of its generated -Objects along with its status. - - For ReplicaSet, DaemonSet, and StatefulSet, the current state of the - corresponding controllers can be derived from Pods they contain and the - ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects - respectively. -- For all specification type Objects for controllers, the target state is the -set of fields in the Object that determine the state to which the controller -attempts to evolve the system. - - This may not necessarily be all fields of the Object. - - For example, for the StatefulSet controller `.Spec.Template`, - `.Spec.Replicas`, and `.Spec.VolumeClaims` determine the target state. The - controller "wants" to create `.Spec.Replicas` Pods generated from - `.Spec.Template` and `.Spec.VolumeClaims`. -- The target Object state is the subset of the target state necessary to create -Objects of the generated type(s). - - To make this concrete, for the StatefulSet controller `.Spec.Template` - and `.Spec.VolumeClaims` are the target Object state. This is enough - information for the controller to generate Pods and corresponding PVCs. -- If a version of the target Object state was used to generate an Object that -has not yet been deleted, we refer to the version, and any snapshots of the -version, as live. - -## API Objects - -Kubernetes controllers already persist their current and target states to the -API Server. In order to maintain a history of revisions to specification type -Objects, we only need to persist snapshots of the target Object states -contained in the specification type when they are revised. - -One approach would be to, for every specification type, have a -corresponding History type. For example, we could introduce a StatefulSetHistory -object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims. -The StatefulSet controller could use this object to store point in time -snapshots of versions of StatefulSetSpecs. However, this requires that we -introduce a new History Kind for all current and future controllers. It has the -benefit of type safety, but, for this benefit, we trade generality. - -Another approach would be to use PodTemplate objects. This mechanisms provides -the desired generality, but it only provides for the recording of versions of -PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to -record revisions to PersistentVolumeClaims). Also, it introduces the potential -for overlapping histories for two Objects of different Kinds, with the same -`.Name` in the same Namespace. Lastly, it constrains the PodTemplate Kind from -evolving to fulfill its original intention. - -We propose an approach that has analogs with the approach taken by the -[Mesos](http://mesos.apache.org/) community. Mesos frameworks, which are in some -ways like Kubernetes controllers, are responsible for check pointing, -persisting, and recovering their own state. This problem is so common that -Mesos provides a ["State Abstraction"](https://github.com/apache/mesos/blob/master/include/mesos/state/state.hpp) -that allows frameworks to persist their state in either ZooKeeper or the -Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos -Masters). This State Abstraction is a mutable, durable dictionary where keys -and values are opaque strings. As controllers only need the capability to -persist an immutable point in time snapshot of target Object states to -implement a revision history, we propose to use the ControllerRevision object -for this purpose. - -``` golang -// ControllerRevision implements an immutable snapshot of state data. Clients -// are responsible for serializing and deserializing the objects that contain -// their internal state. -// Once a ControllerRevision has been successfully created, it can not be updated. -// The API Server will fail validation of all requests that attempt to mutate -// the Data field. ControllerRevisions may, however, be deleted. -type ControllerRevision struct { - metav1.TypeMeta - // +optional - metav1.ObjectMeta - // Data contains the serialized state. - Data runtime.RawExtension - // Revision indicates the revision of the state represented by Data. - Revision int64 -} -``` - -## API Server -The API Server must support the creation and deletion of ControllerRevision -objects. As we have no mechanism for declarative immutability, the API server -must fail any update request that updates the `.Data` field of a -ControllerRevision Object. - -## Controllers -This section is presented as a generalization of how an arbitrary controller -can use ControllerRevision to persist a history of revisions to its -specification type Objects. The technique is applicable, without loss of -generality, to the existing Kubernetes controllers that have Pod as a generated -type. - -When a controller detects a revision to the target Object state of a -specification type Object it will do the following. - -1. The controller will [create a snapshot](#version-snapshot-creation) of the -current target Object state. -1. The controller will [reconstruct the history](#history-reconstruction) of -revisions to the Object's target Object state. -1. The controller will test the current target Object state for -[equivalence](#version-equivalence) with all other versions in the Object's -revision history. - - If the current version is semantically equivalent to its immediate - predecessor no update to the Object's target state has been performed. - - If the current version is equivalent to a version prior to its immediate - predecessor, this indicates a rollback. - - If the current version is not equivalent to any prior version, this - indicates an update or a roll forward. - - Controllers should use their status objects for book keeping with respect - to current and prior revisions. -1. The controller will -[reconcile its generated Objects](#target-object-state-reconciliation) -with the new target Object state. -1. The controller will [maintain the length of its history](#history-maintenance) -to be less than the configured limit. - -### Version Snapshot Creation -To take a snapshot of the target Object state contained in a specification type -Object, a controller will do the following. - -1. The controller will serialize all the Object's target object state and store -the serialized representation in the ControllerRevision's `.Data`. -1. The controller will store a unique, monotonically increasing -[revision number](#revision-number-selection) in the Revision field. -1. The controller will compute the [hash](#hashing) of the -ControllerRevision's `.Data`. -1. The controller will attach a label to the ControllerRevision so that it is -selectable with a low probability of overlap. - - ControllerRefs will be used as the authoritative test for ownership. - - The specification type Object's `.Selector` should be used where - applicable. - - Alternatively, a Kind unique label may be set to the `.Name` of the - specification type Object. -1. The controller will add a ControllerRef indicating the specification type -Object as the owner of the ControllerRevision in the ControllerRevision's -`.OwnerReferences`. -1. The controller will use the hash from above, along with a user identifiable -prefix, to [generate a unique `.Name`](#unique-name-generation) for the -ControllerRevision. - - The controller should, where possible, use the `.Name` of the - specification type Object. -1. The controller will persist the ControllerRevision via the API Server. - - Note that, in practice, creation occurs concurrently with - [collision resolution](#collision-resolution). - -### Revision Number Selection -We propose two methods for selecting the `.Revision` used to order a -specification type Object's revision history. - -1. Set the `.Revision` field to the `.Generation` field. - - This approach has the benefit of leveraging the existing monotonically - increasing sequence generated by `.Generation` field. - - The downside of this approach is that history will not survive the - destruction of an Object. -1. Use an approach analogous to Deployment. - 1. Reconstruct the Object's revision history. - 1. If the history is empty, use a `.Revision` of `0`. - 1. If the history is not empty, set the `.Revision` to a value greater than - the maximum value of all previous `.Revisions`. - -### History Reconstruction -To reconstruct the history of a specification type Object, a controller will do -the following. - -1. Select all ControllerRevision Objects labeled as described -[above](#version-snapshot-creation). -1. Filter any ControllerRevisions that do not have a ControllerRef in their -`.OwnerReferences` indicating ownership by the Object. -1. Sort the ControllerRevisions by the `.Revision` field. -1. This produces a strictly ordered set of ControllerRevisions that comprises -the ordered revision history of the specification type Object. - -### History Maintenance -Controllers should be configured, either globally or on a per specification type -Object basis, to have a `RevisionHistoryLimit`. This field will indicate the -number of non-live revisions the controller should maintain in its history -for each specification type Object. Every time a controller observes a -specification type Object it will do the following. - -1. The controller will -[reconstruct the Object's revision history](#history-reconstruction). - - Note that the process of reconstructing the Object's history filters any - ControllerRevisions not owned by the Object. -1. The controller will filter any ControllerRevisions that represent a live -version. -1. If the number of remaining ControllerRevisions is greater than the configured -`RevisionHistoryLimit`, the controller will delete them, in order with respect -to the value mapped to their `.Revisions`, until the number -of remaining ControllerRevisions is equal to the `RevisionHistoryLimit`. - -This ensures that the number of recorded, non-live revisions is less than or -equal to the configured `RevisionHistoryLimit`. - -### Version Tracking -Controllers must track the version of the target Object state that corresponds -to their generated Objects. This information is necessary to determine which -versions are live, and to track which Objects need to be updated during a -target state update or rollback. We propose two methods that controllers may -use to track live versions and their association with generated Objects. - -1. The most straightforward method is labeling. In this method the generated -Objects are labeled with the `.Name` of the ControllerRevision object that -corresponds to the version of the target Object state that was used to generate -them. As we have taken care to ensure the uniqueness of the `.Names` of the -ControllerRevisions, this approach is reasonable. - - A revision is considered to be live while any generated Object labeled - with its `.Name` is live. - - This method has the benefit of providing visibility, via the label, to - users with respect to the historical provenance of a generated Object. - - The primary drawback is the lack of support for using garbage collection - to ensure that only non-live version snapshots are collected. -1. Controllers may also use the `OwnerReferences` field of the -ControllerRevision to record all Objects that are generated from target Object -state version represented by the ControllerRevision as its owners. - - A revision is considered to be live while any generated Object that owns - it is live. - - This method allows for the implementation of generic garbage collection. - - The primary drawback of this method is that the book keeping is complex, - and deciding if a generated Object corresponds to a particular revision - will require testing each Object for membership in the `OwnerReferences` - of all ControllerRevisions. - -Note that, since we are labeling the generated Objects to indicate their -provenance with respect to the version of the controller's target Object state, -we are susceptible to downstream mutations by other controllers changing the -controller's product. The best we can do is guarantee that our product meets -the specification at the time of creation. If a third-party mutates the product -downstream (as long as it does so in a consistent and intentional way), we -don't want to recall it and make it conform to the original specification. This -would cause the controllers to "fight" indefinitely. - -At the cost of the complexity of implementing both labeling and ownership, -controllers may use a combination of both approaches to mitigate the -deficiencies of each. - -### Version Equivalence -When the target Object state of a specification type Object is revised, we wish -to minimize the number of mutations to generated Objects as the controller seeks -to conform the system to its target state. That is, if a generated Object -already conforms to the revised target Object state, it is imperative that we -do not mutate it. - -Failure to implement this correctly could result in the simultaneous rolling -restart of every Pod in every StatefulSet and DaemonSet in the system when -additions are made to PodTemplateSpec during a master upgrade. It is therefore -necessary to determine if the current target Object state is equivalent to a -prior version. - -Since we [track the version of](#version-tracking) of generated Objects, this -reduces to deciding if the version of the target Object state associated with -the generated Object is equivalent to the current target Object state. -Even though [hashing](#hashing) is used to generate the `.Name` of the -ControllerRevisions used to encapsulate versions of the target Object state, as -we do not require cryptographically strong collision resistance, and given we -use a [collision resolution](#collision-resolution) technique, we can't use the -[generated names](#unique-name-generation) of ControllerRevisions to decide -equality. - -We propose that two ControllerRevisions can be considered equal if their -`.Data` is equivalent, but that it is not sufficient to compare the serialized -representation of their `.Data`. Consider that the addition of new fields -to the Objects that represent the target Object state may cause the serialized -representation of those Objects to be unequal even when they are semantically -equivalent. - -The controller should deserialize the values of the ControllerRevisions -representing their target Object state and perform a deep, semantic equality -test. Here all differences that do not constitute a mutation to the target -Object state is disregarded during the equivalence test. - -### Target Object State Reconciliation -There are three ways for a controller to reconcile a generated Object with the -declared target Object state. - -1. If the target Object state is [equivalent](#version-equivalence) to the -target Object state associated with the generated Object, the controller will -update the associated [version tracking information](#version-tracking). -1. If the Object can be updated in place to reconcile its state with the -current target Object state, a controller may update the Object in place -provided that the associated version tracking information is updated as well. -1. Otherwise, the controller must destroy the Object and recreate it from the -current target Object state. - -### Kubernetes Upgrades -During the upgrade process from a version of Kubernetes that does not support -controller history to a version that does, controllers that implement history -based update mechanisms may find that they have specification type Objects with -no history and with generated Objects. For instance, a StatefulSet may exist -with several Pods and no history. We defer requirements for handling history -initialization to the individual proposals pertaining to those controller's -update mechanisms. However, implementors should take note of the following. - -1. If the history of an Object is not initialized, controllers should -continue to (re)create generated Objects based on the current target Object -state. -1. The history should be initialized on the first mutation to the specification -type Object for which the history will be generated. -1. After the history has been initialized, any generated Objects that have no -indication of the revision from which they were generated may be treated as if -they have a nil revision. That is, without respect to the method of -[version tracking](#version-tracking) used, the generated Objects may be -treated as if they have a version that corresponds to no revision, and the -controller may proceed to -[reconcile their state](#target-object-state-reconciliation) as appropriate to -the internal implementation. - -## Kubectl - -Modifications to kubectl to leverage controller history are an optional -extension. Users can trigger rolling updates and rollbacks by modifying their -manifests and using `kubectl apply`. Controllers will be able to detect -revisions to their target Object state and perform -[reconciliation](#target-object-state-reconciliation) as necessary. - -### Viewing History - -Users can view a controller's revision history with the following command. - -```bash -> kubectl rollout history -``` - -To view the details of the revision indicated by `<revision>`. Users can use -the following command. - -```bash -> kubectl rollout history --revision <revision> -``` - -### Rollback - -For future work, `kubectl rollout undo` can be implemented in the general case -as an extension of the [above](#viewing-history ). - -```bash -> kubectl rollout undo -``` - -Here `kubectl undo` simply uses strategic merge patch to apply the state -contained at a particular revision. - -## Tests - -1. Controllers can create a ControllerRevision containing a revision of their -target Object state. -1. Controllers can reconstruct their revision history. -1. Controllers can't update a ControllerRevision's `.Data`. -1. Controllers can delete a ControllerRevision to maintain their history with -respect to the configured `RevisionHistoryLimit`. - -## Appendix - -### Hashing -We will require a CRHF (collision resistant hash function), but, as we expect -no adversaries, such a function need not be resistant to pre-image and -secondary pre-image attacks. -As the property of interest is primarily collision resistance, and as we -provide a method of [collision resolution](#collision-resolution), both -cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2), -and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable. - -### Collision Resolution -As the function selected for hashing may not be cryptographically strong and may -produce collisions, we need a method for collision resolution. To demonstrate -its feasibility, we construct such a scheme here. However, this proposal does -not mandate its use. - -Given a hash function with output size `HashSize` defined -as `func H(s string) [HashSize] byte`, in order to resolve collisions we -define a new function `func H'(s string, n int) [HashSize]byte` where `H'` -returns the result of invoking `H` on the concatenation of `s` with the string -value of `n`. We define a third function -`func H''(s string, exists func (string) bool)(int,[HashSize]byte)`. `H''` -will start with `n := 0` and compute `s' := H'(s,n)`, incrementing `n` when -`exists(s')` returns true, until `exists(s')` returns false. After this it will -return `n,s'`. - -For our purposes, the implementation of the `exists` function will attempt to -create a `.Named` ControllerRevision via the API Server using a -[unique name generation](#unique-name-generation). If creation fails, due to a -conflict, the method returns false. - -### Unique Name Generation -We can use our [hash function](#hashing) and -[collision resolution](#collision-resolution) scheme to generate a system -wide unique identifier for an Object based on a deterministic non-unique prefix -and a serialized representation of the Object. Kubernetes Object's `.Name` -fields must conform to a DNS subdomain. Therefore, the total length of the -unique identifier must not exceed 255, and in practice 253, characters. We can -generate a unique identifier that meets this constraint by selecting a hash -function such that the output length is equal to `253-len(prefix)` and applying -our [hash](#hashing) function and [collision-resolution](#collision-resolution) -scheme to the serialized representation of the Object's data. The unique hash -and integer can be combined to produce a unique suffix for the Object's `.Name`. - -1. We must also ensure that unique name does not contain any bad words. -1. We may also wish to spend additional characters to prettify the generated -name for readability. +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/cronjob.md b/contributors/design-proposals/apps/cronjob.md index bbc50f96..f0fbec72 100644 --- a/contributors/design-proposals/apps/cronjob.md +++ b/contributors/design-proposals/apps/cronjob.md @@ -1,335 +1,6 @@ -# CronJob Controller (previously ScheduledJob) +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal for implementing a new controller - CronJob controller - which -will be responsible for managing time based jobs, namely: -* once at a specified point in time, -* repeatedly at a specified point in time. -There is already a discussion regarding this subject: -* Distributed CRON jobs [#2156](https://issues.k8s.io/2156) - -There are also similar solutions available, already: -* [Mesos Chronos](https://github.com/mesos/chronos) -* [Quartz](http://quartz-scheduler.org/) - - -## Use Cases - -1. Be able to schedule a job execution at a given point in time. -1. Be able to create a periodic job, e.g. database backup, sending emails. - - -## Motivation - -CronJobs are needed for performing all time-related actions, namely backups, -report generation and the like. Each of these tasks should be allowed to run -repeatedly (once a day/month, etc.) or once at a given point in time. - - -## Design Overview - -Users create a CronJob object. One CronJob object -is like one line of a crontab file. It has a schedule of when to run, -in [Cron](https://en.wikipedia.org/wiki/Cron) format. - - -The CronJob controller creates a Job object [Job](job.md) -about once per execution time of the schedule (e.g. once per -day for a daily schedule.) We say "about" because there are certain -circumstances where two jobs might be created, or no job might be -created. We attempt to make these rare, but do not completely prevent -them. Therefore, Jobs should be idempotent. - -The Job object is responsible for any retrying of Pods, and any parallelism -among pods it creates, and determining the success or failure of the set of -pods. The CronJob does not examine pods at all. - - -### CronJob resource - -The new `CronJob` object will have the following contents: - -```go -// CronJob represents the configuration of a single cron job. -type CronJob struct { - TypeMeta - ObjectMeta - - // Spec is a structure defining the expected behavior of a job, including the schedule. - Spec CronJobSpec - - // Status is a structure describing current status of a job. - Status CronJobStatus -} - -// CronJobList is a collection of cron jobs. -type CronJobList struct { - TypeMeta - ListMeta - - Items []CronJob -} -``` - -The `CronJobSpec` structure is defined to contain all the information how the actual -job execution will look like, including the `JobSpec` from [Job API](job.md) -and the schedule in [Cron](https://en.wikipedia.org/wiki/Cron) format. This implies -that each CronJob execution will be created from the JobSpec actual at a point -in time when the execution will be started. This also implies that any changes -to CronJobSpec will be applied upon subsequent execution of a job. - -```go -// CronJobSpec describes how the job execution will look like and when it will actually run. -type CronJobSpec struct { - - // Schedule contains the schedule in Cron format, see https://en.wikipedia.org/wiki/Cron. - Schedule string - - // Optional deadline in seconds for starting the job if it misses scheduled - // time for any reason. Missed jobs executions will be counted as failed ones. - StartingDeadlineSeconds *int64 - - // ConcurrencyPolicy specifies how to treat concurrent executions of a Job. - ConcurrencyPolicy ConcurrencyPolicy - - // Suspend flag tells the controller to suspend subsequent executions, it does - // not apply to already started executions. Defaults to false. - Suspend bool - - // JobTemplate is the object that describes the job that will be created when - // executing a CronJob. - JobTemplate *JobTemplateSpec -} - -// JobTemplateSpec describes of the Job that will be created when executing -// a CronJob, including its standard metadata. -type JobTemplateSpec struct { - ObjectMeta - - // Specification of the desired behavior of the job. - Spec JobSpec -} - -// ConcurrencyPolicy describes how the job will be handled. -// Only one of the following concurrent policies may be specified. -// If none of the following policies is specified, the default one -// is AllowConcurrent. -type ConcurrencyPolicy string - -const ( - // AllowConcurrent allows CronJobs to run concurrently. - AllowConcurrent ConcurrencyPolicy = "Allow" - - // ForbidConcurrent forbids concurrent runs, skipping next run if previous - // hasn't finished yet. - ForbidConcurrent ConcurrencyPolicy = "Forbid" - - // ReplaceConcurrent cancels currently running job and replaces it with a new one. - ReplaceConcurrent ConcurrencyPolicy = "Replace" -) -``` - -`CronJobStatus` structure is defined to contain information about cron -job executions. The structure holds a list of currently running job instances -and additional information about overall successful and unsuccessful job executions. - -```go -// CronJobStatus represents the current state of a Job. -type CronJobStatus struct { - // Active holds pointers to currently running jobs. - Active []ObjectReference - - // Successful tracks the overall amount of successful completions of this job. - Successful int64 - - // Failed tracks the overall amount of failures of this job. - Failed int64 - - // LastScheduleTime keeps information of when was the last time the job was successfully scheduled. - LastScheduleTime Time -} -``` - -Users must use a generated selector for the job. - -## Modifications to Job resource - -TODO for beta: forbid manual selector since that could cause confusing between -subsequent jobs. - -### Running CronJobs using kubectl - -A user should be able to easily start a Scheduled Job using `kubectl` (similarly -to running regular jobs). For example to run a job with a specified schedule, -a user should be able to type something simple like: - -``` -kubectl run pi --image=perl --restart=OnFailure --runAt="0 14 21 7 *" -- perl -Mbignum=bpi -wle 'print bpi(2000)' -``` - -In the above example: - -* `--restart=OnFailure` implies creating a job instead of replicationController. -* `--runAt="0 14 21 7 *"` implies the schedule with which the job should be run, here - July 21, 2pm. This value will be validated according to the same rules which - apply to `.spec.schedule`. - -## Fields Added to Job Template - -When the controller creates a Job from the JobTemplateSpec in the CronJob, it -adds the following fields to the Job: - -- a name, based on the CronJob's name, but with a suffix to distinguish - multiple executions, which may overlap. -- the standard created-by annotation on the Job, pointing to the SJ that created it - The standard key is `kubernetes.io/created-by`. The value is a serialized JSON object, like - `{ "kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"CronJob","namespace":"default",` - `"name":"nightly-earnings-report","uid":"5ef034e0-1890-11e6-8935-42010af0003e","apiVersion":...` - This serialization contains the UID of the parent. This is used to match the Job to the SJ that created - it. - -## Updates to CronJobs - -If the schedule is updated on a CronJob, it will: -- continue to use the Status.Active list of jobs to detect conflicts. -- try to fulfill all recently-passed times for the new schedule, by starting - new jobs. But it will not try to fulfill times prior to the - Status.LastScheduledTime. - - Example: If you have a schedule to run every 30 minutes, and change that to hourly, then the previously started - top-of-the-hour run, in Status.Active, will be seen and no new job started. - - Example: If you have a schedule to run every hour, change that to 30-minutely, at 31 minutes past the hour, - one run will be started immediately for the starting time that has just passed. - -If the job template of a CronJob is updated, then future executions use the new template -but old ones still satisfy the schedule and are not re-run just because the template changed. - -If you delete and replace a CronJob with one of the same name, it will: -- not use any old Status.Active, and not consider any existing running or terminated jobs from the previous - CronJob (with a different UID) at all when determining conflicts, what needs to be started, etc. -- If there is an existing Job with the same time-based hash in its name (see below), then - new instances of that job will not be able to be created. So, delete it if you want to re-run. -with the same name as conflicts. -- not "re-run" jobs for "start times" before the creation time of the new CronJobJob object. -- not consider executions from the previous UID when making decisions about what executions to - start, or status, etc. -- lose the history of the old SJ. - -To preserve status, you can suspend the old one, and make one with a new name, or make a note of the old status. - - -## Fault-Tolerance - -### Starting Jobs in the face of controller failures - -If the process with the cronJob controller in it fails, -and takes a while to restart, the cronJob controller -may miss the time window and it is too late to start a job. - -With a single cronJob controller process, we cannot give -very strong assurances about not missing starting jobs. - -With a suggested HA configuration, there are multiple controller -processes, and they use master election to determine which one -is active at any time. - -If the Job's StartingDeadlineSeconds is long enough, and the -lease for the master lock is short enough, and other controller -processes are running, then a Job will be started. - -TODO: consider hard-coding the minimum StartingDeadlineSeconds -at say 1 minute. Then we can offer a clearer guarantee, -assuming we know what the setting of the lock lease duration is. - -### Ensuring jobs are run at most once - -There are three problems here: - -- ensure at most one Job created per "start time" of a schedule. -- ensure that at most one Pod is created per Job -- ensure at most one container start occurs per Pod - -#### Ensuring one Job - -Multiple jobs might be created in the following sequence: - -1. cron job controller sends request to start Job J1 to fulfill start time T. -1. the create request is accepted by the apiserver and enqueued but not yet written to etcd. -1. cron job controller crashes -1. new cron job controller starts, and lists the existing jobs, and does not see one created. -1. it creates a new one. -1. the first one eventually gets written to etcd. -1. there are now two jobs for the same start time. - -We can solve this in several ways: - -1. with three-phase protocol, e.g.: - 1. controller creates a "suspended" job. - 1. controller writes an annotation in the SJ saying that it created a job for this time. - 1. controller unsuspends that job. -1. by picking a deterministic name, so that at most one object create can succeed. - -#### Ensuring one Pod - -Job object does not currently have a way to ask for this. -Even if it did, controller is not written to support it. -Same problem as above. - -#### Ensuring one container invocation per Pod - -Kubelet is not written to ensure at-most-one-container-start per pod. - -#### Decision - -This is too hard to do for the alpha version. We will await user -feedback to see if the "at most once" property is needed in the beta version. - -This is awkward but possible for a containerized application ensure on it own, as it needs -to know what CronJob name and Start Time it is from, and then record the attempt -in a shared storage system. We should ensure it could extract this data from its annotations -using the downward API. - -## Name of Jobs - -A CronJob creates one Job at each time when a Job should run. -Since there may be concurrent jobs, and since we might want to keep failed -non-overlapping Jobs around as a debugging record, each Job created by the same CronJob -needs a distinct name. - -To make the Jobs from the same CronJob distinct, we could use a random string, -in the way that pods have a `generateName`. For example, a cronJob named `nightly-earnings-report` -in namespace `ns1` might create a job `nightly-earnings-report-3m4d3`, and later create -a job called `nightly-earnings-report-6k7ts`. This is consistent with pods, but -does not give the user much information. - -Alternatively, we can use time as a uniquifier. For example, the same cronJob could -create a job called `nightly-earnings-report-2016-May-19`. -However, for Jobs that run more than once per day, we would need to represent -time as well as date. Standard date formats (e.g. RFC 3339) use colons for time. -Kubernetes names cannot include time. Using a non-standard date format without colons -will annoy some users. - -Also, date strings are much longer than random suffixes, which means that -the pods will also have long names, and that we are more likely to exceed the -253 character name limit when combining the cron-job name, -the time suffix, and pod random suffix. - -One option would be to compute a hash of the nominal start time of the job, -and use that as a suffix. This would not provide the user with an indication -of the start time, but it would prevent creation of the same execution -by two instances (replicated or restarting) of the controller process. - -We chose to use the hashed-date suffix approach. - -## Manually triggering CronJobs - -A user may wish to manually trigger a CronJob for some reason (see [#47538](http://issues.k8s.io/47538)), such as testing it prior to its scheduled time. This could be made possible via an `/instantiate` subresource in the API, which when POSTed to would immediately spawn a Job from the JobSpec contained within the CronJob. - -## Future evolution - -Below are the possible future extensions to the Job controller: -* Be able to specify workflow template in `.spec` field. This relates to the work - happening in [#18827](https://issues.k8s.io/18827). -* Be able to specify more general template in `.spec` field, to create arbitrary - types of resources. This relates to the work happening in [#18215](https://issues.k8s.io/18215). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/daemon.md b/contributors/design-proposals/apps/daemon.md index bd2f281e..f0fbec72 100644 --- a/contributors/design-proposals/apps/daemon.md +++ b/contributors/design-proposals/apps/daemon.md @@ -1,203 +1,6 @@ -# DaemonSet in Kubernetes +Design proposals have been archived. -**Author**: Ananya Kumar (@AnanyaKumar) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Implemented. - -This document presents the design of the Kubernetes DaemonSet, describes use -cases, and gives an overview of the code. - -## Motivation - -Many users have requested for a way to run a daemon on every node in a -Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential -for use cases such as building a sharded datastore, or running a logger on every -node. In comes the DaemonSet, a way to conveniently create and manage -daemon-like workloads in Kubernetes. - -## Use Cases - -The DaemonSet can be used for user-specified system services, cluster-level -applications with strong node ties, and Kubernetes node services. Below are -example use cases in each category. - -### User-Specified System Services: - -Logging: Some users want a way to collect statistics about nodes in a cluster -and send those logs to an external database. For example, system administrators -might want to know if their machines are performing as expected, if they need to -add more machines to the cluster, or if they should switch cloud providers. The -DaemonSet can be used to run a data collection service (for example fluentd) on -every node and send the data to a service like ElasticSearch for analysis. - -### Cluster-Level Applications - -Datastore: Users might want to implement a sharded datastore in their cluster. A -few nodes in the cluster, labeled ‘app=datastore’, might be responsible for -storing data shards, and pods running on these nodes might serve data. This -architecture requires a way to bind pods to specific nodes, so it cannot be -achieved using a Replication Controller. A DaemonSet is a convenient way to -implement such a datastore. - -For other uses, see the related [feature request](https://issues.k8s.io/1518) - -## Functionality - -The DaemonSet supports standard API features: - - create - - The spec for DaemonSets has a pod template field. - - Using the pod's nodeSelector field, DaemonSets can be restricted to operate -over nodes that have a certain label. For example, suppose that in a cluster -some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a -datastore pod on exactly those nodes labeled ‘app=database’. - - Using the pod's nodeName field, DaemonSets can be restricted to operate on a -specified node. - - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec -used by the Replication Controller. - - The initial implementation will not guarantee that DaemonSet pods are -created on nodes before other pods. - - The initial implementation of DaemonSet does not guarantee that DaemonSet -pods show up on nodes (for example because of resource limitations of the node), -but makes a best effort to launch DaemonSet pods (like Replication Controllers -do with pods). Subsequent revisions might ensure that DaemonSet pods show up on -nodes, preempting other pods if necessary. - - The DaemonSet controller adds an annotation: -``` -"kubernetes.io/created-by: \<json API object reference\>" -``` - - YAML example: -```yaml - apiVersion: extensions/v1beta1 - kind: DaemonSet - metadata: - labels: - app: datastore - name: datastore - spec: - template: - metadata: - labels: - app: datastore-shard - spec: - nodeSelector: - app: datastore-node - containers: - - name: datastore-shard - image: kubernetes/sharded - ports: - - containerPort: 9042 - name: main -``` - - - commands that get info: - - get (e.g. kubectl get daemonsets) - - describe - - Modifiers: - - delete (if --cascade=true, then first the client turns down all the pods -controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is -unlikely to be set on any node); then it deletes the DaemonSet; then it deletes -the pods) - - label - - annotate - - update operations like patch and replace (only allowed to selector and to -nodeSelector and nodeName of pod template) - - DaemonSets have labels, so you could, for example, list all DaemonSets -with certain labels (the same way you would for a Replication Controller). - -In general, for all the supported features like get, describe, update, etc, -the DaemonSet works in a similar way to the Replication Controller. However, -note that the DaemonSet and the Replication Controller are different constructs. - -### Persisting Pods - - - Ordinary liveness probes specified in the pod template work to keep pods -created by a DaemonSet running. - - If a daemon pod is killed or stopped, the DaemonSet will create a new -replica of the daemon pod on the node. - -### Cluster Mutations - - - When a new node is added to the cluster, the DaemonSet controller starts -daemon pods on the node for DaemonSets whose pod template nodeSelectors match -the node's labels. - - Suppose the user launches a DaemonSet that runs a logging daemon on all -nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label -to a node (that did not initially have the label), the logging daemon will -launch on the node. Additionally, if a user removes the label from a node, the -logging daemon on that node will be killed. - -## Alternatives Considered - -We considered several alternatives, that were deemed inferior to the approach of -creating a new DaemonSet abstraction. - -One alternative is to include the daemon in the machine image. In this case it -would run outside of Kubernetes proper, and thus not be monitored, health -checked, usable as a service endpoint, easily upgradable, etc. - -A related alternative is to package daemons as static pods. This would address -most of the problems described above, but they would still not be easily -upgradable, and more generally could not be managed through the API server -interface. - -A third alternative is to generalize the Replication Controller. We would do -something like: if you set the `replicas` field of the ReplicationControllerSpec -to -1, then it means "run exactly one replica on every node matching the -nodeSelector in the pod template." The ReplicationController would pretend -`replicas` had been set to some large number -- larger than the largest number -of nodes ever expected in the cluster -- and would use some anti-affinity -mechanism to ensure that no more than one Pod from the ReplicationController -runs on any given node. There are two downsides to this approach. First, -there would always be a large number of Pending pods in the scheduler (these -will be scheduled onto new machines when they are added to the cluster). The -second downside is more philosophical: DaemonSet and the Replication Controller -are very different concepts. We believe that having small, targeted controllers -for distinct purposes makes Kubernetes easier to understand and use, compared to -having larger multi-functional controllers (see -["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for -some discussion of this topic). - -## Design - -#### Client - -- Add support for DaemonSet commands to kubectl and the client. Client code was -added to pkg/client/unversioned. The main files in Kubectl that were modified are -pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create, -and Update, the client simply forwards the request to the backend via the REST -API. - -#### Apiserver - -- Accept, parse, validate client commands -- REST API calls are handled in pkg/registry/daemonset - - In particular, the api server will add the object to etcd - - DaemonManager listens for updates to etcd (using Framework.informer) -- API objects for DaemonSet were created in expapi/v1/types.go and -expapi/v1/register.go -- Validation code is in expapi/validation - -#### Daemon Manager - -- Creates new DaemonSets when requested. Launches the corresponding daemon pod -on all nodes with labels matching the new DaemonSet's selector. -- Listens for addition of new nodes to the cluster, by setting up a -framework.NewInformer that watches for the creation of Node API objects. When a -new node is added, the daemon manager will loop through each DaemonSet. If the -label of the node matches the selector of the DaemonSet, then the daemon manager -will create the corresponding daemon pod in the new node. -- The daemon manager creates a pod on a node by sending a command to the API -server, requesting for a pod to be bound to the node (the node will be specified -via its hostname.) - -#### Kubelet - -- Does not need to be modified, but health checking will occur for the daemon -pods and revive the pods if they are killed (we set the pod restartPolicy to -Always). We reject DaemonSet objects with pod templates that don't have -restartPolicy set to Always. - -## Open Issues - -- Should work similarly to [Deployment](http://issues.k8s.io/1743). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/daemonset-update.md b/contributors/design-proposals/apps/daemonset-update.md index f4ce1256..f0fbec72 100644 --- a/contributors/design-proposals/apps/daemonset-update.md +++ b/contributors/design-proposals/apps/daemonset-update.md @@ -1,354 +1,6 @@ -# DaemonSet Updates +Design proposals have been archived. -**Author**: @madhusudancs, @lukaszo, @janetkuo +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Proposal -## Abstract - -A proposal for adding the update feature to `DaemonSet`. This feature will be -implemented on server side (in `DaemonSet` API). - -Users already can update a `DaemonSet` today (Kubernetes release 1.5), which will -not cause changes to its subsequent pods, until those pods are killed. In this -proposal, we plan to add a "RollingUpdate" strategy which allows DaemonSet to -downstream its changes to pods. - -## Requirements - -In this proposal, we design DaemonSet updates based on the following requirements: - -- Users can trigger a rolling update of DaemonSet at a controlled speed, which - is achieved by: - - Only a certain number of DaemonSet pods can be down at the same time during - an update - - A DaemonSet pod needs to be ready for a specific amount of time before it's - considered up -- Users can monitor the status of a DaemonSet update (e.g. the number of pods - that are updated and healthy) -- A broken DaemonSet update should not continue, but one can still update the - DaemonSet again to fix it -- Users should be able to update a DaemonSet even during an ongoing DaemonSet - upgrade -- in other words, rollover (e.g. update the DaemonSet to fix a broken - DaemonSet update) - -Here are some potential requirements that haven't been covered by this proposal: - -- Users should be able to view the history of previous DaemonSet updates -- Users can figure out the revision of a DaemonSet's pod (e.g. which version is - this DaemonSet pod?) -- DaemonSet should provide at-most-one guarantee per node (i.e. at most one pod - from a DaemonSet can exist on a node at any time) -- Uptime is critical for each pod of a DaemonSet during an upgrade (e.g. the time - from a DaemonSet pods being killed to recreated and healthy should be < 5s) -- Each DaemonSet pod can still fit on the node after being updated -- Some DaemonSets require the node to be drained before the DaemonSet's pod on it - is updated (e.g. logging daemons) -- DaemonSet's pods are implicitly given higher priority than non-daemons -- DaemonSets can only be operated by admins (i.e. people who manage nodes) - - This is required if we allow DaemonSet controllers to drain, cordon, - uncordon nodes, evict pods, or allow DaemonSet pods to have higher priority - -## Implementation - -### API Object - -To enable DaemonSet upgrades, `DaemonSet` related API object will have the following -changes: - -```go -type DaemonSetUpdateStrategy struct { - // Type of daemon set update. Can be "RollingUpdate" or "OnDelete". - // Default is OnDelete. - // +optional - Type DaemonSetUpdateStrategyType - - // Rolling update config params. Present only if DaemonSetUpdateStrategy = - // RollingUpdate. - //--- - // TODO: Update this to follow our convention for oneOf, whatever we decide it - // to be. Same as Deployment `strategy.rollingUpdate`. - // See https://github.com/kubernetes/kubernetes/issues/35345 - // +optional - RollingUpdate *RollingUpdateDaemonSet -} - -type DaemonSetUpdateStrategyType string - -const ( - // Replace the old daemons by new ones using rolling update i.e replace them on each node one after the other. - RollingUpdateDaemonSetStrategyType DaemonSetUpdateStrategyType = "RollingUpdate" - - // Replace the old daemons only when it's killed - OnDeleteDaemonSetStrategyType DaemonSetUpdateStrategyType = "OnDelete" -) - -// Spec to control the desired behavior of daemon set rolling update. -type RollingUpdateDaemonSet struct { - // The maximum number of DaemonSet pods that can be unavailable during - // the update. Value can be an absolute number (ex: 5) or a percentage of total - // number of DaemonSet pods at the start of the update (ex: 10%). Absolute - // number is calculated from percentage by rounding up. - // This must be greater than 0. - // Default value is 1. - // Example: when this is set to 30%, 30% of the currently running DaemonSet - // pods can be stopped for an update at any given time. The update starts - // by stopping at most 30% of the currently running DaemonSet pods and then - // brings up new DaemonSet pods in their place. Once the new pods are ready, - // it then proceeds onto other DaemonSet pods, thus ensuring that at least - // 70% of original number of DaemonSet pods are available at all times - // during the update. - // +optional - MaxUnavailable intstr.IntOrString -} - -// DaemonSetSpec is the specification of a daemon set. -type DaemonSetSpec struct { - // Note: Existing fields, including Selector and Template are omitted in - // this proposal. - - // Update strategy to replace existing DaemonSet pods with new pods. - // +optional - UpdateStrategy DaemonSetUpdateStrategy `json:"updateStrategy,omitempty"` - - // Minimum number of seconds for which a newly created DaemonSet pod should - // be ready without any of its container crashing, for it to be considered - // available. Defaults to 0 (pod will be considered available as soon as it - // is ready). - // +optional - MinReadySeconds int32 `json:"minReadySeconds,omitempty"` - - // DEPRECATED. - // A sequence number representing a specific generation of the template. - // Populated by the system. Can be set at creation time. Read-only otherwise. - // +optional - TemplateGeneration int64 `json:"templateGeneration,omitempty"` - - // The number of old history to retain to allow rollback. - // This is a pointer to distinguish between explicit zero and not specified. - // Defaults to 10. - RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty"` -} - -// DaemonSetStatus represents the current status of a daemon set. -type DaemonSetStatus struct { - // Note: Existing fields, including CurrentNumberScheduled, NumberMisscheduled, - // DesiredNumberScheduled, NumberReady, and ObservedGeneration are omitted in - // this proposal. - - // UpdatedNumberScheduled is the total number of nodes that are running updated - // daemon pod - // +optional - UpdatedNumberScheduled int32 `json:"updatedNumberScheduled"` - - // NumberAvailable is the number of nodes that should be running the - // daemon pod and have one or more of the daemon pod running and - // available (ready for at least minReadySeconds) - // +optional - NumberAvailable int32 `json:"numberAvailable"` - - // NumberUnavailable is the number of nodes that should be running the - // daemon pod and have non of the daemon pod running and available - // (ready for at least minReadySeconds) - // +optional - NumberUnavailable int32 `json:"numberUnavailable"` - - // Count of hash collisions for the DaemonSet. The DaemonSet controller - // uses this field as a collision avoidance mechanism when it needs to - // create the name for the newest ControllerRevision. - // +optional - CollisionCount *int64 `json:"collisionCount,omitempty"` -} - -const ( - // DEPRECATED: DefaultDeploymentUniqueLabelKey is used instead. - // DaemonSetTemplateGenerationKey is the key of the labels that is added - // to daemon set pods to distinguish between old and new pod - // during DaemonSet template update. - DaemonSetTemplateGenerationKey string = "pod-template-generation" - - // DefaultDaemonSetUniqueLabelKey is the default label key that is added - // to existing DaemonSet pods to distinguish between old and new - // DaemonSet pods during DaemonSet template updates. - DefaultDaemonSetUniqueLabelKey string = "daemonset-controller-hash" -) -``` - -### Controller - -#### DaemonSet Controller - -The DaemonSet Controller will make DaemonSet updates happen. It will watch -DaemonSets on the apiserver. - -DaemonSet controller manages [`ControllerRevisions`](controller_history.md) for -DaemonSet revision introspection and rollback. It's referred to as "history" -throughout the rest of this proposal. - -For each pending DaemonSet updates, it will: - -1. Reconstruct DaemonSet history: - - List existing DaemonSet history controlled by this DaemonSet - - Find the history of DaemonSet's current target state, and create one if - not found: - - The `.name` of this history will be unique, generated from pod template - hash with hash collision resolution. If history creation failed: - - If it's because of name collision: - - Compare history with DaemonSet current target state: - - If they're the same, we've already created the history - - Otherwise, bump DaemonSet `.status.collisionCount` by 1, exit and - retry in the next sync loop - - Otherwise, exit and retry again in the next sync loop. - - The history will be labeled with `DefaultDaemonSetUniqueLabelKey`. - - DaemonSet controller will add a ControllerRef in the history - `.ownerReferences`. - - Current history should have the largest `.revision` number amongst all - existing history. Update `.revision` if it's not (e.g. after a rollback.) - - If more than one current history is found, remove duplicates and relabel - their pods' `DefaultDaemonSetUniqueLabelKey`. -1. Sync nodes: - - Find all nodes that should run these pods created by this DaemonSet. - - Create daemon pods on nodes when they should have those pods running but not - yet. Otherwise, delete running daemon pods that shouldn't be running on nodes. - - Label new pods with current `.spec.templateGeneration` and - `DefaultDaemonSetUniqueLabelKey` value of current history when creating them. -1. Check `DaemonSetUpdateStrategy`: - - If `OnDelete`: do nothing - - If `RollingUpdate`: - - For all pods owned by this DaemonSet: - - If its `pod-template-generation` label value equals to DaemonSet's - `.spec.templateGeneration`, it's a new pod (don't compare - `DefaultDaemonSetUniqueLabelKey`, for backward compatibility). - - Add `DefaultDaemonSetUniqueLabelKey` label to the new pod based on current - history, if the pod doesn't have this label set yet. - - Otherwise, if the value doesn't match, or the pod doesn't have a - `pod-template-generation` label, check its `DefaultDaemonSetUniqueLabelKey` label: - - If the value matches any of the history's `DefaultDaemonSetUniqueLabelKey` label, - it's a pod generated from that history. - - If that history matches the current target state of the DaemonSet, - it's a new pod. - - Otherwise, it's an old pod. - - Otherwise, if the pod doesn't have a `DefaultDaemonSetUniqueLabelKey` label, or no - matching history is found, it's an old pod. - - If there are old pods found, compare `MaxUnavailable` with DaemonSet - `.status.numberUnavailable` to see how many old daemon pods can be - killed. Then, kill those pods in the order that unhealthy pods (failed, - pending, not ready) are killed first. -1. Clean up old history based on `.spec.revisionHistoryLimit` - - Always keep live history and current history -1. Cleanup, update DaemonSet status - - `.status.numberAvailable` = the total number of DaemonSet pods that have - become `Ready` for `MinReadySeconds` - - `.status.numberUnavailable` = `.status.desiredNumberScheduled` - - `.status.numberAvailable` - -If DaemonSet Controller crashes during an update, it can still recover. - -#### API Server - -In DaemonSet strategy (pkg/registry/extensions/daemonset/strategy.go#PrepareForUpdate), -increase DaemonSet's `.spec.templateGeneration` by 1 if any changes is made to -DaemonSet's `.spec.template`. - -This was originally implemented in 1.6, and kept in 1.7 for backward compatibility. - -### kubectl - -#### kubectl rollout - -Users can use `kubectl rollout` to monitor DaemonSet updates: - -- `kubectl rollout status daemonset/<DaemonSet-Name>`: to see the DaemonSet - upgrade status -- `kubectl rollout history daemonset/<DaemonSet-Name>`: to view the history of - DaemonSet updates. -- `kubectl rollout undo daemonset/<DaemonSet-Name>`: to rollback a DaemonSet - -## Updating DaemonSets mid-way - -Users can update an updated DaemonSet before its rollout completes. -In this case, the existing daemon pods will not continue rolling out and the new -one will begin rolling out. - - -## Deleting DaemonSets - -Deleting a DaemonSet (with cascading) will delete all its pods and history. - - -## DaemonSet Strategies - -DaemonSetStrategy specifies how the new daemon pods should replace existing ones. -To begin with, we will support 2 types: - -* On delete: Do nothing, until existing daemon pods are killed (for backward - compatibility). - - Other alternative names: No-op, External -* Rolling update: We gradually kill existing ones while creating the new one. - - -## Tests - -- Updating a RollingUpdate DaemonSet will trigger updates to its daemon pods. -- Updating an OnDelete DaemonSet will not trigger updates, until the pods are - killed. -- Users can use node labels to choose which nodes this DaemonSet should target. - DaemonSet updates only affect pods on those nodes. - - For example, some nodes may be running manifest pods, and other nodes will - be running daemon pods -- DaemonSets can be updated while already being updated (i.e. rollover updates) -- Broken rollout can be rolled back (by applying old config) -- If a daemon pod can no long fit on the node after rolling update, the users - can manually evict or delete other pods on the node to make room for the - daemon pod, and the DaemonSet rollout will eventually succeed (DaemonSet - controller will recreate the failed daemon pod if it can't be scheduled) - - -## Future Plans - -In the future, we may: - -- Implement at-most-one and/or at-least-one guarantees for DaemonSets (i.e. at - most/at least one pod from a DaemonSet can exist on a node at any time) - - At-most-one would use a deterministic name for the pod (e.g. use node name - as daemon pod name suffix) -- Support use cases where uptime is critical for each pod of a DaemonSet during - an upgrade - - One approach is to use dummy pods to pre-pull images to reduce down time -- Support use cases that each DaemonSet pod can still fit on the node after - being updated (unless it becomes larger than the node). Some possible - approaches include: - - Make DaemonSet pods (daemons) have higher priority than non-daemons, and - kubelet will evict pods with lower priority to make room for higher priority - ones - - The DaemonSet controller will evict pods when daemons can't fit on the node - - The DaemonSet controller will cordon the node before upgrading the daemon on - it, and uncordon the node once it's done -- Support use cases that require the node to be drained before the daemons on it - can updated (e.g. logging daemons) - - The DaemonSet controller will drain the node before upgrading the daemon on - it, and uncordon the node once it's done -- Make DaemonSets admin-only resources (admin = people who manage nodes). Some - possible approaches include: - - Remove namespace from DaemonSets (DaemonSets become node-level resources) - - Modify RBAC bootstrap policy to make DaemonSets admin-only - - Delegation or impersonation -- Support more DaemonSet update strategies -- Allow user-defined DaemonSet unique label key -- Support pausing DaemonSet rolling update -- Support auto-rollback DaemonSets - -### API - -Implement a subresource for DaemonSet history (`daemonsets/foo/history`) that -summarizes the information in the history. - -Implement a subresource for DaemonSet rollback (`daemonsets/foo/rollback`) that -triggers a DaemonSet rollback. - -### Tests - -- DaemonSet should support at most one daemon pod per node guarantee. - - Adding or deleting nodes won't break that. -- Users should be able to specify acceptable downtime of their daemon pods, and - DaemonSet updates should respect that. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/deploy.md b/contributors/design-proposals/apps/deploy.md index 1030a9a6..f0fbec72 100644 --- a/contributors/design-proposals/apps/deploy.md +++ b/contributors/design-proposals/apps/deploy.md @@ -1,151 +1,6 @@ -- [Deploy through CLI](#deploy-through-cli) - - [Motivation](#motivation) - - [Requirements](#requirements) - - [Related `kubectl` Commands](#related-kubectl-commands) - - [`kubectl run`](#kubectl-run) - - [`kubectl scale` and `kubectl autoscale`](#kubectl-scale-and-kubectl-autoscale) - - [`kubectl rollout`](#kubectl-rollout) - - [`kubectl set`](#kubectl-set) - - [Mutating Operations](#mutating-operations) - - [Example](#example) - - [Support in Deployment](#support-in-deployment) - - [Deployment Status](#deployment-status) - - [Deployment Revision](#deployment-revision) - - [Pause Deployments](#pause-deployments) - - [Failed Deployments](#failed-deployments) +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Deployment rolling update design proposal -**Author**: @janetkuo - -**Status**: implemented - -# Deploy through CLI - -## Motivation - -Users can use [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) or [`kubectl rolling-update`](https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/) to deploy in their Kubernetes clusters. A Deployment provides declarative update for Pods and ReplicationControllers, whereas `rolling-update` allows the users to update their earlier deployment without worrying about schemas and configurations. Users need a way that's similar to `rolling-update` to manage their Deployments more easily. - -`rolling-update` expects ReplicationController as the only resource type it deals with. It's not trivial to support exactly the same behavior with Deployment, which requires: -- Print out scaling up/down events. -- Stop the deployment if users press Ctrl-c. -- The controller should not make any more changes once the process ends. (Delete the deployment when status.replicas=status.updatedReplicas=spec.replicas) - -So, instead, this document proposes another way to support easier deployment management via Kubernetes CLI (`kubectl`). - -## Requirements - -The followings are operations we need to support for the users to easily managing deployments: - -- **Create**: To create deployments. -- **Rollback**: To restore to an earlier revision of deployment. -- **Watch the status**: To watch for the status update of deployments. -- **Pause/resume**: To pause a deployment mid-way, and to resume it. (A use case is to support canary deployment.) -- **Revision information**: To record and show revision information that's meaningful to users. This can be useful for rollback. - -## Related `kubectl` Commands - -### `kubectl run` - -`kubectl run` should support the creation of Deployment (already implemented) and DaemonSet resources. - -### `kubectl scale` and `kubectl autoscale` - -Users may use `kubectl scale` or `kubectl autoscale` to scale up and down Deployments (both already implemented). - -### `kubectl rollout` - -`kubectl rollout` supports both Deployment and DaemonSet. It has the following subcommands: -- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous revision of deployment. -- `kubectl rollout pause` allows the users to pause a deployment. See [pause deployments](#pause-deployments). -- `kubectl rollout resume` allows the users to resume a paused deployment. -- `kubectl rollout status` shows the status of a deployment. -- `kubectl rollout history` shows meaningful revision information of all previous deployments. See [development revision](#deployment-revision). - -### `kubectl set` - -`kubectl set` has the following subcommands: -- `kubectl set env` allows the users to set environment variables of Kubernetes resources. It should support any object that contains a single, primary PodTemplate (such as Pod, ReplicationController, ReplicaSet, Deployment, and DaemonSet). -- `kubectl set image` allows the users to update multiple images of Kubernetes resources. Users will use `--container` and `--image` flags to update the image of a container. It should support anything that has a PodTemplate. - -`kubectl set` should be used for things that are common and commonly modified. Other possible future commands include: -- `kubectl set volume` -- `kubectl set limits` -- `kubectl set security` -- `kubectl set port` - -### Mutating Operations - -Other means of mutating Deployments and DaemonSets, including `kubectl apply`, `kubectl edit`, `kubectl replace`, `kubectl patch`, `kubectl label`, and `kubectl annotate`, may trigger rollouts if they modify the pod template. - -`kubectl create` and `kubectl delete`, for creating and deleting Deployments and DaemonSets, are also relevant. - -### Example - -With the commands introduced above, here's an example of deployment management: - -```console -# Create a Deployment -$ kubectl run nginx --image=nginx --replicas=2 --generator=deployment/v1beta1 - -# Watch the Deployment status -$ kubectl rollout status deployment/nginx - -# Update the Deployment -$ kubectl set image deployment/nginx --container=nginx --image=nginx:<some-revision> - -# Pause the Deployment -$ kubectl rollout pause deployment/nginx - -# Resume the Deployment -$ kubectl rollout resume deployment/nginx - -# Check the change history (deployment revisions) -$ kubectl rollout history deployment/nginx - -# Rollback to a previous revision. -$ kubectl rollout undo deployment/nginx --to-revision=<revision> -``` - -## Support in Deployment - -### Deployment Status - -Deployment status should summarize information about Pods, which includes: -- The number of pods of each revision. -- The number of ready/not ready pods. - -See issue [#17164](https://github.com/kubernetes/kubernetes/issues/17164). - -### Deployment Revision - -We store previous deployment revision information in annotations `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision` of ReplicaSets of the Deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`. -- `kubernetes.io/change-cause`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation. -- `deployment.kubernetes.io/revision` records a revision number to distinguish the change sequence of a Deployment's -ReplicaSets. A Deployment obtains the largest revision number from its ReplicaSets and increments the number by 1 upon update or creation of the Deployment, and updates the revision annotation of its new ReplicaSet. - -When the users perform a rollback, i.e. `kubectl rollout undo`, the Deployment first looks at its existing ReplicaSets, regardless of their number of replicas. Then it finds the one with annotation `deployment.kubernetes.io/revision` that either contains the specified rollback revision number or contains the second largest revision number among all the ReplicaSets (current new ReplicaSet should obtain the largest revision number) if the user didn't specify any revision number (the user wants to rollback to the last change). Lastly, it -starts scaling up that ReplicaSet it's rolling back to, and scaling down the current ones, and then update the revision counter and the rollout annotations accordingly. - -Note that ReplicaSets are distinguished by PodTemplate (i.e. `.spec.template`). When doing a rollout or rollback, a Deployment reuses existing ReplicaSet if it has the same PodTemplate, and its `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision` annotations will be updated by the new rollout. All previous of revisions of this ReplicaSet will be kept in the annotation `deployment.kubernetes.io/revision-history`. For example, if we had 3 ReplicaSets in -Deployment history, and then we do a rollout with the same PodTemplate as revision 1, then revision 1 is lost and becomes revision 4 after the rollout, and the ReplicaSet that once represented revision 1 will then have an annotation `deployment.kubernetes.io/revision-history=1`. - -To make Deployment revisions more meaningful and readable for users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout: -- `--description`: adds `description` annotation to an object when it's created to describe the object. -- `--note`: adds `note` annotation to an object when it's updated to record the change. -- `--commit`: adds `commit` annotation to an object with the commit id. - -### Pause Deployments - -Users sometimes need to temporarily disable a Deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516). - -For more details, see [pausing and resuming a -Deployment](https://kubernetes.io/docs/user-guide/deployments/#pausing-and-resuming-a-deployment). - -### Failed Deployments - -The Deployment could be marked as "failed" when it gets stuck trying to deploy -its newest ReplicaSet without completing within the given deadline (specified -with `.spec.progressDeadlineSeconds`), see document about -[failed Deployment](https://kubernetes.io/docs/user-guide/deployments/#failed-deployment). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/deployment.md b/contributors/design-proposals/apps/deployment.md index 30392c4a..f0fbec72 100644 --- a/contributors/design-proposals/apps/deployment.md +++ b/contributors/design-proposals/apps/deployment.md @@ -1,259 +1,6 @@ -# Deployment +Design proposals have been archived. -Authors: -- Brian Grant (@bgrant0607) -- Clayton Coleman (@smarterclayton) -- Dan Mace (@ironcladlou) -- David Oppenheimer (@davidopp) -- Janet Kuo (@janetkuo) -- Michail Kargakis (@kargakis) -- Nikhil Jindal (@nikhiljindal) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Abstract -A proposal for implementing a new resource - Deployment - which will enable -declarative config updates for ReplicaSets. Users will be able to create a -Deployment, which will spin up a ReplicaSet to bring up the desired Pods. -Users can also target the Deployment to an existing ReplicaSet either by -rolling back an existing Deployment or creating a new Deployment that can -adopt an existing ReplicaSet. The exact mechanics of replacement depends on -the DeploymentStrategy chosen by the user. DeploymentStrategies are explained -in detail in a later section. - -## Implementation - -### API Object - -The `Deployment` API object will have the following structure: - -```go -type Deployment struct { - TypeMeta - ObjectMeta - - // Specification of the desired behavior of the Deployment. - Spec DeploymentSpec - - // Most recently observed status of the Deployment. - Status DeploymentStatus -} - -type DeploymentSpec struct { - // Number of desired pods. This is a pointer to distinguish between explicit - // zero and not specified. Defaults to 1. - Replicas *int32 - - // Label selector for pods. Existing ReplicaSets whose pods are - // selected by this will be scaled down. New ReplicaSets will be - // created with this selector, with a unique label `pod-template-hash`. - // If Selector is empty, it is defaulted to the labels present on the Pod template. - Selector map[string]string - - // Describes the pods that will be created. - Template *PodTemplateSpec - - // The deployment strategy to use to replace existing pods with new ones. - Strategy DeploymentStrategy - - // Minimum number of seconds for which a newly created pod should be ready - // without any of its container crashing, for it to be considered available. - // Defaults to 0 (pod will be considered available as soon as it is ready) - MinReadySeconds int32 -} - -type DeploymentStrategy struct { - // Type of deployment. Can be "Recreate" or "RollingUpdate". - Type DeploymentStrategyType - - // Rolling update config params. Present only if DeploymentStrategyType = - // RollingUpdate. - RollingUpdate *RollingUpdateDeploymentStrategy -} - -type DeploymentStrategyType string - -const ( - // Kill all existing pods before creating new ones. - RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate" - - // Replace the old ReplicaSets by new one using rolling update i.e gradually scale - // down the old ReplicaSets and scale up the new one. - RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate" -) - -// Spec to control the desired behavior of rolling update. -type RollingUpdateDeploymentStrategy struct { - // The maximum number of pods that can be unavailable during the update. - // Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%). - // Absolute number is calculated from percentage by rounding up. - // This can not be 0 if MaxSurge is 0. - // By default, a fixed value of 1 is used. - // Example: when this is set to 30%, the old RC can be scaled down by 30% - // immediately when the rolling update starts. Once new pods are ready, old RC - // can be scaled down further, followed by scaling up the new RC, ensuring - // that at least 70% of original number of pods are available at all times - // during the update. - MaxUnavailable IntOrString - - // The maximum number of pods that can be scheduled above the original number of - // pods. - // Value can be an absolute number (ex: 5) or a percentage of total pods at - // the start of the update (ex: 10%). This can not be 0 if MaxUnavailable is 0. - // Absolute number is calculated from percentage by rounding up. - // By default, a value of 1 is used. - // Example: when this is set to 30%, the new RC can be scaled up by 30% - // immediately when the rolling update starts. Once old pods have been killed, - // new RC can be scaled up further, ensuring that total number of pods running - // at any time during the update is atmost 130% of original pods. - MaxSurge IntOrString -} - -type DeploymentStatus struct { - // Total number of ready pods targeted by this deployment (this - // includes both the old and new pods). - Replicas int32 - - // Total number of new ready pods with the desired template spec. - UpdatedReplicas int32 - - // Count of hash collisions for the Deployment. The Deployment controller uses this - // field as a collision avoidance mechanism when it needs to create the name for the - // newest ReplicaSet. - CollisionCount *int64 -} - -``` - -### Controller - -#### Deployment Controller - -The DeploymentController will process Deployments and crud ReplicaSets. -For each creation or update for a Deployment, it will: - -1. Find all RSs (ReplicaSets) whose label selector is a superset of DeploymentSpec.Selector. - - For now, we will do this in the client - list all RSs and then filter out the - ones we want. Eventually, we want to expose this in the API. -2. The new RS can have the same selector as the old RS and hence we add a unique - selector to all these RSs (and the corresponding label to their pods) to ensure - that they do not select the newly created pods (or old pods get selected by the - new RS). - - The label key will be "pod-template-hash". - - The label value will be the hash of {podTemplateSpec+collisionCount} where podTemplateSpec - is the one that the new RS uses and collisionCount is a counter in the DeploymentStatus - that increments every time a [hash collision](#hashing-collisions) happens (hash - collisions should be rare with fnv). - - If the RSs and pods don't already have this label and selector: - - We will first add this to RS.PodTemplateSpec.Metadata.Labels for all RSs to - ensure that all new pods that they create will have this label. - - Then we will add this label to their existing pods - - Eventually we flip the RS selector to use the new label. - This process potentially can be abstracted to a new endpoint for controllers [1]. -3. Find if there exists an RS for which value of "pod-template-hash" label - is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then - this is the RS that will be ramped up. If there is no such RS, then we create - a new one using DeploymentSpec and then add a "pod-template-hash" label - to it. The size of the new RS depends on the used DeploymentStrategyType. -4. Scale up the new RS and scale down the olds ones as per the DeploymentStrategy. - Raise events appropriately (both in case of failure or success). -5. Go back to step 1 unless the new RS has been ramped up to desired replicas - and the old RSs have been ramped down to 0. -6. Cleanup old RSs as per revisionHistoryLimit. - -DeploymentController is stateless so that it can recover in case it crashes during a deployment. - -[1] See https://github.com/kubernetes/kubernetes/issues/36897 - -### MinReadySeconds - -We will implement MinReadySeconds using the Ready condition in Pod. We will add -a LastTransitionTime to PodCondition and update kubelet to set Ready to false, -each time any container crashes. Kubelet will set Ready condition back to true once -all containers are ready. For containers without a readiness probe, we will -assume that they are ready as soon as they are up. -https://github.com/kubernetes/kubernetes/issues/11234 tracks updating kubelet -and https://github.com/kubernetes/kubernetes/issues/12615 tracks adding -LastTransitionTime to PodCondition. - -## Changing Deployment mid-way - -### Updating - -Users can update an ongoing Deployment before it is completed. -In this case, the existing rollout will be stalled and the new one will -begin. -For example, consider the following case: -- User updates a Deployment to rolling-update 10 pods with image:v1 to - pods with image:v2. -- User then updates this Deployment to create pods with image:v3, - when the image:v2 RS had been ramped up to 5 pods and the image:v1 RS - had been ramped down to 5 pods. -- When Deployment Controller observes the new update, it will create - a new RS for creating pods with image:v3. It will then start ramping up this - new RS to 10 pods and will ramp down both the existing RSs to 0. - -### Deleting - -Users can pause/cancel a rollout by doing a non-cascading deletion of the Deployment -before it is complete. Recreating the same Deployment will resume it. -For example, consider the following case: -- User creates a Deployment to perform a rolling-update for 10 pods from image:v1 to - image:v2. -- User then deletes the Deployment while the old and new RSs are at 5 replicas each. - User will end up with 2 RSs with 5 replicas each. -User can then re-create the same Deployment again in which case, DeploymentController will -notice that the second RS exists already which it can ramp up while ramping down -the first one. - -### Rollback - -We want to allow the user to rollback a Deployment. To rollback a completed (or -ongoing) Deployment, users can simply use `kubectl rollout undo` or update the -Deployment directly by using its spec.rollbackTo.revision field and specify the -revision they want to rollback to or no revision which means that the Deployment -will be rolled back to its previous revision. - -## Deployment Strategies - -DeploymentStrategy specifies how the new RS should replace existing RSs. -To begin with, we will support 2 types of Deployment: -* Recreate: We kill all existing RSs and then bring up the new one. This results - in quick Deployment but there is a downtime when old pods are down but - the new ones have not come up yet. -* Rolling update: We gradually scale down old RSs while scaling up the new one. - This results in a slower Deployment, but there can be no downtime. Depending on - the strategy parameters, it is possible to have at all times during the rollout - available pods (old or new). The number of available pods and when is a pod - considered "available" can be configured using RollingUpdateDeploymentStrategy. - -## Hashing collisions - -Hashing collisions are a real thing with the existing hashing algorithm[1]. We -need to switch to a more stable algorithm like fnv. Preliminary benchmarks[2] -show that while fnv is a bit slower than adler, it is much more stable. Also, -hashing an API object is subject to API changes which means that the name -for a ReplicaSet may differ between minor Kubernetes versions. - -For both of the aforementioned cases, we will use a field in the DeploymentStatus, -called collisionCount, to create a unique hash value when a hash collision happens. -The Deployment controller will compute the hash value of {template+collisionCount}, -and will use the resulting hash in the ReplicaSet names and selectors. One side -effect of this hash collision avoidance mechanism is that we don't need to -migrate ReplicaSets that were created with adler. - -[1] https://github.com/kubernetes/kubernetes/issues/29735 - -[2] https://github.com/kubernetes/kubernetes/pull/39527 - -## Future - -Apart from the above, we want to add support for the following: -* Running the deployment process in a pod: In future, we can run the deployment process in a pod. Then users can define their own custom deployments and we can run it using the image name. -* More DeploymentStrategyTypes: https://github.com/openshift/origin/blob/master/examples/deployment/README.md#deployment-types lists most commonly used ones. -* Triggers: Deployment will have a trigger field to identify what triggered the deployment. Options are: Manual/UserTriggered, Autoscaler, NewImage. -* Automatic rollback on error: We want to support automatic rollback on error or timeout. - -## References - -- https://github.com/kubernetes/kubernetes/issues/1743 has most of the - discussion that resulted in this proposal. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/indexed-job.md b/contributors/design-proposals/apps/indexed-job.md index 9b142b0f..f0fbec72 100644 --- a/contributors/design-proposals/apps/indexed-job.md +++ b/contributors/design-proposals/apps/indexed-job.md @@ -1,896 +1,6 @@ -# Design: Indexed Feature of Job object +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Summary - -This design extends kubernetes with user-friendly support for -running embarrassingly parallel jobs. - -Here, *parallel* means on multiple nodes, which means multiple pods. -By *embarrassingly parallel*, it is meant that the pods -have no dependencies between each other. In particular, neither -ordering between pods nor gang scheduling are supported. - -Users already have two other options for running embarrassingly parallel -Jobs (described in the next section), but both have ease-of-use issues. - -Therefore, this document proposes extending the Job resource type to support -a third way to run embarrassingly parallel programs, with a focus on -ease of use. - -This new style of Job is called an *indexed job*, because each Pod of the Job -is specialized to work on a particular *index* from a fixed length array of work -items. - -## Background - -The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports -the embarrassingly parallel use case through *workqueue jobs*. -While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very -flexible, they can be difficult to use. They: (1) typically require running a -message queue or other database service, (2) typically require modifications -to existing binaries and images and (3) subtle race conditions are easy to - overlook. - -Users also have another option for parallel jobs: creating [multiple Job objects -from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of -Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job -objects at once. But, that approach also has its drawbacks: (1) for large levels -of parallelism (hundreds or thousands of pods) this approach means that listing -all jobs presents too much information, (2) users want a single source of -information about the success or failure of what the user views as a single -logical process. - -Indexed job fills provides a third option with better ease-of-use for common -use cases. - -## Requirements - -### User Requirements - -- Users want an easy way to run a Pod to completion *for each* item within a -[work list](#example-use-cases). - -- Users want to run these pods in parallel for speed, but to vary the level of -parallelism as needed, independent of the number of work items. - -- Users want to do this without requiring changes to existing images, -or source-to-image pipelines. - -- Users want a single object that encompasses the lifetime of the parallel -program. Deleting it should delete all dependent objects. It should report the -status of the overall process. Users should be able to wait for it to complete, -and can refer to it from other resource types, such as -[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). - - -### Example Use Cases - -Here are several examples of *work lists*: lists of command lines that the user -wants to run, each line its own Pod. (Note that in practice, a work list may not -ever be written out in this form, but it exists in the mind of the Job creator, -and it is a useful way to talk about the intent of the user when discussing -alternatives for specifying Indexed Jobs). - -Note that we will not have the user express their requirements in work list -form; it is just a format for presenting use cases. Subsequent discussion will -reference these work lists. - -#### Work List 1 - -Process several files with the same program: - -``` -/usr/local/bin/process_file 12342.dat -/usr/local/bin/process_file 97283.dat -/usr/local/bin/process_file 38732.dat -``` - -#### Work List 2 - -Process a matrix (or image, etc) in rectangular blocks: - -``` -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 -``` - -#### Work List 3 - -Build a program at several different git commits: - -``` -HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH -HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH -HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH -``` - -#### Work List 4 - -Render several frames of a movie: - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1 -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2 -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3 -``` - -#### Work List 5 - -Render several blocks of frames (Render blocks to avoid Pod startup overhead for -every frame): - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100 -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200 -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300 -``` - -## Design Discussion - -### Converting Work Lists into Indexed Jobs. - -Given a work list, like in the [work list examples](#work-list-examples), -the information from the work list needs to get into each Pod of the Job. - -Users will typically not want to create a new image for each job they -run. They will want to use existing images. So, the image is not the place -for the work list. - -A work list can be stored on networked storage, and mounted by pods of the job. -Also, as a shortcut, for small worklists, it can be included in an annotation on -the Job object, which is then exposed as a volume in the pod via the downward -API. - -### What Varies Between Pods of an indexed-job - -Pods need to differ in some way to do something different. (They do not differ -in the work-queue style of Job, but that style has ease-of-use issues). - -A general approach would be to allow pods to differ from each other in arbitrary -ways. For example, the Job object could have a list of PodSpecs to run. -However, this is so general that it provides little value. It would: - -- make the Job Spec very verbose, especially for jobs with thousands of work -items -- Job becomes such a vague concept that it is hard to explain to users -- in practice, we do not see cases where many pods which differ across many -fields of their specs, and need to run as a group, with no ordering constraints. -- CLIs and UIs need to support more options for creating Job -- it is useful for monitoring and accounting databases want to aggregate data -for pods with the same controller. However, pods with very different Specs may -not make sense to aggregate. -- profiling, debugging, accounting, auditing and monitoring tools cannot assume -common images/files, behaviors, provenance and so on between Pods of a Job. - -Also, variety has another cost. Pods which differ in ways that affect scheduling -(node constraints, resource requirements, labels) prevent the scheduler from -treating them as fungible, which is an important optimization for the scheduler. - -Therefore, we will not allow Pods from the same Job to differ arbitrarily -(anyway, users can use multiple Job objects for that case). We will try to -allow as little as possible to differ between pods of the same Job, while still -allowing users to express common parallel patterns easily. For users who need to -run jobs which differ in other ways, they can create multiple Jobs, and manage -them as a group using labels. - -From the above work lists, we see a need for Pods which differ in their command -lines, and in their environment variables. These work lists do not require the -pods to differ in other ways. - -Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) -has shown this model to be applicable to a very broad range of problems, despite -this restriction. - -Therefore we want to allow pods in the same Job to differ **only** in the following - aspects: -- command line -- environment variables - -### Composition of existing images - -The docker image that is used in a job may not be maintained by the person -running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD. -If we require people to specify the complete command line to use Indexed Job, -then they will not automatically pick up changes in the default -command or args. - -This needs more thought. - -### Running Ad-Hoc Jobs using kubectl - -A user should be able to easily start an Indexed Job using `kubectl`. For -example to run [work list 1](#work-list-1), a user should be able to type -something simple like: - -``` -kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - -- \ - /usr/local/bin/process_file '$F' -``` - -In the above example: - -- `--restart=OnFailure` implies creating a job instead of replicationController. -- Each pods command line is `/usr/local/bin/process_file $F`. -- `--per-completion-env=` implies the jobs `.spec.completions` is set to the -length of the argument array (3 in the example). -- `--per-completion-env=F=<values>` causes env var with `F` to be available in -the environment when the command line is evaluated. - -How exactly this happens is discussed later in the doc: this is a sketch of the -user experience. - -In practice, the list of files might be much longer and stored in a file on the -users local host, like: - -``` -$ cat files-to-process.txt -12342.dat -97283.dat -38732.dat -... -``` - -So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`. - -However, `kubectl` should also support a format like: - `--per-completion-env=F=@files-to-process.txt`. -That allows `kubectl` to parse the file, point out any syntax errors, and would -not run up against command line length limits (2MB is common, as low as 4kB is -POSIX compliant). - -One case we do not try to handle is where the file of work is stored on a cloud -filesystem, and not accessible from the users local host. Then we cannot easily -use indexed job, because we do not know the number of completions. The user -needs to copy the file locally first or use the Work-Queue style of Job (already -supported). - -Another case we do not try to handle is where the input file does not exist yet -because this Job is to be run at a future time, or depends on another job. The -workflow and scheduled job proposal need to consider this case. For that case, -you could use an indexed job which runs a program which shards the input file -(map-reduce-style). - -#### Multiple parameters - -The user may also have multiple parameters, like in [work list 2](#work-list-2). -One way is to just list all the command lines already expanded, one per line, in -a file, like this: - -``` -$ cat matrix-commandlines.txt -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 -``` - -and run the Job like this: - -``` -kubectl run process-matrix --image=my/matrix \ - --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \ - --restart=OnFailure \ - -- \ - 'eval "$COMMAND_LINE"' -``` - -However, this may have some subtleties with shell escaping. Also, it depends on -the user knowing all the correct arguments to the docker image being used (more -on this later). - -Instead, kubectl should support multiple instances of the `--per-completion-env` -flag. For example, to implement work list 2, a user could do: - -``` -kubectl run process-matrix --image=my/matrix \ - --per-completion-env=SR="0 16 0 16" \ - --per-completion-env=ER="15 31 15 31" \ - --per-completion-env=SC="0 0 16 16" \ - --per-completion-env=EC="15 15 31 31" \ - --restart=OnFailure \ - -- \ - /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC -``` - -### Composition With Workflows and ScheduledJob - -A user should be able to create a job (Indexed or not) which runs at a specific -time(s). For example: - -``` -$ kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - --runAt=2015-07-21T14:00:00Z - -- \ - /usr/local/bin/process_file '$F' -created "scheduledJob/process-files-37dt3" -``` - -Kubectl should build the same JobSpec, and then put it into a ScheduledJob -(#11980) and create that. - -For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a -complete workflow from a single command line would be messy, because of the need -to specify all the arguments multiple times. - -For that use case, the user could create a workflow message by hand. Or the user -could create a job template, and then make a workflow from the templates, -perhaps like this: - -``` -$ kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - --asTemplate \ - -- \ - /usr/local/bin/process_file '$F' -created "jobTemplate/process-files" -$ kubectl run merge-files --image=mymerger \ - --restart=OnFailure \ - --asTemplate \ - -- \ - /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \ -created "jobTemplate/merge-files" -$ kubectl create-workflow process-and-merge \ - --job=jobTemplate/process-files - --job=jobTemplate/merge-files - --dependency=process-files:merge-files -created "workflow/process-and-merge" -``` - -### Completion Indexes - -A JobSpec specifies the number of times a pod needs to complete successfully, -through the `job.Spec.Completions` field. The number of completions will be -equal to the number of work items in the work list. - -Each pod that the job controller creates is intended to complete one work item -from the work list. Since a pod may fail, several pods may, serially, attempt to -complete the same index. Therefore, we call it a *completion index* (or just -*index*), but not a *pod index*. - -For each completion index, in the range 1 to `.job.Spec.Completions`, the job -controller will create a pod with that index, and keep creating them on failure, -until each index is completed. - -An dense integer index, rather than a sparse string index (e.g. using just -`metadata.generate-name`) makes it easy to use the index to lookup parameters -in, for example, an array in shared storage. - -### Pod Identity and Template Substitution in Job Controller - -The JobSpec contains a single pod template. When the job controller creates a -particular pod, it copies the pod template and modifies it in some way to make -that pod distinctive. Whatever is distinctive about that pod is its *identity*. - -We consider several options. - -#### Index Substitution Only - -The job controller substitutes only the *completion index* of the pod into the -pod template when creating it. The JSON it POSTs differs only in a single -fields. - -We would put the completion index as a stringified integer, into an annotation -of the pod. The user can extract it from the annotation into an env var via the -downward API, or put it in a file via a Downward API volume, and parse it -himself. - -Once it is an environment variable in the pod (say `$INDEX`), then one of two -things can happen. - -First, the main program can know how to map from an integer index to what it -needs to do. For example, from Work List 4 above: - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX -``` - -Second, a shell script can be prepended to the original command line which maps -the index to one or more string parameters. For example, to implement Work List -5 above, you could do: - -``` -/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME -``` - -In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` -and exports `$START_FRAME` and `$END_FRAME`. - -The shell could be part of the image, but more usefully, it could be generated -by a program and stuffed in an annotation or a configMap, and from there added -to a volume. - -The first approach may require the user to modify an existing image (see next -section) to be able to accept an `$INDEX` env var or argument. The second -approach requires that the image have a shell. We think that together these two -options cover a wide range of use cases (though not all). - -#### Multiple Substitution - -In this option, the JobSpec is extended to include a list of values to -substitute, and which fields to substitute them into. For example, a worklist -like this: - -``` -FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds -FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt -FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit -``` - -Can be broken down into a template like this, with three parameters: - -``` -<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1> -``` - -and a list of parameter tuples, like this: - -``` -("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds") -("FRUIT_COLOR=yellow", "-f banana.txt", "") -("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit") -``` - -The JobSpec can be extended to hold a list of parameter tuples (which are more -easily expressed as a list of lists of individual parameters). For example: - -``` -apiVersion: extensions/v1beta1 -kind: Job -... -spec: - completions: 3 - ... - template: - ... - perCompletionArgs: - container: 0 - - - - "-f apple.txt" - - "-f banana.txt" - - "-f cherry.txt" - - - - "--remove-seeds" - - "" - - "--remove-pit" - perCompletionEnvVars: - - name: "FRUIT_COLOR" - - "green" - - "yellow" - - "red" -``` - -However, just providing custom env vars, and not arguments, is sufficient for -many use cases: parameter can be put into env vars, and then substituted on the -command line. - -#### Comparison - -The multiple substitution approach: - -- keeps the *per completion parameters* in the JobSpec. -- Drawback: makes the job spec large for job with thousands of completions. (But -for very large jobs, the work-queue style or another type of controller, such as -map-reduce or spark, may be a better fit.) -- Drawback: is a form of server-side templating, which we want in Kubernetes but -have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). - -The index-only approach: - -- Requires that the user keep the *per completion parameters* in a separate -storage, such as a configData or networked storage. -- Makes no changes to the JobSpec. -- Drawback: while in separate storage, they could be mutated, which would have -unexpected effects. -- Drawback: Logic for using index to lookup parameters needs to be in the Pod. -- Drawback: CLIs and UIs are limited to using the "index" as the identity of a -pod from a job. They cannot easily say, for example `repeated failures on the -pod processing banana.txt`. - -Index-only approach relies on at least one of the following being true: - -1. Image containing a shell and certain shell commands (not all images have -this). -1. Use directly consumes the index from annotations (file or env var) and -expands to specific behavior in the main program. - -Also Using the index-only approach from non-kubectl clients requires that they -mimic the script-generation step, or only use the second style. - -#### Decision - -It is decided to implement the Index-only approach now. Once the server-side -templating design is complete for Kubernetes, and we have feedback from users, -we can consider if Multiple Substitution. - -## Detailed Design - -#### Job Resource Schema Changes - -No changes are made to the JobSpec. - - -The JobStatus is also not changed. The user can gauge the progress of the job by -the `.status.succeeded` count. - - -#### Job Spec Compatibility - -A job spec written before this change will work exactly the same as before with -the new controller. The Pods it creates will have the same environment as -before. They will have a new annotation, but pod are expected to tolerate -unfamiliar annotations. - -However, if the job controller version is reverted, to a version before this -change, the jobs whose pod specs depend on the new annotation will fail. -This is okay for a Beta resource. - -#### Job Controller Changes - -The Job controller will maintain for each Job a data structured which -indicates the status of each completion index. We call this the -*scoreboard* for short. It is an array of length `.spec.completions`. -Elements of the array are `enum` type with possible values including -`complete`, `running`, and `notStarted`. - -The scoreboard is stored in Job Controller memory for efficiency. In either -case, the Status can be reconstructed from watching pods of the job (such as on -a controller manager restart). The index of the pods can be extracted from the -pod annotation. - -When Job controller sees that the number of running pods is less than the -desired parallelism of the job, it finds the first index in the scoreboard with -value `notRunning`. It creates a pod with this creation index. - -When it creates a pod with creation index `i`, it makes a copy of the -`.spec.template`, and sets -`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to -`i`. It does this in both the index-only and multiple-substitutions options. - -Then it creates the pod. - -When the controller notices that a pod has completed or is running or failed, -it updates the scoreboard. - -When all entries in the scoreboard are `complete`, then the job is complete. - - -#### Downward API Changes - -The downward API is changed to support extracting specific key names into a -single environment variable. So, the following would be supported: - -``` -kind: Pod -version: v1 -spec: - containers: - - name: foo - env: - - name: MY_INDEX - valueFrom: - fieldRef: - fieldPath: metadata.annotations[kubernetes.io/job/completion-index] -``` - -This requires kubelet changes. - -Users who fail to upgrade their kubelets at the same time as they upgrade their -controller manager will see a failure for pods to run when they are created by -the controller. The Kubelet will send an event about failure to create the pod. -The `kubectl describe job` will show many failed pods. - - -#### Kubectl Interface Changes - -The `--completions` and `--completion-index-var-name` flags are added to -kubectl. - -For example, this command: - -``` -kubectl run say-number --image=busybox \ - --completions=3 \ - --completion-index-var-name=I \ - -- \ - sh -c 'echo "My index is $I" && sleep 5' -``` - -will run 3 pods to completion, each printing one of the following lines: - -``` -My index is 1 -My index is 2 -My index is 0 -``` - -Kubectl would create the following pod: - - - -Kubectl will also support the `--per-completion-env` flag, as described -previously. For example, this command: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="apple banana cherry" \ - --per-completion-env=COLOR="green yellow red" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -or equivalently: - -``` -echo "apple banana cherry" > fruits.txt -echo "green yellow red" > colors.txt - -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="$(cat fruits.txt)" \ - --per-completion-env=COLOR="$(cat fruits.txt)" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -or similarly: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT=@fruits.txt \ - --per-completion-env=COLOR=@fruits.txt \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -will all run 3 pods in parallel. Index 0 pod will log: - -``` -Have a nice green apple -``` - -and so on. - - -Notes: - -- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a -quoted space separated list or `@` and the name of a text file containing a -list. -- `--per-completion-env=` can be specified several times, but all must have the -same length list. -- `--completions=N` with `N` equal to list length is implied. -- The flag `--completions=3` sets `job.spec.completions=3`. -- The flag `--completion-index-var-name=I` causes an env var to be created named -I in each pod, with the index in it. -- The flag `--restart=OnFailure` is implied by `--completions` or any -job-specific arguments. The user can also specify `--restart=Never` if they -desire but may not specify `--restart=Always` with job-related flags. -- Setting any of these flags in turn tells kubectl to create a Job, not a -replicationController. - -#### How Kubectl Creates Job Specs. - -To pass in the parameters, kubectl will generate a shell script which -can: -- parse the index from the annotation -- hold all the parameter lists. -- lookup the correct index in each parameter list and set an env var. - -For example, consider this command: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="apple banana cherry" \ - --per-completion-env=COLOR="green yellow red" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -First, kubectl generates the PodSpec as it normally does for `kubectl run`. - -But, then it will generate this script: - -```sh -#!/bin/sh -# Generated by kubectl run ... -# Check for needed commands -if [[ ! type cat ]] -then - echo "$0: Image does not include required command: cat" - exit 2 -fi -if [[ ! type grep ]] -then - echo "$0: Image does not include required command: grep" - exit 2 -fi -# Check that annotations are mounted from downward API -if [[ ! -e /etc/annotations ]] -then - echo "$0: Cannot find /etc/annotations" - exit 2 -fi -# Get our index from annotations file -I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index" -export I - -# Our parameter lists are stored inline in this script. -FRUIT_0="apple" -FRUIT_1="banana" -FRUIT_2="cherry" -# Extract the right parameter value based on our index. -# This works on any Bourne-based shell. -FRUIT=$(eval echo \$"FRUIT_$I") -export FRUIT - -COLOR_0="green" -COLOR_1="yellow" -COLOR_2="red" - -COLOR=$(eval echo \$"FRUIT_$I") -export COLOR -``` - -Then it POSTs this script, encoded, inside a ConfigData. -It attaches this volume to the PodSpec. - -Then it will edit the command line of the Pod to run this script before the rest of -the command line. - -Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this: -It also appends the Secret (later configData) volume with the script in it. - -So, the Pod template that kubectl creates (inside the job template) looks like this: - -``` -apiVersion: v1 -kind: Job -... -spec: - ... - template: - ... - spec: - containers: - - name: c - image: k8s.gcr.io/busybox - command: - - 'sh' - - '-c' - - '/etc/job-params.sh; echo "this is the rest of the command"' - volumeMounts: - - name: annotations - mountPath: /etc - - name: script - mountPath: /etc - volumes: - - name: annotations - downwardAPI: - items: - - path: "annotations" - ieldRef: - fieldPath: metadata.annotations - - name: script - secret: - secretName: jobparams-abc123 -``` - -###### Alternatives - -Kubectl could append a `valueFrom` line like this to -get the index into the environment: - -```yaml -apiVersion: extensions/v1beta1 -kind: Job -metadata: - ... -spec: - ... - template: - ... - spec: - containers: - - name: foo - ... - env: - # following block added: - - name: I - valueFrom: - fieldRef: - fieldPath: metadata.annotations."kubernetes.io/job-idx" -``` - -However, in order to inject other env vars from parameter list, -kubectl still needs to edit the command line. - -Parameter lists could be passed via a configData volume instead of a secret. -Kubectl can be changed to work that way once the configData implementation is -complete. - -Parameter lists could be passed inside an EnvVar. This would have length -limitations, would pollute the output of `kubectl describe pods` and `kubectl -get pods -o json`. - -Parameter lists could be passed inside an annotation. This would have length -limitations, would pollute the output of `kubectl describe pods` and `kubectl -get pods -o json`. Also, currently annotations can only be extracted into a -single file. Complex logic is then needed to filter out exactly the desired -annotation data. - -Bash array variables could simplify extraction of a particular parameter from a -list of parameters. However, some popular base images do not include -`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation -that does not support array syntax. - -Kubelet does support [expanding variables without a -shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not -allow for recursive substitution, which is required to extract the correct -parameter from a list based on the completion index of the pod. The syntax -could be extended, but doing so seems complex and will be an unfamiliar syntax -for users. - -Putting all the command line editing into a script and running that causes -the least pollution to the original command line, and it allows -for complex error handling. - -Kubectl could store the script in an [Inline Volume]( -https://github.com/kubernetes/kubernetes/issues/13610) if that proposal -is approved. That would remove the need to manage the lifetime of the -configData/secret, and prevent the case where someone changes the -configData mid-job, and breaks things in a hard-to-debug way. - - -## Interactions with other features - -#### Supporting Work Queue Jobs too - -For Work Queue Jobs, completions has no meaning. Parallelism should be allowed -to be greater than it, and pods have no identity. So, the job controller should -not create a scoreboard in the JobStatus, just a count. Therefore, we need to -add one of the following to JobSpec: - -- allow unset `.spec.completions` to indicate no scoreboard, and no index for -tasks (identical tasks). -- allow `.spec.completions=-1` to indicate the same. -- add `.spec.indexed` to job to indicate need for scoreboard. - -#### Interaction with vertical autoscaling - -Since pods of the same job will not be created with different resources, -a vertical autoscaler will need to: - -- if it has index-specific initial resource suggestions, suggest those at -admission time; it will need to understand indexes. -- mutate resource requests on already created pods based on usage trend or -previous container failures. -- modify the job template, affecting all indexes. - -#### Comparison to StatefulSets (previously named PetSets) - -The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b. -The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more -restrictive and thus less verbose. - -It would be easier for users if Indexed Job and StatefulSet are similar where -possible. However, StatefulSet differs in several key respects: - -- StatefulSet is for ones to tens of instances. Indexed job should work with tens of -thousands of instances. -- When you have few instances, you may want to give them names. When you have many instances, -integer indexes make more sense. -- When you have thousands of instances, storing the work-list in the JobSpec -is verbose. For StatefulSet, this is less of a problem. -- StatefulSets (apparently) need to differ in more fields than indexed Jobs. - -This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is -intended to support ones to tens of things. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/job.md b/contributors/design-proposals/apps/job.md index 5415ad76..f0fbec72 100644 --- a/contributors/design-proposals/apps/job.md +++ b/contributors/design-proposals/apps/job.md @@ -1,208 +1,6 @@ -# Job Controller +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal for implementing a new controller - Job controller - which will be responsible -for managing pod(s) that require running once to completion even if the machine -the pod is running on fails, in contrast to what ReplicationController currently offers. - -Several existing issues and PRs were already created regarding that particular subject: -* Job Controller [#1624](https://github.com/kubernetes/kubernetes/issues/1624) -* New Job resource [#7380](https://github.com/kubernetes/kubernetes/pull/7380) - - -## Use Cases - -1. Be able to start one or several pods tracked as a single entity. -1. Be able to run batch-oriented workloads on Kubernetes. -1. Be able to get the job status. -1. Be able to specify the number of instances performing a job at any one time. -1. Be able to specify the number of successfully finished instances required to finish a job. -1. Be able to specify a backoff policy, when job is continuously failing. - - -## Motivation - -Jobs are needed for executing multi-pod computation to completion; a good example -here would be the ability to implement any type of batch oriented tasks. - - -## Backoff policy and failed pod limit - -By design, Jobs do not have any notion of failure, other than a pod's `restartPolicy` -which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243), -[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]). There are -situation where one wants to fail a Job after some amount of retries over a certain -period of time, due to a logical error in configuration etc. To do so we are going -to introduce the following fields, which will control the backoff policy: a number of -retries and an initial time of retry. The two fields will allow fine-grained control -over the backoff policy. Each of the two fields will use a default value if none -is provided, `BackoffLimit` is set by default to 6 and `BackoffSeconds` to 10s. -This will result in the following retry sequence: 10s, 20s, 40s, 1m20s, 2m40s, -5m20s. After which the job will be considered failed. - -Additionally, to help debug the issue with a Job, and limit the impact of having -too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243)), -we are going to introduce a field which will allow specifying the maximum number -of failed pods to keep around. This number will also take effect if none of the -limits described above are set. By default it will take value of 1, to allow debugging -job issues, but not to flood the cluster with too many failed jobs and their -accompanying pods. - -All of the above fields will be optional and will apply when `restartPolicy` is -set to `Never` on a `PodTemplate`. With restart policy `OnFailure` only `BackoffLimit` -applies. The reason for that is that failed pods are already restarted by the -kubelet with an [exponential backoff](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy). -Additionally, failures are counted differently depending on `restartPolicy` -setting. For `Never` we count actual pod failures (reflected in `.status.failed` -field). With `OnFailure`, we take an approximate value of pod restarts (as reported -in `.status.containerStatuses[*].restartCount`). -When `.spec.parallelism` is set to a value higher than 1, the failures are an -overall number (as coming from `.status.failed`) because the controller does not -hold information about failures coming from separate pods. - - -## Implementation - -Job controller is similar to replication controller in that they manage pods. -This implies they will follow the same controller framework that replication -controllers already defined. The biggest difference between a `Job` and a -`ReplicationController` object is the purpose; `ReplicationController` -ensures that a specified number of Pods are running at any one time, whereas -`Job` is responsible for keeping the desired number of Pods to a completion of -a task. This difference will be represented by the `RestartPolicy` which is -required to always take value of `RestartPolicyNever` or `RestartOnFailure`. - - -The new `Job` object will have the following content: - -```go -// Job represents the configuration of a single job. -type Job struct { - TypeMeta - ObjectMeta - - // Spec is a structure defining the expected behavior of a job. - Spec JobSpec - - // Status is a structure describing current status of a job. - Status JobStatus -} - -// JobList is a collection of jobs. -type JobList struct { - TypeMeta - ListMeta - - Items []Job -} -``` - -`JobSpec` structure is defined to contain all the information how the actual job execution -will look like. - -```go -// JobSpec describes how the job execution will look like. -type JobSpec struct { - - // Parallelism specifies the maximum desired number of pods the job should - // run at any given time. The actual number of pods running in steady state will - // be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism), - // i.e. when the work left to do is less than max parallelism. - Parallelism *int32 - - // Completions specifies the desired number of successfully finished pods the - // job should be run with. Defaults to 1. - Completions *int32 - - // Optional duration in seconds relative to the startTime that the job may be active - // before the system tries to terminate it; value must be a positive integer. - // It applies to overall job run time, no matter of the value of completions - // or parallelism parameters. - ActiveDeadlineSeconds *int64 - - // Optional number of retries before marking this job failed. - // Defaults to 6. - BackoffLimit *int32 - - // Optional time (in seconds) specifying how long the initial backoff will last. - // Defaults to 10s. - BackoffSeconds *int64 - - // Optional number of failed pods to retain. - FailedPodsLimit *int32 - - // Selector is a label query over pods running a job. - Selector LabelSelector - - // Template is the object that describes the pod that will be created when - // executing a job. - Template *PodTemplateSpec -} -``` - -`JobStatus` structure is defined to contain information about pods executing -specified job. The structure holds information about pods currently executing -the job. - -```go -// JobStatus represents the current state of a Job. -type JobStatus struct { - Conditions []JobCondition - - // CreationTime represents time when the job was created - CreationTime unversioned.Time - - // StartTime represents time when the job was started - StartTime unversioned.Time - - // CompletionTime represents time when the job was completed - CompletionTime unversioned.Time - - // Active is the number of actively running pods. - Active int32 - - // Succeeded is the number of pods successfully completed their job. - Succeeded int32 - - // Failed is the number of pods failures, this applies only to jobs - // created with RestartPolicyNever, otherwise this value will always be 0. - Failed int32 -} - -type JobConditionType string - -// These are valid conditions of a job. -const ( - // JobComplete means the job has completed its execution. - JobComplete JobConditionType = "Complete" -) - -// JobCondition describes current state of a job. -type JobCondition struct { - Type JobConditionType - Status ConditionStatus - LastHeartbeatTime unversioned.Time - LastTransitionTime unversioned.Time - Reason string - Message string -} -``` - -## Events - -Job controller will be emitting the following events: -* JobStart -* JobFinish - -## Future evolution - -Below are the possible future extensions to the Job controller: -* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. *now implemented* -* Be able to create a chain of jobs dependent one on another. *will be implemented in a separate type called Workflow* -* Be able to specify the work each of the workers should execute (see type 1 from - [this comment](https://github.com/kubernetes/kubernetes/issues/1624#issuecomment-97622142)) -* Be able to inspect Pods running a Job, especially after a Job has finished, e.g. - by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)). -* help users avoid non-unique label selectors ([see this proposal](selector-generation.md)) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/selector-generation.md b/contributors/design-proposals/apps/selector-generation.md index 2f3a6b49..f0fbec72 100644 --- a/contributors/design-proposals/apps/selector-generation.md +++ b/contributors/design-proposals/apps/selector-generation.md @@ -1,176 +1,6 @@ -Design -============= +Design proposals have been archived. -# Goals +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Make it really hard to accidentally create a job which has an overlapping -selector, while still making it possible to chose an arbitrary selector, and -without adding complex constraint solving to the APIserver. - -# Use Cases - -1. user can leave all label and selector fields blank and system will fill in -reasonable ones: non-overlappingness guaranteed. -2. user can put on the pod template some labels that are useful to the user, -without reasoning about non-overlappingness. System adds additional label to -assure not overlapping. -3. If user wants to reparent pods to new job (very rare case) and knows what -they are doing, they can completely disable this behavior and specify explicit -selector. -4. If a controller that makes jobs, like scheduled job, wants to use different -labels, such as the time and date of the run, it can do that. -5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and -just changes the API group, the user should not automatically be allowed to -specify a selector, since this is very rarely what people want to do and is -error prone. -6. If User downloads an existing job definition, e.g. with -`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not -create an overlapping job. -7. If User downloads an existing job definition, e.g. with -`kubectl get jobs/old -o yaml` and tries to modify and post it, and he -accidentally copies the uniquifying label from the old one, then he should not -get an error from a label-key conflict, nor get erratic behavior. -8. If user reads swagger docs and sees the selector field, he should not be able -to set it without realizing the risks. -8. (Deferred requirement:) If user wants to specify a preferred name for the -non-overlappingness key, they can pick a name. - -# Proposed changes - -## API - -`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as -follows. - -Field `job.spec.manualSelector` is added. It controls whether selectors are -automatically generated. In automatic mode, user cannot make the mistake of -creating non-unique selectors. In manual mode, certain rare use cases are -supported. - -Validation is not changed. A selector must be provided, and it must select the -pod template. - -Defaulting changes. Defaulting happens in one of two modes: - -### Automatic Mode - -- User does not specify `job.spec.selector`. -- User is probably unaware of the `job.spec.manualSelector` field and does not -think about it. -- User optionally puts labels on pod template (optional). User does not think -about uniqueness, just labeling for user's own reasons. -- Defaulting logic sets `job.spec.selector` to -`matchLabels["controller-uid"]="$UIDOFJOB"` -- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. - - The first label is controller-uid=$UIDOFJOB. - - The second label is "job-name=$NAMEOFJOB". - -### Manual Mode - -- User means User or Controller for the rest of this list. -- User does specify `job.spec.selector`. -- User does specify `job.spec.manualSelector=true` -- User puts a unique label or label(s) on pod template (required). User does -think carefully about uniqueness. -- No defaulting of pod labels or the selector happen. - -### Rationale - -UID is better than Name in that: -- it allows cross-namespace control someday if we need it. -- it is unique across all kinds. `controller-name=foo` does not ensure -uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a -problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the -latter cannot use label `job-name=foo`, though there is a temptation to do so. -- it uniquely identifies the controller across time. This prevents the case -where, for example, someone deletes a job via the REST api or client -(where cascade=false), leaving pods around. We don't want those to be picked up -unintentionally. It also prevents the case where a user looks at an old job that -finished but is not deleted, and tries to select its pods, and gets the wrong -impression that it is still running. - -Job name is more user friendly. It is self documenting - -Commands like `kubectl get pods -l job-name=myjob` should do exactly what is -wanted 99.9% of the time. Automated control loops should still use the -controller-uid=label. - -Using both gets the benefits of both, at the cost of some label verbosity. - -The field is a `*bool`. Since false is expected to be much more common, -and since the feature is complex, it is better to leave it unspecified so that -users looking at a stored pod spec do not need to be aware of this field. - -### Overriding Unique Labels - -If user does specify `job.spec.selector` then the user must also specify -`job.spec.manualSelector`. This ensures the user knows that what he is doing is -not the normal thing to do. - -To prevent users from copying the `job.spec.manualSelector` flag from existing -jobs, it will be optional and default to false, which means when you ask GET and -existing job back that didn't use this feature, you don't even see the -`job.spec.manualSelector` flag, so you are not tempted to wonder if you should -fiddle with it. - -## Job Controller - -No changes - -## Kubectl - -No required changes. Suggest moving SELECTOR to wide output of `kubectl get -jobs` since users do not write the selector. - -## Docs - -Remove examples that use selector and remove labels from pod templates. -Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. - -# Conversion - -The following applies to Job, as well as to other types that adopt this pattern: - -- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`. -- Both the internal type and the `batch/v1` type will get -`job.spec.manualSelector`. -- The fields `manualSelector` and `autoSelector` have opposite meanings. -- Each field defaults to false when unset, and so v1beta1 has a different -default than v1 and internal. This is intentional: we want new uses to default -to the less error-prone behavior, and we do not want to change the behavior of -v1beta1. - -*Note*: since the internal default is changing, client library consumers that -create Jobs may need to add "job.spec.manualSelector=true" to keep working, or -switch to auto selectors. - -Conversion is as follows: -- `extensions/__internal` to `extensions/v1beta1`: the value of -`__internal.Spec.ManualSelector` is defaulted to false if nil, negated, -defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. -- `extensions/v1beta1` to `extensions/__internal`: the value of -`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to -nil if false, and written to `__internal.Spec.ManualSelector`. - -This conversion gives the following properties. - -1. Users that previously used v1beta1 do not start seeing a new field when they -get back objects. -2. Distinction between originally unset versus explicitly set to false is not -preserved (would have been nice to do so, but requires more complicated -solution). -3. Users who only created v1beta1 examples or v1 examples, will not ever see the -existence of either field. -4. Since v1beta1 are convertible to/from v1, the storage location (path in etcd) -does not need to change, allowing scriptable rollforward/rollback. - -# Future Work - -Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if -it works well for job. - -Docs will be edited to show examples without a `job.spec.selector`. - -We probably want as much as possible the same behavior for Job and -ReplicationController. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/stateful-apps.md b/contributors/design-proposals/apps/stateful-apps.md index dd7fddbb..f0fbec72 100644 --- a/contributors/design-proposals/apps/stateful-apps.md +++ b/contributors/design-proposals/apps/stateful-apps.md @@ -1,357 +1,6 @@ -# StatefulSets: Running pods which need strong identity and storage +Design proposals have been archived. -## Motivation +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Many examples of clustered software systems require stronger guarantees per instance than are provided -by the Replication Controller (aka Replication Controllers). Instances of these systems typically require: -1. Data per instance which should not be lost even if the pod is deleted, typically on a persistent volume - * Some cluster instances may have tens of TB of stored data - forcing new instances to replicate data - from other members over the network is onerous -2. A stable and unique identity associated with that instance of the storage - such as a unique member id -3. A consistent network identity that allows other members to locate the instance even if the pod is deleted -4. A predictable number of instances to ensure that systems can form a quorum - * This may be necessary during initialization -5. Ability to migrate from node to node with stable network identity (DNS name) -6. The ability to scale up in a controlled fashion, but are very rarely scaled down without human - intervention - -Kubernetes should expose a pod controller (a StatefulSet) that satisfies these requirements in a flexible -manner. It should be easy for users to manage and reason about the behavior of this set. An administrator -with familiarity in a particular cluster system should be able to leverage this controller and its -supporting documentation to run that clustered system on Kubernetes. It is expected that some adaptation -is required to support each new cluster. - -This resource is **stateful** because it offers an easy way to link a pod's network identity to its storage -identity and because it is intended to be used to run software that is the holders of state for other -components. That does not mean that all stateful applications *must* use StatefulSets, but the tradeoffs -in this resource are intended to facilitate holding state in the cluster. - - -## Use Cases - -The software listed below forms the primary use-cases for a StatefulSet on the cluster - problems encountered -while adapting these for Kubernetes should be addressed in a final design. - -* Quorum with Leader Election - * MongoDB - in replica set mode forms a quorum with an elected leader, but instances must be preconfigured - and have stable network identities. - * ZooKeeper - forms a quorum with an elected leader, but is sensitive to cluster membership changes and - replacement instances *must* present consistent identities - * etcd - forms a quorum with an elected leader, can alter cluster membership in a consistent way, and - requires stable network identities -* Decentralized Quorum - * Cassandra - allows flexible consistency and distributes data via innate hash ring sharding, is also - flexible to scaling, more likely to support members that come and go. Scale down may trigger massive - rebalances. -* Active-active - * Galera - has multiple active masters which must remain in sync -* Leader-followers - * Spark in standalone mode - A single unilateral leader and a set of workers - - -## Background - -Replica sets are designed with a weak guarantee - that there should be N replicas of a particular -pod template. Each pod instance varies only by name, and the replication controller errs on the side of -ensuring that N replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful -deletion, for instance, or by being able to pick arbitrary pods to scale down). In addition, pods by design -have no stable network identity other than their assigned pod IP, which can change over the lifetime of a pod -resource. ReplicaSets are best leveraged for stateless, shared-nothing, zero-coordination, -embarassingly-parallel, or fungible software. - -While it is possible to emulate the guarantees described above by leveraging multiple replication controllers -(for distinct pod templates and pod identities) and multiple services (for stable network identity), the -resulting objects are hard to maintain and must be copied manually in order to scale a cluster. - -By contrast, a DaemonSet *can* offer some of the guarantees above, by leveraging Nodes as stable, long-lived -entities. An administrator might choose a set of nodes, label them a particular way, and create a -DaemonSet that maps pods to each node. The storage of the node itself (which could be network attached -storage, or a local SAN) is the persistent storage. The network identity of the node is the stable -identity. However, while there are examples of clustered software that benefit from close association to -a node, this creates an undue burden on administrators to design their cluster to satisfy these -constraints, when a goal of Kubernetes is to decouple system administration from application management. - - -## Design Assumptions - -* **Specialized Controller** - Rather than increase the complexity of the ReplicaSet to satisfy two distinct - use cases, create a new resource that assists users in solving this particular problem. -* **Safety first** - Running a clustered system on Kubernetes should be no harder - than running a clustered system off Kube. Authors should be given tools to guard against common cluster - failure modes (split brain, phantom member) to prevent introducing more failure modes. Sophisticated - distributed systems designers can implement more sophisticated solutions than StatefulSet if necessary - - new users should not become vulnerable to additional failure modes through an overly flexible design. -* **Controlled scaling** - While flexible scaling is important for some clusters, other examples of clusters - do not change scale without significant external intervention. Human intervention may be required after - scaling. Changing scale during cluster operation can lead to split brain in quorum systems. It should be - possible to scale, but there may be responsibilities on the set author to correctly manage the scale. -* **No generic cluster lifecycle** - Rather than design a general purpose lifecycle for clustered software, - focus on ensuring the information necessary for the software to function is available. For example, - rather than providing a "post-creation" hook invoked when the cluster is complete, provide the necessary - information to the "first" (or last) pod to determine the identity of the remaining cluster members and - allow it to manage its own initialization. - - -## Proposed Design - -Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each -individual can safely be replaced-- the name **StatefulSet** is chosen to convey that the individual members of -the set are themselves "stateful" and thus each one is preserved. Each member has an identity, and there will -always be a member that thinks it is the "first" one. - -The StatefulSet is responsible for creating and maintaining a set of **identities** and ensuring that there is -one pod and zero or more **supporting resources** for each identity. There should never be more than one pod -or unique supporting resource per identity at any one time. A new pod can be created for an identity only -if a previous pod has been fully terminated (reached its graceful termination limit or cleanly exited). - -A StatefulSet has 0..N **members**, each with a unique **identity** which is a name that is unique within the -set. - -```go -type StatefulSet struct { - ObjectMeta - - Spec StatefulSetSpec - ... -} - -type StatefulSetSpec struct { - // Replicas is the desired number of replicas of the given template. - // Each replica is assigned a unique name of the form `name-$replica` - // where replica is in the range `0 - (replicas-1)`. - Replicas int - - // A label selector that "owns" objects created under this set - Selector *LabelSelector - - // Template is the object describing the pod that will be created - each - // pod created by this set will match the template, but have a unique identity. - Template *PodTemplateSpec - - // VolumeClaimTemplates is a list of claims that members are allowed to reference. - // The StatefulSet controller is responsible for mapping network identities to - // claims in a way that maintains the identity of a member. Every claim in - // this list must have at least one matching (by name) volumeMount in one - // container in the template. A claim in this list takes precedence over - // any volumes in the template, with the same name. - VolumeClaimTemplates []PersistentVolumeClaim - - // ServiceName is the name of the service that governs this StatefulSet. - // This service must exist before the StatefulSet, and is responsible for - // the network identity of the set. Members get DNS/hostnames that follow the - // pattern: member-specific-string.serviceName.default.svc.cluster.local - // where "member-specific-string" is managed by the StatefulSet controller. - ServiceName string -} -``` - -Like a replication controller, a StatefulSet may be targeted by an autoscaler. The StatefulSet makes no assumptions -about upgrading or altering the pods in the set for now - instead, the user can trigger graceful deletion -and the StatefulSet will replace the terminated member with the newer template once it exits. Future proposals -may offer update capabilities. A StatefulSet requires RestartAlways pods. The addition of forgiveness may be -necessary in the future to increase the safety of the controller recreating pods. - - -### How identities are managed - -A key question is whether scaling down a StatefulSet and then scaling it back up should reuse identities. If not, -scaling down becomes a destructive action (an admin cannot recover by scaling back up). Given the safety -first assumption, identity reuse seems the correct default. This implies that identity assignment should -be deterministic and not subject to controller races (a controller that has crashed during scale up should -assign the same identities on restart, and two concurrent controllers should decide on the same outcome -identities). - -The simplest way to manage identities, and easiest to understand for users, is a numeric identity system -starting at I=0 that ranges up to the current replica count and is contiguous. - -Future work: - -* Cover identity reclamation - cleaning up resources for identities that are no longer in use. -* Allow more sophisticated identity assignment - instead of `{name}-{0 - replicas-1}`, allow subsets and - complex indexing. - -### Controller behavior - -When a StatefulSet is scaled up, the controller must create both pods and supporting resources for -each new identity. The controller must create supporting resources for the pod before creating the -pod. If a supporting resource with the appropriate name already exists, the controller should treat that as -creation succeeding. If a supporting resource cannot be created, the controller should flag an error to -status, back-off (like a scheduler or replication controller), and try again later. Each resource created -by a StatefulSet controller must have a set of labels that match the selector, support orphaning, and have a -controller back reference annotation identifying the owning StatefulSet by name and UID. - -When a StatefulSet is scaled down, the pod for the removed identity should be deleted. It is less clear what the -controller should do to supporting resources. If every pod requires a PV, and a user accidentally scales -up to N=200 and then back down to N=3, leaving 197 PVs lying around may be undesirable (potential for -abuse). On the other hand, a cluster of 5 that is accidentally scaled down to 3 might irreparably destroy -the cluster if the PV for identities 4 and 5 are deleted (may not be recoverable). For the initial proposal, -leaving the supporting resources is the safest path (safety first) with a potential future policy applied -to the StatefulSet for how to manage supporting resources (DeleteImmediately, GarbageCollect, Preserve). - -The controller should reflect summary counts of resources on the StatefulSet status to enable clients to easily -understand the current state of the set. - -### Parameterizing pod templates and supporting resources - -Since each pod needs a unique and distinct identity, and the pod needs to know its own identity, the -StatefulSet must allow a pod template to be parameterized by the identity assigned to the pod. The pods that -are created should be easily identified by their cluster membership. - -Because that pod needs access to stable storage, the StatefulSet may specify a template for one or more -**persistent volume claims** that can be used for each distinct pod. The name of the volume claim must -match a volume mount within the pod template. - -Future work: - -* In the future other resources may be added that must also be templated - for instance, secrets (unique secret per member), config data (unique config per member), and in the further future, arbitrary extension resources. -* Consider allowing the identity value itself to be passed as an environment variable via the downward API -* Consider allowing per identity values to be specified that are passed to the pod template or volume claim. - - -### Accessing pods by stable network identity - -In order to provide stable network identity, given that pods may not assume pod IP is constant over the -lifetime of a pod, it must be possible to have a resolvable DNS name for the pod that is tied to the -pod identity. There are two broad classes of clustered services - those that require clients to know -all members of the cluster (load balancer intolerant) and those that are amenable to load balancing. -For the former, clients must also be able to easily enumerate the list of DNS names that represent the -member identities and access them inside the cluster. Within a pod, it must be possible for containers -to find and access that DNS name for identifying itself to the cluster. - -Since a pod is expected to be controlled by a single controller at a time, it is reasonable for a pod to -have a single identity at a time. Therefore, a service can expose a pod by its identity in a unique -fashion via DNS by leveraging information written to the endpoints by the endpoints controller. - -The end result might be DNS resolution as follows: - -```sh -# service mongo pointing to pods created by StatefulSet mdb, with identities mdb-1, mdb-2, mdb-3 - -dig mongodb.namespace.svc.cluster.local +short A -172.130.16.50 - -dig mdb-1.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-1 - -dig mdb-2.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-2 - -dig mdb-3.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-3 -``` - -This is currently implemented via an annotation on pods, which is surfaced to endpoints, and finally -surfaced as DNS on the service that exposes those pods. - -```yaml -# The pods created by this StatefulSet will have the DNS names "mysql-0.NAMESPACE.svc.cluster.local" -# and "mysql-1.NAMESPACE.svc.cluster.local" -kind: StatefulSet -metadata: - name: mysql -spec: - replicas: 2 - serviceName: db - template: - spec: - containers: - - image: mysql:latest - -// Example pod created by stateful set -kind: Pod -metadata: - name: mysql-0 - annotations: - pod.beta.kubernetes.io/hostname: "mysql-0" - pod.beta.kubernetes.io/subdomain: db -spec: - ... -``` - - -### Preventing duplicate identities - -The StatefulSet controller is expected to execute like other controllers, as a single writer. However, when -considering designing for safety first, the possibility of the controller running concurrently cannot -be overlooked, and so it is important to ensure that duplicate pod identities are not achieved. - -There are two mechanisms to achieve this at the current time. One is to leverage unique names for pods -that carry the identity of the pod - this prevents duplication because etcd 2 can guarantee single -key transactionality. The other is to use the status field of the StatefulSet to coordinate membership -information. It is possible to leverage both at this time, and encourage users to not assume pod -name is significant, but users are likely to take what they can get. A downside of using unique names -is that it complicates pre-warming of pods and pod migration - on the other hand, those are also -advanced use cases that might be better solved by another, more specialized controller (a -MigratableStatefulSet). - - -### Managing lifecycle of members - -The most difficult aspect of managing a member set is ensuring that all members see a consistent configuration -state of the set. Without a strongly consistent view of cluster state, most clustered software is -vulnerable to split brain. For example, a new set is created with 3 members. If the node containing the -first member is partitioned from the cluster, it may not observe the other two members, and thus create its -own cluster of size 1. The other two members do see the first member, so they form a cluster of size 3. -Both clusters appear to have quorum, which can lead to data loss if not detected. - -StatefulSets should provide basic mechanisms that enable a consistent view of cluster state to be possible, -and in the future provide more tools to reduce the amount of work necessary to monitor and update that -state. - -The first mechanism is that the StatefulSet controller blocks creation of new pods until all previous pods -are reporting a healthy status. The StatefulSet controller uses the strong serializability of the underlying -etcd storage to ensure that it acts on a consistent view of the cluster membership (the pods and their) -status, and serializes the creation of pods based on the health state of other pods. This simplifies -reasoning about how to initialize a StatefulSet, but is not sufficient to guarantee split brain does not -occur. - -The second mechanism is having each "member" use the state of the cluster and transform that into cluster -configuration or decisions about membership. This is currently implemented using a side car container -that watches the master (via DNS today, although in the future this may be to endpoints directly) to -receive an ordered history of events, and then applying those safely to the configuration. Note that -for this to be safe, the history received must be strongly consistent (must be the same order of -events from all observers) and the config change must be bounded (an old config version may not -be allowed to exist forever). For now, this is known as a 'babysitter' (working name) and is intended -to help identify abstractions that can be provided by the StatefulSet controller in the future. - - -## Future Evolution - -Criteria for advancing to beta: - -* StatefulSets do not accidentally lose data due to cluster design - the pod safety proposal will - help ensure StatefulSets can guarantee **at most one** instance of a pod identity is running at - any time. -* A design consensus is reached on StatefulSet upgrades. - -Criteria for advancing to GA: - -* StatefulSets solve 80% of clustered software configuration with minimal input from users and are safe from common split brain problems - * Several representative examples of StatefulSets from the community have been proven/tested to be "correct" for a variety of partition problems (possibly via Jepsen or similar) - * Sufficient testing and soak time has been in place (like for Deployments) to ensure the necessary features are in place. -* StatefulSets are considered easy to use for deploying clustered software for common cases - -Requested features: - -* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster - * Individual services can potentially be used to solve this in some cases. -* Send more / simpler events to each pod from a central spot via the "signal API" -* Persistent local volumes that can leverage local storage -* Allow pods within the StatefulSet to identify "leader" in a way that can direct requests from a service to a particular member. -* Provide upgrades of a StatefulSet in a controllable way (like Deployments). - - -## Overlap with other proposals - -* Jobs can be used to perform a run-once initialization of the cluster -* Init containers can be used to prime PVs and config with the identity of the pod. -* Templates and how fields are overridden in the resulting object should have broad alignment -* DaemonSet defines the core model for how new controllers sit alongside replication controller and - how upgrades can be implemented outside of Deployment objects. - - -## History - -StatefulSets were formerly known as PetSets and were renamed to be less "cutesy" and more descriptive as a -prerequisite to moving to beta. No animals were harmed in the making of this proposal. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/apps/statefulset-update.md b/contributors/design-proposals/apps/statefulset-update.md index 06fd291e..f0fbec72 100644 --- a/contributors/design-proposals/apps/statefulset-update.md +++ b/contributors/design-proposals/apps/statefulset-update.md @@ -1,828 +1,6 @@ -# StatefulSet Updates +Design proposals have been archived. -**Author**: kow3ns@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Proposal -## Abstract -Currently (as of Kubernetes 1.6), `.Spec.Replicas` and -`.Spec.Template.Containers` are the only mutable fields of the -StatefulSet API object. Updating `.Spec.Replicas` will scale the number of Pods -in the StatefulSet. Updating `.Spec.Template.Containers` causes all subsequently -created Pods to have the specified containers. In order to cause the -StatefulSet controller to apply its updated `.Spec`, users must manually delete -each Pod. This manual method of applying updates is error prone. The -implementation of this proposal will add the capability to perform ordered, -automated, sequential updates. - -## Affected Components -1. API Server -1. Kubectl -1. StatefulSet Controller -1. StatefulSetSpec API object -1. StatefulSetStatus API object - -## Use Cases -Upon implementation, this design will support the following in scope use cases, -and it will not rule out the future implementation of the out of scope use -cases. - -### In Scope -- As the administrator of a stateful application, in order to vertically scale -my application, I want to update resource limits or requested resources. -- As the administrator of a stateful application, in order to deploy critical -security updates, break fix patches, and feature releases, I want to update -container images. -- As the administrator of a stateful application, in order to update my -application's configuration, I want to update environment variables, container -entry point commands or parameters, or configuration files. -- As the administrator of the logging and monitoring infrastructure for my -organization, in order to add logging and monitoring side cars, I want to patch -a Pods' containers to add images. - -### Out of Scope -- As the administrator of a stateful application, in order to increase the -applications storage capacity, I want to update PersistentVolumes. -- As the administrator of a stateful application, in order to update the -network configuration of the application, I want to update Services and -container ports in a consistent way. -- As the administrator of a stateful application, when I scale my application -horizontally, I want associated PodDisruptionBudgets to be adjusted to -compensate for the application's scaling. - -## Assumptions - - StatefulSet update must support singleton StatefulSets. However, an update in - this case will cause a temporary outage. This is acceptable as a single - process application is, by definition, not highly available. - - Disruption in Kubernetes is controlled by PodDisruptionBudgets. As - StatefulSet updates progress one Pod at a time, and only occur when all - other Pods have a Status of Running and a Ready Condition, they can not - violate reasonable PodDisruptionBudgets. - - Without priority and preemption, there is no guarantee that an update will - not block due to a loss of capacity or due to the scheduling of another Pod - between Pod termination and Pod creation. This is mitigated by blocking the - update when a Pod fails to schedule. Remediation will require operator - intervention. This implementation is no worse than the current behavior with - respect to eviction. - - We will eventually implement a signal that is delivered to Pods to indicate - the - [reason for termination](https://github.com/kubernetes/community/pull/541). - - StatefulSet updates will use the methodology outlined in the - [controller history](https://github.com/kubernetes/community/pull/594) proposal - for version tracking, update detection, and rollback detection. - This will be a general implementation, usable for any Pod in a Kubernetes - cluster. It is, therefore, out of scope to design such a mechanism here. - - Kubelet does not support resizing a container's resources without terminating - the Pod. In place resource reallocation is out of scope for this design. - Vertical scaling must be performed destructively. - - The primary means of configuration update will be configuration files, - command line flags, environment variables, or ConfigMaps consumed as the one - of the former. - - In place configuration update via SIGHUP is not universally - supported, and Kubelet provides no mechanism to perform this currently. Pod - reconfiguration will be performed destructively. - - Stateful applications are likely to evolve wire protocols and storage formats - between versions. In most cases, when updating the application's Pod's - containers, it will not be safe to roll back or forward to an arbitrary - version. Controller based Pod update should work well when rolling out an - update, or performing a rollback, between two specific revisions of the - controlled API object. This is how Deployment functions, and this property is, - perhaps, even more critical for stateful applications. - -## Requirements -This design is based on the following requirements. -- Users must be able to update the containers of a StatefulSet's Pods. - - Updates to container commands, images, resources and configuration must be - supported. -- The update must progress in a sequential, deterministic order and respect the - StatefulSet - [identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), - [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) - guarantees. -- A failed update must halt. -- Users must be able to roll back an update. -- Users must be able to roll forward to fix a failing/failed update. -- Users must be able to view the status of an update. -- Users should be able to view a bounded history of the updates that have been -applied to the StatefulSet. - -## API Objects - -The following modifications will be made to the StatefulSetSpec API object. - -```go -// StatefulSetUpdateStrategy indicates the strategy that the StatefulSet -// controller will use to perform updates. It includes any additional parameters -// necessary to preform the update for the indicated strategy. -type StatefulSetUpdateStrategy struct { - // Type indicates the type of the StatefulSetUpdateStrategy. - Type StatefulSetUpdateStrategyType - // Partition is used to communicate the ordinal at which to partition - // the StatefulSet when Type is PartitionStatefulSetStrategyType. This - // value must be set when Type is PartitionStatefulSetStrategyType, - // and it must be nil otherwise. - Partition *PartitionStatefulSetStrategy - -// StatefulSetUpdateStrategyType is a string enumeration type that enumerates -// all possible update strategies for the StatefulSet controller. -type StatefulSetUpdateStrategyType string - -const ( - // PartitionStatefulSetStrategyType indicates that updates will only be - // applied to a partition of the StatefulSet. This is useful for canaries - // and phased roll outs. When a scale operation is performed with this - // strategy, new Pods will be created from the updated specification. - PartitionStatefulSetStrategyType StatefulSetUpdateStrategyType = "Partition" - // RollingUpdateStatefulSetStrategyType indicates that update will be - // applied to all Pods in the StatefulSet with respect to the StatefulSet - // ordering constraints. When a scale operation is performed with this - // strategy, new Pods will be created from the updated specification. - RollingUpdateStatefulSetStrategyType = "RollingUpdate" - // OnDeleteStatefulSetStrategyType triggers the legacy behavior. Version - // tracking and ordered rolling restarts are disabled. Pods are recreated - // from the StatefulSetSpec when they are manually deleted. When a scale - // operation is performed with this strategy, new Pods will be created - // from the current specification. - OnDeleteStatefulSetStrategyType = "OnDelete" -) - -// PartitionStatefulSetStrategy contains the parameters used with the -// PartitionStatefulSetStrategyType. -type PartitionStatefulSetStrategy struct { - // Ordinal indicates the ordinal at which the StatefulSet should be - // partitioned. - Ordinal int32 -} - -type StatefulSetSpec struct { - // Replicas, Selector, Template, VolumeClaimsTemplate, and ServiceName - // omitted for brevity. - - // UpdateStrategy indicates the StatefulSetUpdateStrategy that will be - // employed to update Pods in the StatefulSet when a revision is made to - // Template or VolumeClaimsTemplate. - UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty` - - // RevisionHistoryLimit is the maximum number of revisions that will - // be maintained in the StatefulSet's revision history. The revision history - // consists of all revisions not represented by a currently applied - // StatefulSetSpec version. The default value is 2. - RevisionHistoryLimit *int32 `json:revisionHistoryLimit,omitempty` -} -``` - -The following modifications will be made to the StatefulSetStatus API object. - -```go - type StatefulSetStatus struct { - // ObservedGeneration and Replicas fields are omitted for brevity. - - // CurrentRevision, if not empty, indicates the version of PodSpecTemplate, - // VolumeClaimsTemplate tuple used to generate Pods in the sequence - // [0,CurrentReplicas). - CurrentRevision string `json:"currentRevision,omitempty"` - - // UpdateRevision, if not empty, indicates the version of PodSpecTemplate, - // VolumeClaimsTemplate tuple used to generate Pods in the sequence - // [Replicas-UpdatedReplicas,Replicas) - UpdateRevision string `json:"updateRevision,omitempty"` - - // ReadyReplicas is the current number of Pods, created by the StatefulSet - // controller, that have a Status of Running and a Ready Condition. - ReadyReplicas int32 `json:"readyReplicas,omitempty"` - - // CurrentReplicas is the number of Pods created by the StatefulSet - // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated - // by CurrentRevision. - CurrentReplicas int32 `json:"currentReplicas,omitempty"` - - // UpdatedReplicas is the number of Pods created by the StatefulSet - // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated - // by UpdateRevision. - UpdatedReplicas int32 `json:"updatedReplicas,omitempty"` -} -``` - -Additionally we introduce the following constant. - -```go -// StatefulSetRevisionLabel is the label used by StatefulSet controller to track -// which version of StatefulSet's StatefulSetSpec was used generate a Pod. -const StatefulSetRevisionLabel = "statefulset.kubernetes.io/revision" - -``` -## StatefulSet Controller -The StatefulSet controller will watch for modifications to StatefulSet and Pod -API objects. When a StatefulSet is created or updated, or when one -of the Pods in a StatefulSet is updated or deleted, the StatefulSet -controller will attempt to create, update, or delete Pods to conform the -current state of the system to the user declared [target state](#target-state). - -### Revised Controller Algorithm -The StatefulSet controller will use the following algorithm to continue to -make progress toward the user declared [target state](#target-state) while -respecting the controller's -[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), -[deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) -guarantees. The StatefulSet controller will use the technique proposed in -[Controller History](https://github.com/kubernetes/community/pull/594) to -snapshot and version its [target Object state](#target-pod-state). - -1. The controller will reconstruct the -[revision history](#history-reconstruction) of the StatefulSet. -1. The controller will -[process any updates to its StatefulSetSpec](#specification-updates) to -ensure that the StatefulSet's revision history is consistent with the user -declared desired state. -1. The controller will select all Pods in the StatefulSet, filter any Pods not -owned by the StatefulSet, and sort the remaining Pods in ordinal order. -1. For all created Pods, the controller will perform any necessary -[non-destructive state reconciliation](#pod-state-reconciliation). -1. If any Pods with ordinals in the sequence `[0,.Spec.Replicas)` have not been -created, for the Pod corresponding to the lowest such ordinal, the controller -will create the Pod with declared [target Pod state](#target-pod-state). -1. If all Pods in the sequence `[0,.Spec.Replicas)` have been created, but if any -do not have a Ready Condition, the StatefulSet controller will wait for these -Pods to either become Ready, or to be completely deleted. -1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Ready Condition, and -if `.Spec.Replicas` is less than `.Status.Replicas`, the controller will delete -the Pod corresponding to the largest ordinal. This implies that scaling takes -precedence over Pod updates. -1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Status of Running and -a Ready Condition, if `.Spec.Replicas` is equal to `.Status.Replicas`, and if -there are Pods that do not match their [target Pod state](#target-pod-state), -the Pod with the largest ordinal in that set will be deleted. -1. If the StatefulSet controller has achieved the -[declared target state](#target-state) the StatefulSet controller will -[complete any in progress updates](#update-completion). -1. The controller will [report its status](#status-reporting). -1. The controller will perform any necessary -[maintenance of its revision history](#history-maintenance). - -### Target State -The target state of the StatefulSet controller with respect to an individual -StatefulSet is defined as follows. - -1. The StatefulSet contains exactly `[0,.Spec.Replicas)` Pods. -1. All Pods in the StatefulSet have the correct -[target Pod state](#target-pod-state). - -### Target Pod State -As in the [Controller History](https://github.com/kubernetes/community/pull/594) -proposal we define the target Object state of StatefulSetSpec specification type -object to be the `.Template` and `.VolumeClaimsTemplate`. The latter is currently -immutable, but we will version it as one day this constraint may be lifted. This -state provides enough information to generate a Pod and its associated -PersistentVolumeClaims. The target Pod State for a Pod in a StatefulSet is as -follows. -1. The Pods PersistentVolumeClaims have been created. - - Note that we do not currently delete PersistentVolumeClaims. -1. If the Pod's ordinal is in the sequence `[0,.Spec.Replicas)` the Pod should -have a Ready Condition. This implies the Pod is Running. -1. If Pod's ordinal is greater than or equal to `.Spec.Replicas`, the Pod -should be completely terminated and deleted. -1. If the StatefulSet's `Spec.UpdateStrategy.Type` is equal to -`OnDeleteStatefulSetStrategyType`, no version tracking is performed, Pods -can be at an arbitrary version, and they will be recreated from the current -`.Spec.Template` and `.Spec.VolumeClaimsTemplate` when the are deleted. -1. If StatefulSet's `Spec.UpdateStrategy.Type` is equal to -`RollingUpdateStatefulSetStrategyType` then the version of the Pod should be -as follows. - 1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, - the Pod should be consistent with version indicated by `Status.CurrentRevision`. - 1. If the Pod's ordinal is in the sequence - `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)` - the Pod should be consistent with the version indicated by - `Status.UpdateRevision`. -1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to -`PartitionStatefulSetStrategyType` then the version of the Pod should be -as follows. - 1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, - the Pod should be consistent with version indicated by `Status.CurrentRevision`. - 1. If the Pod's ordinal is in the sequence - `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)` the Pod - should be consistent with the version indicated by `Status.UpdateRevision`. - 1. If the Pod does not meet either of the prior two conditions, and if - ordinal is in the sequence `[0, .Spec.UpdateStrategy.Partition.Ordinal)`, - it should be consistent with the version indicated by - `Status.CurrentRevision`. - 1. Otherwise, the Pod should be consistent with the version indicated - by `Status.UpdateRevision`. - -### Pod State Reconciliation -In order to reconcile a Pod with declared desired -[target state](#target-pod-state) the StatefulSet controller will do the -following. - -1. If the Pod is already consistent with its target state the controller will do -nothing. -1. If the Pod is labeled with a `StatefulSetRevisionLabel` that indicates -the Pod was generated from a version of the StatefulSetSpec that is semantically -equivalent to, but not equal to, the [target version](#target-pod-state), the -StatefulSet controller will update the Pod with a `StatefulSetRevisionLabel` -indicating the new semantically equivalent version. This form of reconciliation -is non-destructive. -1. If the Pod was not created from the target version, the Pod will be deleted -and recreated from that version. This form of reconciliation is destructive. - -### Specification Updates -The StatefulSet controller will [snapshot](#snapshot-creation) its target -Object state when mutations are made to its `.Spec.Template` or -`.Spec.VolumeClaimsTemplate` (Note that the latter is currently immutable). - -1. When the StatefulSet controller observes a mutation to a StatefulSet's - `.Spec.Template` it will snapshot its target Object state and compare -the snapshot with the version indicated by its `.Status.UpdateRevision`. -1. If the current state is equivalent to the version indicated by -`.Status.UpdateRevision` no update has occurred. -1. If the `Status.CurrentRevision` field is empty, then the StatefulSet has no -revision history. To initialize its revision history, the StatefulSet controller -will set both `.Status.CurrentRevision` and `.Status.UpdateRevision` to the -version of the current snapshot. -1. If the `.Status.CurrentRevision` is not empty, and if the -`.Status.UpdateRevision` is not equal to the version of the current snapshot, -the StatefulSet controller will set the `.Status.UpdateRevision` to the version -indicated by the current snapshot. - -### StatefulSet Revision History -The StatefulSet controller will use the technique proposed in -[Controller History](https://github.com/kubernetes/community/pull/594) to -snapshot and version its target Object state. - -#### Snapshot Creation -In order to snapshot a version of its target Object state, it will -serialize and store the `.Spec.Template` and `.Spec.VolumesClaimsTemplate` -along with the `.Generation` in each snapshot. Each snapshot will be labeled -with the StatefulSet's `.Selector`. - -#### History Reconstruction -As proposed in -[Controller History](https://github.com/kubernetes/community/pull/594), in -order to reconstruct the revision history of a StatefulSet, the StatefulSet -controller will select all snapshots based on its `Spec.Selector` and sort them -by the contained `.Generation`. This will produce an ordered set of -revisions to the StatefulSet's target Object state. - -#### History Maintenance -In order to prevent the revision history of the StatefulSet from exceeding -memory or storage limits, the StatefulSet controller will periodically prune -its revision history so that no more that `.Spec.RevisionHistoryLimit` non-live -versions of target Object state are preserved. - -### Update Completion -The criteria for update completion is as follows. - -1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to -`OnDeleteStatefulSetStrategyType` then no version tracking is performed. In -this case, an update can never be in progress. -1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to -`PartitionStatefulSetStrategyType` updates can not complete. The version -indicated `.Status.UpdateRevision` will only be applied to Pods with ordinals -in the sequence `(.Spec.UpdateStrategy.Partition.Ordinal,.Spec.Replicas)`. -1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to -`RollingUpdateStatefulSetStrategyType`, then an update is complete when the -StatefulSet is at its [target state](#target-state). The StatefulSet controller -will signal update completion as follows. - 1. The controller will set `.Status.CurrentRevision` to the value of - `.Status.UpdateRevision`. - 1. The controller will set `.Status.CurrentReplicas` to - `.Status.UpdatedReplicas`. Note that this value will be equal to - `.Status.Replicas`. - 1. The controller will set `.Status.UpdatedReplicas` to 0. - -### Status Reporting -After processing the creation, update, or deletion of a StatefulSet or Pod, -the StatefulSet controller will record its status by persisting a -StatefulSetStatus object. This has two purposes. - -1. It allows the StatefulSet controller to recreate the exact StatefulSet -membership in the event of a hard restart of the entire system. -1. It communicates the current state of the StatefulSet to clients. Using the -`.Status.ObserverGeneration`, clients can construct a linearizable view of -the operations performed by the controller. - -When the StatefulSet controller records the status of a StatefulSet it will -do the following. - -1. The controller will increment the `.Status.ObservedGeneration` to communicate -the `.Generation` of the StatefulSet object that was observed. -1. The controller will set the `.Status.Replicas` to the current number of -created Pods. -1. The controller will set the `.Status.ReadyReplicas` to the current number of -Pods that have a Ready Condition. -1. The controller will set the `.Status.CurrentRevision` and -`.Status.UpdateRevision` in accordance with StatefulSet's -[revision history](#statefulset-revision-history) and -any [complete updates](#update-completion). -1. The controller will set the `.Status.CurrentReplicas` to the number of -Pods that it has created from the version indicated by -`.Status.CurrentRevision`. -1. The controller will set the `.Status.UpdatedReplicas` to the number of Pods -that it has created from the version indicated by `.Status.UpdateRevision`. -1. The controller will then persist the StatefulSetStatus make it durable and -communicate it to observers. - -## API Server -The API Server will perform validation for StatefulSet creation and updates. - -### StatefulSet Validation -As is currently implemented, the API Server will not allow mutation to any -fields of the StatefulSet object other than `.Spec.Replicas` and -`.Spec.Template.Containers`. This design imposes the following, additional -constraints. - -1. If the `.Spec.UpdateStrategy.Type` is equal to -`PartitionStatefulSetStrategyType`, the API Server should fail validation -if any of the following conditions are true. - 1. `.Spec.UpdateStrategy.Partition` is nil. - 1. `.Spec.UpdateStrategy.Partition` is not nil, and - `.Spec.UpdateStrategy.Partition.Ordinal` not in the sequence - `(0,.Spec.Replicas)`. -1. The API Server will fail validation on any update to a StatefulSetStatus -object if any of the following conditions are true. - 1. `.Status.Replicas` is negative. - 1. `.Status.ReadyReplicas` is negative or greater than `.Status.Replicas`. - 1. `.Status.CurrentReplicas` is negative or greater than `.Status.Replicas`. - 1. `.Status.UpdateReplicas` is negative or greater than `.Status.Replicas`. - -## Kubectl -Kubectl will use the `rollout` command to control and provide the status of -StatefulSet updates. - - - `kubectl rollout status statefulset <StatefulSet-Name>`: displays the status - of a StatefulSet update. - - `kubectl rollout undo statefulset <StatefulSet-Name>`: triggers a rollback - of the current update. - - `kubectl rollout history statefulset <StatefulSet-Name>`: displays a the - StatefulSets revision history. - -## Usage -This section demonstrates how the design functions in typical usage scenarios. - -### Initial Deployment -Users can create a StatefulSet using `kubectl apply`. - -Given the following manifest `web.yaml` - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 3 - template: - metadata: - labels: - app: nginx - spec: - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.8 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - -Users can use the following command to create the StatefulSet. - -```shell -kubectl apply -f web.yaml -``` - -The only difference between the proposed and current implementation is that -the proposed implementation will initialize the StatefulSet's revision history -upon initial creation. - -### Rolling out an Update -Users can create a rolling update using `kubectl apply`. If a user creates a -StatefulSet [as above](#initial-deployment), the user can trigger a rolling -update by updating the image (as in the manifest as below). - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 3 - template: - metadata: - labels: - app: nginx - spec: - updateStrategy: - type: RollingUpdate - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.9 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - - -Users can use the following command to trigger a rolling update. - -```shell -kubectl apply -f web.yaml -``` - -### Canaries -Users can create a canary using `kubectl apply`. The only difference between a - [rolling update](#rolling-out-an-update) and a canary is that the - `.Spec.UpdateStrategy.Type` is set to `PartitionStatefulSetStrategyType` and - the `.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`. - - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 3 - template: - metadata: - labels: - app: nginx - spec: - updateStrategy: - type: Partition - partition: - ordinal: 2 - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.9 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - -Users can also simultaneously scale up and add a canary. This reduces risk -for some deployment scenarios by adding additional capacity for the canary. -For example, in the manifest below, `.Spec.Replicas` is increased to `4` while -`.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`. - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 4 - template: - metadata: - labels: - app: nginx - spec: - updateStrategy: - type: Partition - partition: - ordinal: 3 - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.9 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - -### Phased Roll Outs -Users can create a canary using `kubectl apply`. The only difference between a - [canary](#canaries) and a phased roll out is that the - `.Spec.UpdateStrategy.Partition.Ordinal` is set to a value less than - `.Spec.Replicas-1`. - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 4 - template: - metadata: - labels: - app: nginx - spec: - updateStrategy: - type: Partition - partition: - ordinal: 2 - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.9 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - -Phased roll outs can be used to roll out a configuration, image, or resource -update to some portion of the fleet maintained by the StatefulSet prior to -updating the entire fleet. It is useful to support linear, geometric, and -exponential roll out of an update. Users can modify the -`.Spec.UpdateStrategy.Partition.Ordinal` to allow the roll out to progress. - -```yaml -apiVersion: apps/v1beta1 -kind: StatefulSet -metadata: - name: web -spec: - serviceName: "nginx" - replicas: 3 - template: - metadata: - labels: - app: nginx - spec: - updateStrategy: - type: Partition - partition: - ordinal: 1 - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.9 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - volumeClaimTemplates: - - metadata: - name: www - annotations: - volume.alpha.kubernetes.io/storage-class: anything - spec: - accessModes: [ "ReadWriteOnce" ] - resources: - requests: - storage: 1Gi -``` - -### Rollbacks -To rollback an update, users can use the `kubectl rollout` command. - -The command below will roll back the `web` StatefulSet to the previous revision in -its history. If a roll out is in progress, it will stop deploying the target -revision, and roll back to the current revision. - -```shell -kubectl rollout undo statefulset web -``` - -### Rolling Forward -Rolling back is usually the safest, and often the fastest, strategy to mitigate -deployment failure, but rolling forward is sometimes the only practical solution -for stateful applications (e.g. A user has a minor configuration error but has -already modified the storage format for the application). Users can use -sequential `kubectl apply`'s to update the StatefulSet's current -[target state](#target-state). The StatefulSet's `.Spec.GenerationPartition` -will be respected, and it therefore interacts well with canaries and phased roll - outs. - -## Tests -- Updating a StatefulSet's containers will trigger updates to the StatefulSet's -Pods respecting the -[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity) -and [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) -guarantees. -- A StatefulSet update will block on failure. -- A StatefulSet update can be rolled back. -- A StatefulSet update can be rolled forward by applying another update. -- A StatefulSet update's status can be retrieved. -- A StatefulSet's revision history contains all updates with respect to the -configured revision history limit. -- A StatefulSet update can create a canary. -- A StatefulSet update can be performed in stages. - -## Future Work -In the future, we may implement the following features to enhance StatefulSet -updates. - -### Termination Reason -Without communicating a signal indicating the reason for termination to a Pod in -a StatefulSet, as proposed [here](https://github.com/kubernetes/community/pull/541), -the tenant application has no way to determine if it is being terminated due to -a scale down operation or due to an update. - -Consider a BASE distributed storage application like Cassandra, where 2 TiB of -persistent data is not atypical, and the data distribution is not identical on -every server. We want to enable two distinct behaviors based on the reason for -termination. - -- If the termination is due to scale down, during the configured termination -grace period, the entry point of the Pod should cause the application to drain -its client connections, replicate its persisted data (so that the cluster is not -left under replicated) and decommission the application to remove it from the -cluster. -- If the termination is due to a temporary capacity loss (e.g. an update or an -image upgrade), the application should drain all of its client connections, -flush any in memory data structures to the file system, and synchronize the -file system with storage media. It should not redistribute its data. - -If the application implements the strategy of always redistributing its data, -we unnecessarily decrease recovery time during an update and incur the -additional network and storage cost of two full data redistributions for every -updated node. -It should be noted that this is already an issue for Node cordon and Pod eviction -(due to drain or taints), and applications can use the same mitigation as they -would for these events for StatefulSet update. - -### VolumeTemplatesSpec Updates -While this proposal does not address -[VolumeTemplateSpec updates](https://github.com/kubernetes/kubernetes/issues/41015), -this would be a valuable feature for production users of storage systems that use -intermittent compaction as a form of garbage collection. Applications that use -log structured merge trees with size tiered compaction (e.g Cassandra) or append -only B(+/*) Trees (e.g Couchbase) can temporarily double their storage requirement -during compaction. If there is insufficient space for compaction -to progress, these applications will either fail or degrade until -additional capacity is added. While, if the user is using AWS EBS or GCE PD, -there are valid manual workarounds to expand the size of a PD, it would be -useful to automate the resize via updates to the StatefulSet's -VolumeClaimsTemplate. - -### In Place Updates -Currently configuration, images, and resource request/limits updates are all -performed destructively. Without a [termination reason](https://github.com/kubernetes/community/pull/541) -implementation, there is little value to implementing in place image updates, -and configuration and resource request/limit updates are not possible. -When [termination reason](#https://github.com/kubernetes/kubernetes/issues/1462) -is implemented we may modify the behavior of StatefulSet update to only update, -rather than delete and create, Pods when the only mutated value is the container - image, and if resizable resource request/limits is implemented, we may extend - the above to allow for updates to Pod resources. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/OWNERS b/contributors/design-proposals/architecture/OWNERS deleted file mode 100644 index 3baa861d..00000000 --- a/contributors/design-proposals/architecture/OWNERS +++ /dev/null @@ -1,10 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-architecture-leads - - jbeda -approvers: - - sig-architecture-leads - - jbeda -labels: - - sig/architecture diff --git a/contributors/design-proposals/architecture/arch-roadmap-1.png b/contributors/design-proposals/architecture/arch-roadmap-1.png Binary files differdeleted file mode 100644 index 660d8206..00000000 --- a/contributors/design-proposals/architecture/arch-roadmap-1.png +++ /dev/null diff --git a/contributors/design-proposals/architecture/architectural-roadmap.md b/contributors/design-proposals/architecture/architectural-roadmap.md index 04a9002a..f0fbec72 100644 --- a/contributors/design-proposals/architecture/architectural-roadmap.md +++ b/contributors/design-proposals/architecture/architectural-roadmap.md @@ -1,1132 +1,6 @@ -# Kubernetes Architectural Roadmap +Design proposals have been archived. -**Shared with the community** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Status: First draft - -Last update: 4/20/2017 - -Authors: Brian Grant, Tim Hockin, and Clayton Coleman - -Intended audience: Kubernetes contributors - -* * * - -<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --> -**Table of Contents** - -- [Kubernetes Architectural Roadmap](#kubernetes-architectural-roadmap) - - [Summary/TL;DR](#summarytldr) - - [Background](#background) - - [System Layers](#system-layers) - - [The Nucleus: API and Execution](#the-nucleus-api-and-execution) - - [The API and cluster control plane](#the-api-and-cluster-control-plane) - - [Execution](#execution) - - [The Application Layer: Deployment and Routing](#the-application-layer-deployment-and-routing) - - [The Governance Layer: Automation and Policy Enforcement](#the-governance-layer-automation-and-policy-enforcement) - - [The Interface Layer: Libraries and Tools](#the-interface-layer-libraries-and-tools) - - [The Ecosystem](#the-ecosystem) - - [Managing the matrix](#managing-the-matrix) - - [Layering of the system as it relates to security](#layering-of-the-system-as-it-relates-to-security) - - [Next Steps](#next-steps) - -<!-- markdown-toc end --> - - -## Summary/TL;DR - -This document describes the ongoing architectural development of the Kubernetes system, and the -motivations behind it. System developers wanting to extend or customize -Kubernetes should use this document as a guide to inform where and how best to implement these -enhancements. Application developers wanting to develop large, portable and/or future-proof -Kubernetes applications may refer to this document for guidance on which parts of Kubernetes they -can rely on being present now and in the future. - -The layers of the architecture are named and described (see the diagram below). Distinctions are -drawn between what exists today and what we plan to provide in future, and why. - -Succinctly, the layers comprise: - -1. **_The Nucleus_** which provides standardized API and execution machinery, including basic REST - mechanics, security, individual Pod, container, network interface and storage volume management, - all of which are extensible via well-defined interfaces. The Nucleus is non-optional and - expected to be the most stable part of the system. - -2. **_The Application Management Layer_** which provides basic deployment and routing, including - self-healing, scaling, service discovery, load balancing and traffic routing. This is - often referred to as orchestration and the service fabric. Default implementations of all - functions are provided, but conformant replacements are permitted. - -3. **_The Governance Layer_** which provides higher level automation and policy enforcement, - including single- and multi-tenancy, metrics, intelligent autoscaling and provisioning, and - schemes for authorization, quota, network, and storage policy expression and enforcement. These - are optional, and achievable via other solutions. - -4. **_The Interface Layer_** which provides commonly used libraries, tools, UI's and systems used to - interact with the Kubernetes API. - -5. **_The Ecosystem_** which includes everything else associated with Kubernetes, and is not really - "part of" Kubernetes at all. This is where most of the development happens, and includes CI/CD, - middleware, logging, monitoring, data processing, PaaS, serverless/FaaS systems, workflow, - container runtimes, image registries, node and cloud provider management, and many others. - - -## Background - -Kubernetes is a platform for deploying and managing containers. For more information about the -mission, scope, and design of Kubernetes, see [What Is -Kubernetes](http://kubernetes.io/docs/whatisk8s/) and the [architectural -overview](/contributors/design-proposals/architecture/architecture.md). The -latter also describes the current breakdown of the system into components/processes. - -Contributors to Kubernetes need to know what functionality they can -rely upon when adding new features to different parts of the system. - -Additionally, one of the problems that faces platforms like Kubernetes -is to define what is "in" and what is “out”. While Kubernetes must -offer some base functionality on which users can rely when running -their containerized applications or building their extensions, -Kubernetes cannot and should not try to solve every problem that users -have. Adding to the difficulty is that, unlike some other types of -infrastructure services such as databases, load balancers, or -messaging systems, there are few obvious, natural boundaries for the -“built-in” functionality. Consequently, reasonable minds can disagree -on exactly where the boundaries lie or what principles guide the -decisions. - -This document, which was inspired by [similar efforts from the -community](https://docs.google.com/document/d/1J6yCsPtggsSx_yfqNenb3xxBK22k43c5XZkVQmS38Mk/edit), -aims to clarify the intentions of the Kubernetes’s architecture -SIG. It is currently somewhat aspirational, and is intended to be a -blueprint for ongoing and future development. NIY marks items not yet -implemented as of the lated updated date at the head of this document. - -[Presentation version](https://docs.google.com/presentation/d/1oPZ4rznkBe86O4rPwD2CWgqgMuaSXguIBHIE7Y0TKVc/edit#slide=id.p) - -## System Layers - -Just as Linux has a kernel, core system libraries, and optional -userland tools, Kubernetes also has "layers" of functionality and -tools. An understanding of these layers is important for developers of -Kubernetes functionality to determine which cross-concept dependencies -should be allowed and which should not. - -Kubernetes APIs, concepts, and functionality can be sorted into the -following layers. - - - -### The Nucleus: API and Execution - -Essential API and execution machinery. - -These APIs and functions, implemented by the upstream Kubernetes -codebase, comprise the bare minimum set of features and concepts -needed to build up the higher-order layers of the system. These -pieces are thoroughly specified and documented, and every -containerized application will use them. Developers can safely assume -they are present. - -They should eventually become stable and "boring". However, Linux has -continuously evolved over its 25-year lifetime and major changes, -including NPTL (which required rebuilding all applications) and -features that enable containers (which are changing how all -applications are run), have been added over the past 10 years. It will -take some time for Kubernetes to stabilize, as well. - -#### The API and cluster control plane - -Kubernetes clusters provide a collection of similar REST APIs, exposed -by the Kubernetes [API -server](https://kubernetes.io/docs/admin/kube-apiserver/), supporting -primarily CRUD operations on (mostly) persistent resources. These APIs -serve as the hub of its control plane. - -REST APIs that follow Kubernetes API conventions (path conventions, -standard metadata, …) are automatically able to benefit from shared -API services (authorization, authentication, audit logging) and -generic client code can interact with them (CLI and UI -discoverability, generic status reporting in UIs and CLIs, generic -waiting conventions for orchestration tools, watching, label -selection). - -The lowest layer of the system also needs to support extension -mechanisms necessary to add the functionality provided by the higher -layers. Additionally, this layer must be suitable for use both in -single-purpose clusters and highly tenanted clusters. The nucleus -should provide sufficient flexibility that higher-level APIs could -introduce new scopes (sets of resources) without compromising the -security model of the cluster. - -Kubernetes cannot function without this basic API machinery and semantics, including: - -* [Authentication](https://kubernetes.io/docs/admin/authentication/): - The authentication scheme is a critical function that must be agreed - upon by both the server and clients. The API server supports basic - auth (username/password) (NOTE: We will likely deprecate basic auth - eventually.), X.509 client certificates, OpenID Connect tokens, and - bearer tokens, any of which (but not all) may be disabled. Clients - should support all forms supported by - [kubeconfig](https://kubernetes.io/docs/user-guide/kubeconfig-file/). Third-party - authentication systems may implement the TokenReview API and - configure the authentication webhook to call it, though choice of a - non-standard authentication mechanism may limit the number of usable - clients. - - * The TokenReview API (same schema as the hook) enables external - authentication checks, such as by Kubelet - - * Pod identity is provided by "[service accounts](https://kubernetes.io/docs/user-guide/service-accounts/)" - - * The ServiceAccount API, including default ServiceAccount - secret creation via a controller and injection via an - admission controller. - -* [Authorization](https://kubernetes.io/docs/admin/authorization/): - Third-party authorization systems may implement the - SubjectAccessReview API and configure the authorization webhook to - call it. - - * The SubjectAccessReview (same schema as the hook), - LocalSubjectAccessReview, and SelfSubjectAccessReview APIs - enable external permission checks, such as by Kubelet and other - controllers - -* REST semantics, watch, durability and consistency guarantees, API - versioning, defaulting, and validation - - * NIY: API deficiencies that need to be addressed: - - * [Confusing defaulting - behavior](https://github.com/kubernetes/kubernetes/issues/34292) - * [Lack of - guarantees](https://github.com/kubernetes/kubernetes/issues/30698) - * [Orchestration - support](https://github.com/kubernetes/kubernetes/issues/34363) - * [Support for event-driven - automation](https://github.com/kubernetes/kubernetes/issues/3692) - * [Clean - teardown](https://github.com/kubernetes/kubernetes/issues/4630) - -* NIY: Built-in admission-control semantics, [synchronous - admission-control hooks, and asynchronous resource - initialization](https://github.com/kubernetes/community/pull/132) -- - it needs to be possible for distribution vendors, system - integrators, and cluster administrators to impose additional - policies and automation - -* NIY: API registration and discovery, including API aggregation, to - register additional APIs, to find out which APIs are supported, and - to get the details of supported operations, payloads, and result - schemas - -* NIY: ThirdPartyResource and ThirdPartyResourceData APIs (or their - successors), to support third-party storage and extension APIs - -* NIY: An extensible and HA-compatible replacement for the - /componentstatuses API to determine whether the cluster is fully - turned up and operating correctly: ExternalServiceProvider - (component registration) - -* The Endpoints API (and future evolutions thereof), which is needed - for component registration, self-publication of API server - endpoints, and HA rendezvous, as well as application-layer target - discovery - -* The Namespace API, which is the means of scoping user resources, and - namespace lifecycle (e.g., bulk deletion) - -* The Event API, which is the means of reporting significant - occurrences, such as status changes and errors, and Event garbage - collection - -* NIY: Cascading-deletion garbage collector, finalization, and - orphaning - -* NIY: We need a built-in [add-on - manager](https://github.com/kubernetes/kubernetes/issues/23233) (not - unlike [static pod - manifests](https://kubernetes.io/docs/admin/static-pods/), but at - the cluster level) so that we can automatically add self-hosted - components and dynamic configuration to the cluster, and so we can - factor out functionality from existing components in running - clusters. At its core would be a pull-based declarative reconciler, - as provided by the [current add-on - manager](https://git.k8s.io/kubernetes/cluster/addons/addon-manager) - and as described in the [whitebox app management - doc](https://docs.google.com/document/d/1S3l2F40LCwFKg6WG0srR6056IiZJBwDmDvzHWRffTWk/edit#heading=h.gh6cf96u8mlr). This - would be easier once we have [apply support in the - API](https://github.com/kubernetes/kubernetes/issues/17333). - - * Add-ons should be cluster services that are managed as part of - the cluster and that provide the same degree of - [multi-tenancy](https://docs.google.com/document/d/148Lbe1w1xmUjMx7cIMWTVQmSjJ8qA77HIGCrdB-ugoc/edit) - as that provided by the cluster. - - * They may, but are not required to, run in the kube-system - namespace, but the chosen namespace needs to be chosen such that - it won't conflict with users' namespaces. - -* The API server acts as the gateway to the cluster. By definition, - the API server must be accessible by clients from outside the - cluster, whereas the nodes, and certainly pods, may not be. Clients - authenticate the API server and also use it as a bastion and - proxy/tunnel to nodes and pods (and services), using /proxy and - /portforward APIs. - -* TBD: The - [CertificateSigningRequest](/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md) - API, to enable credential generation, in particular to mint Kubelet - credentials - -Ideally, this nuclear API server would only support the minimum -required APIs, and additional functionality would be added via -[aggregation](/contributors/design-proposals/api-machinery/aggregated-api-servers.md), -hooks, initializers, and other extension mechanisms. - -Note that the centralized asynchronous controllers, such as garbage -collection, are currently run by a separate process, called the -[Controller -Manager](https://kubernetes.io/docs/admin/kube-controller-manager/). - -/healthz and /metrics endpoints may be used for cluster management -mechanisms, but are not considered part of the supported API surface, -and should not be used by clients generally in order to detect cluster -presence. The /version endpoint should be used instead. - -The API server depends on the following external components: - -* Persistent state store (etcd, or equivalent; perhaps multiple - instances) - -The API server may depend on: - -* Certificate authority - -* Identity provider - -* TokenReview API implementer - -* SubjectAccessReview API implementer - -#### Execution - -The most important and most prominent controller in Kubernetes is the -[Kubelet](https://kubernetes.io/docs/admin/kubelet/), which is the -primary implementer of the Pod and Node APIs that drive the container -execution layer. Without these APIs, Kubernetes would just be a -CRUD-oriented REST application framework backed by a key-value store -(and perhaps the API machinery will eventually be spun out as an -independent project). - -Kubernetes executes isolated application containers as its default, -native mode of execution. Kubernetes provides -[Pods](https://kubernetes.io/docs/user-guide/pods/) that can host -multiple containers and storage volumes as its fundamental execution -primitive. - -The Kubelet API surface and semantics include: - -* The Pod API, the Kubernetes execution primitive, including: - - * Pod feasibility-based admission control based on policies in the - Pod API (resource requests, node selector, node/pod affinity and - anti-affinity, taints and tolerations). API admission control - may reject pods or add additional scheduling constraints to - them, but Kubelet is the final arbiter of what pods can and - cannot run on a given node, not the schedulers or DaemonSets. - - * Container and volume semantics and lifecycle - - * Pod IP address allocation (a routable IP address per pod is - required) - - * A mechanism (i.e., ServiceAccount) to tie a Pod to a specific - security scope - - * Volume sources: - - * emptyDir - - * hostPath - - * secret - - * configMap - - * downwardAPI - - * NIY: [Container and image volumes](http://issues.k8s.io/831) - (and deprecate gitRepo) - - * NIY: Claims against local storage, so that complex - templating or separate configs are not needed for dev - vs. prod application manifests - - * flexVolume (which should replace built-in - cloud-provider-specific volumes) - - * Subresources: binding, status, exec, logs, attach, portforward, - proxy - -* NIY: [Checkpointing of API - resources](https://github.com/kubernetes/kubernetes/issues/489) for - availability and bootstrapping - -* Container image and log lifecycles - -* The Secret API, and mechanisms to enable third-party secret - management - -* The ConfigMap API, for [component - configuration](https://groups.google.com/forum/#!searchin/kubernetes-dev/component$20config%7Csort:relevance/kubernetes-dev/wtXaoHOiSfg/QFW5Ca9YBgAJ) - as well as Pod references - -* The Node API, hosts for Pods - - * May only be visible to cluster administrators in some - configurations - -* Node and pod networks and their controllers (route controller) - -* Node inventory, health, and reachability (node controller) - - * Cloud-provider-specific node inventory functions should be split - into a provider-specific controller. - -* Terminated-pod garbage collection - -* Volume controller - - * Cloud-provider-specific attach/detach logic should be split into - a provider-specific controller. Need a way to extract - provider-specific volume sources from the API. - -* The PersistentVolume API - - * NIY: At least backed by local storage, as mentioned above - -* The PersistentVolumeClaim API - -Again, centralized asynchronous functions, such as terminated-pod -garbage collection, are performed by the Controller Manager. - -The Controller Manager and Kubelet currently call out to a "cloud -provider" interface to query information from the infrastructure layer -and/or to manage infrastructure resources. However, [we’re working to -extract those -touchpoints](/contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md) -([issue](https://github.com/kubernetes/kubernetes/issues/2770)) into -external components. The intended model is that unsatisfiable -application/container/OS-level requests (e.g., Pods, -PersistentVolumeClaims) serve as a signal to external “dynamic -provisioning” systems, which would make infrastructure available to -satisfy those requests and represent them in Kubernetes using -infrastructure resources (e.g., Nodes, PersistentVolumes), so that -Kubernetes could bind the requests and infrastructure resources -together. - -The Kubelet depends on the following external components: - -* Image registry - -* Container Runtime Interface implementation - -* Container Network Interface implementation - -* FlexVolume implementations ("CVI" in the diagram) - -And may depend on: - -* NIY: Cloud-provider node plug-in, to provide node identity, - topology, etc. - -* NIY: Third-party secret management system (e.g., Vault) - -* NIY: Credential generation and rotation controller - -Accepted layering violations: - -* [Explicit service links](https://github.com/kubernetes/community/pull/176) - -* The kubernetes service for the API server - -### The Application Layer: Deployment and Routing - -The application management and composition layer, providing -self-healing, scaling, application lifecycle management, service -discovery, load balancing, and routing -- also known as orchestration -and the service fabric. - -These APIs and functions are REQUIRED for any distribution of -Kubernetes. Kubernetes should provide default implementations for -these APIs, but replacements of the implementations of any or all of -these functions are permitted, provided the conformance tests -pass. Without these, most containerized applications will not run, and -few, if any, published examples will work. The vast majority of -containerized applications will use one or more of these. - -Kubernetes’s API provides IaaS-like container-centric primitives and -also lifecycle controllers to support orchestration (self-healing, -scaling, updates, termination) of all major categories of -workloads. These application management, composition, discovery, and -routing APIs and functions include: - -* A default scheduler, which implements the scheduling policies in the - Pod API: resource requests, nodeSelector, node and pod - affinity/anti-affinity, taints and tolerations. The scheduler runs - as a separate process, on or outside the cluster. - -* NIY: A - [rescheduler](/contributors/design-proposals/scheduling/rescheduling.md), - to reactively and proactively delete scheduled pods so that they can - be replaced and rescheduled to other nodes. - -* Continuously running applications: These application types should - all support rollouts (and rollbacks) via declarative updates, - cascading deletion, and orphaning/adoption. Other than DaemonSet, - all should support horizontal scaling. - - * The Deployment API, which orchestrates updates of stateless - applications, including subresources (status, scale, rollback) - - * The ReplicaSet API, for simple fungible/stateless - applications, especially specific versions of Deployment pod - templates, including subresources (status, scale) - - * The DaemonSet API, for cluster services, including subresources - (status) - - * The StatefulSet API, for stateful applications, including - subresources (status, scale) - - * The PodTemplate API, used by DaemonSet and StatefulSet to record change history - -* Terminating batch applications: These should include support for - automatic culling of terminated jobs (NIY). - - * The Job API ([GC - discussion](https://github.com/kubernetes/kubernetes/issues/30243)) - - * The CronJob API - -* Discovery, load balancing, and routing - - * The Service API, including allocation of cluster IPs, repair on - service allocation maps, load balancing via kube-proxy or - equivalent, and automatic Endpoints generation, maintenance, and - deletion for services. NIY: LoadBalancer service support is - OPTIONAL, but conformance tests must pass if it is - supported. If/when they are added, support for [LoadBalancer and - LoadBalancerClaim - APIs](https://github.com/kubernetes/community/pull/275) should - be present if and only if the distribution supports LoadBalancer - services. - - * The Ingress API, including [internal - L7](https://docs.google.com/document/d/1ILXnyU5D5TbVRwmoPnC__YMO9T5lmiA_UiU8HtgSLYk/edit?ts=585421fc) - (NIY) - - * Service DNS. DNS, using the [official Kubernetes - schema](https://git.k8s.io/dns/docs/specification.md), - is required. - -The application layer may depend on: - -* Identity provider (to-cluster identities and/or to-application - identities) - -* NIY: Cloud-provider controller implementation - -* Ingress controller(s) - -* Replacement and/or additional schedulers and/or reschedulers - -* Replacement DNS service - -* Replacement for kube-proxy - -* Replacement and/or - [auxiliary](https://github.com/kubernetes/kubernetes/issues/31571) - workload controllers, especially for extended rollout strategies - -### The Governance Layer: Automation and Policy Enforcement - -Policy enforcement and higher-level automation. - -These APIs and functions should be optional for running applications, -and should be achievable via other solutions. - -Each supported API/function should be applicable to a large fraction -of enterprise operations, security, and/or governance scenarios. - -It needs to be possible to configure and discover default policies for -the cluster (perhaps similar to [Openshift’s new project template -mechanism](https://docs.openshift.org/latest/admin_guide/managing_projects.html#template-for-new-projects), -but supporting multiple policy templates, such as for system -namespaces vs. user ones), to support at least the following use -cases: - -* Is this a: (source: [multi-tenancy working - doc](https://docs.google.com/document/d/1IoINuGz8eR8Awk4o7ePKuYv9wjXZN5nQM_RFlVIhA4c/edit?usp=sharing)) - - * Single tenant / single user cluster - - * Multiple trusted tenant cluster - - * Production vs. dev cluster - - * Highly tenanted playground cluster - - * Segmented cluster for reselling compute / app services to others - -* Do I care about limiting: - - * Resource usage - - * Internal segmentation of nodes - - * End users - - * Admins - - * DoS - -Automation APIs and functions: - -* The Metrics APIs (needed for H/V autoscaling, scheduling TBD) - -* The HorizontalPodAutoscaler API - -* NIY: The vertical pod autoscaling API(s) - -* [Cluster autoscaling and/or node - provisioning](https://git.k8s.io/contrib/cluster-autoscaler) - -* The PodDisruptionBudget API - -* Dynamic volume provisioning, for at least one volume source type - - * The StorageClass API, implemented at least for the default - volume type - -* Dynamic load-balancer provisioning - -* NIY: The - [PodPreset](/contributors/design-proposals/service-catalog/pod-preset.md) - API - -* NIY: The [service - broker/catalog](https://github.com/kubernetes-incubator/service-catalog) - APIs - -* NIY: The - [Template](/contributors/design-proposals/apps/OBSOLETE_templates.md) - and TemplateInstance APIs - -Policy APIs and functions: - -* [Authorization](https://kubernetes.io/docs/admin/authorization/): - The ABAC and RBAC authorization policy schemes. - - * RBAC, if used, is configured using a number of APIs: Role, - RoleBinding, ClusterRole, ClusterRoleBinding - -* The LimitRange API - -* The ResourceQuota API - -* The PodSecurityPolicy API - -* The ImageReview API - -* The NetworkPolicy API - -The management layer may depend on: - -* Network policy enforcement mechanism - -* Replacement and/or additional horizontal and vertical pod - autoscalers - -* [Cluster autoscaler and/or node provisioner](https://git.k8s.io/contrib/cluster-autoscaler) - -* Dynamic volume provisioners - -* Dynamic load-balancer provisioners - -* Metrics monitoring pipeline, or a replacement for it - -* Service brokers - -### The Interface Layer: Libraries and Tools - -These mechanisms are suggested for distributions, and also are -available for download and installation independently by users. They -include commonly used libraries, tools, systems, and UIs developed by -official Kubernetes projects, though other tools may be used to -accomplish the same tasks. They may be used by published examples. - -Commonly used libraries, tools, systems, and UIs developed under some -Kubernetes-owned GitHub org. - -* Kubectl -- We see kubectl as one of many client tools, rather than - as a privileged one. Our aim is to make kubectl thinner, by [moving - commonly used non-trivial functionality into the - API](https://github.com/kubernetes/kubernetes/issues/12143). This is - necessary in order to facilitate correct operation across Kubernetes - releases, to facilitate API extensibility, to preserve the - API-centric Kubernetes ecosystem model, and to simplify other - clients, especially non-Go clients. - -* Client libraries (e.g., client-go, client-python) - -* Cluster federation (API server, controllers, kubefed) - -* Dashboard - -* Helm - -These components may depend on: - -* Kubectl extensions (discoverable via help) - -* Helm extensions (discoverable via help) - -### The Ecosystem - -These things are not really "part of" Kubernetes at all. - -There are a number of areas where we have already defined [clear-cut -boundaries](https://kubernetes.io/docs/whatisk8s#kubernetes-is-not) -for Kubernetes. - -While Kubernetes must offer functionality commonly needed to deploy -and manage containerized applications, as a general rule, we preserve -user choice in areas complementing Kubernetes’s general-purpose -orchestration functionality, especially areas that have their own -competitive landscapes comprised of numerous solutions satisfying -diverse needs and preferences. Kubernetes may provide plug-in APIs for -such solutions, or may expose general-purpose APIs that could be -implemented by multiple backends, or expose APIs that such solutions -can target. Sometimes, the functionality can compose cleanly with -Kubernetes without explicit interfaces. - -Additionally, to be considered part of Kubernetes, a component would -need to follow Kubernetes design conventions. For instance, systems -whose primary interfaces are domain-specific languages (e.g., -[Puppet](https://docs.puppet.com/puppet/4.9/lang_summary.html), [Open -Policy Agent](http://www.openpolicyagent.org/)) aren’t compatible with -the Kubernetes API-centric approach, and are perfectly fine to use -with Kubernetes, but wouldn’t be considered to be part of -Kubernetes. Similarly, solutions designed to support multiple -platforms likely wouldn’t follow Kubernetes API conventions, and -therefore wouldn’t be considered to be part of Kubernetes. - -* Inside container images: Kubernetes is not opinionated about the - contents of container images -- if it lives inside the container - image, it lives outside Kubernetes. This includes, for example, - language-specific application frameworks. - -* On top of Kubernetes - - * Continuous integration and deployment: Kubernetes is - unopinionated in the source-to-image space. It does not deploy - source code and does not build your application. Continuous - Integration (CI) and continuous deployment workflows are areas - where different users and projects have their own requirements - and preferences, so we aim to facilitate layering CI/CD - workflows on Kubernetes but don't dictate how they should work. - - * Application middleware: Kubernetes does not provide application - middleware, such as message queues and SQL databases, as - built-in infrastructure. It may, however, provide - general-purpose mechanisms, such as service-broker integration, - to make it easier to provision, discover, and access such - components. Ideally, such components would just run on - Kubernetes. - - * Logging and monitoring: Kubernetes does not provide logging - aggregation, comprehensive application monitoring, nor telemetry - analysis and alerting systems, though such mechanisms are - essential to production clusters. - - * Data-processing platforms: Spark and Hadoop are well known - examples, but there are [many such - systems](https://hadoopecosystemtable.github.io/). - - * [Application-specific - operators](https://coreos.com/blog/introducing-operators.html): - Kubernetes supports workload management for common categories of - applications, but not for specific applications. - - * Platform as a Service: Kubernetes [provides a - foundation](https://kubernetes.io/blog/2017/02/caas-the-foundation-for-next-gen-paas/) - for a multitude of focused, opinionated PaaSes, including DIY - ones. - - * Functions as a Service: Similar to PaaS, but FaaS additionally - encroaches into containers and language-specific application - frameworks. - - * [Workflow - orchestration](https://github.com/kubernetes/kubernetes/pull/24781#issuecomment-215914822): - "Workflow" is a very broad, diverse area, with solutions - typically tailored to specific use cases (data-flow graphs, - data-driven processing, deployment pipelines, event-driven - automation, business-process execution, iPaaS) and specific - input and event sources, and often requires arbitrary code to - evaluate conditions, actions, and/or failure handling. - - * [Configuration - DSLs](https://github.com/kubernetes/kubernetes/pull/1007/files): - Domain-specific languages do not facilitate layering - higher-level APIs and tools, they usually have limited - expressibility, testability, familiarity, and documentation, - they promote complex configuration generation, they tend to - compromise interoperability and composability, they complicate - dependency management, and uses often subvert abstraction and - encapsulation. - - * [Kompose](https://github.com/kubernetes-incubator/kompose): - Kompose is a project-supported adaptor tool that facilitates - migration to Kubernetes from Docker Compose and enables simple - use cases, but doesn’t follow Kubernetes conventions and is - based on a manually maintained DSL. - - * [ChatOps](https://github.com/harbur/kubebot): Also adaptor - tools, for the multitude of chat services. - -* Underlying Kubernetes - - * Container runtime: Kubernetes does not provide its own container - runtime, but provides an interface for plugging in the container - runtime of your choice. - - * Image registry: Kubernetes pulls container images to the nodes. - - * Cluster state store: Etcd - - * Network: As with the container runtime, we support an interface - (CNI) that facilitates pluggability. - - * File storage: Local filesystems and network-attached storage. - - * Node management: Kubernetes neither provides nor adopts any - comprehensive machine configuration, maintenance, management, or - self-healing systems, which typically are handled differently in - different public/private clouds, for different operating - systems, for mutable vs. immutable infrastructure, for shops - already using tools outside of their Kubernetes clusters, etc. - - * Cloud provider: IaaS provisioning and management. - - * Cluster creation and management: The community has developed - numerous tools, such as minikube, kubeadm, bootkube, kube-aws, - kops, kargo, kubernetes-anywhere, and so on. As can be seen from - the diversity of tools, there is no one-size-fits-all solution - for cluster deployment and management (e.g., upgrades). There's - a spectrum of possible solutions, each with different - tradeoffs. That said, common building blocks (e.g., [secure - Kubelet - registration](/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md)) - and approaches (in particular, - [self-hosting](/contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md#what-is-self-hosted)) - would reduce the amount of custom orchestration needed in such - tools. - -We would like to see the ecosystem build and/or integrate solutions to -fill these needs. - -Eventually, most Kubernetes development should fall in the ecosystem. - -## Managing the matrix - -Options, Configurable defaults, Extensions, Plug-ins, Add-ons, -Provider-specific functionality, Version skew, Feature discovery, and -Dependency management. - -Kubernetes is not just an open-source toolkit, but is typically -consumed as a running, easy-to-run, or ready-to-run cluster or -service. We would like most users and use cases to be able to use -stock upstream releases. This means Kubernetes needs sufficient -extensibility without rebuilding to handle such use cases. - -While gaps in extensibility are the primary drivers of code forks and -gaps in upstream cluster lifecycle management solutions are currently -the primary drivers of the proliferation of Kubernetes distributions, -the existence of optional features (e.g., alpha APIs, -provider-specific APIs), configurability, pluggability, and -extensibility make the concept inevitable. - -However, to make it possible for users to deploy and manage their -applications and for developers to build Kubernetes extensions on/for -arbitrary Kubernetes clusters, they must be able to make assumptions -about what a cluster or distribution of Kubernetes provides. Where -functionality falls out of these base assumptions, there needs to be a -way to discover what functionality is available and to express -functionality requirements (dependencies) for usage. - -Cluster components, including add-ons, should be registered via the -[component registration -API](https://github.com/kubernetes/kubernetes/issues/18610) and -discovered via /componentstatuses. - -Enabled built-in APIs, aggregated APIs, and registered third-party -resources should be discoverable via the discovery and OpenAPI -(swagger.json) endpoints. As mentioned above, cloud-provider support -for LoadBalancer-type services should be determined by whether the -LoadBalancer API is present. - -Extensions and their options should be registered via FooClass -resources, similar to -[StorageClass](https://git.k8s.io/kubernetes/pkg/apis/storage/v1beta1/types.go#L31), -but with parameter descriptions, types (e.g., integer vs string), -constraints (e.g., range or regexp) for validation, and default -values, with a reference to fooClassName from the extended API. These -APIs should also configure/expose the presence of related features, -such as dynamic volume provisioning (indicated by a non-empty -storageclass.provisioner field), as well as identifying the -responsible -[controller](https://github.com/kubernetes/kubernetes/issues/31571). We -need to add such APIs for at least scheduler classes, ingress -controller classes, flex volume classes, and compute resource classes -(e.g., GPUs, other accelerators). - -Assuming we transitioned existing network-attached volume sources to -flex volumes, this approach would cover volume sources. In the future, -the API should provide only [general-purpose -abstractions](https://docs.google.com/document/d/1QVxD---9tHXYj8c_RayLY9ClrFpqfuejN7p0vtv2kW0/edit#heading=h.mij1ubfelvar), -even if, as with LoadBalancer services, the abstractions are not -implemented in all environments (i.e., the API does not need to cater -to the lowest common denominator). - -NIY: We also need to develop mechanisms for registering and -discovering the following: - -* Admission-control plugins and hooks (including for built-in APIs) - -* Authentication plugins - -* Authorization plugins and hooks - -* Initializers and finalizers - -* [Scheduler - extensions](/contributors/design-proposals/scheduling/scheduler_extender.md) - -* Node labels and [cluster - topology](https://github.com/kubernetes/kubernetes/issues/41442) - (topology classes?) - -NIY: Activation/deactivation of both individual APIs and finer-grain -features could be addressed by the following mechanisms: - -* [The configuration for all components is being converted from - command-line flags to versioned - configuration.](https://github.com/kubernetes/kubernetes/issues/12245) - -* [We intend to store most of that configuration data in - ](https://github.com/kubernetes/kubernetes/issues/1627)[ConfigMap](https://github.com/kubernetes/kubernetes/issues/1627)[s, - to facilitate dynamic reconfiguration, progressive rollouts, and - introspectability.](https://github.com/kubernetes/kubernetes/issues/1627) - -* [Configuration common to all/multiple components should be factored - out into its own configuration - object(s).](https://github.com/kubernetes/kubernetes/issues/19831) - This should include the [feature-gate - mechanism](/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md). - -* An API should be added for semantically meaningful settings, such as - the default length of time to wait before deleting pods on - unresponsive nodes. - -NIY: The problem of [version-skewed -operation](https://github.com/kubernetes/kubernetes/issues/4855), for -features dependent on upgrades of multiple components (including -replicas of the same component in HA clusters), should be addressed -by: - -1. Creating flag gates for all new such features, - -2. Always disabling the features by default in the first minor release - in which they appear, - -3. Providing configuration patches to enable the features, and - -4. Enabling them by default in the next minor release. - -NIY: We additionally need a mechanism to [warn about out of date -nodes](https://github.com/kubernetes/kubernetes/issues/23874), and/or -potentially prevent master upgrades (other than to patch releases) -until/unless the nodes have been upgraded. - -NIY: [Field-level -versioning](https://github.com/kubernetes/kubernetes/issues/34508) -would facilitate solutions to bulk activation of new and/or alpha API -fields, prevention of clobbering of new fields by poorly written -out-of-date clients, and evolution of non-alpha APIs without a -proliferation of full-fledged API definitions. - -The Kubernetes API server silently ignores unsupported resource fields -and query parameters, but not unknown/unregistered APIs (note that -unimplemented/inactive APIs should be disabled). This can facilitate -the reuse of configuration across clusters of multiple releases, but -more often leads to surprises. Kubectl supports optional validation -using the Swagger/OpenAPI specification from the server. Such optional -validation should be [provided by the -server](https://github.com/kubernetes/kubernetes/issues/5889) -(NIY). Additionally, shared resource manifests should specify the -minimum required Kubernetes release, for user convenience, which could -potentially be verified by kubectl and other clients. - -Additionally, unsatisfiable Pod scheduling constraints and -PersistentVolumeClaim criteria silently go unmet, which can useful as -demand signals to automatic provisioners, but also makes the system -more error prone. It should be possible to configure rejection of -unsatisfiable requests, using FooClass-style APIs, as described above -([NIY](https://github.com/kubernetes/kubernetes/issues/17324)). - -The Service Catalog mechanism (NIY) should make it possible to assert -the existence of application-level services, such as S3-compatible -cluster storage. - -## Layering of the system as it relates to security - -In order to properly secure a Kubernetes cluster and enable [safe -extension](https://github.com/kubernetes/kubernetes/issues/17456), a -few fundamental concepts need to be defined and agreed on by the -components of the system. It’s best to think of Kubernetes as a series -of rings from a security perspective, with each layer granting the -successive layer capabilities to act. - -1. One or more data storage systems (etcd) for the nuclear APIs - -2. The nuclear APIs - -3. APIs for highly trusted resources (system policies) - -4. Delegated trust APIs and controllers (users grant access to the API - / controller to perform actions on their behalf) either at the - cluster scope or smaller scopes - -5. Untrusted / scoped APIs and controllers and user workloads that run - at various scopes - -When a lower layer depends on a higher layer, it collapses the -security model and makes defending the system more complicated - an -administrator may *choose* to do so to gain operational simplicity, -but that must be a conscious choice. A simple example is etcd: any -component that can write data to etcd is now root on the entire -cluster, and any actor that can corrupt a highly trusted resource can -almost certainly escalate. It is useful to divide the layers above -into separate sets of machines for each layer of processes (etcd -> -apiservers + controllers -> nuclear security extensions -> delegated -extensions -> user workloads), even if some may be collapsed in -practice. - -If the layers described above define concentric circles, then it -should also be possible for overlapping or independent circles to -exist - for instance, administrators may choose an alternative secret -storage solution that cluster workloads have access to yet the -platform does not implicitly have access to. The point of intersection -for these circles tends to be the machines that run the workloads, and -nodes must have no more privileges than those required for proper -function. - -Finally, adding a new capability via extension at any layer should -follow best practices for communicating the impact of that action. - -When a capability is added to the system via extension, what purpose -does it have? - -* Make the system more secure - -* Enable a new "production quality" API for consumption by everyone in - the cluster - -* Automate a common task across a subset of the cluster - -* Run a hosted workload that offers apis to consumers (spark, a - database, etcd) - -* These fall into three major groups: - - * Required for the cluster (and hence must run close to the core, - and cause operational tradeoffs in the presence of failure) - - * Exposed to all cluster users (must be properly tenanted) - - * Exposed to a subset of cluster users (runs more like traditional - "app" workloads) - -If an administrator can easily be tricked into installing a new -cluster level security rule during extension, then the layering is -compromised and the system is vulnerable. - -## Next Steps - -In addition to completing the technical mechanisms described and/or -implied above, we need to apply the principles in this document to a -set of more focused documents that answer specific practical -questions. Here are some suggested documents and the questions they -answer: - -* **Kubernetes API Conventions For Extension API Developers** - - * Audience: someone planning to build a Kubernetes-like API - extension (TPR or Aggregated API, etc…) - - * Answers Questions: - - * What conventions should I follow? (metadata, status, etc) - - * What integrations do I get from following those conventions? - - * What can I omit and Does - - * Document that answers this question: - -* **Kubernetes API Conventions For CLI and UI Developers** - - * Audience: someone working on kubectl, dashboard, or another CLI - or UI. - - * Answers Questions: - - * How can I show a user which objects subordinate and should - normally be hidden - -* **Required and Optional Behaviors for Kubernetes - Distributions/Services** - - * See also the [certification - issue](https://github.com/kubernetes/community/issues/432) - - * I just implemented a Hosted version of the Kubernetes API. - - * Which API groups, versions and Kinds do I have to expose? - - * Can I provide my own logging, auth, auditing integrations - and so on? - - * Can I call it Kubernetes if it just has pods but no services? - - * I'm packaging up a curated Kubernetes distro and selling it. - - * Which API groups, versions and Kinds do I have to expose in - order to call it Kubernetes. - -* **Assumptions/conventions for application developers** - - * I want to write portable config across on-prem and several clouds. - - * What API types and behaviors should I assume are always - present and fully abstracted. When should I try to detect a - feature (by asking the user what cloud provider they have, - or in the future a feature discovery API). - -* **Kubernetes Security Models** - - * Audience: hosters, distributors, custom cluster builders +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/architecture.md b/contributors/design-proposals/architecture/architecture.md index a8e103e3..f0fbec72 100644 --- a/contributors/design-proposals/architecture/architecture.md +++ b/contributors/design-proposals/architecture/architecture.md @@ -1,251 +1,6 @@ -# Kubernetes Design and Architecture +Design proposals have been archived. -A much more detailed and updated [Architectural -Roadmap](/contributors/design-proposals/architecture/architectural-roadmap.md) is also available. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview -Kubernetes is a production-grade, open-source infrastructure for the deployment, scaling, -management, and composition of application containers across clusters of hosts, inspired -by [previous work at Google](https://research.google.com/pubs/pub44843.html). Kubernetes -is more than just a “container orchestrator”. It aims to eliminate the burden of orchestrating -physical/virtual compute, network, and storage infrastructure, and enable application operators -and developers to focus entirely on container-centric primitives for self-service operation. -Kubernetes also provides a stable, portable foundation (a platform) for building customized -workflows and higher-level automation. - -Kubernetes is primarily targeted at applications composed of multiple containers. It therefore -groups containers using *pods* and *labels* into tightly coupled and loosely coupled formations -for easy management and discovery. - -## Scope - -[Kubernetes](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/) is a platform for deploying and managing containers. -Kubernetes provides a container runtime, container -orchestration, container-centric infrastructure orchestration, self-healing mechanisms such as health checking and re-scheduling, and service discovery and load balancing. - -Kubernetes aspires to be an extensible, pluggable, building-block OSS -platform and toolkit. Therefore, architecturally, we want Kubernetes to be built -as a collection of pluggable components and layers, with the ability to use -alternative schedulers, controllers, storage systems, and distribution -mechanisms, and we're evolving its current code in that direction. Furthermore, -we want others to be able to extend Kubernetes functionality, such as with -higher-level PaaS functionality or multi-cluster layers, without modification of -core Kubernetes source. Therefore, its API isn't just (or even necessarily -mainly) targeted at end users, but at tool and extension developers. Its APIs -are intended to serve as the foundation for an open ecosystem of tools, -automation systems, and higher-level API layers. Consequently, there are no -"internal" inter-component APIs. All APIs are visible and available, including -the APIs used by the scheduler, the node controller, the replication-controller -manager, Kubelet's API, etc. There's no glass to break -- in order to handle -more complex use cases, one can just access the lower-level APIs in a fully -transparent, composable manner. - -## Goals - -The project is committed to the following (aspirational) [design ideals](principles.md): -* _Portable_. Kubernetes runs everywhere -- public cloud, private cloud, bare metal, laptop -- - with consistent behavior so that applications and tools are portable throughout the ecosystem - as well as between development and production environments. -* _General-purpose_. Kubernetes should run all major categories of workloads to enable you to run - all of your workloads on a single infrastructure, stateless and stateful, microservices and - monoliths, services and batch, greenfield and legacy. -* _Meet users partway_. Kubernetes doesn’t just cater to purely greenfield cloud-native - applications, nor does it meet all users where they are. It focuses on deployment and management - of microservices and cloud-native applications, but provides some mechanisms to facilitate - migration of monolithic and legacy applications. -* _Flexible_. Kubernetes functionality can be consumed a la carte and (in most cases) Kubernetes - does not prevent you from using your own solutions in lieu of built-in functionality. -* _Extensible_. Kubernetes enables you to integrate it into your environment and to add the - additional capabilities you need, by exposing the same interfaces used by built-in - functionality. -* _Automatable_. Kubernetes aims to dramatically reduce the burden of manual operations. It - supports both declarative control by specifying users’ desired intent via its API, as well as - imperative control to support higher-level orchestration and automation. The declarative - approach is key to the system’s self-healing and autonomic capabilities. -* _Advance the state of the art_. While Kubernetes intends to support non-cloud-native - applications, it also aspires to advance the cloud-native and DevOps state of the art, such as - in the [participation of applications in their own management](https://kubernetes.io/blog/2016/09/cloud-native-application-interfaces/). - However, in doing - so, we strive not to force applications to lock themselves into Kubernetes APIs, which is, for - example, why we prefer configuration over convention in the [downward API](https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api). - Additionally, Kubernetes is not bound by - the lowest common denominator of systems upon which it depends, such as container runtimes and - cloud providers. An example where we pushed the envelope of what was achievable was in its - [IP per Pod networking model](https://kubernetes.io/docs/concepts/cluster-administration/networking/#kubernetes-model). - -## Architecture - -A running Kubernetes cluster contains node agents (kubelet) and a cluster control plane (AKA -*master*), with cluster state backed by a distributed storage system -([etcd](https://github.com/coreos/etcd)). - -### Cluster control plane (AKA *master*) - -The Kubernetes [control plane](https://en.wikipedia.org/wiki/Control_plane) is split -into a set of components, which can all run on a single *master* node, or can be replicated -in order to support high-availability clusters, or can even be run on Kubernetes itself (AKA -[self-hosted](../cluster-lifecycle/self-hosted-kubernetes.md#what-is-self-hosted)). - -Kubernetes provides a REST API supporting primarily CRUD operations on (mostly) persistent resources, which -serve as the hub of its control plane. Kubernetes’s API provides IaaS-like -container-centric primitives such as [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/), -[Services](https://kubernetes.io/docs/concepts/services-networking/service/), and -[Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/), and also lifecycle APIs to support orchestration -(self-healing, scaling, updates, termination) of common types of workloads, such as -[ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) (simple fungible/stateless app manager), -[Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) (orchestrates updates of -stateless apps), [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) (batch), -[CronJob](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/) (cron), -[DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) (cluster services), and -[StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) (stateful apps). -We deliberately decoupled service naming/discovery and load balancing from application -implementation, since the latter is diverse and open-ended. - -Both user clients and components containing asynchronous controllers interact with the same API resources, -which serve as coordination points, common intermediate representation, and shared state. Most resources -contain metadata, including [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and -[annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/), fully elaborated desired state (spec), -including default values, and observed state (status). - -Controllers work continuously to drive the actual state towards the desired state, while reporting back the currently observed state for users and for other controllers. - -While the controllers are level-based (as described [here](http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html) and [here](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d)) -to maximize fault -tolerance, they typically `watch` for changes to relevant resources in order to minimize reaction -latency and redundant work. This enables decentralized and decoupled -[choreography-like](https://en.wikipedia.org/wiki/Service_choreography) coordination without a -message bus. - -#### API Server - -The [API server](https://kubernetes.io/docs/admin/kube-apiserver/) serves up the -[Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/). It is intended to be a relatively simple -server, with most/all business logic implemented in separate components or in plug-ins. It mainly -processes REST operations, validates them, and updates the corresponding objects in `etcd` (and -perhaps eventually other stores). Note that, for a number of reasons, Kubernetes deliberately does -not support atomic transactions across multiple resources. - -Kubernetes cannot function without this basic API machinery, which includes: -* REST semantics, watch, durability and consistency guarantees, API versioning, defaulting, and - validation -* Built-in admission-control semantics, synchronous admission-control hooks, and asynchronous - resource initialization -* API registration and discovery - -Additionally, the API server acts as the gateway to the cluster. By definition, the API server -must be accessible by clients from outside the cluster, whereas the nodes, and certainly -containers, may not be. Clients authenticate the API server and also use it as a bastion and -proxy/tunnel to nodes and pods (and services). - -#### Cluster state store - -All persistent cluster state is stored in an instance of `etcd`. This provides a way to store -configuration data reliably. With `watch` support, coordinating components can be notified very -quickly of changes. - - -#### Controller-Manager Server - -Most other cluster-level functions are currently performed by a separate process, called the -[Controller Manager](https://kubernetes.io/docs/admin/kube-controller-manager/). It performs -both lifecycle functions (e.g., namespace creation and lifecycle, event garbage collection, -terminated-pod garbage collection, cascading-deletion garbage collection, node garbage collection) -and API business logic (e.g., scaling of pods controlled by a -[ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/)). - -The application management and composition layer, providing self-healing, scaling, application lifecycle management, service discovery, routing, and service binding and provisioning. - -These functions may eventually be split into separate components to make them more easily -extended or replaced. - -#### Scheduler - - -Kubernetes enables users to ask a cluster to run a set of containers. The scheduler -component automatically chooses hosts to run those containers on. - -The scheduler watches for unscheduled pods and binds them to nodes via the `/binding` pod -subresource API, according to the availability of the requested resources, quality of service -requirements, affinity and anti-affinity specifications, and other constraints. - -Kubernetes supports user-provided schedulers and multiple concurrent cluster schedulers, -using the shared-state approach pioneered by -[Omega](https://research.google.com/pubs/pub41684.html). In addition to the disadvantages of -pessimistic concurrency described by the Omega paper, -[two-level scheduling models](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) that hide information from the upper-level -schedulers need to implement all of the same features in the lower-level scheduler as required by -all upper-layer schedulers in order to ensure that their scheduling requests can be satisfied by -available desired resources. - - -### The Kubernetes Node - -The Kubernetes node has the services necessary to run application containers and -be managed from the master systems. - -#### Kubelet - -The most important and most prominent controller in Kubernetes is the Kubelet, which is the -primary implementer of the Pod and Node APIs that drive the container execution layer. Without -these APIs, Kubernetes would just be a CRUD-oriented REST application framework backed by a -key-value store (and perhaps the API machinery will eventually be spun out as an independent -project). - -Kubernetes executes isolated application containers as its default, native mode of execution, as -opposed to processes and traditional operating-system packages. Not only are application -containers isolated from each other, but they are also isolated from the hosts on which they -execute, which is critical to decoupling management of individual applications from each other and -from management of the underlying cluster physical/virtual infrastructure. - -Kubernetes provides [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/) that can host multiple -containers and storage volumes as its fundamental execution primitive in order to facilitate -packaging a single application per container, decoupling deployment-time concerns from build-time -concerns, and migration from physical/virtual machines. The Pod primitive is key to glean the -[primary benefits](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#why-containers) of deployment on modern -cloud platforms, such as Kubernetes. - -API admission control may reject pods or add additional scheduling constraints to them, but -Kubelet is the final arbiter of what pods can and cannot run on a given node, not the schedulers -or DaemonSets. - -Kubelet also currently links in the [cAdvisor](https://github.com/google/cadvisor) resource monitoring -agent. - -#### Container runtime - -Each node runs a container runtime, which is responsible for downloading images and running containers. - -Kubelet does not link in the base container runtime. Instead, we're defining a -[Container Runtime Interface](/contributors/devel/sig-node/container-runtime-interface.md) to control the -underlying runtime and facilitate pluggability of that layer. -This decoupling is needed in order to maintain clear component boundaries, facilitate testing, and facilitate pluggability. -Runtimes supported today, either upstream or by forks, include at least docker (for Linux and Windows), -[rkt](https://github.com/rkt/rkt), -[cri-o](https://github.com/kubernetes-incubator/cri-o), and [frakti](https://github.com/kubernetes/frakti). - -#### Kube Proxy - -The [service](https://kubernetes.io/docs/concepts/services-networking/service/) abstraction provides a way to -group pods under a common access policy (e.g., load-balanced). The implementation of this creates -a virtual IP which clients can access and which is transparently proxied to the pods in a Service. -Each node runs a [kube-proxy](https://kubernetes.io/docs/admin/kube-proxy/) process which programs -`iptables` rules to trap access to service IPs and redirect them to the correct backends. This provides a highly-available load-balancing solution with low performance overhead by balancing -client traffic from a node on that same node. - -Service endpoints are found primarily via [DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/). - -### Add-ons and other dependencies - -A number of components, called [*add-ons*](https://git.k8s.io/kubernetes/cluster/addons) typically run on Kubernetes -itself: -* [DNS](https://git.k8s.io/kubernetes/cluster/addons/dns) -* [Ingress controller](https://github.com/kubernetes/ingress-gce) -* [Heapster](https://github.com/kubernetes/heapster/) (resource monitoring) -* [Dashboard](https://github.com/kubernetes/dashboard/) (GUI) - -### Federation - -A single Kubernetes cluster may span multiple availability zones. - -However, for the highest availability, we recommend using [cluster federation](../multicluster/federation.md). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/declarative-application-management.md b/contributors/design-proposals/architecture/declarative-application-management.md index 08e900a4..f0fbec72 100644 --- a/contributors/design-proposals/architecture/declarative-application-management.md +++ b/contributors/design-proposals/architecture/declarative-application-management.md @@ -1,395 +1,6 @@ -# Declarative application management in Kubernetes +Design proposals have been archived. -> This article was authored by Brian Grant (bgrant0607) on 8/2/2017. The original Google Doc can be found here: [https://goo.gl/T66ZcD](https://goo.gl/T66ZcD) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Most users will deploy a combination of applications they build themselves, also known as **_bespoke_** applications, and **common off-the-shelf (COTS)** components. Bespoke applications are typically stateless application servers, whereas COTS components are typically infrastructure (and frequently stateful) systems, such as databases, key-value stores, caches, and messaging systems. -In the case of the latter, users sometimes have the choice of using hosted SaaS products that are entirely managed by the service provider and are therefore opaque, also known as **_blackbox_** *services*. However, they often run open-source components themselves, and must configure, deploy, scale, secure, monitor, update, and otherwise manage the lifecycles of these **_whitebox_** *COTS applications*. - -This document proposes a unified method of managing both bespoke and off-the-shelf applications declaratively using the same tools and application operator workflow, while leveraging developer-friendly CLIs and UIs, streamlining common tasks, and avoiding common pitfalls. The approach is based on observations of several dozen configuration projects and hundreds of configured applications within Google and in the Kubernetes ecosystem, as well as quantitative analysis of Borg configurations and work on the Kubernetes [system architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/architecture.md), [API](/contributors/devel/sig-architecture/api-conventions.md), and command-line tool ([kubectl](https://github.com/kubernetes/community/wiki/Roadmap:-kubectl)). - -The central idea is that a toolbox of composable configuration tools should manipulate configuration data in the form of declarative API resource specifications, which serve as a [declarative data model](https://docs.google.com/document/d/1RmHXdLhNbyOWPW_AtnnowaRfGejw-qlKQIuLKQWlwzs/edit#), not express configuration as code or some other representation that is restrictive, non-standard, and/or difficult to manipulate. - -## Declarative configuration - -Why the heavy emphasis on configuration in Kubernetes? Kubernetes supports declarative control by specifying users’ desired intent. The intent is carried out by asynchronous control loops, which interact through the Kubernetes API. This declarative approach is critical to the system’s self-healing, autonomic capabilities, and application updates. This approach is in contrast to manual imperative operations or flowchart-like orchestration. - -This is aligned with the industry trend towards [immutable infrastructure](http://thenewstack.io/a-brief-look-at-immutable-infrastructure-and-why-it-is-such-a-quest/), which facilitates predictability, reversibility, repeatability, scalability, and availability. Repeatability is even more critical for containers than for VMs, because containers typically have lifetimes that are measured in days, hours, even minutes. Production container images are typically built from configurable/scripted processes and have parameters overridden by configuration rather than modifying them interactively. - -What form should this configuration take in Kubernetes? The requirements are as follows: - -* Perhaps somewhat obviously, it should support **bulk** management operations: creation, deletion, and updates. - -* As stated above, it should be **universal**, usable for both bespoke and off-the-shelf applications, for most major workload categories, including stateless and stateful, and for both development and production environments. It also needs to be applicable to use cases outside application definition, such as policy configuration and component configuration. - -* It should **expose** the full power of Kubernetes (all CRUD APIs, API fields, API versions, and extensions), be **consistent** with concepts and properties presented by other tools, and should **teach** Kubernetes concepts and API, while providing a **bridge** for application developers that prefer imperative control or that need wizards and other tools to provide an onramp for beginners. - -* It should feel **native** to Kubernetes. There is a place for tools that work across multiple platforms but which are native to another platform and for tools that are designed to work across multiple platforms but are native to none, but such non-native solutions would increase complexity for Kubernetes users by not taking full advantage of Kubernetes-specific mechanisms and conventions. - -* It should **integrate** with key user tools and workflows, such as continuous deployment pipelines and application-level configuration formats, and **compose** with built-in and third-party API-based automation, such as [admission control](https://kubernetes.io/docs/admin/admission-controllers/), autoscaling, and [Operators](https://coreos.com/operators). In order to do this, it needs to support **separation of concerns** by supporting multiple distinct configuration sources and preserving declarative intent while allowing automatically set attributes. - -* In particular, it should be straightforward (but not required) to manage declarative intent under **version control**, which is [standard industry best practice](http://martinfowler.com/bliki/InfrastructureAsCode.html) and what Google does internally. Version control facilitates reproducibility, reversibility, and an audit trail. Unlike generated build artifacts, configuration is primary human-authored, or at least it is desirable to be human-readable, and it is typically changed with a human in the loop, as opposed to fully automated processes, such as autoscaling. Version control enables the use of familiar tools and processes for change control, review, and conflict resolution. - -* Users need the ability to **customize** off-the-shelf configurations and to instantiate multiple **variants**, without crossing the [line into the ecosystem](https://docs.google.com/presentation/d/1oPZ4rznkBe86O4rPwD2CWgqgMuaSXguIBHIE7Y0TKVc/edit#slide=id.g21b1f16809_5_86) of [configuration domain-specific languages, platform as a service, functions as a service](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not), and so on, though users should be able to [layer such tools/systems on top](https://kubernetes.io/blog/2017/02/caas-the-foundation-for-next-gen-paas/) of the mechanism, should they choose to do so. - -* We need to develop clear **conventions**, **examples**, and mechanisms that foster **structure**, to help users understand how to combine Kubernetes’s flexible mechanisms in an effective manner. - -## Configuration customization and variant generation - -The requirement that drives the most complexity in typical configuration solutions is the need to be able to customize configurations of off-the-shelf components and/or to instantiate multiple variants. - -Deploying an application generally requires customization of multiple categories of configuration: - -* Frequently customized - - * Context: namespaces, [names, labels](https://github.com/kubernetes/kubernetes/issues/1698), inter-component references, identity - - * Image: repository/registry (source), tag (image stream/channel), digest (specific image) - - * Application configuration, overriding default values in images: command/args, env, app config files, static data - - * Resource parameters: replicas, cpu, memory, volume sources - - * Consumed services: coordinates, credentials, and client configuration - -* Less frequently customized - - * Management parameters: probe intervals, rollout constraints, utilization targets - -* Customized per environment - - * Environmental adapters: lifecycle hooks, helper sidecars for configuration, monitoring, logging, network/auth proxies, etc - - * Infrastructure mapping: scheduling constraints, tolerations - - * Security and other operational policies: RBAC, pod security policy, network policy, image provenance requirements - -* Rarely customized - - * Application topology, which makes up the basic structure of the application: new/replaced components - -In order to make an application configuration reusable, users need to be able to customize each of those categories of configuration. There are multiple approaches that could be used: - -* Fork: simple to understand; supports arbitrary changes and updates via rebasing, but hard to automate in a repeatable fashion to maintain multiple variants - -* Overlay / patch: supports composition and useful for standard transformations, such as setting organizational defaults or injecting environment-specific configuration, but can be fragile with respect to changes in the base configuration - -* Composition: useful for orthogonal concerns - - * Pull: Kubernetes provides APIs for distribution of application secrets (Secret) and configuration data (ConfigMap), and there is a [proposal open](http://issues.k8s.io/831) to support application data as well - - * the resource identity is fixed, by the object reference, but the contents are decoupled - - * the explicit reference makes it harder to consume a continuously updated stream of such resources, and harder to generate multiple variants - - * can give the PodSpec author some degree of control over the consumption of the data, such as environment variable names and volume paths (though service accounts are at conventional locations rather than configured ones) - - * Push: facilitates separation of concerns and late binding - - * can be explicit, such as with kubectl set or HorizontalPodAutoscaler - - * can be implicit, such as with LimitRange, PodSecurityPolicy, PodPreset, initializers - - * good for attaching policies to selected resources within a scope (namespace and/or label selector) - -* Transformation: useful for common cases (e.g., names and labels) - -* Generation: useful for static decisions, like "if this is a Java app…", which can be integrated into the declarative specification - -* Automation: useful for dynamic adaptation, such as horizontal and vertical auto-scaling, improves ease of use and aids encapsulation (by not exposing those details), and can mitigate phase-ordering problems - -* Parameterization: natural for small numbers of choices the user needs to make, but there are many pitfalls, discussed below - -Rather than relying upon a single approach, we should combine these techniques such that disadvantages are mitigated. - -Tools used to customize configuration [within Google](http://queue.acm.org/detail.cfm?id=2898444) have included: - -* Many bespoke domain-specific configuration languages ([DSLs](http://flabbergast.org)) - -* Python-based configuration DSLs (e.g., [Skylark](https://github.com/google/skylark)) - -* Transliterate configuration DSLs into structured data models/APIs, layered over and under existing DSLs in order to provide a form that is more amenable to automatic manipulation - -* Configuration overlay systems, override mechanisms, and template inheritance - -* Configuration generators, manipulation CLIs, IDEs, and wizards - -* Runtime config databases and spreadsheets - -* Several workflow/push/reconciliation engines - -* Autoscaling and resource-planning tools - -Note that forking/branching generally isn’t viable in Google’s monorepo. - -Despite many projects over the years, some of which have been very widely used, the problem is still considered to be not solved satisfactorily. Our experiences with these tools have informed this proposal, however, as well as the design of Kubernetes itself. - -A non-exhaustive list of tools built by the Kubernetes community (see [spreadsheet](https://docs.google.com/spreadsheets/d/1FCgqz1Ci7_VCz_wdh8vBitZ3giBtac_H8SBw4uxnrsE/edit#gid=0) for up-to-date list), in no particular order, follows: - -* [Helm](https://github.com/kubernetes/helm) -* [OC new-app](https://docs.openshift.com/online/dev_guide/application_lifecycle/new_app.html) -* [Kompose](https://github.com/kubernetes-incubator/kompose) -* [Spread](https://github.com/redspread/spread) -* [Draft](https://github.com/Azure/draft) -* [Ksonnet](https://github.com/ksonnet/ksonnet-lib)/[Kubecfg](https://github.com/ksonnet/kubecfg) -* [Databricks Jsonnet](https://databricks.com/blog/2017/06/26/declarative-infrastructure-jsonnet-templating-language.html) -* [Kapitan](https://github.com/deepmind/kapitan) -* [Konfd](https://github.com/kelseyhightower/konfd) -* [Templates](https://docs.openshift.com/online/dev_guide/templates.html)/[Ktmpl](https://github.com/InQuicker/ktmpl) -* [Fabric8 client](https://github.com/fabric8io/kubernetes-client) -* [Kubegen](https://github.com/errordeveloper/kubegen) -* [kenv](https://github.com/thisendout/kenv) -* [Ansible](https://docs.ansible.com/ansible/latest/modules/k8s_module.html) -* [Puppet](https://forge.puppet.com/garethr/kubernetes/readme) -* [KPM](https://github.com/coreos/kpm) -* [Nulecule](https://github.com/projectatomic/nulecule) -* [Kedge](https://github.com/kedgeproject/kedge) ([OpenCompose](https://github.com/redhat-developer/opencompose) is deprecated) -* [Chartify](https://github.com/appscode/chartify) -* [Podex](https://github.com/kubernetes/contrib/tree/master/podex) -* [k8sec](https://github.com/dtan4/k8sec) -* [kb80r](https://github.com/UKHomeOffice/kb8or) -* [k8s-kotlin-dsl](https://github.com/fkorotkov/k8s-kotlin-dsl) -* [KY](https://github.com/stellaservice/ky) -* [Kploy](https://github.com/kubernauts/kploy) -* [Kdeploy](https://github.com/flexiant/kdeploy) -* [Kubernetes-deploy](https://github.com/Shopify/kubernetes-deploy) -* [Generator-kubegen](https://www.sesispla.net/blog/language/en/2017/07/introducing-generator-kubegen-a-kubernetes-configuration-file-booster-tool/) -* [K8comp](https://github.com/cststack/k8comp) -* [Kontemplate](https://github.com/tazjin/kontemplate) -* [Kexpand](https://github.com/kopeio/kexpand) -* [Forge](https://github.com/datawire/forge/) -* [Psykube](https://github.com/CommercialTribe/psykube) -* [Koki](http://koki.io) -* [Deploymentizer](https://github.com/InVisionApp/kit-deploymentizer) -* [generator-kubegen](https://github.com/sesispla/generator-kubegen) -* [Broadway](https://github.com/namely/broadway) -* [Srvexpand](https://github.com/kubernetes/kubernetes/pull/1980/files) -* [Rok8s-scripts](https://github.com/reactiveops/rok8s-scripts) -* [ERB-Hiera](https://roobert.github.io/2017/08/16/Kubernetes-Manifest-Templating-with-ERB-and-Hiera/) -* [k8s-icl](https://github.com/archipaorg/k8s-icl) -* [sed](https://stackoverflow.com/questions/42618087/how-to-parameterize-image-version-when-passing-yaml-for-container-creation) -* [envsubst](https://github.com/fabric8io/envsubst) -* [Jinja](https://github.com/tensorflow/ecosystem/tree/master/kubernetes) -* [spiff](https://github.com/cloudfoundry-incubator/spiff) - -Additionally, a number of continuous deployment systems use their own formats and/or schemas. - -The number of tools is a signal of demand for a customization solution, as well as lack of awareness of and/or dissatisfaction with existing tools. [Many prefer](https://news.ycombinator.com/item?id=15029086) to use the simplest tool that meets their needs. Most of these tools support customization via simple parameter substitution or a more complex configuration domain-specific language, while not adequately supporting the other customization strategies. The pitfalls of parameterization and domain-specific languages are discussed below. - -### Parameterization pitfalls - -After simply forking (or just cut&paste), parameterization is the most commonly used customization approach. We have [previously discussed](https://github.com/kubernetes/kubernetes/issues/11492) requirements for parameterization mechanisms, such as explicit declaration of parameters for easy discovery, documentation, and validation (e.g., for [form generation](https://github.com/kubernetes/kubernetes/issues/6487)). It should also be straightforward to provide multiple sets of parameter values in support of variants and to manage them under version control, though many tools do not facilitate that. - -Some existing template examples: - -* [Openshift templates](https://github.com/openshift/library/tree/master/official) ([MariaDB example](https://github.com/luciddreamz/library/blob/master/official/mariadb/templates/mariadb-persistent.json)) -* [Helm charts](https://github.com/kubernetes/charts/) ([Jenkins example](https://github.com/kubernetes/charts/blob/master/stable/jenkins/templates/jenkins-master-deployment.yaml)) -* not Kubernetes, but a [Kafka Mesosphere Universe example](https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/C/confluent-kafka/5/marathon.json.mustache) - -Parameterization solutions are easy to implement and to use at small scale, but parameterized templates tend to become complex and difficult to maintain. Syntax-oblivious macro substitution (e.g., sed, jinja, envsubst) can be fragile, and parameter substitution sites generally have to be identified manually, which is tedious and error-prone, especially for the most common use cases, such as resource name prefixing. - -Additionally, performing all customization via template parameters erodes template encapsulation. Some prior configuration-language design efforts made encapsulation a non-goal due to the widespread desire of users to override arbitrary parts of configurations. If used by enough people, someone will want to override each value in a template. Parameterizing every value in a template creates an alternative API schema that contains an out-of-date subset of the full API, and when [every value is a parameter](https://github.com/kubernetes/charts/blob/e002378c13e91bef4a3b0ba718c191ec791ce3f9/stable/artifactory/templates/artifactory-deployment.yaml), a template combined with its parameters is considerably less readable than the expanded result, and less friendly to data-manipulation scripts and tools. - -### Pitfalls of configuration domain-specific languages (DSLs) - -Since parameterization and file imports are common features of most configuration domain-specific languages (DSLs), they inherit the pitfalls of parameterization. The complex custom syntax (and/or libraries) of more sophisticated languages also tends to be more opaque, hiding information such as application topology from humans. Users generally need to understand the input language, transformations applied, and output generated, which is more complex for users to learn. Furthermore, custom-built languages [typically lack good tools](http://mikehadlow.blogspot.com/2012/05/configuration-complexity-clock.html) for refactoring, validation, testing, debugging, etc., and hard-coded translations are hard to maintain and keep up to date. And such syntax typically isn’t friendly to tools, for example [hiding information](https://github.com/kubernetes/kubernetes/issues/13241#issuecomment-233731291) about parameters and source dependencies, and is hostile to composition with other tools, configuration sources, configuration languages, runtime automation, and so on. The configuration source must be modified in order to customize additional properties or to add additional resources, which fosters closed, monolithic, fat configuration ecosystems and obstructs separation of concerns. This is especially true of tools and libraries that don’t facilitate post-processing of their output between pre-processing the DSL and actuation of the resulting API resources. - -Additionally, the more powerful languages make it easy for users to shoot themselves in their feet. For instance, it can be easy to mix computation and data. Among other problems, embedded code renders the configuration unparsable by other tools (e.g., extraction, injection, manipulation, validation, diff, interpretation, reconciliation, conversion) and clients. Such languages also make it easy to reduce boilerplate, which can be useful, but when taken to the extreme, impairs readability and maintainability. Nested/inherited templates are seductive, for those languages that enable them, but very hard to make reusable and maintainable in practice. Finally, it can be tempting to use these capabilities for many purposes, such as changing defaults or introducing new abstractions, but this can create different and surprising behavior compared to direct API usage through CLIs, libraries, UIs, etc., and create accidental pseudo-APIs rather than intentional, actual APIs. If common needs can only be addressed using the configuration language, then the configuration transformer must be invoked by most clients, as opposed to using the API directly, which is contrary to the design of Kubernetes as an API-centric system. - -Such languages are powerful and can perform complex transformations, but we found that to be a [mixed blessing within Google](http://research.google.com/pubs/pub44843.html). For instance, there have been many cases where users needed to generate configuration, manipulate configuration, backport altered API field settings into templates, integrate some kind of dynamic automation with declarative configuration, and so on. All of these scenarios were painful to implement with DSL templates in the way. Templates also created new abstractions, changed API default values, and diverged from the API in other ways that disoriented new users. - -A few DSLs are in use in the Kubernetes community, including Go templates (used by Helm, discussed more below), [fluent DSLs](https://github.com/fabric8io/kubernetes-client), and [jsonnet](http://jsonnet.org/), which was inspired by [Google’s Borg configuration language](https://research.google.com/pubs/pub43438.html) ([more on its root language, GCL](http://alexandria.tue.nl/extra1/afstversl/wsk-i/bokharouss2008.pdf)). [Ksonnet-lib](https://github.com/ksonnet/ksonnet-lib) is a community project aimed at building Kubernetes-specific jsonnet libraries. Unfortunately, the examples (e.g., [nginx](https://github.com/ksonnet/ksonnet-lib/blob/master/examples/readme/hello-nginx.jsonnet)) appear more complex than the raw Kubernetes API YAML, so while it may provide more expressive power, it is less approachable. Databricks looks like [the biggest success case](https://databricks.com/blog/2017/06/26/declarative-infrastructure-jsonnet-templating-language.html) with jsonnet to date, and uses an approach that is admittedly more readable than ksonnet-lib, as is [Kubecfg](https://github.com/ksonnet/kubecfg). However, they all encourage users to author and manipulate configuration code written in a DSL rather than configuration data written in a familiar and easily manipulated format, and are unnecessarily complex for most use cases. - -Helm is discussed below, with package management. - -In case it’s not clear from the above, I do not consider configuration schemas expressed using common data formats such as JSON and YAML (sans use of substitution syntax) to be configuration DSLs. - -## Configuration using REST API resource specifications - -Given the pitfalls of parameterization and configuration DSLs, as mentioned at the beginning of this document, configuration tooling should manipulate configuration **data**, not convert configuration to code nor other marked-up syntax, and, in the case of Kubernetes, this data should primarily contain specifications of the **literal Kubernetes API resources** required to deploy the application in the manner desired by the user. The Kubernetes API and CLI (kubectl) were designed to support this model, and our documentation and examples use this approach. - -[Kubernetes’s API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.md#cluster-control-plane-aka-master) provides IaaS-like container-centric primitives such as Pods, Services, and Ingress, and also lifecycle controllers to support orchestration (self-healing, scaling, updates, termination) of common types of workloads, such as ReplicaSet (simple fungible/stateless app manager), Deployment (orchestrates updates of stateless apps), Job (batch), CronJob (cron), DaemonSet (cluster services), StatefulSet (stateful apps), and [custom third-party controllers/operators](https://coreos.com/blog/introducing-operators.html). The workload controllers, such as Deployment, support declarative upgrades using production-grade strategies such as rolling update, so that the client doesn’t need to perform complex orchestration in the common case. (And we’re moving [proven kubectl features to controllers](https://github.com/kubernetes/kubernetes/issues/12143), generally.) We also deliberately decoupled service naming/discovery and load balancing from application implementation in order to maximize deployment flexibility, which should be preserved by the configuration mechanism. - -[Kubectl apply](https://github.com/kubernetes/kubernetes/issues/15894) [was designed](https://github.com/kubernetes/kubernetes/issues/1702) ([original proposal](https://github.com/kubernetes/kubernetes/issues/1178)) to support declarative updates without clobbering operationally and/or automatically set desired state. Properties not explicitly specified by the user are free to be changed by automated and other out-of-band mechanisms. Apply is implemented as a 3-way merge of the user’s previous configuration, the new configuration, and the live state. - -We [chose this simple approach of using literal API resource specifications](https://github.com/kubernetes/kubernetes/pull/1007/files) for the following reasons: - -* KISS: It was simple and natural, given that we designed the API to support CRUD on declarative primitives, and Kubernetes uses the API representation in all scenarios where API resources need to be serialized (e.g., in persistent cluster storage). -* It didn’t require users to learn multiple different schemas, the API and another configuration format. We believe many/most production users will eventually want to use the API, and knowledge of the API transfers to other clients and tools. It doesn’t obfuscate the API, which is relatively easy to read. -* It automatically stays up to date with the API, automatically supports all Kubernetes resources, versions, extensions, etc., and can be automatically converted to new API versions. -* It could share mechanisms with other clients (e.g., Swagger/OpenAPI, which is used for schema validation), which are now supported in several languages: Go, Python, Java, … -* Declarative configuration is only one interface to the system. There are also CLIs (e.g., kubectl), UIs (e.g., dashboard), mobile apps, chat bots, controllers, admission controllers, Operators, deployment pipelines, etc. Those clients will (and should) target the API. The user will need to interact with the system in terms of the API in these other scenarios. -* The API serves as a well defined intermediate representation, pre- and post-creation, with a documented deprecation policy. Tools, libraries, controllers, UI wizards, etc. can be built on top, leaving room for exploration and innovation within the community. Example API-based transformations include: - * Overlay application: kubectl patch - * Generic resource tooling: kubectl label, kubectl annotate - * Common-case tooling: kubectl set image, kubectl set resources - * Dynamic pod transformations: LimitRange, PodSecurityPolicy, PodPreset - * Admission controllers and initializers - * API-based controllers, higher-level APIs, and controllers driven by custom resources - * Automation: horizontal and [vertical pod autoscaling](https://github.com/kubernetes/community/pull/338) -* It is inherently composable: just add more resource manifests, in the same file or another file. No embedded imports required. - -Of course, there are downsides to the approach: - -* Users need to learn some API schema details, though we believe operators will want to learn them, anyway. -* The API schema does contain a fair bit of boilerplate, though it could be auto-generated and generally increases clarity. -* The API introduces a significant number of concepts, though they exist for good reasons. -* The API has no direct representation of common generation steps (e.g., generation of ConfigMap or Secret resources from source data), though these can be described in a declarative format using API conventions, as we do with component configuration in Kubernetes. -* It is harder to fix warts in the API than to paper over them. Fixing "bugs" may break compatibility (e.g., as with changing the default imagePullPolicy). However, the API is versioned, so it is not impossible, and fixing the API benefits all clients, tools, UIs, etc. -* JSON is cumbersome and some users find YAML to be error-prone to write. It would also be nice to support a less error-prone data syntax than YAML, such as [Relaxed JSON](https://github.com/phadej/relaxed-json), [HJson](https://hjson.org/), [HCL](https://github.com/hashicorp/hcl), [StrictYAML](https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst), or [YAML2](https://github.com/yaml/YAML2/wiki/Goals). However, one major disadvantage would be the lack of library support in multiple languages. HCL also wouldn’t directly map to our API schema due to our avoidance of maps. Perhaps there are there YAML conventions that could result in less error-prone specifications. - -## What needs to be improved? - -While the basic mechanisms for this approach are in place, a number of common use cases could be made easier. Most user complaints are around discovering what features exist (especially annotations), documentation of and examples using those features, generating/finding skeleton resource specifications (including boilerplate and commonly needed features), formatting and validation of resource specifications, and determining appropriate cpu and memory resource requests and limits. Specific user scenarios are discussed below. - -### Bespoke application deployment - -Deployment of bespoke applications involves multiple steps: - -1. Build the container image -2. Generate and/or modify Kubernetes API resource specifications to use the new image -3. Reconcile those resources with a Kubernetes cluster - -Step 1, building the image, is out of scope for Kubernetes. Step 3 is covered by kubectl apply. Some tools in the ecosystem, such as [Draft](https://github.com/Azure/draft), combine the 3 steps. - -Kubectl contains ["generator" commands](/contributors/devel/sig-cli/kubectl-conventions.md#generators), such as [kubectl run](https://kubernetes.io/docs/user-guide/kubectl/v1.7/#run), expose, various create commands, to create commonly needed Kubernetes resource configurations. However, they also don’t help users understand current best practices and conventions, such as proper label and annotation usage. This is partly a matter of updating them and partly one of making the generated resources suitable for consumption by new users. Options supporting declarative output, such as dry run, local, export, etc., don’t currently produce clean, readable, reusable resource specifications ([example](https://blog.heptio.com/using-kubectl-to-jumpstart-a-yaml-file-heptioprotip-6f5b8a63a3ea))**.** We should clean them up. - -Openshift provides a tool, [oc new-app](https://docs.openshift.com/enterprise/3.1/dev_guide/new_app.html), that can pull source-code templates, [detect](https://github.com/kubernetes/kubernetes/issues/14801)[ application types](https://github.com/kubernetes/kubernetes/issues/14801) and create Kubernetes resources for applications from source and from container images. [podex](https://github.com/kubernetes/contrib/tree/master/podex) was built to extract basic information from an image to facilitate creation of default Kubernetes resources, but hasn’t been kept up to date. Similar resource generation tools would be useful for getting started, and even just [validating that the image really exists](https://github.com/kubernetes/kubernetes/issues/12428) would reduce user error. - -For updating the image in an existing deployment, kubectl set image works both on the live state and locally. However, we should [make the image optional](https://github.com/kubernetes/kubernetes/pull/47246) in controllers so that the image could be updated independently of kubectl apply, if desired. And, we need to [automate image tag-to-digest translation](https://github.com/kubernetes/kubernetes/issues/33664) ([original issue](https://github.com/kubernetes/kubernetes/issues/1697)), which is the approach we’d expect users to use in production, as opposed to just immediately re-pulling the new image and restarting all existing containers simultaneously. We should keep the original tag in an imageStream annotation, which could eventually become a field. - -### Continuous deployment - -In addition to PaaSes, such as [Openshift](https://blog.openshift.com/openshift-3-3-pipelines-deep-dive/) and [Deis Workflow](https://github.com/deis/workflow), numerous continuous deployment systems have been integrated with Kubernetes, such as [Google Container Builder](https://github.com/GoogleCloudPlatform/cloud-builders/tree/master/kubectl), [Jenkins](https://github.com/GoogleCloudPlatform/continuous-deployment-on-kubernetes), [Gitlab](https://about.gitlab.com/2016/11/14/idea-to-production/), [Wercker](http://www.wercker.com/integrations/kubernetes), [Drone](https://open.blogs.nytimes.com/2017/01/12/continuous-deployment-to-google-cloud-platform-with-drone/), [Kit](https://invisionapp.github.io/kit/), [Bitbucket Pipelines](https://confluence.atlassian.com/bitbucket/deploy-to-kubernetes-892623297.html), [Codeship](https://blog.codeship.com/continuous-deployment-of-docker-apps-to-kubernetes/), [Shippable](https://www.shippable.com/kubernetes.html), [SemaphoreCI](https://semaphoreci.com/community/tutorials/continuous-deployment-with-google-container-engine-and-kubernetes), [Appscode](https://appscode.com/products/cloud-deployment/), [Kontinuous](https://github.com/AcalephStorage/kontinuous), [ContinuousPipe](https://continuouspipe.io/), [CodeFresh](https://docs.codefresh.io/docs/kubernetes#section-deploy-to-kubernetes), [CloudMunch](https://www.cloudmunch.com/continuous-delivery-for-kubernetes/), [Distelli](https://www.distelli.com/kubernetes/), [AppLariat](https://www.applariat.com/ci-cd-applariat-travis-gke-kubernetes/), [Weave Flux](https://github.com/weaveworks/flux), and [Argo](https://argoproj.github.io/argo-site/#/). Developers usually favor simplicity, whereas operators have more requirements, such as multi-stage deployment pipelines, deployment environment management (e.g., staging and production), and canary analysis. In either case, users need to be able to deploy both updated images and configuration updates, ideally using the same workflow. [Weave Flux](https://github.com/weaveworks/flux) and [Kube-applier](https://blog.box.com/blog/introducing-kube-applier-declarative-configuration-for-kubernetes/) support unified continuous deployment of this style. In other CD systems a unified flow may be achievable by making the image deployment step perform a local kubectl set image (or equivalent) and commit the change to the configuration, and then use another build/deployment trigger on the configuration repository to invoke kubectl apply --prune. - -### Migrating from Docker Compose - -Some developers like Docker’s Compose format as a simplified all-in-one configuration schema, or are at least already familiar with it. Kubernetes supports the format using the [Kompose tool](https://github.com/kubernetes/kompose), which provides an easy migration path for these developers by translating the format to Kubernetes resource specifications. - -The Compose format, even with extensions (e.g., replica counts, pod groupings, controller types), is inherently much more limited in expressivity than Kubernetes-native resource specifications, so users would not want to use it forever in production. But it provides a useful onramp, without introducing [yet another schema](https://github.com/kubernetes/kubernetes/pull/1980#issuecomment-60457567) to the community. We could potentially increase usage by including it in a [client-tool release bundle](https://github.com/kubernetes/release/issues/3). - -### Reconciliation of multiple resources and multiple files - -Most applications require multiple Kubernetes resources. Although kubectl supports multiple resources in a single file, most users store the resource specifications using one resource per file, for a number of reasons: - -* It was the approach used by all of our early application-stack examples -* It provides more control by making it easier to specify which resources to operate on -* It’s inherently composable -- just add more files - -The control issue should be addressed by adding support to select resources to mutate by label selector, name, and resource types, which has been planned from the beginning but hasn’t yet been fully implemented. However, we should also [expand and improve kubectl’s support for input from multiple files](https://github.com/kubernetes/kubernetes/issues/24649). - -### Declarative updates - -Kubectl apply (and strategic merge patch, upon which apply is built) has a [number of bugs and shortcomings](https://github.com/kubernetes/kubernetes/issues/35234), which we are fixing, since it is the underpinning of many things (declarative configuration, add-on management, controller diffs). Eventually we need [true API support](https://github.com/kubernetes/kubernetes/issues/17333) for apply so that clients can simply PUT their resource manifests and it can be used as the fundamental primitive for declarative updates for all clients. One of the trickier issues we should address with apply is how to handle [controller selector changes](https://github.com/kubernetes/kubernetes/issues/26202). We are likely to forbid changes for now, as we do with resource name changes. - -Kubectl should also operate on resources in an intelligent order when presented with multiple resources. While we’ve tried to avoid creation-order dependencies, they do exist in a few places, such as with namespaces, custom resource definitions, and ownerReferences. - -### ConfigMap and Secret updates - -We need a declarative syntax for regenerating [Secrets](https://github.com/kubernetes/kubernetes/issues/24744) and [ConfigMaps](https://github.com/kubernetes/kubernetes/issues/30337) from their source files that could be used with apply, and provide easier ways to [roll out new ConfigMaps and garbage collect unneeded ones](https://github.com/kubernetes/kubernetes/issues/22368). This could be embedded in a manifest file, which we need for "package" metadata (see [Addon manager proposal](https://docs.google.com/document/d/1Laov9RCOPIexxTMACG6Ffkko9sFMrrZ2ClWEecjYYVg/edit) and [Helm chart.yaml](https://github.com/kubernetes/helm/blob/master/docs/charts.md)). There also needs to be an easier way to [generate names of the new resources](https://github.com/kubernetes/kubernetes/pull/49961) and to update references to ConfigMaps and Secrets, such as in env and volumes. This could be done via new kubectl set commands, but users primarily need the “stream” update model, as with images. - -### Determining success/failure - -The declarative, [asynchronous control-loop-based approach](https://docs.google.com/presentation/d/1oPZ4rznkBe86O4rPwD2CWgqgMuaSXguIBHIE7Y0TKVc/edit#slide=id.g21b1f16809_3_155) makes it more challenging for the user to determine whether the change they made succeeded or failed, or the system is still converging towards the new desired state. Enough status information needs to be reported such that progress and problems are visible to controllers watching the status, and the status needs to be reported in a consistent enough way that a [general-purpose mechanism](https://github.com/kubernetes/kubernetes/issues/34363) can be built that works for arbitrary API types following Kubernetes API conventions. [Third-party attempts](https://github.com/Mirantis/k8s-AppController#dependencies) to monitor the status generally are not implemented correctly, since Kubernetes’s extensible API model requires exposing distributed-system effects to clients. This complexity can be seen all over our [end-to-end tests](https://github.com/kubernetes/kubernetes/blob/master/test/utils/deployment.go#L74), which have been made robust over many thousands of executions. Definitely authors of individual application configurations should not be forced to figure out how to implement such checks, as they currently do in Helm charts (--wait, test). - -### Configuration customization - -The strategy for customization involves the following main approaches: - -1. Fork or simply copy the resource specifications, and then locally modify them, imperatively, declaratively, or manually, in order to reuse off-the-shelf configuration. To facilitate these modifications, we should: - * Automate common customizations, especially [name prefixing and label injection](https://github.com/kubernetes/kubernetes/issues/1698) (including selectors, pod template labels, and object references), which would address the most common substitutions in existing templates - * Fix rough edges for local mutation via kubectl get --export and [kubectl set](https://github.com/kubernetes/kubernetes/issues/21648) ([--dry-run](https://github.com/kubernetes/kubernetes/issues/11488), --local, -o yaml), and enable kubectl to directly update files on disk - * Build fork/branch management tooling for common workflows, such as branch creation, cherrypicking (e.g., to copy configuration changes from a staging to production branch), rebasing, etc., perhaps as a plugin to kubectl. - * Build/improve structural diff, conflict detection, validation (e.g., [kubeval](https://github.com/garethr/kubeval), [ConfigMap element properties](https://github.com/kubernetes/kubernetes/issues/4210)), and comprehension tools -2. Resource overlays, for instantiating multiple variants. Kubectl patch already works locally using strategic merge patch, so the overlays have the same structure as the base resources. The main feature needed to facilitate that is automatic pairing of overlays with the resources they should patch. - -Fork provides one-time customization, which is the most common case. Overlay patches provide deploy-time customization. These techniques can be combined with dynamic customization (PodPreset, other admission controllers, third-party controllers, etc.) and run-time customization (initContainers and entrypoint.sh scripts inside containers). - -Benefits of these approaches: - -* Easier for app developers and operators to build initial configurations (no special template syntax) -* Compatible with existing project tooling and conventions, and easy to read since it doesn’t obfuscate the API and doesn’t force users to learn a new way to configure their applications -* Supports best practices -* Handles cases the [original configuration author didn’t envision](http://blog.shippable.com/the-new-devops-matrix-from-hell) -* Handles cases where original author changes things that break existing users -* Supports composition by adding resources: secrets, configmaps, autoscaling -* Supports injection of operational concerns, such as node affinity/anti-affinity and tolerations -* Supports selection among alternatives, and multiple simultaneous versions -* Supports canaries and multi-cluster deployment -* Usable for [add-on management](https://github.com/kubernetes/kubernetes/issues/23233), by avoiding [obstacles that Helm has](https://github.com/kubernetes/kubernetes/issues/23233#issuecomment-285524825), and should eliminate the need for the EnsureExists behavior - -#### What about parameterization? - -An area where more investigation is needed is explicit inline parameter substitution, which, while overused and should be rendered unnecessary by the capabilities described above, is [frequently requested](https://stackoverflow.com/questions/44832085/passing-variables-to-args-field-in-a-yaml-file-kubernetes) and has been reinvented many times by the community. - -A [simple parameterization approach derived from Openshift’s design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/OBSOLETE_templates.md) was approved because it was constrained in functionality and solved other problems (e.g., instantiation of resource variants by other controllers, [project templates in Openshift](https://docs.openshift.com/container-platform/3.5/dev_guide/templates.html)). That proposal explains some of the reasoning behind the design tradeoffs, as well as the [use cases](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/OBSOLETE_templates.md#use-cases). Work started, but was abandoned, though there is an independent [client-based implementation](https://github.com/InQuicker/ktmpl). However, the Template resource wrapped the resource specifications in another object, which is suboptimal, since transformations would then need to be able to deal with standalone resources, Lists of resources, and Templates, or would need to be applied post-instantiation, and it couldn’t be represented using multiple files, as users prefer. - -What is more problematic is that our client libraries, schema validators, yaml/json parsers/decoders, initializers, and protobuf encodings all require that all specified fields have valid values, so parameters cannot currently be left in non-string (e.g., int, bool) fields in actual resources. Additionally, the API server requires at least complete/final resource names to be specified, and strategic merge also requires all merge keys to be specified. Therefore, some amount of pre-instantiation (though not necessarily client-side) transformation is necessary to create valid resources, and we may want to explicitly store the output, or the fields should just contain the default values initially. Parameterized fields could be automatically converted to patches to produce valid resources. Such a transformation could be made reversible, unlike traditional substitution approaches, since the patches could be preserved (e.g., using annotations). The Template API supported the declaration of parameter names, display names, descriptions, default values, required/optional, and types (string, int, bool, base64), and both string and raw json substitutions. If we were to update that specification, we could use the same mechanism for both parameter validation and ConfigMap validation, so that the same mechanism could be used for env substitution and substitution of values of other fields. As mentioned in the [env validation issue](https://github.com/kubernetes/kubernetes/issues/4210#issuecomment-305555589), we should consider a subset of [JSON schema](http://json-schema.org/example1.html), which we’ll probably use for CRD. The only [unsupported attribute](https://tools.ietf.org/html/draft-wright-json-schema-validation-00) appears to be the display name, which is non-critical. [Base64 could be represented using media](http://json-schema.org/latest/json-schema-hypermedia.html#rfc.section.5.3.2). That could be useful as a common parameter schema to facilitate parameter discovery and documentation that is independent of the substitution syntax and mechanism ([example from Deployment Manager](https://github.com/GoogleCloudPlatform/deploymentmanager-samples/blob/master/templates/replicated_service.py.schema)). - -Without parameters how would we support a click-to-deploy experience? People who are kicking the tires, have undemanding use cases, are learning, etc. are unlikely to know what customization they want to perform initially, if they even need any. The main information users need to provide is the name prefix they want to apply. Otherwise, choosing among a few alternatives would suit their needs better than parameters. The overlay approach should support that pretty well. Beyond that, I suggest kicking users over to a Kubernetes-specific configuration wizard or schema-aware IDE, and/or support a fork workflow. - -The other application-definition [use cases mentioned in the Template proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/OBSOLETE_templates.md#use-cases) are achievable without parameterization, as well. - -#### What about application configuration generation? - -A number of legacy applications have configuration mechanisms that couple application options and information about the deployment environment. In such cases, a ConfigMap containing the configuration data is not sufficient, since the runtime information (e.g., identities, secrets, service addresses) must be incorporated. There are a [number of tools used for this purpose outside Kubernetes](https://github.com/kubernetes/kubernetes/issues/2068). However, in Kubernetes, they would have to be run as Pod initContainers, sidecar containers, or container [entrypoint.sh init scripts](https://github.com/kubernetes/kubernetes/issues/30716). As this is only a need of some legacy applications, we should not complicate Kubernetes itself to solve it. Instead, we should be prepared to recommend a third-party tool, or provide one, and ensure the downward API provides the information it would need. - -#### What about [package management](https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527) and Helm? - -[Helm](https://github.com/kubernetes/helm/blob/master/docs/chart_repository.md), [KPM](https://github.com/coreos/kpm), [App Registry](https://github.com/app-registry), [Kubepack](https://kubepack.com/), and [DCOS](https://docs.mesosphere.com/1.7/usage/managing-services/) (for Mesos) bundle whitebox off-the-shelf application configurations into **_packages_**. However, unlike traditional artifact repositories, which store and serve generated build artifacts, configurations are primarily human-authored. As mentioned above, it is industry best practice to manage such configurations using version control systems, and Helm package repositories are backed by source code repositories. (Example: [MariaDB](https://github.com/kubernetes/charts/tree/master/stable/mariadb).) - -Advantages of packages: - -1. Package formats add structure to raw Kubernetes primitives, which are deliberately flexible and freeform - * Starter resource specifications that illustrate API schema and best practices - * Labels for application topology (e.g., app, role, tier, track, env) -- similar to the goals of [Label Schema](http://label-schema.org/rc1/) - * File organization and manifest (list of files), to make it easier for users to navigate larger collections of application specifications, to reduce the need for tooling to search for information, and to facilitate segregation of resources from other artifacts (e.g., container sources) - * Application metadata: name, authors, description, icon, version, source(s), etc. - * Application lifecycle operations: build, test, debug, up, upgrade, down, etc. -1. [Package registries/repositories](https://github.com/app-registry/spec) facilitate [discovery](https://youtu.be/zGJsXyzE5A8?t=1159) of off-the-shelf applications and of their dependencies - * Scattered source repos are hard to find - * Ideally it would be possible to map the format type to a container containing the tool that understands the format. - -Helm is probably the most-used configuration tool other than kubectl, many [application charts](https://github.com/kubernetes/charts) have been developed (as with the [Openshift template library](https://github.com/openshift/library)), and there is an ecosystem growing around it (e.g., [chartify](https://github.com/appscode/chartify), [helmfile](https://github.com/roboll/helmfile), [landscaper](https://github.com/Eneco/landscaper), [draughtsman](https://github.com/giantswarm/draughtsman), [chartmuseum](https://github.com/chartmuseum/chartmuseum)). Helm’s users like the familiar analogy to package management and the structure that it provides. However, while Helm is useful and is the most comprehensive tool, it isn’t suitable for all use cases, such as [add-on management](https://github.com/kubernetes/kubernetes/issues/23233). The biggest obstacle is that its [non-Kubernetes-compatible API](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not)[ and DSL syntax push it out of Kubernetes proper into the Kubernetes ecosystem](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not). And, as much as Helm is targeting only Kubernetes, it takes little advantage of that. Additionally, scenarios we’d like to support better include chart authoring (prefer simpler syntax and more straightforward management under version control), operational customization (e.g., via scripting, [forking](https://github.com/kubernetes/helm/issues/2554), or patching/injection), deployment pipelines (e.g., [canaries](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/ouqXYXdsPYw)), multi-cluster / [multi-environment](https://groups.google.com/d/msg/kubernetes-users/GPaGOGxCDD8/NbNL-NPhCAAJ) deployment, and multi-tenancy. - -Helm provides functionality covering several areas: - -* Package conventions: metadata (e.g., name, version, descriptions, icons; Openshift has [something similar](https://github.com/luciddreamz/library/blob/master/official/java/templates/openjdk18-web-basic-s2i.json#L10)), labels, file organization -* Package bundling, unbundling, and hosting -* Package discovery: search and browse -* [Dependency management](https://github.com/kubernetes/helm/blob/master/docs/charts.md#chart-dependencies) -* Application lifecycle management framework: build, install, uninstall, upgrade, test, etc. - * a non-container-centric example of that would be [ElasticBox](https://www.ctl.io/knowledge-base/cloud-application-manager/automating-deployments/start-stop-and-upgrade-boxes/) -* Kubernetes drivers for creation, update, deletion, etc. -* Template expansion / schema transformation -* (It’s currently lacking a formal parameter schema.) - -It's useful for Helm to provide an integrated framework, but the independent functions could be decoupled, and re-bundled into multiple separate tools: - -* Package management -- search, browse, bundle, push, and pull of off-the-shelf application packages and their dependencies. -* Application lifecycle management -- install, delete, upgrade, rollback -- and pre- and post- hooks for each of those lifecycle transitions, and success/failure tests. -* Configuration customization via parameter substitution, aka template expansion, aka rendering. - -That would enable the package-like structure and conventions to be used with raw declarative management via kubectl or other tool that linked in its [business logic](https://github.com/kubernetes/kubernetes/issues/7311), for the lifecycle management to be used without the template expansion, and the template expansion to be used in declarative workflows without the lifecycle management. Support for both client-only and server-side operation and migration from grpc to Kubernetes API extension mechanisms would further expand the addressable use cases. - -([Newer proposal, presented at the Helm Summit](https://docs.google.com/presentation/d/10dp4hKciccincnH6pAFf7t31s82iNvtt_mwhlUbeCDw/edit#slide=id.p).) - -#### What about the service broker? - -The [Open Service Broker API](https://openservicebrokerapi.org/) provides a standardized way to provision and bind to blackbox services. It enables late binding of clients to service providers and enables usage of higher-level application services (e.g., caches, databases, messaging systems, object stores) portably, mitigating lock-in and facilitating hybrid and multi-cloud usage of these services, extending the portability of cloud-native applications running on Kubernetes. The service broker is not intended to be a solution for whitebox applications that require any level of management by the user. That degree of abstraction/encapsulation requires full automation, essentially creating a software appliance (cf. [autonomic computing](https://en.wikipedia.org/wiki/Autonomic_computing)): autoscaling, auto-repair, auto-update, automatic monitoring / logging / alerting integration, etc. Operators, initializers, autoscalers, and other automation may eventually achieve this, and we need to for [cluster add-ons](https://github.com/kubernetes/kubernetes/issues/23233) and other [self-hosted components](https://github.com/kubernetes/kubernetes/issues/246), but the typical off-the-shelf application template doesn’t achieve that. - -#### What about configurations with high cyclomatic complexity or massive numbers of variants? - -Consider more automation, such as autoscaling, self-configuration, etc. to reduce the amount of explicit configuration necessary. One could also write a program in some widely used conventional programming language to generate the resource specifications. It’s more likely to have IDE support, test frameworks, documentation generators, etc. than a DSL. Better yet, create composable transformations, applying [the Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules). In any case, don’t look for a silver bullet to solve all configuration-related problems. Decouple solutions instead. - -#### What about providing an intentionally restrictive simplified, tailored developer experience to streamline a specific use case, environment, workflow, etc.? - -This is essentially a [DIY PaaS](https://kubernetes.io/blog/2017/02/caas-the-foundation-for-next-gen-paas/). Write a configuration generator, either client-side or using CRDs ([example](https://github.com/pearsontechnology/environment-operator/blob/dev/docs/User_Guide.md)). The effort involved to document the format, validate it, test it, etc. is similar to building a new API, but I could imagine someone eventually building a SDK to make that easier. - -#### What about more sophisticated deployment orchestration? - -Deployment pipelines, [canary deployments](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/ouqXYXdsPYw), [blue-green deployments](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/mwIq9bpwNCA), dependency-based orchestration, event-driven orchestrations, and [workflow-driven orchestration](https://github.com/kubernetes/kubernetes/issues/1704) should be able to use the building blocks discussed in this document. [AppController](https://github.com/Mirantis/k8s-AppController) and [Smith](https://github.com/atlassian/smith) are examples of tools built by the community. - -#### What about UI wizards, IDE integration, application frameworks, etc.? - -Representing configuration using the literal API types should facilitate programmatic manipulation of the configuration via user-friendly tools, such as UI wizards (e.g., [dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/#deploying-containerized-applications) and many CD tools, such as [Puppet Pipelines](https://puppet.com/docs/pipelines)) and IDEs (e.g., [VSCode](https://www.youtube.com/watch?v=QfqS9OSVWGs), [IntelliJ](https://github.com/tinselspoon/intellij-kubernetes)), as well as configuration generation and manipulation by application frameworks (e.g., [Spring Cloud](https://github.com/fabric8io/spring-cloud-kubernetes)). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/identifiers.md b/contributors/design-proposals/architecture/identifiers.md index 3b872481..f0fbec72 100644 --- a/contributors/design-proposals/architecture/identifiers.md +++ b/contributors/design-proposals/architecture/identifiers.md @@ -1,109 +1,6 @@ -# Identifiers and Names in Kubernetes +Design proposals have been archived. -A summarization of the goals and recommendations for identifiers in Kubernetes. -Described in GitHub issue [#199](http://issue.k8s.io/199). +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Definitions - -`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time -and space; intended to distinguish between historical occurrences of similar -entities. - -`Name`: A non-empty string guaranteed to be unique within a given scope at a -particular time; used in resource URLs; provided by clients at creation time and -encouraged to be human friendly; intended to facilitate creation idempotence and -space-uniqueness of singleton objects, distinguish distinct entities, and -reference particular entities across operations. - -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL): -An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, -with the '-' character allowed anywhere except the first or last character, -suitable for use as a hostname or segment in a domain name. - -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN): -One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum -length of 253 characters. - -[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID): -A 128 bit generated value that is extremely unlikely to collide across time and -space and requires no central coordination. - -[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME): -An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, -with the '-' character allowed anywhere except the first or the last character -or adjacent to another '-' character, it must contain at least a (a-z) -character. - -## Objectives for names and UIDs - -1. Uniquely identify (via a UID) an object across space and time. -2. Uniquely name (via a name) an object across space. -3. Provide human-friendly names in API operations and/or configuration files. -4. Allow idempotent creation of API resources (#148) and enforcement of -space-uniqueness of singleton objects. -5. Allow DNS names to be automatically generated for some objects. - - -## General design - -1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must -be specified. Name must be non-empty and unique within the apiserver. This -enables idempotent and space-unique creation operations. Parts of the system -(e.g. replication controller) may join strings (e.g. a base name and a random -suffix) to create a unique Name. For situations where generating a name is -impractical, some or all objects may support a param to auto-generate a name. -Generating random names will defeat idempotency. - * Examples: "guestbook.user", "backend-x4eb1" -2. When an object is created via an API, a Namespace string (a DNS_LABEL) -may be specified. Depending on the API receiver, -namespaces might be validated (e.g. apiserver might ensure that the namespace -actually exists). If a namespace is not specified, one will be assigned by the -API receiver. This assignment policy might vary across API receivers (e.g. -apiserver might have a default, kubelet might generate something semi-random). - * Example: "api.k8s.example.com" -3. Upon acceptance of an object via an API, the object is assigned a UID -(a UUID). UID must be non-empty and unique across space and time. - * Example: "01234567-89ab-cdef-0123-456789abcdef" - -## Case study: Scheduling a pod - -Pods can be placed onto a particular node in a number of ways. This case study -demonstrates how the above design can be applied to satisfy the objectives. - -### A pod scheduled by a user through the apiserver - -1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver. -2. The apiserver validates the input. - 1. A default Namespace is assigned. - 2. The pod name must be space-unique within the Namespace. - 3. Each container within the pod has a name which must be space-unique within -the pod. -3. The pod is accepted. - 1. A new UID is assigned. -4. The pod is bound to a node. - 1. The kubelet on the node is passed the pod's UID, Namespace, and Name. -5. Kubelet validates the input. -6. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod -from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is -unique across time. * This may correspond to Docker's container ID. - -### A pod placed by a config file on the node - -1. A config file is stored on the node, containing a pod with UID="", -Namespace="", and Name="cadvisor". -2. Kubelet validates the input. - 1. Since UID is not provided, kubelet generates one. - 2. Since Namespace is not provided, kubelet generates one. - 1. The generated namespace should be deterministic and cluster-unique for -the source, such as a hash of the hostname and file path. - * E.g. Namespace="file-f4231812554558a718a01ca942782d81" -3. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod -from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is -unique across time. - 1. This may correspond to Docker's container ID. - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/images/apiserver.png b/contributors/design-proposals/architecture/images/apiserver.png Binary files differdeleted file mode 100644 index 2936ca9d..00000000 --- a/contributors/design-proposals/architecture/images/apiserver.png +++ /dev/null diff --git a/contributors/design-proposals/architecture/namespaces.md b/contributors/design-proposals/architecture/namespaces.md index c357f9a9..f0fbec72 100644 --- a/contributors/design-proposals/architecture/namespaces.md +++ b/contributors/design-proposals/architecture/namespaces.md @@ -1,411 +1,6 @@ -# Namespaces +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A Namespace is a mechanism to partition resources created by users into -a logically named group. -## Motivation - -A single cluster should be able to satisfy the needs of multiple user -communities. - -Each user community wants to be able to work in isolation from other -communities. - -Each user community has its own: - -1. resources (pods, services, replication controllers, etc.) -2. policies (who can or cannot perform actions in their community) -3. constraints (this community is allowed this much quota, etc.) - -A cluster operator may create a Namespace for each unique user community. - -The Namespace provides a unique scope for: - -1. named resources (to avoid basic naming collisions) -2. delegated management authority to trusted users -3. ability to limit community resource consumption - -## Use cases - -1. As a cluster operator, I want to support multiple user communities on a -single cluster. -2. As a cluster operator, I want to delegate authority to partitions of the -cluster to trusted users in those communities. -3. As a cluster operator, I want to limit the amount of resources each -community can consume in order to limit the impact to other communities using -the cluster. -4. As a cluster user, I want to interact with resources that are pertinent to -my user community in isolation of what other user communities are doing on the -cluster. - -## Design - -### Data Model - -A *Namespace* defines a logically named group for multiple *Kind*s of resources. - -```go -type Namespace struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - Spec NamespaceSpec `json:"spec,omitempty"` - Status NamespaceStatus `json:"status,omitempty"` -} -``` - -A *Namespace* name is a DNS compatible label. - -A *Namespace* must exist prior to associating content with it. - -A *Namespace* must not be deleted if there is content associated with it. - -To associate a resource with a *Namespace* the following conditions must be -satisfied: - -1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with -the server -2. The resource's *TypeMeta.Namespace* field must have a value that references -an existing *Namespace* - -The *Name* of a resource associated with a *Namespace* is unique to that *Kind* -in that *Namespace*. - -It is intended to be used in resource URLs; provided by clients at creation -time, and encouraged to be human friendly; intended to facilitate idempotent -creation, space-uniqueness of singleton objects, distinguish distinct entities, -and reference particular entities across operations. - -### Authorization - -A *Namespace* provides an authorization scope for accessing content associated -with the *Namespace*. - -See [Authorization plugins](https://kubernetes.io/docs/admin/authorization/) - -### Limit Resource Consumption - -A *Namespace* provides a scope to limit resource consumption. - -A *LimitRange* defines min/max constraints on the amount of resources a single -entity can consume in a *Namespace*. - -See [Admission control: Limit Range](../resource-management/admission_control_limit_range.md) - -A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and -allows cluster operators to define *Hard* resource usage limits that a -*Namespace* may consume. - -See [Admission control: Resource Quota](../resource-management/admission_control_resource_quota.md) - -### Finalizers - -Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* -objects. - -```go -type FinalizerName string - -// These are internal finalizers to Kubernetes, must be qualified name unless defined here -const ( - FinalizerKubernetes FinalizerName = "kubernetes" -) - -// NamespaceSpec describes the attributes on a Namespace -type NamespaceSpec struct { - // Finalizers is an opaque list of values that must be empty to permanently remove object from storage - Finalizers []FinalizerName -} -``` - -A *FinalizerName* is a qualified name. - -The API Server enforces that a *Namespace* can only be deleted from storage if -and only if it's *Namespace.Spec.Finalizers* is empty. - -A *finalize* operation is the only mechanism to modify the -*Namespace.Spec.Finalizers* field post creation. - -Each *Namespace* created has *kubernetes* as an item in its list of initial -*Namespace.Spec.Finalizers* set by default. - -### Phases - -A *Namespace* may exist in the following phases. - -```go -type NamespacePhase string -const( - NamespaceActive NamespacePhase = "Active" - NamespaceTerminating NamespacePhase = "Terminating" -) - -type NamespaceStatus struct { - ... - Phase NamespacePhase -} -``` - -A *Namespace* is in the **Active** phase if it does not have a -*ObjectMeta.DeletionTimestamp*. - -A *Namespace* is in the **Terminating** phase if it has a -*ObjectMeta.DeletionTimestamp*. - -**Active** - -Upon creation, a *Namespace* goes in the *Active* phase. This means that content -may be associated with a namespace, and all normal interactions with the -namespace are allowed to occur in the cluster. - -If a DELETE request occurs for a *Namespace*, the -*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A -*namespace controller* observes the change, and sets the -*Namespace.Status.Phase* to *Terminating*. - -**Terminating** - -A *namespace controller* watches for *Namespace* objects that have a -*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to -initiate graceful termination of the *Namespace* associated content that are -known to the cluster. - -The *namespace controller* enumerates each known resource type in that namespace -and deletes it one by one. - -Admission control blocks creation of new resources in that namespace in order to -prevent a race-condition where the controller could believe all of a given -resource type had been deleted from the namespace, when in fact some other rogue -client agent had created new objects. Using admission control in this scenario -allows each of registry implementations for the individual objects to not need -to take into account Namespace life-cycle. - -Once all objects known to the *namespace controller* have been deleted, the -*namespace controller* executes a *finalize* operation on the namespace that -removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list. - -If the *namespace controller* sees a *Namespace* whose -*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers* -list is empty, it will signal the server to permanently remove the *Namespace* -from storage by sending a final DELETE action to the API server. - -There are situations where the *namespace controller* is unable to guarantee -cleanup of all resources. During a cleanup run, it attempts a best-effort -resource deletion, remembers the errors that occurred and reports back via -**namespace status condition**. Some errors can be transient and will -auto-resolve in the following cleanup runs, others may require manual -intervention. - -These are the status conditions reporting on the process of namespace -termination: -- `NamespaceDeletionDiscoveryFailure` reports on errors during the first phase - of namespace termination - - [resource discovery](../api-machinery/api-group.md). -- `NamespaceDeletionGroupVersionParsingFailure` reports on errors that happen - when parsing the [GVK](../api-machinery/api-group.md) - of all discovered resources. -- `NamespaceDeletionContentFailure` reports on errors preventing the controller - from deleting resources belonging to successfully discovered and parsed GVK. - -When any part of a certain phase fails, the *namespace controller* sets appropriate -status condition with a descriptive message of what went wrong. After -the controller successfully passes that phase, it sets the status condition to -report success. - -Example of a failing namespace termination where -`NamespaceDeletionContentFailure` is no longer reporting any error and -`NamespaceDeletionDiscoveryFailure` continues to fail. - -```yaml -status: - conditions: - - lastTransitionTime: "2019-02-13T12:58:03Z" - message: All content successfully deleted - reason: ContentDeleted - status: "False" - type: NamespaceDeletionContentFailure - - lastTransitionTime: "2019-02-13T12:55:16Z" - message: 'Discovery failed for some groups, 2 failing: unable to retrieve the - complete list of server APIs: mutators.abc.com/v1alpha1: the server is currently - unable to handle the request, validators.abc.com/v1alpha1: the server is - currently unable to handle the request' - reason: DiscoveryFailed - status: "True" - type: NamespaceDeletionDiscoveryFailure - phase: Terminating -``` - -### REST API - -To interact with the Namespace API: - -| Action | HTTP Verb | Path | Description | -| ------ | --------- | ---- | ----------- | -| CREATE | POST | /api/{version}/namespaces | Create a namespace | -| LIST | GET | /api/{version}/namespaces | List all namespaces | -| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} | -| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} | -| FINALIZE | PUT | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} | -| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces | - -This specification reserves the name *finalize* as a sub-resource to namespace. - -As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*. - -To interact with content associated with a Namespace: - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} | -| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} | -| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} | -| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} | -| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} | -| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} | -| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | -| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | - -The API server verifies the *Namespace* on resource creation matches the -*{namespace}* on the path. - -The API server will associate a resource with a *Namespace* if not populated by -the end-user based on the *Namespace* context of the incoming request. If the -*Namespace* of the resource being created, or updated does not match the -*Namespace* on the request, then the API server will reject the request. - -### Storage - -A namespace provides a unique identifier space and therefore must be in the -storage path of a resource. - -In etcd, we want to continue to still support efficient WATCH across namespaces. - -Resources that persist content in etcd will have storage paths as follows: - -/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} - -This enables consumers to WATCH /registry/{resourceType} for changes across -namespace of a particular {resourceType}. - -### Kubelet - -The kubelet will register pod's it sources from a file or http source with a -namespace associated with the *cluster-id* - -### Example: OpenShift Origin managing a Kubernetes Namespace - -In this example, we demonstrate how the design allows for agents built on-top of -Kubernetes that manage their own set of resource types associated with a -*Namespace* to take part in Namespace termination. - -OpenShift creates a Namespace in Kubernetes - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"] - }, - "status": { - "phase": "Active" - } -} -``` - -OpenShift then goes and creates a set of resources (pods, services, etc) -associated with the "development" namespace. It also creates its own set of -resources in its own storage associated with the "development" namespace unknown -to Kubernetes. - -User deletes the Namespace in Kubernetes, and Namespace now has following state: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"] - }, - "status": { - "phase": "Terminating" - } -} -``` - -The Kubernetes *namespace controller* observes the namespace has a -*deletionTimestamp* and begins to terminate all of the content in the namespace -that it knows about. Upon success, it executes a *finalize* action that modifies -the *Namespace* by removing *kubernetes* from the list of finalizers: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin"] - }, - "status": { - "phase": "Terminating" - } -} -``` - -OpenShift Origin has its own *namespace controller* that is observing cluster -state, and it observes the same namespace had a *deletionTimestamp* assigned to -it. It too will go and purge resources from its own storage that it manages -associated with that namespace. Upon completion, it executes a *finalize* action -and removes the reference to "openshift.com/origin" from the list of finalizers. - -This results in the following state: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": [] - }, - "status": { - "phase": "Terminating" - } -} -``` - -At this point, the Kubernetes *namespace controller* in its sync loop will see -that the namespace has a deletion timestamp and that its list of finalizers is -empty. As a result, it knows all content associated from that namespace has been -purged. It performs a final DELETE action to remove that Namespace from the -storage. - -At this point, all content associated with that Namespace, and the Namespace -itself are gone. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/principles.md b/contributors/design-proposals/architecture/principles.md index 7bb548d2..f0fbec72 100644 --- a/contributors/design-proposals/architecture/principles.md +++ b/contributors/design-proposals/architecture/principles.md @@ -1,98 +1,6 @@ -# Design Principles +Design proposals have been archived. -Principles to follow when extending Kubernetes. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## API - -See also the [API conventions](/contributors/devel/sig-architecture/api-conventions.md). - -* All APIs should be declarative. -* API objects should be complementary and composable, not opaque wrappers. -* The control plane should be transparent -- there are no hidden internal APIs. -* The cost of API operations should be proportional to the number of objects -intentionally operated upon. Therefore, common filtered lookups must be indexed. -Beware of patterns of multiple API calls that would incur quadratic behavior. -* Object status must be 100% reconstructable by observation. Any history kept -must be just an optimization and not required for correct operation. -* Cluster-wide invariants are difficult to enforce correctly. Try not to add -them. If you must have them, don't enforce them atomically in master components, -that is contention-prone and doesn't provide a recovery path in the case of a -bug allowing the invariant to be violated. Instead, provide a series of checks -to reduce the probability of a violation, and make every component involved able -to recover from an invariant violation. -* Low-level APIs should be designed for control by higher-level systems. -Higher-level APIs should be intent-oriented (think SLOs) rather than -implementation-oriented (think control knobs). - -## Control logic - -* Functionality must be *level-based*, meaning the system must operate correctly -given the desired state and the current/observed state, regardless of how many -intermediate state updates may have been missed. Edge-triggered behavior must be -just an optimization. - * There should be a CAP-like theorem regarding the tradeoffs between driving control loops via polling or events about simultaneously achieving high performance, reliability, and simplicity -- pick any 2. -* Assume an open world: continually verify assumptions and gracefully adapt to -external events and/or actors. Example: we allow users to kill pods under -control of a replication controller; it just replaces them. -* Do not define comprehensive state machines for objects with behaviors -associated with state transitions and/or "assumed" states that cannot be -ascertained by observation. -* Don't assume a component's decisions will not be overridden or rejected, nor -for the component to always understand why. For example, etcd may reject writes. -Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, -but back off and/or make alternative decisions. -* Components should be self-healing. For example, if you must keep some state -(e.g., cache) the content needs to be periodically refreshed, so that if an item -does get erroneously stored or a deletion event is missed etc, it will be soon -fixed, ideally on timescales that are shorter than what will attract attention -from humans. -* Component behavior should degrade gracefully. Prioritize actions so that the -most important activities can continue to function even when overloaded and/or -in states of partial failure. - -## Architecture - -* Only the apiserver should communicate with etcd/store, and not other -components (scheduler, kubelet, etc.). -* Compromising a single node shouldn't compromise the cluster. -* Components should continue to do what they were last told in the absence of -new instructions (e.g., due to network partition or component outage). -* All components should keep all relevant state in memory all the time. The -apiserver should write through to etcd/store, other components should write -through to the apiserver, and they should watch for updates made by other -clients. -* Watch is preferred over polling. - -## Extensibility - -TODO: pluggability - -## Bootstrapping - -* [Self-hosting](http://issue.k8s.io/246) of all components is a goal. -* Minimize the number of dependencies, particularly those required for -steady-state operation. -* Stratify the dependencies that remain via principled layering. -* Break any circular dependencies by converting hard dependencies to soft -dependencies. - * Also accept that data from other components from another source, such as -local files, which can then be manually populated at bootstrap time and then -continuously updated once those other components are available. - * State should be rediscoverable and/or reconstructable. - * Make it easy to run temporary, bootstrap instances of all components in -order to create the runtime state needed to run the components in the steady -state; use a lock (master election for distributed components, file lock for -local components like Kubelet) to coordinate handoff. We call this technique -"pivoting". - * Have a solution to restart dead components. For distributed components, -replication works well. For local components such as Kubelet, a process manager -or even a simple shell loop works. - -## Availability - -TODO - -## General principles - -* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/resource-management.md b/contributors/design-proposals/architecture/resource-management.md index 9eff9833..f0fbec72 100644 --- a/contributors/design-proposals/architecture/resource-management.md +++ b/contributors/design-proposals/architecture/resource-management.md @@ -1,128 +1,6 @@ -# The Kubernetes Resource Model (KRM) +Design proposals have been archived. -> This article was authored by Brian Grant (bgrant0607) on 2/20/2018. The original Google Doc can be found [here](https://docs.google.com/document/d/1RmHXdLhNbyOWPW_AtnnowaRfGejw-qlKQIuLKQWlwzs/edit#). +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Kubernetes is not just API-driven, but is *API-centric*. -At the center of the Kubernetes control plane is the [apiserver](https://kubernetes.io/docs/admin/kube-apiserver/), which implements common functionality for all of the system’s APIs. Both user clients and components implementing the business logic of Kubernetes, called controllers, interact with the same APIs. The APIs are REST-like, supporting primarily CRUD operations on (mostly) persistent resources. All persistent cluster state is stored in one or more instances of the [etcd](https://github.com/coreos/etcd) key-value store. - - - -With the growth in functionality over the past four years, the number of built-in APIs grown by more than an order of magnitude. Moreover, Kubernetes now supports multiple API extension mechanisms that are not only used to add new functionality to Kubernetes itself, but provide frameworks for constructing an ecosystem of components, such as [Operators](https://coreos.com/operators/), for managing applications, platforms, infrastructure, and other things beyond the scope of Kubernetes itself. In addition to providing an overview of the common behaviors of built-in Kubernetes API resources, this document attempts to explain the assumptions, expectations, principles, conventions, and goals of the **Kubernetes Resource Model (KRM)** so as to foster consistency and interoperability within that ecosystem as the uses of its API mechanisms and patterns expand. Any API using the same mechanisms and patterns will automatically work with any libraries and tools (e.g., CLIs, UIs, configuration, deployment, workflow) that have already integrated support for the model, which means that integrating support for N APIs implemented using the model in M tools is merely O(M) work rather than O(NM) work. - -## Declarative control - -In Kubernetes, declarative abstractions are primary, rather than layered on top of the system. The Kubernetes control plane is analogous to cloud-provider declarative resource-management systems (NOTE: Kubernetes also doesn’t bake-in templating, for reasons discussed in the last section.), but presents higher-level (e.g., containerized workloads and services), portable abstractions. Imperative operations and flowchart-like workflow orchestration can be built on top of its declarative model, however. - -Kubernetes supports declarative control by recording user intent as the desired state in its API resources. This enables a single API schema for each resource to serve as a declarative data model, as both a source and a target for automated components (e.g., autoscalers), and even as an intermediate representation for resource transformations prior to instantiation. - -The intent is carried out by asynchronous [controllers](/contributors/devel/sig-api-machinery/controllers.md), which interact through the Kubernetes API. Controllers don’t access the state store, etcd, directly, and don’t communicate via private direct APIs. Kubernetes itself does expose some features similar to key-value stores such as etcd and [Zookeeper](https://zookeeper.apache.org/), however, in order to facilitate centralized [state and configuration management and distribution](https://sysgears.com/articles/managing-configuration-of-distributed-system-with-apache-zookeeper/) to decentralized components. - -Controllers continuously strive to make the observed state match the desired state, and report back their status to the apiserver asynchronously. All of the state, desired and observed, is made visible through the API to users and to other controllers. The API resources serve as coordination points, common intermediate representation, and shared state. - -Controllers are level-based (as described [here](http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html) and [here](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d)) to maximize fault tolerance, which enables the system to operate correctly just given the desired state and the observed state, regardless of how many intermediate state updates may have been missed. However, they can achieve the benefits of an edge-triggered implementation by monitoring changes to relevant resources via a notification-style watch API, which minimizes reaction latency and redundant work. This facilitates efficient decentralized and decoupled coordination in a more resilient manner than message buses. (NOTE: Polling is simple, and messaging is simple, but neither is ideal. There should be a CAP-like theorem about simultaneously achieving low latency, resilience, and simplicity -- pick any 2. Challenges with using "reliable" messaging for events/updates include bootstrapping consumers, events lost during bus outages, consumers not keeping up, bounding queue state, and delivery to unspecified numbers of consumers.) - -## Additional resource model properties - -The Kubernetes control-plane design is intended to make the system resilient and extensible, supporting both declarative configuration and automation, while providing a consistent experience to users and clients. In order to add functionality conforming to these objectives, it should be as easy as defining a new resource type and adding a new controller. - -The Kubernetes resource model is designed to reinforce these objectives through its core assumptions (e.g., lack of exclusive control and multiple actors), principles (e.g., transparency and loose coupling), and goals (e.g., composability and extensibility): - -* There are few direct inter-component APIs, and no hidden internal resource-oriented APIs. All APIs are visible and available (subject to authorization policy). The distinction between being part of the system and being built on top of the system is deliberately blurred. In order to handle more complex use cases, there's no glass to break. One can just access lower-level APIs in a fully transparent manner, or add new APIs, as necessary. - -* Kubernetes operates in a distributed environment, and the control-plane itself may be sharded and distributed (e.g., as in the case of aggregated APIs). Desired state is updated immediately but actuated asynchronously and eventually. Kubernetes does not support atomic transactions across multiple resources and (especially) resource types, pessimistic locking, other durations where declarative intent cannot be updated (e.g., unavailability while busy), discrete synchronous long-running operations, nor synchronous success preconditions based on the results of actuation (e.g., failing to write a new image tag to a PodSpec when the image cannot be pulled). The Kubernetes API also does not provide strong ordering or consistency across multiple resources, and does not enforce referential integrity. Providing stronger semantics would compromise the resilience, extensibility, and observability of the system, while providing less benefit than one might expect, especially given other assumptions, such as the lack of exclusive control and multiple actors. Resources could be modified or deleted immediately after being created. Failures could occur immediately after success, or even prior to apparent success, if not adequately monitored. Caching and concurrency generally obfuscate event ordering. Workflows often involve external, non-transactional resources, such as git repositories and cloud resources. Therefore, graceful tolerance of out-of-order events and problems that could be self-healed automatically is expected. As an example, if a resource can't properly function due to a nonexistent dependent resource, that should be reported as the reason the resource isn't fully functional in the resource's status field. - -* Typically each resource specifies a single desired state. However, for safety reasons, changes to that state may not be fully realized immediately. Since progressive transitions (e.g., rolling updates, traffic shifting, data migrations) are dependent on the underlying mechanisms being controlled, they must be implemented for each resource type, as needed. If multiple versions of some desired state need to coexist simultaneously (e.g., previous and next versions), they each need to be represented explicitly in the system. The convention is to generate a separate resource for each version, each with a content-derived generated name. Comprehensive version control is the responsibility of other systems (e.g., git). - -* The reported observed state is truth. Controllers are expected to reconcile observed and desired state and repair discrepancies, and Kubernetes avoids maintaining opaque, internal persistent state. Resource status must be reconstructable by observation. - -* The current status is represented using as many properties as necessary, rather than being modeled by state machines with explicit, enumerated states. Such state machines are not extensible (states can neither be added nor removed), and they encourage inference of implicit properties from the states rather than representing the properties explicitly. - -* Resources are not assumed to have single, exclusive "owners". They may be read and written by multiple parties and/or components, often, but not always, responsible for orthogonal concerns (not unlike [aspects](https://en.wikipedia.org/wiki/Aspect-oriented_programming)). Controllers cannot assume their decisions will not be overridden or rejected, must continually verify assumptions, and should gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a controller; it just replaces them. - -* Object references are usually represented using predictable, client-provided names, to facilitate loose coupling, declarative references, disaster recovery, deletion and re-creation (e.g., to change immutable properties or to transition between incompatible APIs), and more. They are also represented as tuples (of name, namespace, API version, and resource type, or subsets of those) rather than URLs in order to facilitate inference of reference components from context. - -## API topology and conventions - -The [API](https://kubernetes.io/docs/reference/api-concepts/) URL structure is of the following form: - -<p align="center"> - /prefix/group/version/namespaces/namespace/resourcetype/name -</p> - -The API is divided into [**groups**](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-groups) of related **resource types** that co-evolve together, in API [**version**s](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning). A client may interact with a resource in any supported version for that group and type. - -Instances of a given resource type are usually (NOTE: There are a small number of non-namespaced resources, also, which have global scope within a particular API service.) grouped by user-created [**namespaces**](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/), which scope [**names**](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/), references, and some policies. - -All resources contain common **metadata**, including the information contained within the URL path, to enable content-based path discovery. Because the resources are uniform and self-describing, they may be operated on generically and in bulk. The metadata also include user-provided key-value metadata in the form of [**labels**](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and [**annotations**](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/). Labels are used for filtering and grouping by identifying attributes, and annotations are generally used by extensions for configuration and checkpointing. - -Most resources also contain the [desired state ](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[(**spec**)](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[ and observed state ](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[(**status**)](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status). Status is written using the /status subresource (appended to the standard resource path (NOTE: Note that subresources don’t follow the collection-name/collection-item convention. They are singletons.)), using the same API schema, in order to enable distinct authorization policies for users and controllers. - -A few other subresources (e.g., `/scale`), with their own API types, similarly enable distinct authorization policies for controllers, and also polymorphism, since the same subresource type may be implemented for multiple parent resource types. Where distinct authorization policies are not required, polymorphism may be achieved simply by convention, using patch, akin to duck typing. - -Supported data formats include YAML, JSON, and protocol buffers. - -Example resource: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - namespace: default - name: explorer - labels: - category: demo - annotations: - commit: 483ac937f496b2f36a8ff34c3b3ba84f70ac5782 -spec: - containers: - - name: explorer - image: gcr.io/google_containers/explorer:1.1.3 - args: ["-port=8080"] - ports: - - containerPort: 8080 - protocol: TCP -status: -``` - -API groups may be exposed as a unified API surface while being served by distinct [servers](https://kubernetes.io/docs/tasks/access-kubernetes-api/setup-extension-api-server/) using [**aggregation**](https://kubernetes.io/docs/concepts/api-extension/apiserver-aggregation/), which is particularly useful for APIs with special storage needs. However, Kubernetes also supports [**custom resources**](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (CRDs), which enables users to define new types that fit the standard API conventions without needing to build and run another server. CRDs can be used to make systems declaratively and dynamically configurable in a Kubernetes-compatible manner, without needing another storage system. - -Each API server supports a custom [discovery API](https://github.com/kubernetes/client-go/blob/master/discovery/discovery_client.go) to enable clients to discover available API groups, versions, and types, and also [OpenAPI](https://kubernetes.io/blog/2016/12/kubernetes-supports-openapi/), which can be used to extract documentation and validation information about the resource types. - -See the [Kubernetes API conventions](/contributors/devel/sig-architecture/api-conventions.md ) for more details. - -## Resource semantics and lifecycle - -Each API resource undergoes [a common sequence of behaviors](https://kubernetes.io/docs/admin/accessing-the-api/) upon each operation. For a mutation, these behaviors include: - -1. [Authentication](https://kubernetes.io/docs/admin/authentication/) -2. [Authorization](https://kubernetes.io/docs/admin/authorization/): [Built-in](https://kubernetes.io/docs/admin/authorization/rbac/) and/or [administrator-defined](https://kubernetes.io/docs/admin/authorization/webhook/) identity-based policies -3. [Defaulting](/contributors/devel/sig-architecture/api-conventions.md#defaulting): API-version-specific default values are made explicit and persisted -4. Conversion: The apiserver converts between the client-requested [API version](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#API-versioning) and the version it uses to store each resource type in etcd -5. [Admission control](https://kubernetes.io/docs/admin/admission-controllers/): [Built-in](https://kubernetes.io/docs/admin/admission-controllers/) and/or [administrator-defined](https://kubernetes.io/docs/admin/extensible-admission-controllers/) resource-type-specific policies -6. [Validation](/contributors/devel/sig-architecture/api-conventions.md#validation): Resource field values are validated. Other than the presence of required fields, the API resource schema is not currently validated, but optional validation may be added in the future -7. Idempotence: Resources are accessed via immutable client-provided, declarative-friendly names -8. [Optimistic concurrency](/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency): Writes may specify a precondition that the **resourceVersion** last reported for a resource has not changed -9. [Audit logging](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/): Records the sequence of changes to each resource by all actors - -Additional behaviors are supported upon deletion: - -* Graceful termination: Some resources support delayed deletion, which is indicated by **deletionTimestamp** and **deletionGracePeriodSeconds** being set upon deletion - -* Finalization: A **finalizer** is block on deletion placed by an external controller, and needs to be removed before the resource deletion can complete - -* [Garbage collection](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/garbage-collection.md): A resource may specify **ownerReferences**, in which case the resource will be deleted once all of the referenced resources have been deleted - -And get: - -* List: All resources of a particular type within a particular namespace may be requested; [response chunking](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/api-chunking.md) is supported - -* [Label selection](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering): Lists may be filtered by their label keys and values - -* Watch: A client may subscribe to changes to listed resources using the resourceVersion returned with the list results - -## Declarative configuration - -Kubernetes API resource specifications are designed for humans to directly author and read as declarative configuration data, as well as to enable composable configuration tools and automated systems to manipulate them programmatically. We chose this simple approach of using literal API resource specifications for configuration, rather than other representations, because it was natural, given that we designed the API to support CRUD on declarative primitives. The API schema must already well defined, documented, and supported. With this approach, there’s no other representation to keep up to date with new resources and versions, or to require users to learn. [Declarative configuration](https://goo.gl/T66ZcD) is only one client use case; there are also CLIs (e.g., kubectl), UIs, deployment pipelines, etc. The user will need to interact with the system in terms of the API in these other scenarios, and knowledge of the API transfers to other clients and tools. Additionally, configuration, macro/substitution, and templating languages are generally more difficult to manipulate programmatically than pure data, and involve complexity/expressiveness tradeoffs that prevent one solution being ideal for all use cases. Such languages/tools could be layered over the native API schemas, if desired, but they should not assume exclusive control over all API fields, because doing so obstructs automation and creates undesirable coupling with the configuration ecosystem. - -The Kubernetes Resource Model encourages separation of concerns by supporting multiple distinct configuration sources and preserving declarative intent while allowing automatically set attributes. Properties not explicitly declaratively managed by the user are free to be changed by other clients, enabling the desired state to be cooperatively determined by both users and systems. This is achieved by an operation, called [**Apply**](https://docs.google.com/document/d/1q1UGAIfmOkLSxKhVg7mKknplq3OTDWAIQGWMJandHzg/edit#heading=h.xgjl2srtytjt) ("make it so"), that performs a 3-way merge of the previous configuration, the new configuration, and the live state. A 2-way merge operation, called [strategic merge patch](https:git.k8s.io/community/contributors/devel/sig-api-machinery/strategic-merge-patch.md), enables patches to be expressed using the same schemas as the resources themselves. Such patches can be used to perform automated updates without custom mutation operations, common updates (e.g., container image updates), combinations of configurations of orthogonal concerns, and configuration customization, such as for overriding properties of variants. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/architecture/scope.md b/contributors/design-proposals/architecture/scope.md index c81f6cc7..f0fbec72 100644 --- a/contributors/design-proposals/architecture/scope.md +++ b/contributors/design-proposals/architecture/scope.md @@ -1,342 +1,6 @@ -# Kubernetes scope +Design proposals have been archived. -Purpose of this doc: Clarify factors affecting decisions regarding -what is and is not in scope for the Kubernetes project. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Related documents: -* [What is Kubernetes?](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/) -* [Kubernetes design and architecture](architecture.md) -* [Kubernetes architectural roadmap (2017)](architectural-roadmap.md) -* [Design principles](principles.md) -* [Kubernetes resource management](resource-management.md) -Kubernetes is a portable, extensible open-source platform for managing -containerized workloads and services, that facilitates both -declarative configuration and automation. Workload portability is an -especially high priority. Kubernetes provides a flexible, easy-to-run, -secure foundation for running containerized applications on any cloud -provider or your own systems. - -While not a full distribution in the Linux sense, adoption of -Kubernetes has been facilitated by the fact that the upstream releases -are usable on their own, with minimal dependencies (e.g., etcd, a -container runtime, and a networking implementation). - -The high-level scope and goals are often insufficient for making -decisions about where to draw the line, so this documents where the -line is, the rationale for some past decisions, and some general -criteria that have been applied, including non-technical -considerations. For instance, user adoption and continued operation of -the project itself are also important factors. - -## Significant areas - -More details can be found below, but a concise list of areas in scope follows: -* Containerized workload execution and management -* Service discovery, load balancing, and routing -* Workload identity propagation and authentication -* Declarative resource management platform -* Command-line tool -* Web dashboard (UI) -* Cluster lifecycle tools -* Extensibility to support execution and management in diverse environments -* Multi-cluster management tools and systems -* Project GitHub automation and other process automation -* Project continuous build and test infrastructure -* Release tooling -* Documentation -* Usage data collection mechanisms - -## Scope domains - -Most decisions are regarding whether any part of the project should -undertake efforts in a particular area. However, some decisions may -sometimes be necessary for smaller scopes. The term "core" is sometimes -used, but is not well defined. The following are scopes that may be relevant: -* Kubernetes project github orgs - * All github orgs - * The kubernetes github org - * The kubernetes-sigs and kubernetes-incubator github orgs - * The kubernetes-client github org - * Other github orgs -* Release artifacts - * The Kubernetes release bundle - * Binaries built in kubernetes/kubernetes - * “core” server components: apiserver, controller manager, scheduler, kube-proxy, kubelet - * kubectl - * kubeadm - * Other images, packages, etc. -* The kubernetes/kubernetes repository (aka k/k) - * master branch - * kubernetes/kubernetes/master/pkg - * kubernetes/kubernetes/master/staging -* [Functionality layers](architectural-roadmap.md) - * required - * pluggable - * optional - * usable independently of the rest of Kubernetes - -## Other inclusion considerations - -The Kubernetes project is a large, complex effort. - -* Is the functionality consistent with the existing implementation - conventions, design principles, architecture, and direction? - -* Do the subproject owners, approvers, reviewers, and regular contributors - agree to maintain the functionality? - -* Do the contributors to the functionality agree to follow the - project’s development conventions and requirements, including CLA, - code of conduct, github and build tooling, testing, documentation, - and release criteria, etc.? - -* Does the functionality improve existing use cases, or mostly enable - new ones? The project isn't completely blocking new functionality - (more reducing the rate of expansion), but it is trying to - limit additions to kubernetes/kubernetes/master, and aims to improve the - quality of the functionality that already exists. - -* Is it needed by project contributors? Example: We need cluster - creation and upgrade functionality in order to run end-to-end tests. - -* Is it necessary in order to enable workload portability? - -* Is it needed in order for upstream releases to be usable? For - example, things without which users otherwise were - reverse-engineering Kubernetes to figure out, and/or copying code - out of Kubernetes itself to make work. - -* Is it functionality that users expect, such as because other - container platforms and/or service discovery and routing mechanisms - provide it? If a capability that relates to Kubernetes's fundamental - purpose were to become table stakes in the industry, Kubernetes - would need to support it in order to stay relevant. (Whether it - would need to be addressed by the core project would depend on the - other criteria.) - -* Is there sufficiently broad user demand and/or sufficient expected - user benefit for the functionality? - -* Is there an adequate mechanism to discover, deploy, express a - dependency on, and upgrade the functionality if implemented using an - extension mechanism? Are there consistent notions of releases, maturity, - quality, version skew, conformance, etc. for extensions? - -* Is it needed as a reference implementation exercising extension - points or other APIs? - -* Is the functionality sufficiently general-purpose? - -* Is it an area where we want to provide an opinionated solution - and/or where fragmentation would be problematic for users, or are - there many reasonable alternative approaches and solutions to the - problem? - -* Is it an area where we want to foster exploration and innovation in - the ecosystem? - -* Has the ecosystem produced adequate solutions on its own? For - instance, have ecosystem projects taken on requirements of the - Kubernetes project, if needed? Example: etcd3 added a number of features - and other improvements to benefit Kubernetes, so the project didn't - need to launch a separate storage effort. - -* Is there an acceptable home for the recommended ecosystem solution(s)? - Example: the [CNCF Sandbox](https://github.com/cncf/toc/blob/master/process/sandbox.md) is one possible home - -* Has the functionality been provided by the project/release/component - historically? - -## Technical scope details and rationale - -### Containerized workload execution and management - -Including: -* common general categories of workloads, such as stateless, stateful, batch, and cluster services -* provisioning, allocation, accessing, and managing compute, storage, and network resources on behalf of the workloads, and enforcement of security policies on those resources -* workload prioritization, capacity assessment, placement, and relocation (aka scheduling) -* graceful workload eviction -* local container image caching -* configuration and secret distribution -* manual and automatic horizontal and vertical scaling -* deployment, progressive (aka rolling) upgrades, and downgrades -* self-healing -* exposing container logs, status, health, and resource usage metrics for collection - -### Service discovery, load balancing, and routing - -Including: -* endpoint tracking and discovery, including pod and non-pod endpoints -* the most common L4 and L7 Internet protocols (TCP, UDP, SCTP, HTTP, HTTPS) -* intra-cluster DNS configuration and serving -* external DNS configuration -* accessing external services (e.g., imported services, Open Service Broker) -* exposing traffic latency, throughput, and status metrics for collection -* access authorization - -### Workload identity propagation and authentication - -Including: -* internal identity (e.g., SPIFFE support) -* external identity (e.g., TLS certificate management) - -### Declarative resource management platform - -Including: -* CRUD API operations and behaviors, diff, patch, dry run, watch -* declarative updates (apply) -* resource type definition, registration, discovery, documentation, and validation mechanisms -* pluggable authentication, authorization, admission (API-level policy enforcement), and audit-logging mechanisms -* Namespace (resource scoping primitive) lifecycle -* resource instance persistence and garbage collection -* asynchronous event reporting -* API producer SDK -* API client SDK / libraries in widely used languages -* dynamic, resource-oriented CLI, as a reference implementation for interacting with the API and basic tool for declarative and imperative management - * simplifies getting started and avoids complexities of documenting the system with just, for instance, curl - -### Command-line tool - -Since some Kubernetes primitives are fairly low-level, in addition to -general-purpose resource-oriented operations, the CLI also supports -“porcelain” for common simple, domain-specific operational operations (both -status/progress extraction and mutations) that don’t have discrete API -implementations, such as run, expose, rollout, cp, top, cordon, and -drain. And there should be support for non-resource-oriented APIs, -such as exec, logs, attach, port-forward, and proxy. - -### Web dashboard (UI) - -The project supported a dashboard, initially built into the apiserver, -almost from the beginning. Other projects in the space had UIs and -users expected one. There wasn’t a vendor-neutral one in the -ecosystem, however, and a solution was needed for the project's local -cluster environment, minikube. The dashboard has also served as a UI -reference implementation and a vehicle to drive conventions (e.g., -around resource category terminology). The dashboard has also been -useful as a tool to demonstrate and to learn about Kubernetes -concepts, features, and behaviors. - -### Cluster lifecycle tools - -Cluster lifecycle includes provisioning, bootstrapping, -upgrade/downgrade, and teardown. The project develops several such tools. -Tools are needed for the following scenarios/purposes: -* usability of upstream releases: at least one solution that can be used to bootstrap the upstream release (e.g., kubeadm) -* testing: solutions that can be used to run multi-node end-to-end tests (e.g., kind), integration tests, upgrade/downgrade tests, version-skew tests, scalability tests, and other types of tests the projects deems necessary to ensure adequate release quality -* portable, low-dependency local environment: at least one local environment (e.g., minikube), in order to simplify documentation tutorials that require a cluster to exist - -### Extensibility to support execution and management in diverse environments - -Including: -* CRI -* CNI -* CSI -* external cloud providers -* KMS providers -* OSB brokers -* Cluster APIs - -### Multi-cluster management tools and systems - -Many users desire to operate in and deploy applications to multiple -geographic locations and environments, even across multiple providers. -This generally requires managing multiple Kubernetes clusters. While -general deployment pipeline tools and continuous deployment systems -are not in scope, the project has explored multiple mechanisms to -simplify management of resources across multiple clusters, including -Federation v1, Federation v2, and the Cluster Registry API. - -### Project GitHub automation and other process automation - -As one of the largest, most active projects on Github, Kubernetes has -some extreme needs. - -Including: -* prow -* gubernator -* velodrome and kettle -* website infrastructure -* k8s.io - -### Project continuous build and test infrastructure - -Including: -* prow -* tide -* triage dashboard - -### Release tooling - -Including: -* anago - -### Documentation - -Documentation of project-provided functionality and components, for -multiple audiences, including: -* application developers -* application operators -* cluster operators -* ecosystem developers -* distribution providers, and others who want to port Kubernetes to new environments -* project contributors - -### Usage data collection mechanisms - -Including: -* Spartakus - -## Examples of projects and areas not in scope - -Some of these are obvious, but many have been seriously deliberated in the -past. -* The resource instance store (etcd) -* Container runtimes, other than current grandfathered ones -* Network and storage plugins, other than current grandfathered ones -* CoreDNS - * Since intra-cluster DNS is in scope, we need to ensure we have - some solution, which has been kubedns, but now that there is an - adequate alternative outside the project, we are adopting it. -* Service load balancers (e.g., Envoy, Linkerd), other than kube-proxy -* Cloud provider implementations, other than current grandfathered ones -* Container image build tools -* Image registries and distribution mechanisms -* Identity (user/group) sources of truth (e.g., LDAP) -* Key management systems (e.g., Vault) -* CI, CD, and GitOps (push to deploy) systems, other than - infrastructure used to build and test the Kubernetes project itself -* Application-level services, such as middleware (e.g., message - buses), data-processing frameworks (e.g., Spark), machine-learning - frameworks (e.g., Kubeflow), databases (e.g., Mysql), caches, nor - cluster storage systems (e.g., Ceph) as built-in services. Such - components can run on Kubernetes, and/or can be accessed by - applications running on Kubernetes through portable mechanisms, such - as the Open Service Broker. Application-specific Operators (e.g., - Cassandra Operator) are also not in scope. -* Application and cluster log aggregation and searching, application - and cluster monitoring aggregation and dashboarding (other than - heapster, which is grandfathered), alerting, application performance - management, tracing, and debugging tools -* General-purpose machine configuration (e.g., Chef, Puppet, Ansible, - Salt), maintenance, automation (e.g., Rundeck), and management systems -* Templating and configuration languages (e.g., jinja, jsonnet, - starlark, hcl, dhall, hocon) -* File packaging tools (e.g., helm, kpm, kubepack, duffle) -* Managing non-containerized applications in VMs, and other general - IaaS functionality -* Full Platform as a Service functionality -* Full Functions as a Service functionality -* [Workflow - orchestration](https://github.com/kubernetes/kubernetes/pull/24781#issuecomment-215914822): - "Workflow" is a very broad, diverse area, with solutions typically - tailored to specific use cases (e.g., data-flow graphs, data-driven - processing, deployment pipelines, event-driven automation, - business-process execution, iPaaS) and specific input and event - sources, and often requires arbitrary code to evaluate conditions, - actions, and/or failure handling. -* Other forms of human-oriented and programmatic interfaces over the - Kubernetes API other than “basic” CLIs (e.g., kubectl) and UI - (dashboard), such as mobile dashboards, IDEs, chat bots, SQL, - interactive shells, etc. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/access.md b/contributors/design-proposals/auth/access.md index 927bb033..f0fbec72 100644 --- a/contributors/design-proposals/auth/access.md +++ b/contributors/design-proposals/auth/access.md @@ -1,372 +1,6 @@ -# K8s Identity and Access Management Sketch +Design proposals have been archived. -This document suggests a direction for identity and access management in the -Kubernetes system. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Background - -High level goals are: - - Have a plan for how identity, authentication, and authorization will fit in -to the API. - - Have a plan for partitioning resources within a cluster between independent -organizational units. - - Ease integration with existing enterprise and hosted scenarios. - -### Actors - -Each of these can act as normal users or attackers. - - External Users: People who are accessing applications running on K8s (e.g. -a web site served by webserver running in a container on K8s), but who do not -have K8s API access. - - K8s Users: People who access the K8s API (e.g. create K8s API objects like -Pods) - - K8s Project Admins: People who manage access for some K8s Users - - K8s Cluster Admins: People who control the machines, networks, or binaries -that make up a K8s cluster. - - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. - -### Threats - -Both intentional attacks and accidental use of privilege are concerns. - -For both cases it may be useful to think about these categories differently: - - Application Path - attack by sending network messages from the internet to -the IP/port of any application running on K8s. May exploit weakness in -application or misconfiguration of K8s. - - K8s API Path - attack by sending network messages to any K8s API endpoint. - - Insider Path - attack on K8s system components. Attacker may have -privileged access to networks, machines or K8s software and data. Software -errors in K8s system components and administrator error are some types of threat -in this category. - -This document is primarily concerned with K8s API paths, and secondarily with -Internal paths. The Application path also needs to be secure, but is not the -focus of this document. - -### Assets to protect - -External User assets: - - Personal information like private messages, or images uploaded by External -Users. - - web server logs. - -K8s User assets: - - External User assets of each K8s User. - - things private to the K8s app, like: - - credentials for accessing other services (docker private repos, storage -services, facebook, etc) - - SSL certificates for web servers - - proprietary data and code - -K8s Cluster assets: - - Assets of each K8s User. - - Machine Certificates or secrets. - - The value of K8s cluster computing resources (cpu, memory, etc). - -This document is primarily about protecting K8s User assets and K8s cluster -assets from other K8s Users and K8s Project and Cluster Admins. - -### Usage environments - -Cluster in Small organization: - - K8s Admins may be the same people as K8s Users. - - Few K8s Admins. - - Prefer ease of use to fine-grained access control/precise accounting, etc. - - Product requirement that it be easy for potential K8s Cluster Admin to try -out setting up a simple cluster. - -Cluster in Large organization: - - K8s Admins typically distinct people from K8s Users. May need to divide -K8s Cluster Admin access by roles. - - K8s Users need to be protected from each other. - - Auditing of K8s User and K8s Admin actions important. - - Flexible accurate usage accounting and resource controls important. - - Lots of automated access to APIs. - - Need to integrate with existing enterprise directory, authentication, -accounting, auditing, and security policy infrastructure. - -Org-run cluster: - - Organization that runs K8s master components is same as the org that runs -apps on K8s. - - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. - -Hosted cluster: - - Offering K8s API as a service, or offering a Paas or Saas built on K8s. - - May already offer web services, and need to integrate with existing customer -account concept, and existing authentication, accounting, auditing, and security -policy infrastructure. - - May want to leverage K8s User accounts and accounting to manage their User -accounts (not a priority to support this use case.) - - Precise and accurate accounting of resources needed. Resource controls -needed for hard limits (Users given limited slice of data) and soft limits -(Users can grow up to some limit and then be expanded). - -K8s ecosystem services: - - There may be companies that want to offer their existing services (Build, CI, -A/B-test, release automation, etc) for use with K8s. There should be some story -for this case. - -Pods configs should be largely portable between Org-run and hosted -configurations. - - -# Design - -Related discussion: -- http://issue.k8s.io/442 -- http://issue.k8s.io/443 - -This doc describes two security profiles: - - Simple profile: like single-user mode. Make it easy to evaluate K8s -without lots of configuring accounts and policies. Protects from unauthorized -users, but does not partition authorized users. - - Enterprise profile: Provide mechanisms needed for large numbers of users. -Defense in depth. Should integrate with existing enterprise security -infrastructure. - -K8s distribution should include templates of config, and documentation, for -simple and enterprise profiles. System should be flexible enough for -knowledgeable users to create intermediate profiles, but K8s developers should -only reason about those two Profiles, not a matrix. - -Features in this doc are divided into "Initial Feature", and "Improvements". -Initial features would be candidates for version 1.00. - -## Identity - -### userAccount - -K8s will have a `userAccount` API object. -- `userAccount` has a UID which is immutable. This is used to associate users -with objects and to record actions in audit logs. -- `userAccount` has a name which is a string and human readable and unique among -userAccounts. It is used to refer to users in Policies, to ensure that the -Policies are human readable. It can be changed only when there are no Policy -objects or other objects which refer to that name. An email address is a -suggested format for this field. -- `userAccount` is not related to the unix username of processes in Pods created -by that userAccount. -- `userAccount` API objects can have labels. - -The system may associate one or more Authentication Methods with a -`userAccount` (but they are not formally part of the userAccount object.) - -In a simple deployment, the authentication method for a user might be an -authentication token which is verified by a K8s server. In a more complex -deployment, the authentication might be delegated to another system which is -trusted by the K8s API to authenticate users, but where the authentication -details are unknown to K8s. - -Initial Features: -- There is no superuser `userAccount` -- `userAccount` objects are statically populated in the K8s API store by reading -a config file. Only a K8s Cluster Admin can do this. -- `userAccount` can have a default `namespace`. If API call does not specify a -`namespace`, the default `namespace` for that caller is assumed. -- `userAccount` is global. A single human with access to multiple namespaces is -recommended to only have one userAccount. - -Improvements: -- Make `userAccount` part of a separate API group from core K8s objects like -`pod.` Facilitates plugging in alternate Access Management. - -Simple Profile: - - Single `userAccount`, used by all K8s Users and Project Admins. One access -token shared by all. - -Enterprise Profile: - - Every human user has own `userAccount`. - - `userAccount`s have labels that indicate both membership in groups, and -ability to act in certain roles. - - Each service using the API has own `userAccount` too. (e.g. `scheduler`, -`repcontroller`) - - Automated jobs to denormalize the ldap group info into the local system -list of users into the K8s userAccount file. - -### Unix accounts - -A `userAccount` is not a Unix user account. The fact that a pod is started by a -`userAccount` does not mean that the processes in that pod's containers run as a -Unix user with a corresponding name or identity. - -Initially: -- The unix accounts available in a container, and used by the processes running -in a container are those that are provided by the combination of the base -operating system and the Docker manifest. -- Kubernetes doesn't enforce any relation between `userAccount` and unix -accounts. - -Improvements: -- Kubelet allocates disjoint blocks of root-namespace uids for each container. -This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) -- requires docker to integrate user namespace support, and deciding what -getpwnam() does for these uids. -- any features that help users avoid use of privileged containers -(http://issue.k8s.io/391) - -### Namespaces - -K8s will have a `namespace` API object. It is similar to a Google Compute -Engine `project`. It provides a namespace for objects created by a group of -people co-operating together, preventing name collisions with non-cooperating -groups. It also serves as a reference point for authorization policies. - -Namespaces are described in [namespaces](../architecture/namespaces.md). - -In the Enterprise Profile: - - a `userAccount` may have permission to access several `namespace`s. - -In the Simple Profile: - - There is a single `namespace` used by the single user. - -Namespaces versus userAccount vs. Labels: -- `userAccount`s are intended for audit logging (both name and UID should be -logged), and to define who has access to `namespace`s. -- `labels` (see [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/)) -should be used to distinguish pods, users, and other objects that cooperate -towards a common goal but are different in some way, such as version, or -responsibilities. -- `namespace`s prevent name collisions between uncoordinated groups of people, -and provide a place to attach common policies for co-operating groups of people. - - -## Authentication - -Goals for K8s authentication: -- Include a built-in authentication system with no configuration required to use -in single-user mode, and little configuration required to add several user -accounts, and no https proxy required. -- Allow for authentication to be handled by a system external to Kubernetes, to -allow integration with existing to enterprise authorization systems. The -Kubernetes namespace itself should avoid taking contributions of multiple -authorization schemes. Instead, a trusted proxy in front of the apiserver can be -used to authenticate users. - - For organizations whose security requirements only allow FIPS compliant -implementations (e.g. apache) for authentication. - - So the proxy can terminate SSL, and isolate the CA-signed certificate from -less trusted, higher-touch APIserver. - - For organizations that already have existing SaaS web services (e.g. -storage, VMs) and want a common authentication portal. -- Avoid mixing authentication and authorization, so that authorization policies -be centrally managed, and to allow changes in authentication methods without -affecting authorization code. - -Initially: -- Tokens used to authenticate a user. -- Long lived tokens identify a particular `userAccount`. -- Administrator utility generates tokens at cluster setup. -- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750 -- No scopes for tokens. Authorization happens in the API server -- Tokens dynamically generated by apiserver to identify pods which are making -API calls. -- Tokens checked in a module of the APIserver. -- Authentication in apiserver can be disabled by flag, to allow testing without -authorization enabled, and to allow use of an authenticating proxy. In this -mode, a query parameter or header added by the proxy will identify the caller. - -Improvements: -- Refresh of tokens. -- SSH keys to access inside containers. - -To be considered for subsequent versions: -- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749) -- Scoped tokens. -- Tokens that are bound to the channel between the client and the api server - - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf - - http://www.browserauth.net - -## Authorization - -K8s authorization should: -- Allow for a range of maturity levels, from single-user for those test driving -the system, to integration with existing to enterprise authorization systems. -- Allow for centralized management of users and policies. In some -organizations, this will mean that the definition of users and access policies -needs to reside on a system other than k8s and encompass other web services -(such as a storage service). -- Allow processes running in K8s Pods to take on identity, and to allow narrow -scoping of permissions for those identities in order to limit damage from -software faults. -- Have Authorization Policies exposed as API objects so that a single config -file can create or delete Pods, Replication Controllers, Services, and the -identities and policies for those Pods and Replication Controllers. -- Be separate as much as practical from Authentication, to allow Authentication -methods to change over time and space, without impacting Authorization policies. - -K8s will implement a relatively simple -[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model. - -The model will be described in more detail in a forthcoming document. The model -will: -- Be less complex than XACML -- Be easily recognizable to those familiar with Amazon IAM Policies. -- Have a subset/aliases/defaults which allow it to be used in a way comfortable -to those users more familiar with Role-Based Access Control. - -Authorization policy is set by creating a set of Policy objects. - -The API Server will be the Enforcement Point for Policy. For each API call that -it receives, it will construct the Attributes needed to evaluate the policy -(what user is making the call, what resource they are accessing, what they are -trying to do that resource, etc) and pass those attributes to a Decision Point. -The Decision Point code evaluates the Attributes against all the Policies and -allows or denies the API call. The system will be modular enough that the -Decision Point code can either be linked into the APIserver binary, or be -another service that the apiserver calls for each Decision (with appropriate -time-limited caching as needed for performance). - -Policy objects may be applicable only to a single namespace or to all -namespaces; K8s Project Admins would be able to create those as needed. Other -Policy objects may be applicable to all namespaces; a K8s Cluster Admin might -create those in order to authorize a new type of controller to be used by all -namespaces, or to make a K8s User into a K8s Project Admin.) - -## Accounting - -The API should have a `quota` concept (see http://issue.k8s.io/442). A quota -object relates a namespace (and optionally a label selector) to a maximum -quantity of resources that may be used (see [resources design doc](../scheduling/resources.md)). - -Initially: -- A `quota` object is immutable. -- For hosted K8s systems that do billing, Project is recommended level for -billing accounts. -- Every object that consumes resources should have a `namespace` so that -Resource usage stats are roll-up-able to `namespace`. -- K8s Cluster Admin sets quota objects by writing a config file. - -Improvements: -- Allow one namespace to charge the quota for one or more other namespaces. This -would be controlled by a policy which allows changing a billing_namespace = -label on an object. -- Allow quota to be set by namespace owners for (namespace x label) combinations -(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't -allow "webserver" namespace and "instance=test" use more than 10 cores. -- Tools to help write consistent quota config files based on number of nodes, -historical namespace usages, QoS needs, etc. -- Way for K8s Cluster Admin to incrementally adjust Quota objects. - -Simple profile: - - A single `namespace` with infinite resource limits. - -Enterprise profile: - - Multiple namespaces each with their own limits. - -Issues: -- Need for locking or "eventual consistency" when multiple apiserver goroutines -are accessing the object store and handling pod creations. - - -## Audit Logging - -API actions can be logged. - -Initial implementation: -- All API calls logged to nginx logs. - -Improvements: -- API server does logging instead. -- Policies to drop logging for high rate trusted API calls, or by users -performing audit or other sensitive functions. - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/apparmor.md b/contributors/design-proposals/auth/apparmor.md index 5130a52d..f0fbec72 100644 --- a/contributors/design-proposals/auth/apparmor.md +++ b/contributors/design-proposals/auth/apparmor.md @@ -1,302 +1,6 @@ -- [Overview](#overview) - - [Motivation](#motivation) - - [Related work](#related-work) -- [Alpha Design](#alpha-design) - - [Overview](#overview-1) - - [Prerequisites](#prerequisites) - - [API Changes](#api-changes) - - [Pod Security Policy](#pod-security-policy) - - [Deploying profiles](#deploying-profiles) - - [Testing](#testing) -- [Beta Design](#beta-design) - - [API Changes](#api-changes-1) -- [Future work](#future-work) - - [System component profiles](#system-component-profiles) - - [Deploying profiles](#deploying-profiles-1) - - [Custom app profiles](#custom-app-profiles) - - [Security plugins](#security-plugins) - - [Container Runtime Interface](#container-runtime-interface) - - [Alerting](#alerting) - - [Profile authoring](#profile-authoring) -- [Appendix](#appendix) +Design proposals have been archived. -# Overview +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -AppArmor is a [mandatory access control](https://en.wikipedia.org/wiki/Mandatory_access_control) -(MAC) system for Linux that supplements the standard Linux user and group based -permissions. AppArmor can be configured for any application to reduce the potential attack surface -and provide greater [defense in depth](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)). -It is configured through profiles tuned to whitelist the access needed by a specific program or -container, such as Linux capabilities, network access, file permissions, etc. Each profile can be -run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which -only reports violations. -AppArmor is similar to SELinux. Both are MAC systems implemented as a Linux security module (LSM), -and are mutually exclusive. SELinux offers a lot of power and very fine-grained controls, but is -generally considered very difficult to understand and maintain. AppArmor sacrifices some of that -flexibility in favor of ease of use. Seccomp-bpf is another Linux kernel security feature for -limiting attack surface, and can (and should!) be used alongside AppArmor. - -## Motivation - -AppArmor can enable users to run a more secure deployment, and / or provide better auditing and -monitoring of their systems. Although it is not the only solution, we should enable AppArmor for -users that want a simpler alternative to SELinux, or are already maintaining a set of AppArmor -profiles. We have heard from multiple Kubernetes users already that AppArmor support is important to -them. The [seccomp proposal](seccomp.md#use-cases) details several use cases that -also apply to AppArmor. - -## Related work - -Much of this design is drawn from the work already done to support seccomp profiles in Kubernetes, -which is outlined in the [seccomp design doc](seccomp.md). The designs should be -kept close to apply lessons learned, and reduce cognitive and maintenance overhead. - -Docker has supported AppArmor profiles since version 1.3, and maintains a default profile which is -applied to all containers on supported systems. - -AppArmor was upstreamed into the Linux kernel in version 2.6.36. It is currently maintained by -[Canonical](http://www.canonical.com/), is shipped by default on all Ubuntu and openSUSE systems, -and is supported on several -[other distributions](http://wiki.apparmor.net/index.php/Main_Page#Distributions_and_Ports). - -# Alpha Design - -This section describes the proposed design for -[alpha-level](/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions) support, although -additional features are described in [future work](#future-work). For AppArmor alpha support -(targeted for Kubernetes 1.4) we will enable: - -- Specifying a pre-loaded profile to apply to a pod container -- Restricting pod containers to a set of profiles (admin use case) - -We will also provide a reference implementation of a pod for loading profiles on nodes, but an -official supported mechanism for deploying profiles is out of scope for alpha. - -## Overview - -An AppArmor profile can be specified for a container through the Kubernetes API with a pod -annotation. If a profile is specified, the Kubelet will verify that the node meets the required -[prerequisites](#prerequisites) (e.g. the profile is already configured on the node) before starting -the container, and will not run the container if the profile cannot be applied. If the requirements -are met, the container runtime will configure the appropriate options to apply the profile. Profile -requirements and defaults can be specified on the -[PodSecurityPolicy](pod-security-policy.md). - -## Prerequisites - -When an AppArmor profile is specified, the Kubelet will verify the prerequisites for applying the -profile to the container. In order to [fail -securely](https://www.owasp.org/index.php/Fail_securely), a container **will not be run** if any of -the prerequisites are not met. The prerequisites are: - -1. **Kernel support** - The AppArmor kernel module is loaded. Can be checked by - [libcontainer](https://github.com/opencontainers/runc/blob/4dedd0939638fc27a609de1cb37e0666b3cf2079/libcontainer/apparmor/apparmor.go#L17). -2. **Runtime support** - For the initial implementation, Docker will be required (rkt does not - currently have AppArmor support). All supported Docker versions include AppArmor support. See - [Container Runtime Interface](#container-runtime-interface) for other runtimes. -3. **Installed profile** - The target profile must be loaded prior to starting the container. Loaded - profiles can be found in the AppArmor securityfs \[1\]. - -If any of the prerequisites are not met an event will be generated to report the error and the pod -will be -[rejected](https://github.com/kubernetes/kubernetes/blob/cdfe7b7b42373317ecd83eb195a683e35db0d569/pkg/kubelet/kubelet.go#L2201) -by the Kubelet. - -*[1] The securityfs can be found in `/proc/mounts`, and defaults to `/sys/kernel/security` on my -Ubuntu system. The profiles can be found at `{securityfs}/apparmor/profiles` -([example](http://bazaar.launchpad.net/~apparmor-dev/apparmor/master/view/head:/utils/aa-status#L137)).* - -## API Changes - -The initial alpha support of AppArmor will follow the pattern -[used by seccomp](https://github.com/kubernetes/kubernetes/pull/25324) and specify profiles through -annotations. Profiles can be specified per-container through pod annotations. The annotation format -is a key matching the container, and a profile name value: - -``` -container.apparmor.security.alpha.kubernetes.io/<container_name>=<profile_name> -``` - -The profiles can be specified in the following formats (following the convention used by [seccomp](seccomp.md#api-changes)): - -1. `runtime/default` - Applies the default profile for the runtime. For docker, the profile is - generated from a template - [here](https://github.com/docker/docker/blob/master/profiles/apparmor/template.go). If no - AppArmor annotations are provided, this profile is enabled by default if AppArmor is enabled in - the kernel. Runtimes may define this to be unconfined, as Docker does for privileged pods. -2. `localhost/<profile_name>` - The profile name specifies the profile to load. - -*Note: There is no way to explicitly specify an "unconfined" profile, since it is discouraged. If - this is truly needed, the user can load an "allow-all" profile.* - -### Pod Security Policy - -The [PodSecurityPolicy](pod-security-policy.md) allows cluster administrators to control -the security context for a pod and its containers. An annotation can be specified on the -PodSecurityPolicy to restrict which AppArmor profiles can be used, and specify a default if no -profile is specified. - -The annotation key is `apparmor.security.alpha.kubernetes.io/allowedProfileNames`. The value is a -comma delimited list, with each item following the format described [above](#api-changes). If a list -of profiles are provided and a pod does not have an AppArmor annotation, the first profile in the -list will be used by default. - -Enforcement of the policy is standard. See the -[seccomp implementation](https://github.com/kubernetes/kubernetes/pull/28300) as an example. - -## Deploying profiles - -We will provide a reference implementation of a DaemonSet pod for loading profiles on nodes, but -there will not be an official mechanism or API in the initial version (see -[future work](#deploying-profiles-1)). The reference container will contain the `apparmor_parser` -tool and a script for using the tool to load all profiles in a set of (configurable) -directories. The initial implementation will poll (with a configurable interval) the directories for -additions, but will not update or unload existing profiles. The pod can be run in a DaemonSet to -load the profiles onto all nodes. The pod will need to be run in privileged mode. - -This simple design should be sufficient to deploy AppArmor profiles from any volume source, such as -a ConfigMap or PersistentDisk. Users seeking more advanced features should be able extend this -design easily. - -## Testing - -Our e2e testing framework does not currently run nodes with AppArmor enabled, but we can run a node -e2e test suite on an AppArmor enabled node. The cases we should test are: - -- *PodSecurityPolicy* - These tests can be run on a cluster even if AppArmor is not enabled on the - nodes. - - No AppArmor policy allows pods with arbitrary profiles - - With a policy a default is selected - - With a policy arbitrary profiles are prevented - - With a policy allowed profiles are allowed -- *Node AppArmor enforcement* - These tests need to run on AppArmor enabled nodes, in the node e2e - suite. - - A valid container profile gets applied - - An unloaded profile will be rejected - -# Beta Design - -The only part of the design that changes for beta is the API, which is upgraded from -annotation-based to first class fields. - -## API Changes - -AppArmor profiles will be specified in the container's SecurityContext, as part of an -`AppArmorOptions` struct. The options struct makes the API more flexible to future additions. - -```go -type SecurityContext struct { - ... - // The AppArmor options to be applied to the container. - AppArmorOptions *AppArmorOptions `json:"appArmorOptions,omitempty"` - ... -} - -// Reference to an AppArmor profile loaded on the host. -type AppArmorProfileName string - -// Options specifying how to run Containers with AppArmor. -type AppArmorOptions struct { - // The profile the Container must be run with. - Profile AppArmorProfileName `json:"profile"` -} -``` - -The `AppArmorProfileName` format matches the format for the profile annotation values describe -[above](#api-changes). - -The `PodSecurityPolicySpec` receives a similar treatment with the addition of an -`AppArmorStrategyOptions` struct. Here the `DefaultProfile` is separated from the `AllowedProfiles` -in the interest of making the behavior more explicit. - -```go -type PodSecurityPolicySpec struct { - ... - AppArmorStrategyOptions *AppArmorStrategyOptions `json:"appArmorStrategyOptions,omitempty"` - ... -} - -// AppArmorStrategyOptions specifies AppArmor restrictions and requirements for pods and containers. -type AppArmorStrategyOptions struct { - // If non-empty, all pod containers must be run with one of the profiles in this list. - AllowedProfiles []AppArmorProfileName `json:"allowedProfiles,omitempty"` - // The default profile to use if a profile is not specified for a container. - // Defaults to "runtime/default". Must be allowed by AllowedProfiles. - DefaultProfile AppArmorProfileName `json:"defaultProfile,omitempty"` -} -``` - -# Future work - -Post-1.4 feature ideas. These are not fully-fleshed designs. - -## System component profiles - -We should publish (to GitHub) AppArmor profiles for all Kubernetes system components, including core -components like the API server and controller manager, as well as addons like influxDB and -Grafana. `kube-up.sh` and its successor should have an option to apply the profiles, if the AppArmor -is supported by the nodes. Distros that support AppArmor and provide a Kubernetes package should -include the profiles out of the box. - -## Deploying profiles - -We could provide an official supported solution for loading profiles on the nodes. One option is to -extend the reference implementation described [above](#deploying-profiles) into a DaemonSet that -watches the directory sources to sync changes, or to watch a ConfigMap object directly. Another -option is to add an official API for this purpose, and load the profiles on-demand in the Kubelet. - -## Custom app profiles - -[Profile stacking](http://wiki.apparmor.net/index.php/AppArmorStacking) is an AppArmor feature -currently in development that will enable multiple profiles to be applied to the same object. If -profiles are stacked, the allowed set of operations is the "intersection" of both profiles -(i.e. stacked profiles are never more permissive). Taking advantage of this feature, the cluster -administrator could restrict the allowed profiles on a PodSecurityPolicy to a few broad profiles, -and then individual apps could apply more app specific profiles on top. - -## Security plugins - -AppArmor, SELinux, TOMOYO, grsecurity, SMACK, etc. are all Linux MAC implementations with similar -requirements and features. At the very least, the AppArmor implementation should be factored in a -way that makes it easy to add alternative systems. A more advanced approach would be to extract a -set of interfaces for plugins implementing the alternatives. An even higher level approach would be -to define a common API or profile interface for all of them. Work towards this last option is -already underway for Docker, called -[Docker Security Profiles](https://github.com/docker/docker/issues/17142#issuecomment-148974642). - -## Container Runtime Interface - -Other container runtimes will likely add AppArmor support eventually, so the -[Container Runtime Interface](/contributors/devel/sig-node/container-runtime-interface.md) (CRI) needs to be made compatible -with this design. The two important pieces are a way to report whether AppArmor is supported by the -runtime, and a way to specify the profile to load (likely through the `LinuxContainerConfig`). - -## Alerting - -Whether AppArmor is running in enforcing or complain mode it generates logs of policy -violations. These logs can be important cues for intrusion detection, or at the very least a bug in -the profile. Violations should almost always generate alerts in production systems. We should -provide reference documentation for setting up alerts. - -## Profile authoring - -A common method for writing AppArmor profiles is to start with a restrictive profile in complain -mode, and then use the `aa-logprof` tool to build a profile from the logs. We should provide -documentation for following this process in a Kubernetes environment. - -# Appendix - -- [What is AppArmor](https://askubuntu.com/questions/236381/what-is-apparmor) -- [Debugging AppArmor on Docker](https://github.com/docker/labs/blob/master/security/apparmor/README.md) -- Load an AppArmor profile with `apparmor_parser` (required by Docker so it should be available): - - ``` - $ apparmor_parser --replace --write-cache /path/to/profile - ``` - -- Unload with: - - ``` - $ apparmor_parser --remove /path/to/profile - ``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/bound-service-account-tokens.md b/contributors/design-proposals/auth/bound-service-account-tokens.md index 961e17a2..f0fbec72 100644 --- a/contributors/design-proposals/auth/bound-service-account-tokens.md +++ b/contributors/design-proposals/auth/bound-service-account-tokens.md @@ -1,239 +1,6 @@ -# Bound Service Account Tokens +Design proposals have been archived. -Author: @mikedanese +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Objective - -This document describes an API that would allow workloads running on Kubernetes -to request JSON Web Tokens that are audience, time and eventually key bound. - -# Background - -Kubernetes already provisions JWTs to workloads. This functionality is on by -default and thus widely deployed. The current workload JWT system has serious -issues: - -1. Security: JWTs are not audience bound. Any recipient of a JWT can masquerade - as the presenter to anyone else. -1. Security: The current model of storing the service account token in a Secret - and delivering it to nodes results in a broad attack surface for the - Kubernetes control plane when powerful components are run - giving a service - account a permission means that any component that can see that service - account's secrets is at least as powerful as the component. -1. Security: JWTs are not time bound. A JWT compromised via 1 or 2, is valid - for as long as the service account exists. This may be mitigated with - service account signing key rotation but is not supported by client-go and - not automated by the control plane and thus is not widely deployed. -1. Scalability: JWTs require a Kubernetes secret per service account. - -# Proposal - -Infrastructure to support on demand token requests will be implemented in the -core apiserver. Once this API exists, a client of the apiserver will request an -attenuated token for its own use. The API will enforce required attenuations, -e.g. audience and time binding. - -## Token attenuations - -### Audience binding - -Tokens issued from this API will be audience bound. Audience of requested tokens -will be bound by the `aud` claim. The `aud` claim is an array of strings -(usually URLs) that correspond to the intended audience of the token. A -recipient of a token is responsible for verifying that it identifies as one of -the values in the audience claim, and should otherwise reject the token. The -TokenReview API will support this validation. - -### Time binding - -Tokens issued from this API will be time bound. Time validity of these tokens -will be claimed in the following fields: - -* `exp`: expiration time -* `nbf`: not before -* `iat`: issued at - -A recipient of a token should verify that the token is valid at the time that -the token is presented, and should otherwise reject the token. The TokenReview -API will support this validation. - -Cluster administrators will be able to configure the maximum validity duration -for expiring tokens. During the migration off of the old service account tokens, -clients of this API may request tokens that are valid for many years. These -tokens will be drop in replacements for the current service account tokens. - -### Object binding - -Tokens issued from this API may be bound to a Kubernetes object in the same -namespace as the service account. The name, group, version, kind and uid of the -object will be embedded as claims in the issued token. A token bound to an -object will only be valid for as long as that object exists. - -Only a subset of object kinds will support object binding. Initially the only -kinds that will be supported are: - -* v1/Pod -* v1/Secret - -The TokenRequest API will validate this binding. - -## API Changes - -### Add `tokenrequests.authentication.k8s.io` - -We will add an imperative API (a la TokenReview) to the -`authentication.k8s.io` API group: - -```golang -type TokenRequest struct { - Spec TokenRequestSpec - Status TokenRequestStatus -} - -type TokenRequestSpec struct { - // Audiences are the intendend audiences of the token. A token issued - // for multiple audiences may be used to authenticate against any of - // the audiences listed. This implies a high degree of trust between - // the target audiences. - Audiences []string - - // ValidityDuration is the requested duration of validity of the request. The - // token issuer may return a token with a different validity duration so a - // client needs to check the 'expiration' field in a response. - ValidityDuration metav1.Duration - - // BoundObjectRef is a reference to an object that the token will be bound to. - // The token will only be valid for as long as the bound object exists. - BoundObjectRef *BoundObjectReference -} - -type BoundObjectReference struct { - // Kind of the referent. Valid kinds are 'Pod' and 'Secret'. - Kind string - // API version of the referent. - APIVersion string - - // Name of the referent. - Name string - // UID of the referent. - UID types.UID -} - -type TokenRequestStatus struct { - // Token is the token data - Token string - - // Expiration is the time of expiration of the returned token. Empty means the - // token does not expire. - Expiration metav1.Time -} - -``` - -This API will be exposed as a subresource under a serviceaccount object. A -requestor for a token for a specific service account will `POST` a -`TokenRequest` to the `/token` subresource of that serviceaccount object. - -### Modify `tokenreviews.authentication.k8s.io` - -The TokenReview API will be extended to support passing an additional audience -field which the service account authenticator will validate. - -```golang -type TokenReviewSpec struct { - // Token is the opaque bearer token. - Token string - // Audiences are the identifiers that the client identifies as. - Audiences []string -} -``` - -### Example Flow - -``` -> POST /apis/v1/namespaces/default/serviceaccounts/default/token -> { -> "kind": "TokenRequest", -> "apiVersion": "authentication.k8s.io/v1", -> "spec": { -> "audience": [ -> "https://kubernetes.default.svc" -> ], -> "validityDuration": "99999h", -> "boundObjectRef": { -> "kind": "Pod", -> "apiVersion": "v1", -> "name": "pod-foo-346acf" -> } -> } -> } -{ - "kind": "TokenRequest", - "apiVersion": "authentication.k8s.io/v1", - "spec": { - "audience": [ - "https://kubernetes.default.svc" - ], - "validityDuration": "99999h", - "boundObjectRef": { - "kind": "Pod", - "apiVersion": "v1", - "name": "pod-foo-346acf" - } - }, - "status": { - "token": - "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJz[payload omitted].EkN-[signature omitted]", - "expiration": "Jan 24 16:36:00 PST 3018" - } -} -``` - -The token payload will be: - -``` -{ - "iss": "https://example.com/some/path", - "sub": "system:serviceaccount:default:default, - "aud": [ - "https://kubernetes.default.svc" - ], - "exp": 24412841114, - "iat": 1516841043, - "nbf": 1516841043, - "kubernetes.io": { - "serviceAccountUID": "c0c98eab-0168-11e8-92e5-42010af00002", - "boundObjectRef": { - "kind": "Pod", - "apiVersion": "v1", - "uid": "a4bb8aa4-0168-11e8-92e5-42010af00002", - "name": "pod-foo-346acf" - } - } -} -``` - -## Service Account Authenticator Modification - -The service account token authenticator will be extended to support validation -of time and audience binding claims. - -## ACLs for TokenRequest - -The NodeAuthorizer will allow the kubelet to use its credentials to request a -service account token on behalf of pods running on that node. The -NodeRestriction admission controller will require that these tokens are pod -bound. - -## Footnotes - -* New apiserver flags: - * --service-account-issuer: Identifier of the issuer. - * --service-account-signing-key: Path to issuer private key used for signing. - * --service-account-api-audience: Identifier of the API. Used to validate - tokens authenticating to the Kubernetes API. -* The Kubernetes apiserver will identify itself as `kubernetes.default.svc` - which is the DNS name of the Kubernetes apiserver. When no audience is - requested, the audience is defaulted to an array - containing only this identifier. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/cluster-role-aggregation.md b/contributors/design-proposals/auth/cluster-role-aggregation.md index 12739589..f0fbec72 100644 --- a/contributors/design-proposals/auth/cluster-role-aggregation.md +++ b/contributors/design-proposals/auth/cluster-role-aggregation.md @@ -1,94 +1,6 @@ -# Cluster Role Aggregation -In order to support easy RBAC integration for CustomResources and Extension -APIServers, we need to have a way for API extenders to add permissions to the -"normal" roles for admin, edit, and view. +Design proposals have been archived. -These roles express an intent for the namespaced power of administrators of the -namespace (manage ownership), editors of the namespace (manage content like -pods), and viewers of the namespace (see what is present). As new APIs are -made available, these roles should reflect that intent to prevent migration -concerns every time a new API is added. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -To do this, we will allow one ClusterRole to be built out of a selected set of -ClusterRoles. -## API Changes -```yaml -aggregationRule: - selectors: - - matchLabels: - rbac.authorization.k8s.io/aggregate-to-admin: true -``` - -```go -// ClusterRole is a cluster level, logical grouping of PolicyRules that can be referenced as a unit by a RoleBinding or ClusterRoleBinding. -type ClusterRole struct { - metav1.TypeMeta - // Standard object's metadata. - metav1.ObjectMeta - - // Rules holds all the PolicyRules for this ClusterRole - Rules []PolicyRule - - // AggregationRule is an optional field that describes how to build the Rules for this ClusterRole. - // If AggregationRule is set, then the Rules are controller managed and direct changes to Rules will be - // stomped by the controller. - AggregationRule *AggregationRule -} - -// AggregationRule describes how to locate ClusterRoles to aggregate into the ClusterRole -type AggregationRule struct { - // Selector holds a list of selectors which will be used to find ClusterRoles and create the rules. - // If any of the selectors match, then the ClusterRole's permissions will be added - Selectors []metav1.LabelSelector -} -``` - -The `aggregationRule` stanza contains a list of LabelSelectors which are used -to select the set of ClusterRoles which should be combined. When -`aggregationRule` is set, the list of `rules` becomes controller managed and is -subject to overwriting at any point. - -`aggregationRule` needs to be protected from escalation. The simplest way to -do this is to restrict it to users with verb=`*`, apiGroups=`*`, resources=`*`. We -could later loosen it by using a covers check against all aggregated rules -without changing backward compatibility. - -## Controller -There is a controller which watches for changes to ClusterRoles and then -updates all aggregated ClusterRoles if their list of Rules has changed. Since -there are relatively few ClusterRoles, it checks them all and most -short-circuit. - -## The Payoff -If you want to create a CustomResource for your operator and you want namespace -admin's to be able to create one, instead of trying to: - 1. Create a new ClusterRole - 2. Update every namespace with a matching RoleBinding - 3. Teach everyone to add the RoleBinding to all their admin users - 4. When you remove it, clean up dangling RoleBindings - - Or - - 1. Make a non-declarative patch against the admin ClusterRole - 2. When you remove it, try to safely create a new non-declarative patch to - remove it. - -You can simply create a new ClusterRole like -```yaml -apiVersion: rbac.authorization.k8s.io/v1beta1 -kind: ClusterRole -metadata: - name: etcd-operator-admin - label: - rbac.authorization.k8s.io/aggregate-to-admin: true -rules: -- apiGroups: - - etcd.database.coreos.com - resources: - - etcdclusters - verbs: - - "*" -``` -alongside your CustomResourceDefinition. The admin role is updated correctly and -removal is a `kubectl delete -f` away.
\ No newline at end of file +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/encryption.md b/contributors/design-proposals/auth/encryption.md index 121e06b4..f0fbec72 100644 --- a/contributors/design-proposals/auth/encryption.md +++ b/contributors/design-proposals/auth/encryption.md @@ -1,443 +1,6 @@ -# Encryption +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The scope of this proposal is to ensure that resources can be encrypted at the -datastore layer with sufficient metadata support to enable integration with -multiple encryption providers and key rotation. Encryption will be optional for -any resource, but will be used by default for the Secret resource. Secrets are -already protected in transit via TLS. - -Full disk encryption of the volumes storing etcd data is already expected as -standard security hygiene. Adding the proposed encryption at the datastore -layer defends against malicious parties gaining access to: - -- etcd backups; or -- A running etcd instance without access to memory of the etcd process. - -Allowing sensitive data to be encrypted adheres to best practices as well as -other requirements such as HIPAA. - -## High level design - -Before a resource is written to etcd and after it is read, an encryption -provider will take the plaintext data and encrypt/decrypt it. These providers -will be able to be created and turned on depending on the users needs or -requirements and will adhere to an encryption interface. This interface will -provide the abstraction to allow various encryption mechanisms to be -implemented, as well as for the method of encryption to be rotated over time. - -For the first iteration, a default provider that handles encryption in-process -using a locally stored key on disk will be developed. - -## Kubernetes Storage Changes - -Kubernetes requires that an update that does not change the serialized form of -object not be persisted to etcd to prevent other components from seeing no-op -updates. - -This must be done within the Kubernetes storage interfaces - we will introduce a -new API to the Kube storage layer that transforms the serialized object into the -desired at-rest form and provides hints as to whether no-op updates should still -persist (when key rotation is in effect). - -```go -// ValueTransformer allows a string value to be transformed before being read from or written to the underlying store. The methods -// must be able to undo the transformation caused by the other. -type ValueTransformer interface { - // TransformFromStorage may transform the provided data from its underlying - // storage representation or return an error. Stale is true if the object - // on disk is stale (encrypted with an older key) and a write to etcd - // should be issued, even if the contents of the object have not changed. - TransformFromStorage([]byte) (data []byte, stale bool, err error) - // TransformToStorage may transform the provided data into the appropriate form in storage or return an error. - TransformToStorage([]byte) (data []byte, err error) -} -``` - -When the storage layer of Kubernetes is initialized for some resource, an -implementation of this interface that manages encryption will be passed down. -Other resources can use a no-op provider by default. - -## Encryption Provider - -An encryption provider implements the ValueTransformer interface. Out of the box -this proposal will implement encryption using a standard AES-GCM performing -AEAD, using the standard Go library for AES-GCM. - -Each encryption provider will have a unique string identifier to ensure -versioning of the ciphertext in etcd, and to allow future schemes to be added. - -During encryption, only a single provider is required. During decryption, -multiple providers or keys may be in use (when migrating from an older version -of a provider, or when rotating keys), and thus the ValueTransformer -implementation must be able to delegate to the appropriate provider. - -Note that the ValueTransformer is a general storage interface and not related to -encryption directly. The AES implementation linked below combines -ValueTransformer and encryption provider. - -### AES-GCM Encryption provider - -Implemented in [#41939](https://github.com/kubernetes/kubernetes/pull/41939). - -The simplest possible provider is an AES-GCM encrypter/decrypter using AEAD, -where we create a unique nonce on each new write to etcd, use that as the IV for -AES-GCM of the value (the JSON or protobuf data) along with a set of -authenticated data to create the ciphertext, and then on decryption use the -nonce and the authenticated data to decode. - -The provider will be assigned a versioned identifier to uniquely pair the -implementation with the data at rest, such as “k8s-aes-gcm-v1”. Any -implementation that attempts to decode data associated with this provider id -must follow a known structure and apply a specific algorithm. - -Various options for key generation and management are covered in the following -sections. The provider implements one of those schemes to retrieve a set of -keys. One is identified as the write key, all others are used to decrypt data -from previous keys. Keys must be rotated more often than every 2^32 writes. - -The provider will use the recommended Go defaults for all crypto settings -unless otherwise noted. We should use AES-256 keys (32 bytes). - -Process for transforming a value (object encoded as JSON or protobuf) to and -from stable storage will look like the following: - -Layout as written to etcd2 (json safe string only): -``` -NONCE := read(/dev/urandom) -PLAIN_TEXT := <VALUE> -AUTHENTICATED_DATA := ETCD_KEY -CIPHER_TEXT := aes_gcm_encrypt(KEY, IV:NONCE, PLAIN_TEXT, A:AUTHENTICATED_DATA) -BASE64_DATA := base64(<NONCE><CIPHER_TEXT>) -STORED_DATA := <PROVIDER>:<KEY_ID>:<BASE64_DATA> -``` - -Layout as written to etcd3 (bytes): -``` -NONCE := read(/dev/urandom) -PLAIN_TEXT := <VALUE> -AUTHENTICATED_DATA := ETCD_KEY -CIPHER_TEXT := aes_gcm_encrypt(KEY, IV:NONCE, PLAIN_TEXT, A:AUTHENTICATED_DATA) -STORED_DATA := <PROVIDER_ID>:<KEY_ID>:<NONCE><AUTHENTICATED_DATA><CIPHER_TEXT> -``` - -Pseudo-code for encrypt (golang): -```go -block := aes.NewCipher(primaryKeyString) -aead := cipher.NewGCM(c.block) -keyId := primaryKeyId - -// string prefix chosen over a struct to minimize complexity and for write -// serialization performance. -// for each write -nonce := make([]byte, block.BlockSize()) -io.ReadFull(crypto_rand.Reader, nonce) -authenticatedData := ETCD_KEY -cipherText := aead.Seal(nil, nonce, value, authenticatedData) -storedData := providerId + keyId + base64.Encode(nonce + authenticatedData + cipherText) -``` - -Pseudo-code for decrypt (golang): -```go -// for each read -providerId, keyId, base64Encoded := // slice provider and key from value - -// ensure this provider is the one handling providerId -aead := // lookup an aead instance for keyId or error -bytes := base64.Decode(base64Encoded) -nonce, authenticatedData, cipherText := // slice from bytes -out, err := aead.Open(nil, nonce, cipherText, authenticatedData) -``` - -### Alternative Considered: SecretBox - -Using [secretbox](https://godoc.org/golang.org/x/crypto/nacl/secretbox) would -also be a good choice for crypto. We decided to go with AES-GCM for the first -implmentation since: - -- No new library required. -- We'd need to manage AEAD ourselves. -- The cache attack is not much of a concern on x86 with AES-NI, but is more so - on ARM - -There's no problem with adding this as an alternative later. - -## Configuration - -We will add the following options to the API server. At API server startup the -user will specify: - -```yaml ---encryption-provider-config=/path/to/config ---encryption-provider=default ---encrypt-resource=v1/Secrets -``` - -The encryption provider will check it has the keys it needs and if not, generate -them as described in the following section. - -## Key Generation, Distribution and Rotation - -To start with we want to support a simple user-driven key generation, -distribution and rotation scheme. Automatic rotation may be achievable in the -future. - -To enable key rotation a common pattern is to have keys used for resource -encryption encrypted by another set of keys (Key Encryption Keys aka KEK). The -keys used for encrypting kubernetes resources (Data Encryption Keys, aka DEK) -are generated by the apiserver and stored encrypted with one of the KEKs. - -In future versions, storing a KEK off-host and off-loading encryption/decryption -of the DEK to AWS KMS, Google Cloud KMS, Hashicorp Vault etc. should be -possible. The decrypted DEK would be cached locally after boot. - -Using a remote encrypt/decrypt API offered by an external store will be limited -to encrypt/decrypt of keys, not the actual resources for performance reasons. - -Incremental deliverable options are presented below. - -### Option 1: Simple list of keys on disk - -In this solution there is no KEK/DEK scheme, just single keys in a list on disk. -They will live in a file specified by the --encryption-provider-config, which -can be an empty file when encryption is turned on. - -If the key file is empty or the user calls PUT on a /rotate API endpoint keys -are generated as follows: - -1. A new encryption key is created. -1. The key is added to a file on the API master with metadata including an ID - and an expiry time. Subsequent calls to rotate will prepend new keys to the - file such that the first key is always the key to use for encryption. -1. The list of keys being used by the master is updated in memory so that the - new key is in the list of read keys. -1. The list of keys being used by the master is updated in memory so that the - new key is the current write key. -1. All secrets are re-encrypted with the new key. - -Pros: - - - Simplicity. - - The generate/write/read interfaces can be pluggable for later replacement - with external secret management systems. - - A single master shouldn't require API Server downtime for rotation. - - No unseal step on startup since the file is already present. - - Attacker with access to /rotate is a DoS at worst, it doesn't return any - keys. - -Cons: - - - Coordination of keys between a deployment with multiple masters will require - updating the KeyEncryptionKeyDatabase file on disk and forcing a re-read. - - Users will be responsible for backing up the keyfile from the API server - disk. - -### Option 2: User supplied encryption key - -In this solution there is no KEK/DEK scheme, just single keys managed by the -user. To enable encryption a user specifies the "user-supplied-key" encryption -provider at api startup. Nothing is actually encrypted until the user calls PUT -on a /rotate API endpoint: - -1. A new encryption key is created. -1. The key is provided back to the caller for persistent storage. Within the - cluster, it only lives in memory on the master. -1. The list of keys being used by the master is updated in memory so that the - new key is in the list of read keys. -1. The list of keys being used by the master is updated in memory so that the - new key is is the current write key. -1. All secrets are re-encrypted with the new key. - -On master restart the api server will wait until the user supplies the list of -keys needed to decrypt all secrets in the database. In most cases this will be a -single key unless the re-encryption step was incomplete. - -Pros: - - - Simplicity. - - A single master shouldn't require API Server downtime. - - User is explicitly in control of managing and backing up the encryption keys. - -Cons: - - - Coordination of keys between a deployment with multiple masters is not - possible. This would have to be added as a subsequent feature using a - consensus protocol. - - API master needs to refuse to start and wait on a decrypt key from the user. - - /rotate API needs to be strongly protected: if an attacker can cause a - rotation and get the new key, it might as well not be encrypted at all. - -### Option 3: Encrypted DEKs in etcd, KEKs on disk - -In order to take an API driven approach for key rotation, new API objects (not -exposed over REST) will be defined: - -* Key Encryption Key (KEK) - key used to unlock the Data Encryption Key. Stored - on API server nodes. -* Data Encryption Key (DEK) - long-lived secret encrypted with a KEK. Stored in - etcd encrypted. Unencrypted in-memory in API servers. -* KEK Slot - to support KEK rotation there will be an ordered list of KEKs - stored in the KEK DB. The current active KEK slot number, is stored in etcd - for consistency. -* KEK DB - a file with N KEKs in a JSON list. KEK[0], by definition, is null. - -```go -type DataEncryptionKey struct { - ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - Value string // Encrypted -} -``` - -```go -type KeyEncryptionKeySlot struct { - ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - Slot int -} -``` - -```go -type KeyEncryptionKeyDatabase struct { - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - Keys []string -} -``` - -To enable encryption a user must first create a KEK DB file and tell the API -server to use it with `--encryption-provider-config=/path/to/config`. The -file will be a simple YAML file that lists all of the keys: - -```yaml -kind: KeyEncryptionKeyDatabase -version: v1 -keys: - - foo - - bar - - baz -``` - -The user will also need to specify the encryption provider and the resources to -encrypt as follows: -```yaml ---encryption-provider-config=/path/to/key-encryption-key/db ---encryption-provider=default ---encrypt-resource=v1/Secrets ---encrypt-resource=v1/ConfigMap -``` - -Then a user calls PUT on a /rotate API endpoint the first time: - -1. A new encryption key (unencrypted DEK) is created. -1. Encrypt DEK with KEK[1] -1. The list of DEKs being used by the master is updated in memory so that the - new key is in the list of read keys. -1. The list of DEKs being used by the master is updated in etcd so that the - new key is in the list of read keys available to all masters. -1. Confirm that all masters have the new DEK for reading. Key point here is that - all readers have the new key before anyone writes with it. -1. The list of DEKs being used by the master is updated in memory so that the - new key is is the current write key. -1. The list of DEKs being used by the master is updated in etcd so that the new - key is the current write key and is available to all masters. It doesn't - matter if there's some masters using the new key and some using the old key, - since we know all masters can read with the new key. Eventually all masters - will be writing with the new key. -1. All secrets are re-encrypted with the new key. - -After N rotation calls: - -1. A new encryption key (unencrypted DEK) is created. -1. Encrypt DEK with KEK[N+1] - -Each rotation generates a new KEK and DEK. Two DEKs will be in-use temporarily -during rotation, but only one at steady-state. - -Pros: - - - Most closely matches the pattern that will be used for integrating with - external encryption systems. Hashicorp Vault, Amazon KMS, Google KMS and HSM - would eventually serve the purpose of KEK storage rather than local disk. - -Cons: - - - End state is still KEKs on disk on the master. This is equivalent to the much - simpler list of keys on disk in terms of key management and security. - Complexity is much higher. - - Coordination of keys between a deployment with multiple masters will require - manually generating and providing a key in the key file then calling rotate - to have the config re-read. Same as keys on disk. - -### Option 4: Protocol for KEK agreement between masters - -TODO: write a proposal for coordinating KEK agreement among multiple masters and -having the KEK be either user supplied or backed by external store. - - -## External providers - -It should be easy for the user to substitute a default encryption provider for -one of the following: - -* A local HSM implementation that retrieves the keys from the secure enclave - prior to reusing the AES-GCM implementation (initialization of keys only) -* Exchanging a local temporary token for the actual decryption tokens from a - networked secret vault -* Decrypting the AES-256 keys from disk using asymmetric encryption combined - with a user input password -* Sending the data over the network to a key management system for encryption - and decryption (Google KMS, Amazon KMS, Hashicorp Vault w/ Transit backend) - -### Backwards Compatibility - -Once a user encrypts any resource in etcd, they are locked to that Kubernetes -version and higher unless they choose to manually decrypt that resource in etcd. -This will be discouraged. It will be highly recommended that users discern if -their Kubernetes cluster is on a stable version before enabling encryption. - -### Performance - -Introducing even a relatively well tuned AES-GCM implementation is likely to -have performance implications for Kubernetes. Fortunately, existing -optimizations occur above the storage layer and so the highest penalty will be -incurred on writes when secrets are created or updated. In multi-tenant Kube -clusters secrets tend to have the highest load factor (there are 20-40 resources -types per namespace, but most resources only have 1 instance where secrets might -have 3-9 instances across 10k namespaces). Writes are uncommon, creates usually -happen only when a namespace is created, and reads are somewhat common. - -### Actionable Items / Milestones - -* [p0] Add ValueTransformer to storage (Done in [#41939](https://github.com/kubernetes/kubernetes/pull/41939)) -* [p0] Create a default implementation of AES-GCM interface (Done in [#41939](https://github.com/kubernetes/kubernetes/pull/41939)) -* [p0] Add encryption flags on kube-apiserver and key rotation API -* [p1] Add kubectl command to call /rotate endpoint -* [p1] Audit of default implementation for safety and security -* [p2] E2E and performance testing -* [p2] Documentation and users guide -* [p2] Read cache layer if encrypting/decrypting Secrets adds too much load on kube-apiserver - - -## Alternative Considered: Encrypting the entire etcd database - -It should be easy for the user to substitute a default encryption provider for -one of the following: - -Rather than encrypting individual resources inside the etcd database, another -approach is to encrypt the entire database. - -Pros: - - - Removes the complexity of deciding which types of things should be encrypted - in the database. - - Protects any other sensitive information that might be exposed if etcd - backups are made public accidentally or one of the other desribed attacks - occurs. - -Cons: - - - Unknown, but likely significant performance impact. If it isn't fast enough - you don't get to fall back on only encrypting the really important stuff. - As a counter argument: Docker [implemented their encryption at this - layer](https://docs.docker.com/engine/swarm/swarm_manager_locking/) and have - been happy with the performance. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/enhance-pluggable-policy.md b/contributors/design-proposals/auth/enhance-pluggable-policy.md index 29eff236..f0fbec72 100644 --- a/contributors/design-proposals/auth/enhance-pluggable-policy.md +++ b/contributors/design-proposals/auth/enhance-pluggable-policy.md @@ -1,424 +1,6 @@ -# Enhance Pluggable Policy +Design proposals have been archived. -While trying to develop an authorization plugin for Kubernetes, we found a few -places where API extensions would ease development and add power. There are a -few goals: - 1. Provide an authorization plugin that can evaluate a .Authorize() call based -on the full content of the request to RESTStorage. This includes information -like the full verb, the content of creates and updates, and the names of -resources being acted upon. - 1. Provide a way to ask whether a user is permitted to take an action without - running in process with the API Authorizer. For instance, a proxy for exec - calls could ask whether a user can run the exec they are requesting. - 1. Provide a way to ask who can perform a given action on a given resource. -This is useful for answering questions like, "who can create replication -controllers in my namespace". +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This proposal adds to and extends the existing API to so that authorizers may -provide the functionality described above. It does not attempt to describe how -the policies themselves can be expressed, that is up the authorization plugins -themselves. - -## Enhancements to existing Authorization interfaces - -The existing Authorization interfaces are described -[here](../admin/authorization.md). A couple additions will allow the development -of an Authorizer that matches based on different rules than the existing -implementation. - -### Request Attributes - -The existing authorizer.Attributes only has 5 attributes (user, groups, -isReadOnly, kind, and namespace). If we add more detailed verbs, content, and -resource names, then Authorizer plugins will have the same level of information -available to RESTStorage components in order to express more detailed policy. -The replacement excerpt is below. - -An API request has the following attributes that can be considered for -authorization: - - user - the user-string which a user was authenticated as. This is included -in the Context. - - groups - the groups to which the user belongs. This is included in the -Context. - - verb - string describing the requesting action. Today we have: get, list, -watch, create, update, and delete. The old `readOnly` behavior is equivalent to -allowing get, list, watch. - - namespace - the namespace of the object being access, or the empty string if -the endpoint does not support namespaced objects. This is included in the -Context. - - resourceGroup - the API group of the resource being accessed - - resourceVersion - the API version of the resource being accessed - - resource - which resource is being accessed - - applies only to the API endpoints, such as `/api/v1beta1/pods`. For -miscellaneous endpoints, like `/version`, the kind is the empty string. - - resourceName - the name of the resource during a get, update, or delete -action. - - subresource - which subresource is being accessed - -A non-API request has 2 attributes: - - verb - the HTTP verb of the request - - path - the path of the URL being requested - - -### Authorizer Interface - -The existing Authorizer interface is very simple, but there isn't a way to -provide details about allows, denies, or failures. The extended detail is useful -for UIs that want to describe why certain actions are allowed or disallowed. Not -all Authorizers will want to provide that information, but for those that do, -having that capability is useful. In addition, adding a `GetAllowedSubjects` -method that returns back the users and groups that can perform a particular -action makes it possible to answer questions like, "who can see resources in my -namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). - -```go -// OLD -type Authorizer interface { - Authorize(a Attributes) error -} -``` - -```go -// NEW -// Authorizer provides the ability to determine if a particular user can perform -// a particular action -type Authorizer interface { - // Authorize takes a Context (for namespace, user, and traceability) and - // Attributes to make a policy determination. - // reason is an optional return value that can describe why a policy decision - // was made. Reasons are useful during debugging when trying to figure out - // why a user or group has access to perform a particular action. - Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error) -} - -// AuthorizerIntrospection is an optional interface that provides the ability to -// determine which users and groups can perform a particular action. This is -// useful for building caches of who can see what. For instance, "which -// namespaces can this user see". That would allow someone to see only the -// namespaces they are allowed to view instead of having to choose between -// listing them all or listing none. -type AuthorizerIntrospection interface { - // GetAllowedSubjects takes a Context (for namespace and traceability) and - // Attributes to determine which users and groups are allowed to perform the - // described action in the namespace. This API enables the ResourceBasedReview - // requests below - GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error) -} -``` - -### SubjectAccessReviews - -This set of APIs answers the question: can a user or group (use authenticated -user if none is specified) perform a given action. Given the Authorizer -interface (proposed or existing), this endpoint can be implemented generically -against any Authorizer by creating the correct Attributes and making an -.Authorize() call. - -There are three different flavors: - -1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this -checks to see if a specified user or group can perform a given action at the -cluster scope or across all namespaces. This is a highly privileged operation. -It allows a cluster-admin to inspect rights of any person across the entire -cluster and against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - -this checks to see if the current user (including his groups) can perform a -given action at any specified scope. This is an unprivileged operation. It -doesn't expose any information that a user couldn't discover simply by trying an -endpoint themselves. -3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - -this checks to see if a specified user or group can perform a given action in -**this** namespace. This is a moderately privileged operation. In a multi-tenant -environment, having a namespace scoped resource makes it very easy to reason -about powers granted to a namespace admin. This allows a namespace admin -(someone able to manage permissions inside of one namespaces, but not all -namespaces), the power to inspect whether a given user or group can manipulate -resources in his namespace. - -SubjectAccessReview is runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets -a SubjectAccessReviewResponse back. Here is an example of a call and its -corresponding return: - -```json -// input -{ - "kind": "SubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "user": "Clark", - "groups": ["admins", "managers"] - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json -// or -accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject) - -// output -{ - "kind": "SubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "allowed": true -} -``` - -PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that -only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL -and he gets a SubjectAccessReviewResponse back. Here is an example of a call and -its corresponding return: - -```json -// input -{ - "kind": "PersonalSubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "namespace": "any-ns", - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json -// or -accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject) - -// output -{ - "kind": "PersonalSubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "allowed": true -} -``` - -LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he -gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and -its corresponding return: - -```json -// input -{ - "kind": "LocalSubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "my-ns" - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "user": "Clark", - "groups": ["admins", "managers"] - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json -// or -accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject) - -// output -{ - "kind": "LocalSubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "my-ns" - "allowed": true -} -``` - -The actual Go objects look like this: - -```go -type AuthorizationAttributes struct { - // Namespace is the namespace of the action being requested. Currently, there - // is no distinction between no namespace and all namespaces - Namespace string `json:"namespace" description:"namespace of the action being requested"` - // Verb is one of: get, list, watch, create, update, delete - Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"` - // Resource is one of the existing resource types - ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"` - // ResourceVersion is the version of resource - ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"` - // Resource is one of the existing resource types - Resource string `json:"resource" description:"one of the existing resource types"` - // ResourceName is the name of the resource being requested for a "get" or - // deleted for a "delete" - ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"` - // Subresource is one of the existing subresources types - Subresource string `json:"subresource" description:"one of the existing subresources"` -} - -// SubjectAccessReview is an object for requesting information about whether a -// user or group can perform an action -type SubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` - // User is optional, but at least one of User or Groups must be specified - User string `json:"user" description:"optional, user to check"` - // Groups is optional, but at least one of User or Groups must be specified - Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` -} - -// SubjectAccessReviewResponse describes whether or not a user or group can -// perform an action -type SubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} - -// PersonalSubjectAccessReview is an object for requesting information about -// whether a user or group can perform an action -type PersonalSubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// PersonalSubjectAccessReviewResponse describes whether this user can perform -// an action -type PersonalSubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} - -// LocalSubjectAccessReview is an object for requesting information about -// whether a user or group can perform an action -type LocalSubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` - // User is optional, but at least one of User or Groups must be specified - User string `json:"user" description:"optional, user to check"` - // Groups is optional, but at least one of User or Groups must be specified - Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` -} - -// LocalSubjectAccessReviewResponse describes whether or not a user or group can -// perform an action -type LocalSubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} -``` - -### ResourceAccessReview - -This set of APIs answers the question: which users and groups can perform the -specified verb on the specified resourceKind. Given the Authorizer interface -described above, this endpoint can be implemented generically against any -Authorizer by calling the .GetAllowedSubjects() function. - -There are two different flavors: - -1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this -checks to see which users and groups can perform a given action at the cluster -scope or across all namespaces. This is a highly privileged operation. It allows -a cluster-admin to inspect rights of all subjects across the entire cluster and -against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - -this checks to see which users and groups can perform a given action in **this** -namespace. This is a moderately privileged operation. In a multi-tenant -environment, having a namespace scoped resource makes it very easy to reason -about powers granted to a namespace admin. This allows a namespace admin -(someone able to manage permissions inside of one namespaces, but not all -namespaces), the power to inspect which users and groups can manipulate -resources in his namespace. - -ResourceAccessReview is a runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets -a ResourceAccessReviewResponse back. Here is an example of a call and its -corresponding return: - -```json -// input -{ - "kind": "ResourceAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "list", - "resource": "replicationcontrollers" - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json -// or -accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject) - -// output -{ - "kind": "ResourceAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "default" - "users": ["Clark", "Hubert"], - "groups": ["cluster-admins"] -} -``` - -The actual Go objects look like this: - -```go -// ResourceAccessReview is a means to request a list of which users and groups -// are authorized to perform the action specified by spec -type ResourceAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// ResourceAccessReviewResponse describes who can perform the action -type ResourceAccessReviewResponse struct { - kapi.TypeMeta - - // Users is the list of users who can perform the action - Users []string - // Groups is the list of groups who can perform the action - Groups []string -} - -// LocalResourceAccessReview is a means to request a list of which users and -// groups are authorized to perform the action specified in a specific namespace -type LocalResourceAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// LocalResourceAccessReviewResponse describes who can perform the action -type LocalResourceAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Users is the list of users who can perform the action - Users []string - // Groups is the list of groups who can perform the action - Groups []string -} -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/flex-volumes-drivers-psp.md b/contributors/design-proposals/auth/flex-volumes-drivers-psp.md index 453d736f..f0fbec72 100644 --- a/contributors/design-proposals/auth/flex-volumes-drivers-psp.md +++ b/contributors/design-proposals/auth/flex-volumes-drivers-psp.md @@ -1,94 +1,6 @@ -# Allow Pod Security Policy to manage access to the Flexvolumes +Design proposals have been archived. -## Current state +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Cluster admins can control the usage of specific volume types by using Pod -Security Policy (PSP). Admins can allow the use of Flexvolumes by listing the -`flexVolume` type in the `volumes` field. The only thing that can be managed is -allowance or disallowance of Flexvolumes. -Technically, Flexvolumes are implemented as vendor drivers. They are executable -files that must be placed on every node at -`/usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor~driver>/<driver>`. -In most cases they are scripts. Limiting driver access means not only limiting -an access to the volumes that this driver can provide, but also managing access -to executing a driver’s code (that is arbitrary, in fact). - -It is possible to have many flex drivers for the different storage types. In -essence, Flexvolumes represent not a single volume type, but the different -types that allow usage of various vendor volumes. - -## Desired state - -In order to further improve security and to provide more granular control for -the usage of the different Flexvolumes, we need to enhance PSP. When such a -change takes place, cluster admins will be able to grant access to any -Flexvolumes of a particular driver (in contrast to any volume of all drivers). - -For example, if we have two drivers for Flexvolumes (`cifs` and -`digitalocean`), it will become possible to grant access for one group to use -only volumes from DigitalOcean and grant access for another group to use -volumes from all Flexvolumes. - -## Proposed changes - -It has been suggested to add a whitelist of allowed Flexvolume drivers to the -PSP. It should behave similar to [the existing -`allowedHostPaths`](https://github.com/kubernetes/kubernetes/pull/50212) except -that: - -1) comparison of equality will be used instead of comparison of prefixes. -2) Flexvolume’s driver field will be inspected rather than `hostPath`’s path field. - -### PodSecurityPolicy modifications - -```go -// PodSecurityPolicySpec defines the policy enforced. -type PodSecurityPolicySpec struct { - ... - // AllowedFlexVolumes is a whitelist of allowed Flexvolumes. Empty or nil indicates that all - // Flexvolumes may be used. This parameter is effective only when the usage of the Flexvolumes - // is allowed in the "Volumes" field. - // +optional - AllowedFlexVolumes []AllowedFlexVolume -} - -// AllowedFlexVolume represents a single Flexvolume that is allowed to be used. -type AllowedFlexVolume struct { - // Driver is the name of the Flexvolume driver. - Driver string -} -``` - -Empty `AllowedFlexVolumes` allows usage of Flexvolumes with any driver. It must -behave as before and provide backward compatibility. - -Non-empty `AllowedFlexVolumes` changes the behavior from "all allowed" to "all -disallowed except those that are explicitly listed here". - -### Admission controller modifications - -Admission controller should be updated accordingly to inspect a Pod's volumes. -If it finds a `flexVolume`, it should ensure that its driver is allowed to be -used. - -### Validation rules - -Flexvolume driver names must be non-empty. - -If a PSP disallows to pods to request volumes of type `flexVolume` then -`AllowedFlexVolumes` must be empty. In case it is not empty, API server must -report an error. - -API server should allow granting an access to Flexvolumes that do not exist at -time of PSP creation. - -## Notes -It is possible to have even more flexible control over the Flexvolumes and take -into account options that have been passed to a driver. We decided that this is -a desirable feature but outside the scope of this proposal. - -The current change could be enough for many cases. Also, when cluster admins -are able to manage access to particular Flexvolume drivers, it becomes possible -to "emulate" control over the driver’s options by using many drivers with -hard-coded options. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/image-provenance.md b/contributors/design-proposals/auth/image-provenance.md index 942ea416..f0fbec72 100644 --- a/contributors/design-proposals/auth/image-provenance.md +++ b/contributors/design-proposals/auth/image-provenance.md @@ -1,326 +1,6 @@ +Design proposals have been archived. -# Overview +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Organizations wish to avoid running "unapproved" images. -The exact nature of "approval" is beyond the scope of Kubernetes, but may include reasons like: - - - only run images that are scanned to confirm they do not contain vulnerabilities - - only run images that use a "required" base image - - only run images that contain binaries which were built from peer reviewed, checked-in source - by a trusted compiler toolchain. - - only allow images signed by certain public keys. - - - etc... - -Goals of the design include: -* Block creation of pods that would cause "unapproved" images to run. -* Make it easy for users or partners to build "image provenance checkers" which check whether images are "approved". - * We expect there will be multiple implementations. -* Allow users to request an "override" of the policy in a convenient way (subject to the override being allowed). - * "overrides" are needed to allow "emergency changes", but need to not happen accidentally, since they may - require tedious after-the-fact justification and affect audit controls. - -Non-goals include: -* Encoding image policy into Kubernetes code. -* Implementing objects in core kubernetes which describe complete policies for what images are approved. - * A third-party implementation of an image policy checker could optionally use ThirdPartyResource to store its policy. -* Kubernetes core code dealing with concepts of image layers, build processes, source repositories, etc. - * We expect there will be multiple PaaSes and/or de-facto programming environments, each with different takes on - these concepts. At any rate, Kubernetes is not ready to be opinionated on these concepts. -* Sending more information than strictly needed to a third-party service. - * Information sent by Kubernetes to a third-party service constitutes an API of Kubernetes, and we want to - avoid making these broader than necessary, as it restricts future evolution of Kubernetes, and makes - Kubernetes harder to reason about. Also, excessive information limits cache-ability of decisions. Caching - reduces latency and allows short outages of the backend to be tolerated. - - -Detailed discussion in [Ensuring only images are from approved sources are run]( -https://github.com/kubernetes/kubernetes/issues/22888). - -# Implementation - -A new admission controller will be added. That will be the only change. - -## Admission controller - -An `ImagePolicyWebhook` admission controller will be written. The admission controller examines all pod objects which are -created or updated. It can either admit the pod, or reject it. If it is rejected, the request sees a `403 FORBIDDEN` - -The admission controller code will go in `plugin/pkg/admission/imagepolicy`. - -There will be a cache of decisions in the admission controller. - -If the apiserver cannot reach the webhook backend, it will log a warning and either admit or deny the pod. -A flag will control whether it admits or denies on failure. -The rationale for deny is that an attacker could DoS the backend or wait for it to be down, and then sneak a -bad pod into the system. The rationale for allow here is that, if the cluster admin also does -after-the-fact auditing of what images were run (which we think will be common), this will catch -any bad images run during periods of backend failure. With default-allow, the availability of Kubernetes does -not depend on the availability of the backend. - -# Webhook Backend - -The admission controller code in that directory does not contain logic to make an admit/reject decision. Instead, it extracts -relevant fields from the Pod creation/update request and sends those fields to a Backend (which we have been loosely calling "WebHooks" -in Kubernetes). The request the admission controller sends to the backend is called a WebHook request to distinguish it from the -request being admission-controlled. The server that accepts the WebHook request from Kubernetes is called the "Backend" -to distinguish it from the WebHook request itself, and from the API server. - -The whole system will work similarly to the [Authentication WebHook]( -https://github.com/kubernetes/kubernetes/pull/24902 -) or the [AuthorizationWebHook]( -https://github.com/kubernetes/kubernetes/pull/20347). - -The WebHook request can optionally authenticate itself to its backend using a token from a `kubeconfig` file. - -The WebHook request and response are JSON, and correspond to the following `go` structures: - -```go -// Filename: pkg/apis/imagepolicy.k8s.io/register.go -package imagepolicy - -// ImageReview checks if the set of images in a pod are allowed. -type ImageReview struct { - unversioned.TypeMeta - - // Spec holds information about the pod being evaluated - Spec ImageReviewSpec - - // Status is filled in by the backend and indicates whether the pod should be allowed. - Status ImageReviewStatus - } - -// ImageReviewSpec is a description of the pod creation request. -type ImageReviewSpec struct { - // Containers is a list of a subset of the information in each container of the Pod being created. - Containers []ImageReviewContainerSpec - // Annotations is a list of key-value pairs extracted from the Pod's annotations. - // It only includes keys which match the pattern `*.image-policy.k8s.io/*`. - // It is up to each webhook backend to determine how to interpret these annotations, if at all. - Annotations map[string]string - // Namespace is the namespace the pod is being created in. - Namespace string -} - -// ImageReviewContainerSpec is a description of a container within the pod creation request. -type ImageReviewContainerSpec struct { - Image string - // In future, we may add command line overrides, exec health check command lines, and so on. -} - -// ImageReviewStatus is the result of the token authentication request. -type ImageReviewStatus struct { - // Allowed indicates that all images were allowed to be run. - Allowed bool - // Reason should be empty unless Allowed is false in which case it - // may contain a short description of what is wrong. Kubernetes - // may truncate excessively long errors when displaying to the user. - Reason string -} -``` - -## Extending with Annotations - -All annotations on a Pod that match `*.image-policy.k8s.io/*` are sent to the webhook. -Sending annotations allows users who are aware of the image policy backend to send -extra information to it, and for different backends implementations to accept -different information. - -Examples of information you might put here are - -- request to "break glass" to override a policy, in case of emergency. -- a ticket number from a ticket system that documents the break-glass request -- provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup - -In any case, the annotations are provided by the user and are not validated by Kubernetes in any way. In the future, if an annotation is determined to be widely -useful, we may promote it to a named field of ImageReviewSpec. - -In the case of a Pod update, Kubernetes may send the backend either all images in the updated image, or only the ones that -changed, at its discretion. - -## Interaction with Controllers - -In the case of a Deployment object, no image check is done when the Deployment object is created or updated. -Likewise, no check happens when the Deployment controller creates a ReplicaSet. The check only happens -when the ReplicaSet controller creates a Pod. Checking Pod is necessary since users can directly create pods, -and since third-parties can write their own controllers, which kubernetes might not be aware of or even contain -pod templates. - -The ReplicaSet, or other controller, is responsible for recognizing when a 403 has happened -(whether due to user not having permission due to bad image, or some other permission reason) -and throttling itself and surfacing the error in a way that CLIs and UIs can show to the user. - -Issue [22298](https://github.com/kubernetes/kubernetes/issues/22298) needs to be resolved to -propagate Pod creation errors up through a stack of controllers. - -## Changes in policy over time - -The Backend might change the policy over time. For example, yesterday `redis:v1` was allowed, but today `redis:v1` is not allowed -due to a CVE that just came out (fictional scenario). In this scenario: -. - -- a newly created replicaSet will be unable to create Pods. -- updating a deployment will be safe in the sense that it will detect that the new ReplicaSet is not scaling - up and not scale down the old one. -- an existing replicaSet will be unable to create Pods that replace ones which are terminated. If this is due to - slow loss of nodes, then there should be time to react before significant loss of capacity. -- For non-replicated things (size 1 ReplicaSet, StatefulSet), a single node failure may disable it. -- a node rolling update will eventually check for liveness of replacements, and would be throttled if - in the case when the image was no longer allowed and so replacements could not be started. -- rapid node restarts will cause existing pod objects to be restarted by kubelet. -- slow node restarts or network partitions will cause node controller to delete pods and there will be no replacement - -It is up to the Backend implementor, and the cluster administrator who decides to use that backend, to decide -whether the Backend should be allowed to change its mind. There is a tradeoff between responsiveness -to changes in policy, versus keeping existing services running. The two models that make sense are: - -- never change a policy, unless some external process has ensured no active objects depend on the to-be-forbidden - images. -- change a policy and assume that transition to new image happens faster than the existing pods decay. - -## Ubernetes - -If two clusters share an image policy backend, then they will have the same policies. - -The clusters can pass different tokens to the backend, and the backend can use this to distinguish -between different clusters. - -## Image tags and IDs - -Image tags are like: `myrepo/myimage:v1`. - -Image IDs are like: `myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed`. -You can see image IDs with `docker images --no-trunc`. - -The Backend needs to be able to resolve tags to IDs (by talking to the images repo). -If the Backend resolves tags to IDs, there is some risk that the tag-to-ID mapping will be -modified after approval by the Backend, but before Kubelet pulls the image. We will not address this -race condition at this time. - -We will wait and see how much demand there is for closing this hole. If the community demands a solution, -we may suggest one of these: - -1. Use a backend that refuses to accept images that are specified with tags, and require users to resolve to IDs - prior to creating a pod template. - - [kubectl could be modified to automate this process](https://github.com/kubernetes/kubernetes/issues/1697) - - a CI/CD system or templating system could be used that maps IDs to tags before Deployment modification/creation. -1. Audit logs from kubelets to see image IDs were actually run, to see if any unapproved images slipped through. -1. Monitor tag changes in image repository for suspicious activity, or restrict remapping of tags after initial application. - -If none of these works well, we could do the following: - -- Image Policy Admission Controller adds new field to Pod, e.g. `pod.spec.container[i].imageID` (or an annotation). - and kubelet will enforce that both the imageID and image match the image pulled. - -Since this adds complexity and interacts with imagePullPolicy, we avoid adding the above feature initially. - -### Caching - -There will be a cache of decisions in the admission controller. -TTL will be user-controllable, but default to 1 hour for allows and 30s for denies. -Low TTL for deny allows user to correct a setting on the backend and see the fix -rapidly. It is assumed that denies are infrequent. -Caching allows permits RC to scale up services even during short unavailability of the webhook backend. -The ImageReviewSpec is used as the key to the cache. - -In the case of a cache miss and timeout talking to the backend, the default is to allow Pod creation. -Keeping services running is more important than a hypothetical threat from an un-verified image. - - -### Post-pod-creation audit - -There are several cases where an image not currently allowed might still run. Users wanting a -complete audit solution are advised to also do after-the-fact auditing of what images -ran. This can catch: - -- images allowed due to backend not reachable -- images that kept running after policy change (e.g. CVE discovered) -- images started via local files or http option of kubelet -- checking SHA of images allowed by a tag which was remapped - -This proposal does not include post-pod-creation audit. - -## Alternatives considered - -### Admission Control on Controller Objects - -We could have done admission control on Deployments, Jobs, ReplicationControllers, and anything else that creates a Pod, directly or indirectly. -This approach is good because it provides immediate feedback to the user that the image is not allowed. However, we do not expect disallowed images -to be used often. And controllers need to be able to surface problems creating pods for a variety of other reasons anyways. - -Other good things about this alternative are: - -- Fewer calls to Backend, once per controller rather than once per pod creation. Caching in backend should be able to help with this, though. -- End user that created the object is seen, rather than the user of the controller process. This can be fixed by implementing `Impersonate-User` for controllers. - -Other problems are: - -- Works only with "core" controllers. Need to update admission controller if we add more "core" controllers. Won't work with "third party controllers", e.g. how we open-source distributed systems like hadoop, spark, zookeeper, etc running on kubernetes. Because those controllers don't have config that can be "admission controlled", or if they do, schema is not known to admission controller, have to "search" for pod templates in json. Yuck. -- How would it work if user created pod directly, which is allowed, and the recommended way to run something at most once. - -### Sending User to Backend - -We could have sent the username of the pod creator to the backend. The username could be used to allow different users to run -different categories of images. This would require propagating the username from e.g. Deployment creation, through to -Pod creation via, e.g. the `Impersonate-User:` header. This feature is [not ready](https://github.com/kubernetes/kubernetes/issues/27152). - When it is, we will re-evaluate adding user as a field of `ImagePolicyRequest`. - -### Enforcement at Docker level - -Docker supports plugins which can check any container creation before it happens. For example the [twistlock/authz](https://github.com/twistlock/authz) -Docker plugin can audit the full request sent to the Docker daemon and approve or deny it. This could include checking if the image is allowed. - -We reject this option because: -- it requires all nodes to be able to configured with how to reach the Backend, which complicates node setup. -- it may not work with other runtimes -- propagating error messages back to the user is more difficult -- it requires plumbing additional information about requests to nodes (if we later want to consider `User` in policy). - -### Policy Stored in API - -We decided to store policy about what SecurityContexts a pod can have in the API, via PodSecurityPolicy. -This is because Pods are a Kubernetes object, and the Policy is very closely tied to the definition of Pods, -and grows in step as the Pods API grows. - -For Image policy, the connection is not as strong. To Kubernetes API, and Image is just a string, and it -does not know any of the image metadata, which lives outside the API. - -Image policy may depend on the Dockerfile, the source code, the source repo, the source review tools, -vulnerability databases, and so on. Kubernetes does not have these as built-in concepts or have plans to add -them anytime soon. - -### Registry whitelist/blacklist - -We considered a whitelist/blacklist of registries and/or repositories. Basically, a prefix match on image strings. - The problem of approving images would be then pushed to a problem of controlling who has access to push to a -trusted registry/repository. That approach is simple for kubernetes. Problems with it are: - -- tricky to allow users to share a repository but have different image policies per user or per namespace. -- tricky to do things after image push, such as scan image for vulnerabilities (such as Docker Nautilus), and have those results considered by policy -- tricky to block "older" versions from running, whose interaction with current system may not be well understood. -- how to allow emergency override? -- hard to change policy decision over time. - -We still want to use rkt trust, docker content trust, etc for any registries used. We just need additional -image policy checks beyond what trust can provide. - -### Send every Request to a Generic Admission Control Backend - -Instead of just sending a subset of PodSpec to an Image Provenance backed, we could have sent every object -that is created or updated (or deleted?) to one or ore Generic Admission Control Backends. - -This might be a good idea, but needs quite a bit more thought. Some questions with that approach are: -It will not be a generic webhook. A generic webhook would need a lot more discussion: - -- a generic webhook needs to touch all objects, not just pods. So it won't have a fixed schema. How to express this in our IDL? Harder to write clients - that interpret unstructured data rather than a fixed schema. Harder to version, and to detect errors. -- a generic webhook client needs to ignore kinds it does not care about, or the apiserver needs to know which backends care about which kinds. How - to specify which backends see which requests. Sending all requests including high-rate requests like events and pod-status updated, might be - too high a rate for some backends? - -Additionally, just sending all the fields of just the Pod kind also has problems: -- it exposes our whole API to a webhook backend without giving us (the project) any chance to review or understand how it is being used. -- because we do not know which fields of an object are inspected by the backend, caching of decisions is not effective. Sending fewer fields allows caching. -- sending fewer fields makes it possible to rev the version of the webhook request slower than the version of our internal objects (e.g. pod v2 could still use imageReview v1.) -probably lots more reasons. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/kms-grpc-class-diagram.png b/contributors/design-proposals/auth/kms-grpc-class-diagram.png Binary files differdeleted file mode 100644 index fe63d8d0..00000000 --- a/contributors/design-proposals/auth/kms-grpc-class-diagram.png +++ /dev/null diff --git a/contributors/design-proposals/auth/kms-grpc-deployment-diagram.png b/contributors/design-proposals/auth/kms-grpc-deployment-diagram.png Binary files differdeleted file mode 100644 index c5ff1df5..00000000 --- a/contributors/design-proposals/auth/kms-grpc-deployment-diagram.png +++ /dev/null diff --git a/contributors/design-proposals/auth/kms-plugin-grpc-api.md b/contributors/design-proposals/auth/kms-plugin-grpc-api.md index fd268ba7..f0fbec72 100644 --- a/contributors/design-proposals/auth/kms-plugin-grpc-api.md +++ b/contributors/design-proposals/auth/kms-plugin-grpc-api.md @@ -1,114 +1,6 @@ -# KMS Plugin API for secrets encryption +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Since v1.7, Kubernetes allows encryption of resources. It supports 3 kinds of encryptions: ``aescbc``, ``aesgcm`` and ``secretbox``. They are implemented as value transformer. This feature currently only supports encryption using keys in the configuration file (plain text, encoded with base64). -Using an external trusted service to manage the keys separates the responsibility of key management from operating and managing a Kubernetes cluster. So a new transformer, “Envelope Transformer”, was introduced in 1.8 ([49350](https://github.com/kubernetes/kubernetes/pull/49350)). “Envelope Transformer” defines an extension point, the interface ``envelope.Service``. The intent was to make it easy to add new KMS provider by implementing the interface. For example the provider for Google Cloud KMS, Hashicorp Vault and Microsoft azure KeyVault. - -But as more KMS providers are added, more vendor dependencies are also introduced. So now we wish to pull all KMS providers out of API server, while retaining the ability of the API server to delegate encrypting secrets to an external trusted KMS service . - -## High Level Design - -At a high-level, (see [51965](https://github.com/kubernetes/kubernetes/issues/51965)), use gRPC to decouple the API server from the out-of-tree KMS providers. There is only one envelope service implementation (implement the interface ``envelope.Service``), and the envelope service communicates with the out-of-tree KMS provider through gRPC. The deployment diagram is like below: - - - -Here we assume the remote KMS provider is accessible from the API server. Not care how to launch the KMS provider process and how to manage it in this document. - -API server side (gRPC client) should know nothing about the external KMS. We only need to configure the KMS provider (gRPC server) endpoint for it. - -KMS provider (gRPC server) must handle all details related to external KMS. It need know how to connect to KMS, how to pass the authentication, and which key or keys will be used, etc.. A qualified KMS provider implementation should hide all details from API server. - -To add new KMS provider, we just implement the gRPC server. No new code or dependencies are added into API server. We configure the API server to make the gRPC client communicate with the new KMS provider. - -Following is the class diagram that illustrates a possible implementation: - - - -The class ``envelope.envelopeTransformer`` and the interface ``envelope.Service`` exists in current code base. What we need to do is to implement the class ``envelope.grpcService``. - -## Proto File Definition - -```protobuf -// envelope/service.proto -syntax = "proto3"; - -package envelope; - -service KeyManagementService { - // Version returns the runtime name and runtime version. - rpc Version(VersionRequest) returns (VersionResponse) {} - rpc Decrypt(DecryptRequest) returns (DecryptResponse) {} - rpc Encrypt(EncryptRequest) returns (EncryptResponse) {} -} - -message VersionRequest { - // Version of the KMS plugin API. - string version = 1; -} - -message VersionResponse { - // Version of the KMS plugin API. - string version = 1; - // Name of the KMS provider. - string runtime_name = 2; - // Version of the KMS provider. The string must be semver-compatible. - string runtime_version = 3; -} - -message DecryptRequest { - // Version of the KMS plugin API, now use “v1beta1” - string version = 1; - bytes cipher = 2; -} - -message DecryptResponse { - bytes plain = 1; -} - -message EncryptRequest { - // Version of the KMS plugin API, now use “v1beta1” - string version = 1; - bytes plain = 2; -} - -message EncryptResponse { - bytes cipher = 1; -} -``` - -## Deployment - -To avoid the need to implement authentication and authorization, the KMS provider should run on the master and be called via a local unix domain socket. - -Cluster administrators have various options to ensure the KMS provider runs on the master: taints and tolerations is one. On GKE we target configuration at the kubelet that runs on the master directly (it isn't registered as a regular kubelet) and will have it start the KMS provider. - -## Performance - -The KMS provider will be called on every secret write and make a remote RPC to the KMS provider to do the actual encrypt/decrypt. To keep the overhead of the gRPC call to the KMS provider low, the KMS provider should run on the master. This should mean the extra overhead is small compared to the remote call. - -Unencrypted DEKs are cached on the API server side, so gRPC calls to the KMS provider are only required to fill the cache on startup. - -## Configuration - -The out-of-tree provider will be specified in existing configuration file used to configure any of the encryption providers. The location of this configuration file is identified by the existing startup parameter ``--experimental-encryption-provider-config``. - -To specify the gRPC server endpoint, we add a new configuration parameter ``endpoint`` for the KMS configuration in a current deployment. The endpoint is a unix domain socket connection, for example ``unix:///tmp/kms-provider.sock``. - -Now we expect the API server and KMS provider run in the same Pod. Not support TCP socket connection. So it’s not necessary to add TLS support. - -Here is a sample configuration file with vault out-of-tree provider configured: - -```yaml -kind: EncryptionConfig -apiVersion: v1 -resources: - - resources: - - secrets - providers: - - kms: - name: grpc-kms-provider - cachesize: 1000 - endpoint: unix:///tmp/kms-provider.sock -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/kubectl-exec-plugins.md b/contributors/design-proposals/auth/kubectl-exec-plugins.md index dba6e7b7..d966c6a9 100644 --- a/contributors/design-proposals/auth/kubectl-exec-plugins.md +++ b/contributors/design-proposals/auth/kubectl-exec-plugins.md @@ -1,282 +1,6 @@ -# Out-of-tree client authentication providers +Design proposals have been archived. -Author: @ericchiang +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Objective -This document describes a credential rotation strategy for client-go using an exec-based -plugin mechanism. - -# Motivation - -Kubernetes clients can provide three kinds of credentials: bearer tokens, TLS -client certs, and basic authentication username and password. Kubeconfigs can either -in-line the credential, load credentials from a file, or can use an `AuthProvider` -to actively fetch and rotate credentials. `AuthProviders` are compiled into client-go -and target specific providers (GCP, Keystone, Azure AD) or implement a specification -supported but a subset of vendors (OpenID Connect). - -Long term, it's not practical to maintain custom code in kubectl for every provider. This -is in-line with other efforts around kubernetes/kubernetes to move integration with cloud -provider, or other non-standards-based systems, out of core in favor of extension points. - -Credential rotation tools have to be called on a regular basis in case the current -credentials have expired, making [kubectl plugins](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/), -kubectl's current extension point, unsuitable for credential rotation. It's easier -to wrap `kubectl` so the tool is invoked on every command. For example, the following -is a [real example]( -https://github.com/heptio/authenticator#4-set-up-kubectl-to-use-heptio-authenticator-for-aws-tokens) -from Heptio's AWS authenticator: - -```terminal -kubectl --kubeconfig /path/to/kubeconfig --token "$(heptio-authenticator-aws token -i CLUSTER_ID -r ROLE_ARN)" [...] -``` - -Beside resulting in a long command, this potentially encourages distributions to -wrap or fork kubectl, changing the way that users interact with different -Kubernetes clusters. - -# Proposal - -This proposal builds off of earlier requests to [support exec-based plugins]( -https://github.com/kubernetes/kubernetes/issues/35530#issuecomment-256170024), and -proposes that we should add this as a first-class feature of kubectl. Specifically, -client-go should be able to receive credentials by executing a command and reading -that command's stdout. - -In fact, client-go already does this today. The GCP plugin can already be configured -to [call a command]( -https://github.com/kubernetes/client-go/blob/kubernetes-1.8.5/plugin/pkg/client/auth/gcp/gcp.go#L228-L240) -other than `gcloud`. - -## Plugin responsibilities - -Plugins are exec'd through client-go and print credentials to stdout. Errors are -surfaced through stderr and a non-zero exit code. client-go will use structured APIs -to pass information to the plugin, and receive credentials from it. - -```go -// ExecCredentials are credentials returned by the plugin. -type ExecCredentials struct { - metav1.TypeMeta `json:",inline"` - - // Token is a bearer token used by the client for request authentication. - Token string `json:"token,omitempty"` - // Expiry indicates a unix time when the provided credentials expire. - Expiry int64 `json:"expiry,omitempty"` -} - -// Response defines metadata about a failed request, including HTTP status code and -// response headers. -type Response struct { - // HTTP header returned by the server. - Header map[string][]string `json:"header,omitempty"` - // HTTP status code returned by the server. - Code int32 `json:"code,omitempty"` -} - -// ExecInfo is structed information passed to the plugin. -type ExecInfo struct { - metav1.TypeMeta `json:",inline"` - - // Response is populated when the transport encounters HTTP status codes, such as 401, - // suggesting previous credentials were invalid. - // +optional - Response *Response `json:"response,omitempty"` - - // Interactive is true when the transport detects the command is being called from an - // interactive prompt. - Interactive bool `json:"interactive,omitempty"` -} -``` - -To instruct client-go to use the bearer token `BEARER_TOKEN`, a plugin would print: - -```terminal -$ ./kubectl-example-auth-plugin -{ - "kind": "ExecCredentials", - "apiVersion":"client.authentication.k8s.io/v1alpha1", - "token":"BEARER_TOKEN" -} -``` - -To surface runtime-based information to the plugin, such as a request body for request -signing, client-go will set the environment variable `KUBERNETES_EXEC_INFO` to a JSON -serialized Kubernetes object when calling the plugin. - - -```terminal -KUBERNETES_EXEC_INFO='{ - "kind":"ExecInfo", - "apiVersion":"client.authentication.k8s.io/v1alpha1", - "response": { - "code": 401, - "header": { - "WWW-Authenticate": ["Bearer realm=\"Access to the staging site\""] - } - }, - "interactive": true -}' -``` - -### Caching - -kubectl repeatedly [re-initializes transports](https://github.com/kubernetes/kubernetes/issues/37876) -while client-go transports are long lived over many requests. As a result naive auth -provider implementations that re-request credentials on every request have historically -been slow. - -Plugins will be called on client-go initialization, and again when the API server returns -a 401 HTTP status code indicating expired credentials. Plugins can indicate their credentials -explicit expiry using the `Expiry` field on the returned `ExecCredentials` object, otherwise -credentials will be cached throughout the lifetime of a program. - -## Kubeconfig changes - -The current `AuthProviderConfig` uses `map[string]string` for configuration, which -makes it hard to express things like a list of arguments or list key/value environment -variables. As such, `AuthInfo` should add another field which expresses the `exec` -config. This has the benefit of a more natural structure, but the trade-off of not being -compatible with the existing `kubectl config set-credentials` implementation. - -```go -// AuthInfo contains information that describes identity information. This is use to tell the kubernetes cluster who you are. -type AuthInfo struct { - // Existing fields ... - - // Exec is a command to execute which returns credentials to the transport to use. - // +optional - Exec *ExecAuthProviderConfig `json:"exec,omitempty"` - - // ... -} - -type ExecAuthProviderConfig struct { - Command string `json:"command"` - Args []string `json:"args"` - // Env defines additional environment variables to expose to the process. These - // are unioned with the host's environment, as well as variables client-go uses - // to pass argument to the plugin. - Env []ExecEnvVar `json:"env"` - - // Preferred input version of the ExecInfo. The returned ExecCredentials MUST use - // the same encoding version as the input. - APIVersion string `json:"apiVersion,omitempty"` - - // TODO: JSONPath options for filtering output. -} - -type ExecEnvVar struct { - Name string `json:"name"` - Value string `json:"value"` - - // TODO: Load env vars from files or from other envs? -} -``` - -This would allow a user block of a kubeconfig to declare the following: - -```yaml -users: -- name: mmosley - user: - exec: - apiVersion: "client.authentication.k8s.io/v1alpha1" - command: /bin/kubectl-login - args: ["hello", "world"] -``` - -The AWS authenticator, modified to return structured output, would become: - -```yaml -users: -- name: kubernetes-admin - user: - exec: - apiVersion: "client.authentication.k8s.io/v1alpha1" - command: heptio-authenticator-aws - # CLUSTER_ID and ROLE_ARN should be replaced with actual desired values. - args: ["token", "-i", "(CLUSTER_ID)", "-r", "(ROLE_ARN)"] -``` - -## TLS client certificate support - -TLS client certificate support is orthogonal to bearer tokens, but something that -we should consider supporting in the future. Beyond requiring different command -output, it also requires changes to the client-go `AuthProvider` interface. - -The current The auth provider interface doesn't let the user modify the dialer, -only wrap the transport. - -```go -type AuthProvider interface { - // WrapTransport allows the plugin to create a modified RoundTripper that - // attaches authorization headers (or other info) to requests. - WrapTransport(http.RoundTripper) http.RoundTripper - // Login allows the plugin to initialize its configuration. It must not - // require direct user interaction. - Login() error -} -``` - -Since this doesn't let a `AuthProvider` supply things like client certificates, -the signature of the `AuthProvider` should change too ([with corresponding changes -to `k8s.io/client-go-transport`]( -https://gist.github.com/ericchiang/7f5804403b359ebdf79dcf76c4071bff)): - -```go -import ( - "k8s.io/client-go/transport" - // ... -) - -type AuthProvider interface { - // UpdateTransportConfig updates a config by adding a transport wrapper, - // setting a bearer token (should ignore if one is already set), or adding - // TLS client certificate credentials. - // - // This is called once on transport initialization. Providers that need to - // rotate credentials should use Config.WrapTransport to dynamically update - // credentials. - UpdateTransportConfig(c *transport.Config) - - // Login() dropped, it was never used. -} -``` - -This would let auth transports supply TLS credentials, as well as instrument -transports with in-memory rotation code like the utilities implemented by -[`k8s.io/client-go/util/certificate`](https://godoc.org/k8s.io/client-go/util/certificate). - -The `ExecCredentials` would then expand to provide TLS options. - -```go -type ExecCredentials struct { - metav1.TypeMeta `json:",inline"` - - // Token is a bearer token used by the client for request authentication. - Token string `json:"token,omitempty"` - // PEM encoded client certificate and key. - ClientCertificateData string `json:"clientCertificateData,omitempty"` - ClientKeyData string `json:"clientKeyData,omitempty"` - - // Expiry indicates a unix time when the provided credentials expire. - Expiry int64 `json:"expiry,omitempty"` -} -``` - -The `AuthProvider` then adds those credentials to the `transport.Config`. - -## Login - -Historically, `AuthProviders` have had a `Login()` method with the hope that it -could trigger bootstrapping into the cluster. While no providers implement this -method, the Azure `AuthProvider` can already prompt an [interactive auth flow]( -https://github.com/kubernetes/client-go/blob/kubernetes-1.8.5/plugin/pkg/client/auth/azure/azure.go#L343). -This suggests that an exec'd tool should be able to trigger its own custom logins, -either by opening a browser, or performing a text based prompt. - -We should take care that interactive stderr and stdin are correctly inherited by -the sub-process to enable this kind of interaction. The plugin will still be -responsible for prompting the user, receiving user feedback, and timeouts. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first. diff --git a/contributors/design-proposals/auth/no-new-privs.md b/contributors/design-proposals/auth/no-new-privs.md index 5c96c9d1..d966c6a9 100644 --- a/contributors/design-proposals/auth/no-new-privs.md +++ b/contributors/design-proposals/auth/no-new-privs.md @@ -1,141 +1,6 @@ -# No New Privileges +Design proposals have been archived. -- [Description](#description) - * [Interactions with other Linux primitives](#interactions-with-other-linux-primitives) -- [Current Implementations](#current-implementations) - * [Support in Docker](#support-in-docker) - * [Support in rkt](#support-in-rkt) - * [Support in OCI runtimes](#support-in-oci-runtimes) -- [Existing SecurityContext objects](#existing-securitycontext-objects) -- [Changes of SecurityContext objects](#changes-of-securitycontext-objects) -- [Pod Security Policy changes](#pod-security-policy-changes) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Description - -In Linux, the `execve` system call can grant more privileges to a newly-created -process than its parent process. Considering security issues, since Linux kernel -v3.5, there is a new flag named `no_new_privs` added to prevent those new -privileges from being granted to the processes. - -[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt) -is inherited across `fork`, `clone` and `execve` and can not be unset. With -`no_new_privs` set, `execve` promises not to grant the privilege to do anything -that could not have been done without the `execve` call. - -For more details about `no_new_privs`, please check the -[Linux kernel documentation](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). - -This is different from `NOSUID` in that `no_new_privs`can give permission to -the container process to further restrict child processes with seccomp. This -permission goes only one-way in that the container process can not grant more -permissions, only further restrict. - -### Interactions with other Linux primitives - -- suid binaries: will break when `no_new_privs` is enabled -- seccomp2 as a non root user: requires `no_new_privs` -- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs` -- ambient capabilities: requires `no_new_privs` -- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969) - - -## Current Implementations - -### Support in Docker - -Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs` -while creating containers, for example -`docker run --security-opt=no_new_privs busybox`. - -Docker provides via their Go api an object named `ContainerCreateConfig` to -configure container creation parameters. In this object, there is a string -array `HostConfig.SecurityOpt` to specify the security options. Client can -utilize this field to specify the arguments for security options while -creating new containers. - -This field did not scale well for the Docker client, so it's suggested that -Kubernetes does not follow that design. - -This is not on by default in Docker. - -More details of the Docker implementation can be read -[here](https://github.com/moby/moby/pull/20727) as well as the original -discussion [here](https://github.com/moby/moby/issues/20329). - -### Support in rkt - -Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt. - -More details of the rkt implementation can be read -[here](https://github.com/rkt/rkt/pull/2677). - -### Support in OCI runtimes - -Since version 0.3.0 of the OCI runtime specification, a user can specify the -`noNewPrivs` boolean flag in the configuration file. - -More details of the OCI implementation can be read -[here](https://github.com/opencontainers/runtime-spec/pull/290). - -## Existing SecurityContext objects - -Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` -for `PodSpec`. `SecurityContext` objects define the related security options -for Kubernetes containers, e.g. selinux options. - -To support "no new privileges" options in Kubernetes, it is proposed to make -the following changes: - -## Changes of SecurityContext objects - -Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext` -definition. - -By default, ie when `allowPrivilegeEscalation=nil`, we will set `no_new_privs=true` -with the following exceptions: - -- when a container is `privileged` -- when `CAP_SYS_ADMIN` is added to a container -- when a container is not run as root, uid `0` (to prevent breaking suid - binaries) - -The API will reject as invalid `privileged=true` and -`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and -`allowPrivilegeEscalation=false.` - -When `allowPrivilegeEscalation` is set to `false` it will enable `no_new_privs` -for that container. - -`allowPrivilegeEscalation` in `SecurityContext` provides container level -control of the `no_new_privs` flag and can override the default in both directions -of the `allowPrivilegeEscalation` setting. - -This requires changes to the Docker, rkt, and CRI runtime integrations so that -kubelet will add the specific `no_new_privs` option. - -## Pod Security Policy changes - -The default can be set via a new `*bool` type field named `defaultAllowPrivilegeEscalation` -in a Pod Security Policy. -This would allow users to set `defaultAllowPrivilegeEscalation=false`, overriding the -default `nil` behavior of `no_new_privs=false` for containers -whose uids are not 0. - -This would also keep the behavior of setting the security context as -`allowPrivilegeEscalation=true` -for privileged containers and those with `capAdd=CAP_SYS_ADMIN`. - -To recap, below is a table defining the default behavior at the pod security -policy level and what can be set as a default with a pod security policy. - -| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN | -|----------------------------------|--------------------|--------------------|--------------------------| -| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false | -| false | no_new_privs=true | no_new_privs=true | no_new_privs=false | -| true | no_new_privs=false | no_new_privs=false | no_new_privs=false | - -A new `bool` field named `allowPrivilegeEscalation` will be added to the Pod -Security Policy as well to gate whether or not a user is allowed to set the -security context to `allowPrivilegeEscalation=true`. This field will default to -false. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first. diff --git a/contributors/design-proposals/auth/pod-security-context.md b/contributors/design-proposals/auth/pod-security-context.md index 29ff7ff0..f0fbec72 100644 --- a/contributors/design-proposals/auth/pod-security-context.md +++ b/contributors/design-proposals/auth/pod-security-context.md @@ -1,370 +1,6 @@ -## Abstract +Design proposals have been archived. -A proposal for refactoring `SecurityContext` to have pod-level and container-level attributes in -order to correctly model pod- and container-level security concerns. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -Currently, containers have a `SecurityContext` attribute which contains information about the -security settings the container uses. In practice, many of these attributes are uniform across all -containers in a pod. Simultaneously, there is also a need to apply the security context pattern -at the pod level to correctly model security attributes that apply only at a pod level. - -Users should be able to: - -1. Express security settings that are applicable to the entire pod -2. Express base security settings that apply to all containers -3. Override only the settings that need to be differentiated from the base in individual - containers - -This proposal is a dependency for other changes related to security context: - -1. [Volume ownership management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/12944) -2. [Generic SELinux label management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/14192) - -Goals of this design: - -1. Describe the use cases for which a pod-level security context is necessary -2. Thoroughly describe the API backward compatibility issues that arise from the introduction of - a pod-level security context -3. Describe all implementation changes necessary for the feature - -## Constraints and assumptions - -1. We will not design for intra-pod security; we are not currently concerned about isolating - containers in the same pod from one another -1. We will design for backward compatibility with the current V1 API - -## Use Cases - -1. As a developer, I want to correctly model security attributes which belong to an entire pod -2. As a user, I want to be able to specify container attributes that apply to all containers - without repeating myself -3. As an existing user, I want to be able to use the existing container-level security API - -### Use Case: Pod level security attributes - -Some security attributes make sense only to model at the pod level. For example, it is a -fundamental property of pods that all containers in a pod share the same network namespace. -Therefore, using the host namespace makes sense to model at the pod level only, and indeed, today -it is part of the `PodSpec`. Other host namespace support is currently being added and these will -also be pod-level settings; it makes sense to model them as a pod-level collection of security -attributes. - -## Use Case: Override pod security context for container - -Some use cases require the containers in a pod to run with different security settings. As an -example, a user may want to have a pod with two containers, one of which runs as root with the -privileged setting, and one that runs as a non-root UID. To support use cases like this, it should -be possible to override appropriate (i.e., not intrinsically pod-level) security settings for -individual containers. - -## Proposed Design - -### SecurityContext - -For posterity and ease of reading, note the current state of `SecurityContext`: - -```go -package api - -type Container struct { - // Other fields omitted - - // Optional: SecurityContext defines the security options the pod should be run with - SecurityContext *SecurityContext `json:"securityContext,omitempty"` -} - -type SecurityContext struct { - // Capabilities are the capabilities to add/drop when running the container - Capabilities *Capabilities `json:"capabilities,omitempty"` - - // Run the container in privileged mode - Privileged *bool `json:"privileged,omitempty"` - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 `json:"runAsUser,omitempty"` - - // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser - // field is not explicitly set then the kubelet may check the image for a specified user or - // perform defaulting to specify a user. - RunAsNonRoot bool `json:"runAsNonRoot,omitempty"` -} - -// SELinuxOptions contains the fields that make up the SELinux context of a container. -type SELinuxOptions struct { - // SELinux user label - User string `json:"user,omitempty"` - - // SELinux role label - Role string `json:"role,omitempty"` - - // SELinux type label - Type string `json:"type,omitempty"` - - // SELinux level label. - Level string `json:"level,omitempty"` -} -``` - -### PodSecurityContext - -`PodSecurityContext` specifies two types of security attributes: - -1. Attributes that apply to the pod itself -2. Attributes that apply to the containers of the pod - -In the internal API, fields of the `PodSpec` controlling the use of the host PID, IPC, and network -namespaces are relocated to this type: - -```go -package api - -type PodSpec struct { - // Other fields omitted - - // Optional: SecurityContext specifies pod-level attributes and container security attributes - // that apply to all containers. - SecurityContext *PodSecurityContext `json:"securityContext,omitempty"` -} - -// PodSecurityContext specifies security attributes of the pod and container attributes that apply -// to all containers of the pod. -type PodSecurityContext struct { - // Use the host's network namespace. If this option is set, the ports that will be - // used must be specified. - // Optional: Default to false. - HostNetwork bool - // Use the host's IPC namespace - HostIPC bool - - // Use the host's PID namespace - HostPID bool - - // Capabilities are the capabilities to add/drop when running containers - Capabilities *Capabilities `json:"capabilities,omitempty"` - - // Run the container in privileged mode - Privileged *bool `json:"privileged,omitempty"` - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 `json:"runAsUser,omitempty"` - - // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser - // field is not explicitly set then the kubelet may check the image for a specified user or - // perform defaulting to specify a user. - RunAsNonRoot bool -} - -// Comments and generated docs will change for the container.SecurityContext field to indicate -// the precedence of these fields over the pod-level ones. - -type Container struct { - // Other fields omitted - - // Optional: SecurityContext defines the security options the pod should be run with. - // Settings specified in this field take precedence over the settings defined in - // pod.Spec.SecurityContext. - SecurityContext *SecurityContext `json:"securityContext,omitempty"` -} -``` - -In the V1 API, the pod-level security attributes which are currently fields of the `PodSpec` are -retained on the `PodSpec` for backward compatibility purposes: - -```go -package v1 - -type PodSpec struct { - // Other fields omitted - - // Use the host's network namespace. If this option is set, the ports that will be - // used must be specified. - // Optional: Default to false. - HostNetwork bool `json:"hostNetwork,omitempty"` - // Use the host's pid namespace. - // Optional: Default to false. - HostPID bool `json:"hostPID,omitempty"` - // Use the host's ipc namespace. - // Optional: Default to false. - HostIPC bool `json:"hostIPC,omitempty"` - - // Optional: SecurityContext specifies pod-level attributes and container security attributes - // that apply to all containers. - SecurityContext *PodSecurityContext `json:"securityContext,omitempty"` -} -``` - -The `pod.Spec.SecurityContext` specifies the security context of all containers in the pod. -The containers' `securityContext` field is overlaid on the base security context to determine the -effective security context for the container. - -The new V1 API should be backward compatible with the existing API. Backward compatibility is -defined as: - -> 1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before your change must -> work the same after your change. -> 2. Any API call that uses your change must not cause problems (e.g. crash or degrade behavior) when -> issued against servers that do not include your change. -> 3. It must be possible to round-trip your change (convert to different API versions and back) with -> no loss of information. - -Previous versions of this proposal attempted to deal with backward compatibility by defining -the affect of setting the pod-level fields on the container-level fields. While trying to find -consensus on this design, it became apparent that this approach was going to be extremely complex -to implement, explain, and support. Instead, we will approach backward compatibility as follows: - -1. Pod-level and container-level settings will not affect one another -2. Old clients will be able to use container-level settings in the exact same way -3. Container level settings always override pod-level settings if they are set - -#### Examples - -1. Old client using `pod.Spec.Containers[x].SecurityContext` - - An old client creates a pod: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - - looks to old clients like: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - - looks to new clients like: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - -2. New client using `pod.Spec.SecurityContext` - - A new client creates a pod using a field of `pod.Spec.SecurityContext`: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - - name: b - ``` - - appears to new clients as: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - - name: b - ``` - - old clients will see: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - - name: b - ``` - -3. Pods created using `pod.Spec.SecurityContext` and `pod.Spec.Containers[x].SecurityContext` - - If a field is set in both `pod.Spec.SecurityContext` and - `pod.Spec.Containers[x].SecurityContext`, the value in `pod.Spec.Containers[x].SecurityContext` - wins. In the following pod: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - securityContext: - runAsUser: 1002 - - name: b - ``` - - The effective setting for `runAsUser` for container A is `1002`. - -#### Testing - -A backward compatibility test suite will be established for the v1 API. The test suite will -verify compatibility by converting objects into the internal API and back to the version API and -examining the results. - -All of the examples here will be used as test-cases. As more test cases are added, the proposal will -be updated. - -An example of a test like this can be found in the -[OpenShift API package](https://github.com/openshift/origin/blob/master/pkg/api/compatibility_test.go) - -E2E test cases will be added to test the correct determination of the security context for containers. - -### Kubelet changes - -1. The Kubelet will use the new fields on the `PodSecurityContext` for host namespace control -2. The Kubelet will be modified to correctly implement the backward compatibility and effective - security context determination defined here +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/pod-security-policy.md b/contributors/design-proposals/auth/pod-security-policy.md index 39a4e4bc..f0fbec72 100644 --- a/contributors/design-proposals/auth/pod-security-policy.md +++ b/contributors/design-proposals/auth/pod-security-policy.md @@ -1,345 +1,6 @@ -## Abstract +Design proposals have been archived. -PodSecurityPolicy allows cluster administrators to control the creation and validation of a security -context for a pod and containers. The intent of PodSecurityPolicy is to protect the cluster from the -pod and containers, not to protect a pod or containers from a user. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Administration of a multi-tenant cluster requires the ability to provide varying sets of permissions -among the tenants, the infrastructure components, and end users of the system who may themselves be -administrators within their own isolated namespace. - -Actors in a cluster may include infrastructure that is managed by administrators, infrastructure -that is exposed to end users (builds, deployments), the isolated end user namespaces in the cluster, and -the individual users inside those namespaces. Infrastructure components that operate on behalf of a -user (builds, deployments) should be allowed to run at an elevated level of permissions without -granting the user themselves an elevated set of permissions. - -## Goals - -1. Associate [service accounts](service_accounts.md), groups, and users with -a set of constraints that dictate how a security context is established for a pod and the pod's containers. -1. Provide the ability for users and infrastructure components to run pods with elevated privileges -on behalf of another user or within a namespace where privileges are more restrictive. -1. Secure the ability to reference elevated permissions or to change the constraints under which -a user runs. - -## Use Cases - -Use case 1: -As an administrator, I can create a namespace for a person that can't create privileged containers -AND enforce that the UID of the containers is set to a certain value - -Use case 2: -As a cluster operator, an infrastructure component should be able to create a pod with elevated -privileges in a namespace where regular users cannot create pods with these privileges or execute -commands in that pod. - -Use case 3: -As a cluster administrator, I can allow a given namespace (or service account) to create privileged -pods or to run root pods - -Use case 4: -As a cluster administrator, I can allow a project administrator to control the security contexts of -pods and service accounts within a project - - -## Requirements - -1. Provide a set of restrictions that controls how a security context is created for pods and containers -as a new cluster-scoped object called `PodSecurityPolicy`. -1. User information in `user.Info` must be available to admission controllers. (Completed in -https://github.com/kubernetes/kubernetes/pull/8203) -1. Some authorizers may restrict a user's ability to reference a service account. Systems requiring -the ability to secure service accounts on a user level must be able to add a policy that enables -referencing specific service accounts themselves. -1. Admission control must validate the creation of Pods against the allowed set of constraints. - -## Design - -### Model - -PodSecurityPolicy objects exist in the root scope, outside of a namespace. The -PodSecurityPolicy will reference users and groups that are allowed -to operate under the constraints. In order to support this, `ServiceAccounts` must be mapped -to a user name or group list by the authentication/authorization layers. This allows the security -context to treat users, groups, and service accounts uniformly. - -Below is a list of PodSecurityPolicies which will likely serve most use cases: - -1. A default policy object. This object is permissioned to something which covers all actors, such -as a `system:authenticated` group, and will likely be the most restrictive set of constraints. -1. A default constraints object for service accounts. This object can be identified as serving -a group identified by `system:service-accounts`, which can be imposed by the service account authenticator / token generator. -1. Cluster admin constraints identified by `system:cluster-admins` group - a set of constraints with elevated privileges that can be used -by an administrative user or group. -1. Infrastructure components constraints which can be identified either by a specific service -account or by a group containing all service accounts. - -```go -// PodSecurityPolicy governs the ability to make requests that affect the SecurityContext -// that will be applied to a pod and container. -type PodSecurityPolicy struct { - unversioned.TypeMeta `json:",inline"` - api.ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the policy enforced. - Spec PodSecurityPolicySpec `json:"spec,omitempty"` -} - -// PodSecurityPolicySpec defines the policy enforced. -type PodSecurityPolicySpec struct { - // Privileged determines if a pod can request to be run as privileged. - Privileged bool `json:"privileged,omitempty"` - // Capabilities is a list of capabilities that can be added. - Capabilities []api.Capability `json:"capabilities,omitempty"` - // Volumes allows and disallows the use of different types of volume plugins. - Volumes VolumeSecurityPolicy `json:"volumes,omitempty"` - // HostNetwork determines if the policy allows the use of HostNetwork in the pod spec. - HostNetwork bool `json:"hostNetwork,omitempty"` - // HostPorts determines which host port ranges are allowed to be exposed. - HostPorts []HostPortRange `json:"hostPorts,omitempty"` - // HostPID determines if the policy allows the use of HostPID in the pod spec. - HostPID bool `json:"hostPID,omitempty"` - // HostIPC determines if the policy allows the use of HostIPC in the pod spec. - HostIPC bool `json:"hostIPC,omitempty"` - // SELinuxContext is the strategy that will dictate the allowable labels that may be set. - SELinuxContext SELinuxContextStrategyOptions `json:"seLinuxContext,omitempty"` - // RunAsUser is the strategy that will dictate the allowable RunAsUser values that may be set. - RunAsUser RunAsUserStrategyOptions `json:"runAsUser,omitempty"` - - // The users who have permissions to use this policy - Users []string `json:"users,omitempty"` - // The groups that have permission to use this policy - Groups []string `json:"groups,omitempty"` -} - -// HostPortRange defines a range of host ports that will be enabled by a policy -// for pods to use. It requires both the start and end to be defined. -type HostPortRange struct { - // Start is the beginning of the port range which will be allowed. - Start int `json:"start"` - // End is the end of the port range which will be allowed. - End int `json:"end"` -} - -// VolumeSecurityPolicy allows and disallows the use of different types of volume plugins. -type VolumeSecurityPolicy struct { - // HostPath allows or disallows the use of the HostPath volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#hostpath - HostPath bool `json:"hostPath,omitempty"` - // EmptyDir allows or disallows the use of the EmptyDir volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#emptydir - EmptyDir bool `json:"emptyDir,omitempty"` - // GCEPersistentDisk allows or disallows the use of the GCEPersistentDisk volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#gcepersistentdisk - GCEPersistentDisk bool `json:"gcePersistentDisk,omitempty"` - // AWSElasticBlockStore allows or disallows the use of the AWSElasticBlockStore volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#awselasticblockstore - AWSElasticBlockStore bool `json:"awsElasticBlockStore,omitempty"` - // GitRepo allows or disallows the use of the GitRepo volume plugin. - GitRepo bool `json:"gitRepo,omitempty"` - // Secret allows or disallows the use of the Secret volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#secret - Secret bool `json:"secret,omitempty"` - // NFS allows or disallows the use of the NFS volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes/#nfs - NFS bool `json:"nfs,omitempty"` - // ISCSI allows or disallows the use of the ISCSI volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/iscsi/README.md - ISCSI bool `json:"iscsi,omitempty"` - // Glusterfs allows or disallows the use of the Glusterfs volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/glusterfs/README.md - Glusterfs bool `json:"glusterfs,omitempty"` - // PersistentVolumeClaim allows or disallows the use of the PersistentVolumeClaim volume plugin. - // More info: http://kubernetes.io/docs/user-guide/persistent-volumes/#persistentvolumeclaims - PersistentVolumeClaim bool `json:"persistentVolumeClaim,omitempty"` - // RBD allows or disallows the use of the RBD volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/rbd/README.md - RBD bool `json:"rbd,omitempty"` - // Cinder allows or disallows the use of the Cinder volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/mysql-cinder-pd/README.md - Cinder bool `json:"cinder,omitempty"` - // CephFS allows or disallows the use of the CephFS volume plugin. - CephFS bool `json:"cephfs,omitempty"` - // DownwardAPI allows or disallows the use of the DownwardAPI volume plugin. - DownwardAPI bool `json:"downwardAPI,omitempty"` - // FC allows or disallows the use of the FC volume plugin. - FC bool `json:"fc,omitempty"` -} - -// SELinuxContextStrategyOptions defines the strategy type and any options used to create the strategy. -type SELinuxContextStrategyOptions struct { - // Type is the strategy that will dictate the allowable labels that may be set. - Type SELinuxContextStrategy `json:"type"` - // seLinuxOptions required to run as; required for MustRunAs - // More info: http://releases.k8s.io/HEAD/docs/design/security_context.md#security-context - SELinuxOptions *api.SELinuxOptions `json:"seLinuxOptions,omitempty"` -} - -// SELinuxContextStrategyType denotes strategy types for generating SELinux options for a -// SecurityContext. -type SELinuxContextStrategy string - -const ( - // container must have SELinux labels of X applied. - SELinuxStrategyMustRunAs SELinuxContextStrategy = "MustRunAs" - // container may make requests for any SELinux context labels. - SELinuxStrategyRunAsAny SELinuxContextStrategy = "RunAsAny" -) - -// RunAsUserStrategyOptions defines the strategy type and any options used to create the strategy. -type RunAsUserStrategyOptions struct { - // Type is the strategy that will dictate the allowable RunAsUser values that may be set. - Type RunAsUserStrategy `json:"type"` - // UID is the user id that containers must run as. Required for the MustRunAs strategy if not using - // a strategy that supports pre-allocated uids. - UID *int64 `json:"uid,omitempty"` - // UIDRangeMin defines the min value for a strategy that allocates by a range based strategy. - UIDRangeMin *int64 `json:"uidRangeMin,omitempty"` - // UIDRangeMax defines the max value for a strategy that allocates by a range based strategy. - UIDRangeMax *int64 `json:"uidRangeMax,omitempty"` -} - -// RunAsUserStrategyType denotes strategy types for generating RunAsUser values for a -// SecurityContext. -type RunAsUserStrategy string - -const ( - // container must run as a particular uid. - RunAsUserStrategyMustRunAs RunAsUserStrategy = "MustRunAs" - // container must run as a particular uid. - RunAsUserStrategyMustRunAsRange RunAsUserStrategy = "MustRunAsRange" - // container must run as a non-root uid - RunAsUserStrategyMustRunAsNonRoot RunAsUserStrategy = "MustRunAsNonRoot" - // container may make requests for any uid. - RunAsUserStrategyRunAsAny RunAsUserStrategy = "RunAsAny" -) -``` - -### PodSecurityPolicy Lifecycle - -As reusable objects in the root scope, PodSecurityPolicy follows the lifecycle of the -cluster itself. Maintenance of constraints such as adding, assigning, or changing them is the -responsibility of the cluster administrator. Deleting is not considered in PodSecurityPolicy, -It's important for controllers without the ability to use psps (like the namespace controller) -to be able to delete pods. - -Creating a new user within a namespace should not require the cluster administrator to -define the user's PodSecurityPolicy. They should receive the default set of policies -that the administrator has defined for the groups they are assigned. - - -## Default PodSecurityPolicy And Overrides - -In order to establish policy for service accounts and users, there must be a way -to identify the default set of constraints that is to be used. This is best accomplished by using -groups. As mentioned above, groups may be used by the authentication/authorization layer to ensure -that every user maps to at least one group (with a default example of `system:authenticated`) and it -is up to the cluster administrator to ensure that a `PodSecurityPolicy` object exists that -references the group. - -If an administrator would like to provide a user with a changed set of security context permissions, -they may do the following: - -1. Create a new `PodSecurityPolicy` object and add a reference to the user or a group -that the user belongs to. -1. Add the user (or group) to an existing `PodSecurityPolicy` object with the proper -elevated privileges. - -## Admission - -Admission control using an authorizer provides the ability to control the creation of resources -based on capabilities granted to a user. In terms of the `PodSecurityPolicy`, it means -that an admission controller may inspect the user info made available in the context to retrieve -an appropriate set of policies for validation. - -The appropriate set of PodSecurityPolicies is defined as all of the policies -available that have reference to the user or groups that the user belongs to. - -Admission will use the PodSecurityPolicy to ensure that any requests for a -specific security context setting are valid and to generate settings using the following approach: - -1. Determine all the available `PodSecurityPolicy` objects that are allowed to be used -1. Sort the `PodSecurityPolicy` objects in a most restrictive to least restrictive order. -1. For each `PodSecurityPolicy`, generate a `SecurityContext` for each container. The generation phase will not override -any user requested settings in the `SecurityContext`, and will rely on the validation phase to ensure that -the user requests are valid. -1. Validate the generated `SecurityContext` to ensure it falls within the boundaries of the `PodSecurityPolicy` -1. If all containers validate under a single `PodSecurityPolicy` then the pod will be admitted -1. If all containers DO NOT validate under the `PodSecurityPolicy` then try the next `PodSecurityPolicy` -1. If no `PodSecurityPolicy` validates for the pod then the pod will not be admitted - - -## Creation of a SecurityContext Based on PodSecurityPolicy - -The creation of a `SecurityContext` based on a `PodSecurityPolicy` is based upon the configured -settings of the `PodSecurityPolicy`. - -There are three scenarios under which a `PodSecurityPolicy` field may fall: - -1. Governed by a boolean: fields of this type will be defaulted to the most restrictive value. -For instance, `AllowPrivileged` will always be set to false if unspecified. - -1. Governed by an allowable set: fields of this type will be checked against the set to ensure -their value is allowed. For example, `AllowCapabilities` will ensure that only capabilities -that are allowed to be requested are considered valid. `HostNetworkSources` will ensure that -only pods created from source X are allowed to request access to the host network. -1. Governed by a strategy: Items that have a strategy to generate a value will provide a -mechanism to generate the value as well as a mechanism to ensure that a specified value falls into -the set of allowable values. See the Types section for the description of the interfaces that -strategies must implement. - -Strategies have the ability to become dynamic. In order to support a dynamic strategy it should be -possible to make a strategy that has the ability to either be pre-populated with dynamic data by -another component (such as an admission controller) or has the ability to retrieve the information -itself based on the data in the pod. An example of this would be a pre-allocated UID for the namespace. -A dynamic `RunAsUser` strategy could inspect the namespace of the pod in order to find the required pre-allocated -UID and generate or validate requests based on that information. - - -```go -// SELinuxStrategy defines the interface for all SELinux constraint strategies. -type SELinuxStrategy interface { - // Generate creates the SELinuxOptions based on constraint rules. - Generate(pod *api.Pod, container *api.Container) (*api.SELinuxOptions, error) - // Validate ensures that the specified values fall within the range of the strategy. - Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList -} - -// RunAsUserStrategy defines the interface for all uid constraint strategies. -type RunAsUserStrategy interface { - // Generate creates the uid based on policy rules. - Generate(pod *api.Pod, container *api.Container) (*int64, error) - // Validate ensures that the specified values fall within the range of the strategy. - Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList -} -``` - -## Escalating Privileges by an Administrator - -An administrator may wish to create a resource in a namespace that runs with -escalated privileges. By allowing security context -constraints to operate on both the requesting user and the pod's service account, administrators are able to -create pods in namespaces with elevated privileges based on the administrator's security context -constraints. - -This also allows the system to guard commands being executed in the non-conforming container. For -instance, an `exec` command can first check the security context of the pod against the security -context constraints of the user or the user's ability to reference a service account. -If it does not validate then it can block users from executing the command. Since the validation -will be user aware, administrators would still be able to run the commands that are restricted to normal users. - -## Interaction with the Kubelet - -In certain cases, the Kubelet may need provide information about -the image in order to validate the security context. An example of this is a cluster -that is configured to run with a UID strategy of `MustRunAsNonRoot`. - -In this case the admission controller can set the existing `MustRunAsNonRoot` flag on the `SecurityContext` -based on the UID strategy of the `SecurityPolicy`. It should still validate any requests on the pod -for a specific UID and fail early if possible. However, if the `RunAsUser` is not set on the pod -it should still admit the pod and allow the Kubelet to ensure that the image does not run as -`root` with the existing non-root checks. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/proc-mount-type.md b/contributors/design-proposals/auth/proc-mount-type.md index 073fc23e..f0fbec72 100644 --- a/contributors/design-proposals/auth/proc-mount-type.md +++ b/contributors/design-proposals/auth/proc-mount-type.md @@ -1,93 +1,6 @@ -# ProcMount/ProcMountType Option +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Currently the way docker and most other container runtimes work is by masking -and setting as read-only certain paths in `/proc`. This is to prevent data -from being exposed into a container that should not be. However, there are -certain use-cases where it is necessary to turn this off. -## Motivation - -For end-users who would like to run unprivileged containers using user namespaces -_nested inside_ CRI containers, we need an option to have a `ProcMount`. That is, -we need an option to designate explicitly turn off masking and setting -read-only of paths so that we can -mount `/proc` in the nested container as an unprivileged user. - -Please see the following filed issues for more information: -- [opencontainers/runc#1658](https://github.com/opencontainers/runc/issues/1658#issuecomment-373122073) -- [moby/moby#36597](https://github.com/moby/moby/issues/36597) -- [moby/moby#36644](https://github.com/moby/moby/pull/36644) - -Please also see the [use case for building images securely in kubernetes](https://github.com/jessfraz/blog/blob/master/content/post/building-container-images-securely-on-kubernetes.md). - -Unmasking the paths in `/proc` option really only makes sense for when a user -is nesting -unprivileged containers with user namespaces as it will allow more information -than is necessary to the program running in the container spawned by -kubernetes. - -The main use case for this option is to run -[genuinetools/img](https://github.com/genuinetools/img) inside a kubernetes -container. That program then launches sub-containers that take advantage of -user namespaces and re-mask /proc and set /proc as read-only. So therefore -there is no concern with having an unmasked proc open in the top level container. - -It should be noted that this is different that the host /proc. It is still -a newly mounted /proc just the container runtimes will not mask the paths. - -Since the only use case for this option is to run unprivileged nested -containers, -this option should only be allowed or used if the user in the container is not `root`. -This can be easily enforced with `MustRunAs`. -Since the user inside is still unprivileged, -doing things to `/proc` would be off limits regardless, since linux user -support already prevents this. - -## Existing SecurityContext objects - -Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` -for `PodSpec`. `SecurityContext` objects define the related security options -for Kubernetes containers, e.g. selinux options. - -To support "ProcMount" options in Kubernetes, it is proposed to make -the following changes: - -## Changes of SecurityContext objects - -Add a new `string` type field named `ProcMountType` will hold the viable -options for `procMount` to the `SecurityContext` -definition. - -By default,`procMount` is `default`, aka the same behavior as today and the -paths are masked. - -This will look like the following in the spec: - -```go -type ProcMountType string - -const ( - // DefaultProcMount uses the container runtime default ProcType. Most - // container runtimes mask certain paths in /proc to avoid accidental security - // exposure of special devices or information. - DefaultProcMount ProcMountType = "Default" - - // UnmaskedProcMount bypasses the default masking behavior of the container - // runtime and ensures the newly created /proc the container stays in tact with - // no modifications. - UnmaskedProcMount ProcMountType = "Unmasked" -) - -procMount *ProcMountType -``` - -This requires changes to the CRI runtime integrations so that -kubelet will add the specific `unmasked` or `whatever_it_is_named` option. - -## Pod Security Policy changes - -A new `[]ProcMountType{}` field named `allowedProcMounts` will be added to the Pod -Security Policy as well to gate the allowed ProcMountTypes a user is allowed to -set. This field will default to `[]ProcMountType{ DefaultProcMount }`. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/runas-groupid.md b/contributors/design-proposals/auth/runas-groupid.md index b05b2950..f0fbec72 100644 --- a/contributors/design-proposals/auth/runas-groupid.md +++ b/contributors/design-proposals/auth/runas-groupid.md @@ -1,238 +1,6 @@ -# RunAsGroup Proposal +Design proposals have been archived. -**Author**: krmayankk@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Proposal - -## Abstract - - -As a Kubernetes User, we should be able to specify both user id and group id for the containers running -inside a pod on a per Container basis, similar to how docker allows that using docker run options `-u, ---user="" Username or UID (format: <name|uid>[:<group|gid>]) format`. - -PodSecurityContext allows Kubernetes users to specify RunAsUser which can be overridden by RunAsUser -in SecurityContext on a per Container basis. There is no equivalent field for specifying the primary -Group of the running container. - -## Motivation - -Enterprise Kubernetes users want to run containers as non root. This means running containers with a -non zero user id and non zero primary group id. This gives Enterprises, confidence that their customer code -is running with least privilege and if it escapes the container boundary, will still cause least harm -by decreasing the attack surface. - -### What is the significance of Primary Group Id ? -Primary Group Id is the group id used when creating files and directories. It is also the default group -associated with a user, when he/she logins. All groups are defined in `/etc/group` file and are created -with the `groupadd` command. A Process/Container runs with uid/primary gid of the calling user. If no -primary group is specified for a user, 0(root) group is assumed. This means , any files/directories created -by a process running as user with no primary group associated with it, will be owned by group id 0(root). - -## Goals - -1. Provide the ability to specify the Primary Group id for a container inside a Pod -2. Bring launching of containers using Kubernetes at par with Dockers by supporting the same features. - - -## Use Cases - -### Use case 1: -As a Kubernetes User, I should be able to control both user id and primary group id of containers -launched using Kubernetes at runtime, so that i can run the container as non root with least possible -privilege. - -### Use case 2: -As a Kubernetes User, I should be able to control both user id and primary group id of containers -launched using Kubernetes at runtime, so that i can override the user id and primary group id specified -in the Dockerfile of the container image, without having to create a new Docker image. - -## Design - -### Model - -Introduce a new API field in SecurityContext and PodSecurityContext called `RunAsGroup`. - -#### SecurityContext - -``` -// SecurityContext holds security configuration that will be applied to a container. -// Some fields are present in both SecurityContext and PodSecurityContext. When both -// are set, the values in SecurityContext take precedence. -type SecurityContext struct { - //Other fields not shown for brevity - ..... - - // The UID to run the entrypoint of the container process. - // Defaults to user specified in image metadata if unspecified. - // May also be set in PodSecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsUser *int64 - // The GID to run the entrypoint of the container process. - // Defaults to group specified in image metadata if unspecified. - // May also be set in PodSecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsGroup *int64 - // Indicates that the container must run as a non-root user. - // If true, the Kubelet will validate the image at runtime to ensure that it - // does not run as UID 0 (root) and fail to start the container if it does. - // If unset or false, no such validation will be performed. - // May also be set in SecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsNonRoot *bool - // Indicates that the container must run as a non-root group. - // If true, the Kubelet will validate the image at runtime to ensure that it - // does not run as GID 0 (root) and fail to start the container if it does. - // If unset or false, no such validation will be performed. - // May also be set in SecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsNonRootGroup *bool - - ..... - } -``` - -#### PodSecurityContext - -``` -type PodSecurityContext struct { - //Other fields not shown for brevity - ..... - - // The UID to run the entrypoint of the container process. - // Defaults to user specified in image metadata if unspecified. - // May also be set in SecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence - // for that container. - // +optional - RunAsUser *int64 - // The GID to run the entrypoint of the container process. - // Defaults to group specified in image metadata if unspecified. - // May also be set in PodSecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsGroup *int64 - // Indicates that the container must run as a non-root user. - // If true, the Kubelet will validate the image at runtime to ensure that it - // does not run as UID 0 (root) and fail to start the container if it does. - // If unset or false, no such validation will be performed. - // May also be set in SecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsNonRoot *bool - // Indicates that the container must run as a non-root group. - // If true, the Kubelet will validate the image at runtime to ensure that it - // does not run as GID 0 (root) and fail to start the container if it does. - // If unset or false, no such validation will be performed. - // May also be set in SecurityContext. If set in both SecurityContext and - // PodSecurityContext, the value specified in SecurityContext takes precedence. - // +optional - RunAsNonRootGroup *bool - - - ..... - } -``` - -#### PodSecurityPolicy - -PodSecurityPolicy defines strategies or conditions that a pod must run with in order to be accepted -into the system. Two of the relevant strategies are RunAsUser and SupplementalGroups. We introduce -a new strategy called RunAsGroup which will support the following options: -- MustRunAs -- MustRunAsNonRoot -- RunAsAny - -``` -// PodSecurityPolicySpec defines the policy enforced. - type PodSecurityPolicySpec struct { - //Other fields not shown for brevity - ..... - // RunAsUser is the strategy that will dictate the allowable RunAsUser values that may be set. - RunAsUser RunAsUserStrategyOptions - // SupplementalGroups is the strategy that will dictate what supplemental groups are used by the SecurityContext. - SupplementalGroups SupplementalGroupsStrategyOptions - - - // RunAsGroup is the strategy that will dictate the allowable RunAsGroup values that may be set. - RunAsGroup RunAsGroupStrategyOptions - ..... -} - -// RunAsGroupStrategyOptions defines the strategy type and any options used to create the strategy. - type RunAsUserStrategyOptions struct { - // Rule is the strategy that will dictate the allowable RunAsGroup values that may be set. - Rule RunAsGroupStrategy - // Ranges are the allowed ranges of gids that may be used. - // +optional - Ranges []GroupIDRange - } - -// RunAsGroupStrategy denotes strategy types for generating RunAsGroup values for a - // SecurityContext. - type RunAsGroupStrategy string - - const ( - // container must run as a particular gid. - RunAsGroupStrategyMustRunAs RunAsGroupStrategy = "MustRunAs" - // container must run as a non-root gid - RunAsGroupStrategyMustRunAsNonRoot RunAsGroupStrategy = "MustRunAsNonRoot" - // container may make requests for any gid. - RunAsGroupStrategyRunAsAny RunAsGroupStrategy = "RunAsAny" - ) -``` - -## Behavior - -Following points should be noted: - -- `FSGroup` and `SupplementalGroups` will continue to have their old meanings and would be untouched. -- The `RunAsGroup` In the SecurityContext will override the `RunAsGroup` in the PodSecurityContext. -- If both `RunAsUser` and `RunAsGroup` are NOT provided, the USER field in Dockerfile is used -- If both `RunAsUser` and `RunAsGroup` are specified, that is passed directly as User. -- If only one of `RunAsUser` or `RunAsGroup` is specified, the remaining value is decided by the Runtime, - where the Runtime behavior is to make it run with uid or gid as 0. -- If a non numeric Group is specified in the Dockerfile and `RunAsNonRootGroup` is set, this will be - treated as error, similar to the behavior of `RunAsNonRoot` for non numeric User in Dockerfile. - -Basically, we guarantee to set the values provided by user, and the runtime dictates the rest. - -Here is an example of what gets passed to docker User -- runAsUser set to 9999, runAsGroup set to 9999 -> Config.User set to 9999:9999 -- runAsUser set to 9999, runAsGroup unset -> Config.User set to 9999 -> docker runs you with 9999:0 -- runAsUser unset, runAsGroup set to 9999 -> Config.User set to :9999 -> docker runs you with 0:9999 -- runAsUser unset, runAsGroup unset -> Config.User set to whatever is present in Dockerfile -This is to keep the behavior backward compatible and as expected. - -## Summary of Changes needed - -At a high level, the changes classify into: -1. API -2. Validation -3. CRI -4. Runtime for Docker and rkt -5. Swagger -6. DockerShim -7. Admission -8. Registry - -- plugin/pkg/admission/security/podsecuritypolicy -- plugin/pkg/admission/securitycontext -- pkg/securitycontext/util.go -- pkg/security/podsecuritypolicy/selinux -- pkg/security/podsecuritypolicy/user -- pkg/security/podsecuritypolicy/group -- pkg/registry/extensions/podsecuritypolicy/storage -- pkg/kubelet/rkt -- pkg/kubelet/kuberuntime -- pkg/kubelet/dockershim/ -- pkg/kubelet/apis/cri/v1alpha1/runtime -- pkg/apis/extensions/validation/ -- pkg/api/validation/ -- api/swagger-spec/ -- api/openapi-spec/swagger.json +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/secrets.md b/contributors/design-proposals/auth/secrets.md index a5352444..f0fbec72 100644 --- a/contributors/design-proposals/auth/secrets.md +++ b/contributors/design-proposals/auth/secrets.md @@ -1,624 +1,6 @@ -## Abstract +Design proposals have been archived. -A proposal for the distribution of [secrets](https://kubernetes.io/docs/concepts/configuration/secret/) -(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using -a custom [volume](https://kubernetes.io/docs/concepts/storage/volumes/#secret) type. See the -[secrets example](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets) for more information. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Secrets are needed in containers to access internal resources like the -Kubernetes master or external resources such as git repositories, databases, -etc. Users may also want behaviors in the kubelet that depend on secret data -(credentials for image pull from a docker registry) associated with pods. - -Goals of this design: - -1. Describe a secret resource -2. Define the various challenges attendant to managing secrets on the node -3. Define a mechanism for consuming secrets in containers without modification - -## Constraints and Assumptions - -* This design does not prescribe a method for storing secrets; storage of -secrets should be pluggable to accommodate different use-cases -* Encryption of secret data and node security are orthogonal concerns -* It is assumed that node and master are secure and that compromising their -security could also compromise secrets: - * If a node is compromised, the only secrets that could potentially be -exposed should be the secrets belonging to containers scheduled onto it - * If the master is compromised, all secrets in the cluster may be exposed -* Secret rotation is an orthogonal concern, but it should be facilitated by -this proposal -* A user who can consume a secret in a container can know the value of the -secret; secrets must be provisioned judiciously - -## Use Cases - -1. As a user, I want to store secret artifacts for my applications and consume -them securely in containers, so that I can keep the configuration for my -applications separate from the images that use them: - 1. As a cluster operator, I want to allow a pod to access the Kubernetes -master using a custom `.kubeconfig` file, so that I can securely reach the -master - 2. As a cluster operator, I want to allow a pod to access a Docker registry -using credentials from a `.dockercfg` file, so that containers can push images - 3. As a cluster operator, I want to allow a pod to access a git repository -using SSH keys, so that I can push to and fetch from the repository -2. As a user, I want to allow containers to consume supplemental information -about services such as username and password which should be kept secret, so -that I can share secrets about a service amongst the containers in my -application securely -3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a -secret and have the kubelet implement some reserved behaviors based on the types -of secrets the service account consumes: - 1. Use credentials for a docker registry to pull the pod's docker image - 2. Present Kubernetes auth token to the pod or transparently decorate -traffic between the pod and master service -4. As a user, I want to be able to indicate that a secret expires and for that -secret's value to be rotated once it expires, so that the system can help me -follow good practices - -### Use-Case: Configuration artifacts - -Many configuration files contain secrets intermixed with other configuration -information. For example, a user's application may contain a properties file -than contains database credentials, SaaS API tokens, etc. Users should be able -to consume configuration artifacts in their containers and be able to control -the path on the container's filesystems where the artifact will be presented. - -### Use-Case: Metadata about services - -Most pieces of information about how to use a service are secrets. For example, -a service that provides a MySQL database needs to provide the username, -password, and database name to consumers so that they can authenticate and use -the correct database. Containers in pods consuming the MySQL service would also -consume the secrets associated with the MySQL service. - -### Use-Case: Secrets associated with service accounts - -[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple -capabilities and security contexts from individual human users. A -`ServiceAccount` contains references to some number of secrets. A `Pod` can -specify that it is associated with a `ServiceAccount`. Secrets should have a -`Type` field to allow the Kubelet and other system components to take action -based on the secret's type. - -#### Example: service account consumes auth token secret - -As an example, the service account proposal discusses service accounts consuming -secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod -associated with a service account which consumes this type of secret, the -Kubelet may take a number of actions: - -1. Expose the secret in a `.kubernetes_auth` file in a well-known location in -the container's file system -2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod -to the `kubernetes-master` service with the auth token, e. g. by adding a header -to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) - -#### Example: service account consumes docker registry credentials - -Another example use case is where a pod is associated with a secret containing -docker registry credentials. The Kubelet could use these credentials for the -docker pull to retrieve the image. - -### Use-Case: Secret expiry and rotation - -Rotation is considered a good practice for many types of secret data. It should -be possible to express that a secret has an expiry date; this would make it -possible to implement a system component that could regenerate expired secrets. -As an example, consider a component that rotates expired secrets. The rotator -could periodically regenerate the values for expired secrets of common types and -update their expiry dates. - -## Deferral: Consuming secrets as environment variables - -Some images will expect to receive configuration items as environment variables -instead of files. We should consider what the best way to allow this is; there -are a few different options: - -1. Force the user to adapt files into environment variables. Users can store -secrets that need to be presented as environment variables in a format that is -easy to consume from a shell: - - $ cat /etc/secrets/my-secret.txt - export MY_SECRET_ENV=MY_SECRET_VALUE - - The user could `source` the file at `/etc/secrets/my-secret` prior to -executing the command for the image either inline in the command or in an init -script. - -2. Give secrets an attribute that allows users to express the intent that the -platform should generate the above syntax in the file used to present a secret. -The user could consume these files in the same manner as the above option. - -3. Give secrets attributes that allow the user to express that the secret -should be presented to the container as an environment variable. The container's -environment would contain the desired values and the software in the container -could use them without accommodation the command or setup script. - -For our initial work, we will treat all secrets as files to narrow the problem -space. There will be a future proposal that handles exposing secrets as -environment variables. - -## Flow analysis of secret data with respect to the API server - -There are two fundamentally different use-cases for access to secrets: - -1. CRUD operations on secrets by their owners -2. Read-only access to the secrets needed for a particular node by the kubelet - -### Use-Case: CRUD operations by owners - -In use cases for CRUD operations, the user experience for secrets should be no -different than for other API resources. - -#### Data store backing the REST API - -The data store backing the REST API should be pluggable because different -cluster operators will have different preferences for the central store of -secret data. Some possibilities for storage: - -1. An etcd collection alongside the storage for other API resources -2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) -3. A secrets server like [Vault](https://www.vaultproject.io/) or -[Keywhiz](https://square.github.io/keywhiz/) -4. An external datastore such as an external etcd, RDBMS, etc. - -#### Size limit for secrets - -There should be a size limit for secrets in order to: - -1. Prevent DOS attacks against the API server -2. Allow kubelet implementations that prevent secret data from touching the -node's filesystem - -The size limit should satisfy the following conditions: - -1. Large enough to store common artifact types (encryption keypairs, -certificates, small configuration files) -2. Small enough to avoid large impact on node resource consumption (storage, -RAM for tmpfs, etc) - -To begin discussion, we propose an initial value for this size limit of **1MB**. - -#### Other limitations on secrets - -Defining a policy for limitations on how a secret may be referenced by another -API resource and how constraints should be applied throughout the cluster is -tricky due to the number of variables involved: - -1. Should there be a maximum number of secrets a pod can reference via a -volume? -2. Should there be a maximum number of secrets a service account can reference? -3. Should there be a total maximum number of secrets a pod can reference via -its own spec and its associated service account? -4. Should there be a total size limit on the amount of secret data consumed by -a pod? -5. How will cluster operators want to be able to configure these limits? -6. How will these limits impact API server validations? -7. How will these limits affect scheduling? - -For now, we will not implement validations around these limits. Cluster -operators will decide how much node storage is allocated to secrets. It will be -the operator's responsibility to ensure that the allocated storage is sufficient -for the workload scheduled onto a node. - -For now, kubelets will only attach secrets to api-sourced pods, and not file- -or http-sourced ones. Doing so would: - - confuse the secrets admission controller in the case of mirror pods. - - create an apiserver-liveness dependency -- avoiding this dependency is a -main reason to use non-api-source pods. - -### Use-Case: Kubelet read of secrets for node - -The use-case where the kubelet reads secrets has several additional requirements: - -1. Kubelets should only be able to receive secret data which is required by -pods scheduled onto the kubelet's node -2. Kubelets should have read-only access to secret data -3. Secret data should not be transmitted over the wire insecurely -4. Kubelets must ensure pods do not have access to each other's secrets - -#### Read of secret data by the Kubelet - -The Kubelet should only be allowed to read secrets which are consumed by pods -scheduled onto that Kubelet's node and their associated service accounts. -Authorization of the Kubelet to read this data would be delegated to an -authorization plugin and associated policy rule. - -#### Secret data on the node: data at rest - -Consideration must be given to whether secret data should be allowed to be at -rest on the node: - -1. If secret data is not allowed to be at rest, the size of secret data becomes -another draw on the node's RAM - should it affect scheduling? -2. If secret data is allowed to be at rest, should it be encrypted? - 1. If so, how should be this be done? - 2. If not, what threats exist? What types of secret are appropriate to -store this way? - -For the sake of limiting complexity, we propose that initially secret data -should not be allowed to be at rest on a node; secret data should be stored on a -node-level tmpfs filesystem. This filesystem can be subdivided into directories -for use by the kubelet and by the volume plugin. - -#### Secret data on the node: resource consumption - -The Kubelet will be responsible for creating the per-node tmpfs file system for -secret storage. It is hard to make a prescriptive declaration about how much -storage is appropriate to reserve for secrets because different installations -will vary widely in available resources, desired pod to node density, overcommit -policy, and other operation dimensions. That being the case, we propose for -simplicity that the amount of secret storage be controlled by a new parameter to -the kubelet with a default value of **64MB**. It is the cluster operator's -responsibility to handle choosing the right storage size for their installation -and configuring their Kubelets correctly. - -Configuring each Kubelet is not the ideal story for operator experience; it is -more intuitive that the cluster-wide storage size be readable from a central -configuration store like the one proposed in [#1553](http://issue.k8s.io/1553). -When such a store exists, the Kubelet could be modified to read this -configuration item from the store. - -When the Kubelet is modified to advertise node resources (as proposed in -[#4441](http://issue.k8s.io/4441)), the capacity calculation -for available memory should factor in the potential size of the node-level tmpfs -in order to avoid memory overcommit on the node. - -#### Secret data on the node: isolation - -Every pod will have a [security context](security_context.md). -Secret data on the node should be isolated according to the security context of -the container. The Kubelet volume plugin API will be changed so that a volume -plugin receives the security context of a volume along with the volume spec. -This will allow volume plugins to implement setting the security context of -volumes they manage. - -## Community work - -Several proposals / upstream patches are notable as background for this -proposal: - -1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) -2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](service_accounts.md) -4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) -5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) - -## Proposed Design - -We propose a new `Secret` resource which is mounted into containers with a new -volume type. Secret volumes will be handled by a volume plugin that does the -actual work of fetching the secret and storing it. Secrets contain multiple -pieces of data that are presented as different files within the secret volume -(example: SSH key pair). - -In order to remove the burden from the end user in specifying every file that a -secret consists of, it should be possible to mount all files provided by a -secret with a single `VolumeMount` entry in the container specification. - -### Secret API Resource - -A new resource for secrets will be added to the API: - -```go -type Secret struct { - TypeMeta - ObjectMeta - - // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN. - // The serialized form of the secret data is a base64 encoded string, - // representing the arbitrary (possibly non-string) data value here. - Data map[string][]byte `json:"data,omitempty"` - - // Used to facilitate programmatic handling of secret data. - Type SecretType `json:"type,omitempty"` -} - -type SecretType string - -const ( - SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) - SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token - SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth - SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth - // FUTURE: other type values -) - -const MaxSecretSize = 1 * 1024 * 1024 -``` - -A Secret can declare a type in order to provide type information to system -components that work with secrets. The default type is `opaque`, which -represents arbitrary user-owned data. - -Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must -be valid DNS subdomains. - -A new REST API and registry interface will be added to accompany the `Secret` -resource. The default implementation of the registry will store `Secret` -information in etcd. Future registry implementations could store the `TypeMeta` -and `ObjectMeta` fields in etcd and store the secret data in another data store -entirely, or store the whole object in another data store. - -#### Other validations related to secrets - -Initially there will be no validations for the number of secrets a pod -references, or the number of secrets that can be associated with a service -account. These may be added in the future as the finer points of secrets and -resource allocation are fleshed out. - -### Secret Volume Source - -A new `SecretSource` type of volume source will be added to the `VolumeSource` -struct in the API: - -```go -type VolumeSource struct { - // Other fields omitted - - // SecretSource represents a secret that should be presented in a volume - SecretSource *SecretSource `json:"secret"` -} - -type SecretSource struct { - Target ObjectReference -} -``` - -Secret volume sources are validated to ensure that the specified object -reference actually points to an object of type `Secret`. - -In the future, the `SecretSource` will be extended to allow: - -1. Fine-grained control over which pieces of secret data are exposed in the -volume -2. The paths and filenames for how secret data are exposed - -### Secret Volume Plugin - -A new Kubelet volume plugin will be added to handle volumes with a secret -source. This plugin will require access to the API server to retrieve secret -data and therefore the volume `Host` interface will have to change to expose a -client interface: - -```go -type Host interface { - // Other methods omitted - - // GetKubeClient returns a client interface - GetKubeClient() client.Interface -} -``` - -The secret volume plugin will be responsible for: - -1. Returning a `volume.Mounter` implementation from `NewMounter` that: - 1. Retrieves the secret data for the volume from the API server - 2. Places the secret data onto the container's filesystem - 3. Sets the correct security attributes for the volume based on the pod's -`SecurityContext` -2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that -cleans the volume from the container's filesystem - -### Kubelet: Node-level secret storage - -The Kubelet must be modified to accept a new parameter for the secret storage -size and to create a tmpfs file system of that size to store secret data. Rough -accounting of specific changes: - -1. The Kubelet should have a new field added called `secretStorageSize`; units -are megabytes -2. `NewMainKubelet` should accept a value for secret storage size -3. The Kubelet server should have a new flag added for secret storage size -4. The Kubelet's `setupDataDirs` method should be changed to create the secret -storage - -### Kubelet: New behaviors for secrets associated with service accounts - -For use-cases where the Kubelet's behavior is affected by the secrets associated -with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example, -if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the -Kubelet will need to be changed to accommodate this. Subsequent proposals can -address this on a type-by-type basis. - -## Examples - -For clarity, let's examine some detailed examples of some common use-cases in -terms of the suggested changes. All of these examples are assumed to be created -in a namespace called `example`. - -### Use-Case: Pod with ssh keys - -To create a pod that uses an ssh key stored as a secret, we first need to create -a secret: - -```json -{ - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "ssh-key-secret" - }, - "data": { - "id-rsa": "dmFsdWUtMg0KDQo=", - "id-rsa.pub": "dmFsdWUtMQ0K" - } -} -``` - -**Note:** The serialized JSON and YAML values of secret data are encoded as -base64 strings. Newlines are not valid within these strings and must be -omitted. - -Now we can create a pod which references the secret with the ssh key and -consumes it in a volume: - -```json -{ - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "secret-test-pod", - "labels": { - "name": "secret-test" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "ssh-key-secret" - } - } - ], - "containers": [ - { - "name": "ssh-test-container", - "image": "mySshImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } -} -``` - -When the container's command runs, the pieces of the key will be available in: - - /etc/secret-volume/id-rsa.pub - /etc/secret-volume/id-rsa - -The container is then free to use the secret data to establish an ssh -connection. - -### Use-Case: Pods with prod / test credentials - -This example illustrates a pod which consumes a secret containing prod -credentials and another pod which consumes a secret with test environment -credentials. - -The secrets: - -```json -{ - "apiVersion": "v1", - "kind": "List", - "items": - [{ - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "prod-db-secret" - }, - "data": { - "password": "dmFsdWUtMg0KDQo=", - "username": "dmFsdWUtMQ0K" - } - }, - { - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "test-db-secret" - }, - "data": { - "password": "dmFsdWUtMg0KDQo=", - "username": "dmFsdWUtMQ0K" - } - }] -} -``` - -The pods: - -```json -{ - "apiVersion": "v1", - "kind": "List", - "items": - [{ - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "prod-db-client-pod", - "labels": { - "name": "prod-db-client" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "prod-db-secret" - } - } - ], - "containers": [ - { - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } - }, - { - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "test-db-client-pod", - "labels": { - "name": "test-db-client" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "test-db-secret" - } - } - ], - "containers": [ - { - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } - }] -} -``` - -The specs for the two pods differ only in the value of the object referred to by -the secret volume source. Both containers will have the following files present -on their filesystems: - - /etc/secret-volume/username - /etc/secret-volume/password +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/security.md b/contributors/design-proposals/auth/security.md index d2c3e0e2..f0fbec72 100644 --- a/contributors/design-proposals/auth/security.md +++ b/contributors/design-proposals/auth/security.md @@ -1,214 +1,6 @@ -# Security in Kubernetes +Design proposals have been archived. -Kubernetes should define a reasonable set of security best practices that allows -processes to be isolated from each other, from the cluster infrastructure, and -which preserves important boundaries between those who manage the cluster, and -those who use the cluster. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -While Kubernetes today is not primarily a multi-tenant system, the long term -evolution of Kubernetes will increasingly rely on proper boundaries between -users and administrators. The code running on the cluster must be appropriately -isolated and secured to prevent malicious parties from affecting the entire -cluster. - - -## High Level Goals - -1. Ensure a clear isolation between the container and the underlying host it -runs on -2. Limit the ability of the container to negatively impact the infrastructure -or other containers -3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - -ensure components are only authorized to perform the actions they need, and -limit the scope of a compromise by limiting the capabilities of individual -components -4. Reduce the number of systems that have to be hardened and secured by -defining clear boundaries between components -5. Allow users of the system to be cleanly separated from administrators -6. Allow administrative functions to be delegated to users where necessary -7. Allow applications to be run on the cluster that have "secret" data (keys, -certs, passwords) which is properly abstracted from "public" data. - -## Use cases - -### Roles - -We define "user" as a unique identity accessing the Kubernetes API server, which -may be a human or an automated process. Human users fall into the following -categories: - -1. k8s admin - administers a Kubernetes cluster and has access to the underlying -components of the system -2. k8s project administrator - administrates the security of a small subset of -the cluster -3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster -resources - -Automated process users fall into the following categories: - -1. k8s container user - a user that processes running inside a container (on the -cluster) can use to access other cluster resources independent of the human -users attached to a project -2. k8s infrastructure user - the user that Kubernetes infrastructure components -use to perform cluster functions with clearly defined roles - -### Description of roles - -* Developers: - * write pod specs. - * making some of their own images, and using some "community" docker images - * know which pods need to talk to which other pods - * decide which pods should share files with other pods, and which should not. - * reason about application level security, such as containing the effects of a -local-file-read exploit in a webserver pod. - * do not often reason about operating system or organizational security. - * are not necessarily comfortable reasoning about the security properties of a -system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. - -* Project Admins: - * allocate identity and roles within a namespace - * reason about organizational security within a namespace - * don't give a developer permissions that are not needed for role. - * protect files on shared storage from unnecessary cross-team access - * are less focused about application security - -* Administrators: - * are less focused on application security. Focused on operating system -security. - * protect the node from bad actors in containers, and properly-configured -innocent containers from bad actors in other containers. - * comfortable reasoning about the security properties of a system at the level -of detail of Linux Capabilities, SELinux, AppArmor, etc. - * decides who can use which Linux Capabilities, run privileged containers, use -hostPath, etc. - * e.g. a team that manages Ceph or a mysql server might be trusted to have -raw access to storage devices in some organizations, but teams that develop the -applications at higher layers would not. - - -## Proposed Design - -A pod runs in a *security context* under a *service account* that is defined by -an administrator or project administrator, and the *secrets* a pod has access to -is limited by that *service account*. - - -1. The API should authenticate and authorize user actions [authn and authz](access.md) -2. All infrastructure components (kubelets, kube-proxies, controllers, -scheduler) should have an infrastructure user that they can authenticate with -and be authorized to perform only the functions they require against the API. -3. Most infrastructure components should use the API as a way of exchanging data -and changing the system, and only the API should have access to the underlying -data store (etcd) -4. When containers run on the cluster and need to talk to other containers or -the API server, they should be identified and authorized clearly as an -autonomous process via a [service account](service_accounts.md) - 1. If the user who started a long-lived process is removed from access to -the cluster, the process should be able to continue without interruption - 2. If the user who started processes are removed from the cluster, -administrators may wish to terminate their processes in bulk - 3. When containers run with a service account, the user that created / -triggered the service account behavior must be associated with the container's -action -5. When container processes run on the cluster, they should run in a -[security context](security_context.md) that isolates those processes via Linux -user security, user namespaces, and permissions. - 1. Administrators should be able to configure the cluster to automatically -confine all container processes as a non-root, randomly assigned UID - 2. Administrators should be able to ensure that container processes within -the same namespace are all assigned the same unix user UID - 3. Administrators should be able to limit which developers and project -administrators have access to higher privilege actions - 4. Project administrators should be able to run pods within a namespace -under different security contexts, and developers must be able to specify which -of the available security contexts they may use - 5. Developers should be able to run their own images or images from the -community and expect those images to run correctly - 6. Developers may need to ensure their images work within higher security -requirements specified by administrators - 7. When available, Linux kernel user namespaces can be used to ensure 5.2 -and 5.4 are met. - 8. When application developers want to share filesystem data via distributed -filesystems, the Unix user ids on those filesystems must be consistent across -different container processes -6. Developers should be able to define [secrets](secrets.md) that are -automatically added to the containers when pods are run - 1. Secrets are files injected into the container whose values should not be -displayed within a pod. Examples: - 1. An SSH private key for git cloning remote data - 2. A client certificate for accessing a remote system - 3. A private key and certificate for a web server - 4. A .kubeconfig file with embedded cert / token data for accessing the -Kubernetes master - 5. A .dockercfg file for pulling images from a protected registry - 2. Developers should be able to define the pod spec so that a secret lands -in a specific location - 3. Project administrators should be able to limit developers within a -namespace from viewing or modifying secrets (anyone who can launch an arbitrary -pod can view secrets) - 4. Secrets are generally not copied from one namespace to another when a -developer's application definitions are copied - - -### Related design discussion - -* [Authorization and authentication](access.md) -* [Secret distribution via files](http://pr.k8s.io/2030) -* [Docker secrets](https://github.com/docker/docker/pull/6697) -* [Docker vault](https://github.com/docker/docker/issues/10310) -* [Service Accounts:](service_accounts.md) -* [Secret volumes](http://pr.k8s.io/4126) - -## Specific Design Points - -### TODO: authorization, authentication - -### Isolate the data store from the nodes and supporting infrastructure - -Access to the central data store (etcd) in Kubernetes allows an attacker to run -arbitrary containers on hosts, to gain access to any protected information -stored in either volumes or in pods (such as access tokens or shared secrets -provided as environment variables), to intercept and redirect traffic from -running services by inserting middlemen, or to simply delete the entire history -of the cluster. - -As a general principle, access to the central data store should be restricted to -the components that need full control over the system and which can apply -appropriate authorization and authentication of change requests. In the future, -etcd may offer granular access control, but that granularity will require an -administrator to understand the schema of the data to properly apply security. -An administrator must be able to properly secure Kubernetes at a policy level, -rather than at an implementation level, and schema changes over time should not -risk unintended security leaks. - -Both the kubelet and kube-proxy need information related to their specific roles - -for the kubelet, the set of pods it should be running, and for the kube-proxy, the -set of services and endpoints to load balance. The kubelet also needs to provide -information about running pods and historical termination data. The access -pattern for both kubelet and kube-proxy to load their configuration is an efficient -"wait for changes" request over HTTP. It should be possible to limit the kubelet -and kube-proxy to only access the information they need to perform their roles and no -more. - -The controller manager for Replication Controllers and other future controllers -act on behalf of a user via delegation to perform automated maintenance on -Kubernetes resources. Their ability to access or modify resource state should be -strictly limited to their intended duties and they should be prevented from -accessing information not pertinent to their role. For example, a replication -controller needs only to create a copy of a known pod configuration, to -determine the running state of an existing pod, or to delete an existing pod -that it created - it does not need to know the contents or current state of a -pod, nor have access to any data in the pods attached volumes. - -The Kubernetes pod scheduler is responsible for reading data from the pod to fit -it onto a node in the cluster. At a minimum, it needs access to view the ID of a -pod (to craft the binding), its current state, any resource information -necessary to identify placement, and other data relevant to concerns like -anti-affinity, zone or region preference, or custom logic. It does not need the -ability to modify pods or see other resources, only to create bindings. It -should not need the ability to delete bindings unless the scheduler takes -control of relocating components on failed hosts (which could be implemented by -a separate component that can delete bindings but not create them). The -scheduler may need read access to user or project-container information to -determine preferential location (underspecified at this time). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/security_context.md b/contributors/design-proposals/auth/security_context.md index 360f5046..f0fbec72 100644 --- a/contributors/design-proposals/auth/security_context.md +++ b/contributors/design-proposals/auth/security_context.md @@ -1,188 +1,6 @@ -# Security Contexts +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A security context is a set of constraints that are applied to a container in -order to achieve the following goals (from [security design](security.md)): - -1. Ensure a clear isolation between container and the underlying host it runs -on -2. Limit the ability of the container to negatively impact the infrastructure -or other containers - -## Background - -The problem of securing containers in Kubernetes has come up -[before](http://issue.k8s.io/398) and the potential problems with container -security are [well known](http://opensource.com/business/14/7/docker-security-selinux). -Although it is not possible to completely isolate Docker containers from their -hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) -make it possible to greatly reduce the attack surface. - -## Motivation - -### Container isolation - -In order to improve container isolation from host and other containers running -on the host, containers should only be granted the access they need to perform -their work. To this end it should be possible to take advantage of Docker -features such as the ability to -[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) -and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) -to the container process. - -Support for user namespaces has recently been -[merged](https://github.com/docker/libcontainer/pull/304) into Docker's -libcontainer project and should soon surface in Docker itself. It will make it -possible to assign a range of unprivileged uids and gids from the host to each -container, improving the isolation between host and container and between -containers. - -### External integration with shared storage - -In order to support external integration with shared storage, processes running -in a Kubernetes cluster should be able to be uniquely identified by their Unix -UID, such that a chain of ownership can be established. Processes in pods will -need to have consistent UID/GID/SELinux category labels in order to access -shared disks. - -## Constraints and Assumptions - -* It is out of the scope of this document to prescribe a specific set of -constraints to isolate containers from their host. Different use cases need -different settings. -* The concept of a security context should not be tied to a particular security -mechanism or platform (i.e. SELinux, AppArmor) -* Applying a different security context to a scope (namespace or pod) requires -a solution such as the one proposed for [service accounts](service_accounts.md). - -## Use Cases - -In order of increasing complexity, following are example use cases that would -be addressed with security contexts: - -1. Kubernetes is used to run a single cloud application. In order to protect -nodes from containers: - * All containers run as a single non-root user - * Privileged containers are disabled - * All containers run with a particular MCS label - * Kernel capabilities like CHOWN and MKNOD are removed from containers - -2. Just like case #1, except that I have more than one application running on -the Kubernetes cluster. - * Each application is run in its own namespace to avoid name collisions - * For each application a different uid and MCS label is used - -3. Kubernetes is used as the base for a PAAS with multiple projects, each -project represented by a namespace. - * Each namespace is associated with a range of uids/gids on the node that -are mapped to uids/gids on containers using linux user namespaces. - * Certain pods in each namespace have special privileges to perform system -actions such as talking back to the server for deployment, run docker builds, -etc. - * External NFS storage is assigned to each namespace and permissions set -using the range of uids/gids assigned to that namespace. - -## Proposed Design - -### Overview - -A *security context* consists of a set of constraints that determine how a -container is secured before getting created and run. A security context resides -on the container and represents the runtime parameters that will be used to -create and run the container via container APIs. A *security context provider* -is passed to the Kubelet so it can have a chance to mutate Docker API calls in -order to apply the security context. - -It is recommended that this design be implemented in two phases: - -1. Implement the security context provider extension point in the Kubelet -so that a default security context can be applied on container run and creation. -2. Implement a security context structure that is part of a service account. The -default context provider can then be used to apply a security context based on -the service account associated with the pod. - -### Security Context Provider - -The Kubelet will have an interface that points to a `SecurityContextProvider`. -The `SecurityContextProvider` is invoked before creating and running a given -container: - -```go -type SecurityContextProvider interface { - // ModifyContainerConfig is called before the Docker createContainer call. - // The security context provider can make changes to the Config with which - // the container is created. - // An error is returned if it's not possible to secure the container as - // requested with a security context. - ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) - - // ModifyHostConfig is called before the Docker runContainer call. - // The security context provider can make changes to the HostConfig, affecting - // security options, whether the container is privileged, volume binds, etc. - // An error is returned if it's not possible to secure the container as requested - // with a security context. - ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig) -} -``` - -If the value of the SecurityContextProvider field on the Kubelet is nil, the -kubelet will create and run the container as it does today. - -### Security Context - -A security context resides on the container and represents the runtime -parameters that will be used to create and run the container via container APIs. -Following is an example of an initial implementation: - -```go -type Container struct { - ... other fields omitted ... - // Optional: SecurityContext defines the security options the pod should be run with - SecurityContext *SecurityContext -} - -// SecurityContext holds security configuration that will be applied to a container. SecurityContext -// contains duplication of some existing fields from the Container resource. These duplicate fields -// will be populated based on the Container configuration if they are not set. Defining them on -// both the Container AND the SecurityContext will result in an error. -type SecurityContext struct { - // Capabilities are the capabilities to add/drop when running the container - Capabilities *Capabilities - - // Run the container in privileged mode - Privileged *bool - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 -} - -// SELinuxOptions are the labels to be applied to the container. -type SELinuxOptions struct { - // SELinux user label - User string - - // SELinux role label - Role string - - // SELinux type label - Type string - - // SELinux level label. - Level string -} -``` - -### Admission - -It is up to an admission plugin to determine if the security context is -acceptable or not. At the time of writing, the admission control plugin for -security contexts will only allow a context that has defined capabilities or -privileged. Contexts that attempt to define a UID or SELinux options will be -denied by default. In the future the admission plugin will base this decision -upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/auth/service_accounts.md b/contributors/design-proposals/auth/service_accounts.md index af72e467..f0fbec72 100644 --- a/contributors/design-proposals/auth/service_accounts.md +++ b/contributors/design-proposals/auth/service_accounts.md @@ -1,206 +1,6 @@ -# Service Accounts +Design proposals have been archived. -## Motivation +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Processes in Pods may need to call the Kubernetes API. For example: - - scheduler - - replication controller - - node controller - - a map-reduce type framework which has a controller that then tries to make a -dynamically determined number of workers and watch them - - continuous build and push system - - monitoring system - -They also may interact with services other than the Kubernetes API, such as: - - an image repository, such as docker -- both when the images are pulled to -start the containers, and for writing images in the case of pods that generate -images. - - accessing other cloud services, such as blob storage, in the context of a -large, integrated, cloud offering (hosted or private). - - accessing files in an NFS volume attached to the pod - -## Design Overview - -A service account binds together several things: - - a *name*, understood by users, and perhaps by peripheral systems, for an -identity - - a *principal* that can be authenticated and [authorized](../admin/authorization.md) - - a [security context](security_context.md), which defines the Linux -Capabilities, User IDs, Groups IDs, and other capabilities and controls on -interaction with the file system and OS. - - a set of [secrets](secrets.md), which a container may use to access various -networked resources. - -## Design Discussion - -A new object Kind is added: - -```go -type ServiceAccount struct { - TypeMeta `json:",inline" yaml:",inline"` - ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"` - - username string - securityContext ObjectReference // (reference to a securityContext object) - secrets []ObjectReference // (references to secret objects -} -``` - -The name ServiceAccount is chosen because it is widely used already (e.g. by -Kerberos and LDAP) to refer to this type of account. Note that it has no -relation to Kubernetes Service objects. - -The ServiceAccount object does not include any information that could not be -defined separately: - - username can be defined however users are defined. - - securityContext and secrets are only referenced and are created using the -REST API. - -The purpose of the serviceAccount object is twofold: - - to bind usernames to securityContexts and secrets, so that the username can -be used to refer succinctly in contexts where explicitly naming securityContexts -and secrets would be inconvenient - - to provide an interface to simplify allocation of new securityContexts and -secrets. - -These features are explained later. - -### Names - -From the standpoint of the Kubernetes API, a `user` is any principal which can -authenticate to Kubernetes API. This includes a human running `kubectl` on her -desktop and a container in a Pod on a Node making API calls. - -There is already a notion of a username in Kubernetes, which is populated into a -request context after authentication. However, there is no API object -representing a user. While this may evolve, it is expected that in mature -installations, the canonical storage of user identifiers will be handled by a -system external to Kubernetes. - -Kubernetes does not dictate how to divide up the space of user identifier -strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or -may be qualified to allow for federated identity (`alice@example.com` vs. -`alice@example.org`.) Naming convention may distinguish service accounts from -user accounts (e.g. `alice@example.com` vs. -`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but -Kubernetes does not require this. - -Kubernetes also does not require that there be a distinction between human and -Pod users. It will be possible to setup a cluster where Alice the human talks to -the Kubernetes API as username `alice` and starts pods that also talk to the API -as user `alice` and write files to NFS as user `alice`. But, this is not -recommended. - -Instead, it is recommended that Pods and Humans have distinct identities, and -reference implementations will make this distinction. - -The distinction is useful for a number of reasons: - - the requirements for humans and automated processes are different: - - Humans need a wide range of capabilities to do their daily activities. -Automated processes often have more narrowly-defined activities. - - Humans may better tolerate the exceptional conditions created by -expiration of a token. Remembering to handle this in a program is more annoying. -So, either long-lasting credentials or automated rotation of credentials is -needed. - - A Human typically keeps credentials on a machine that is not part of the -cluster and so not subject to automatic management. A VM with a -role/service-account can have its credentials automatically managed. - - the identity of a Pod cannot in general be mapped to a single human. - - If policy allows, it may be created by one human, and then updated by -another, and another, until its behavior cannot be attributed to a single human. - -**TODO**: consider getting rid of separate serviceAccount object and just -rolling its parts into the SecurityContext or Pod Object. - -The `secrets` field is a list of references to /secret objects that a process -started as that service account should have access to be able to assert that -role. - -The secrets are not inline with the serviceAccount object. This way, most or -all users can have permission to `GET /serviceAccounts` so they can remind -themselves what serviceAccounts are available for use. - -Nothing will prevent creation of a serviceAccount with two secrets of type -`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and -client libraries will have some behavior, TBD, to handle the case of multiple -secrets of a given type (pick first or provide all and try each in order, etc). - -When a serviceAccount and a matching secret exist, then a `User.Info` for the -serviceAccount and a `BearerToken` from the secret are added to the map of -tokens used by the authentication process in the apiserver, and similarly for -other types. (We might have some types that do not do anything on apiserver but -just get pushed to the kubelet.) - -### Pods - -The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If -this is unset, then a default value is chosen. If it is set, then the -corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account -Finalizer (see below). - -TBD: how policy limits which users can make pods with which service accounts. - -### Authorization - -Kubernetes API Authorization Policies refer to users. Pods created with a -`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to -authenticate to the Kubernetes APIserver as a particular user. So any policy -that is desired can be applied to them. - -A higher level workflow is needed to coordinate creation of serviceAccounts, -secrets and relevant policy objects. Users are free to extend Kubernetes to put -this business logic wherever is convenient for them, though the Service Account -Finalizer is one place where this can happen (see below). - -### Kubelet - -The kubelet will treat as "not ready to run" (needing a finalizer to act on it) -any Pod which has an empty SecurityContext. - -The kubelet will set a default, restrictive, security context for any pods -created from non-Apiserver config sources (http, file). - -Kubelet watches apiserver for secrets which are needed by pods bound to it. - -**TODO**: how to only let kubelet see secrets it needs to know. - -### The service account finalizer - -There are several ways to use Pods with SecurityContexts and Secrets. - -One way is to explicitly specify the securityContext and all secrets of a Pod -when the pod is initially created, like this: - -**TODO**: example of pod with explicit refs. - -Another way is with the *Service Account Finalizer*, a plugin process which is -optional, and which handles business logic around service accounts. - -The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount -definitions. - -First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no -`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext -and secrets references for the corresponding `serviceAccount`. - -Second, if ServiceAccount definitions change, it may take some actions. - -**TODO**: decide what actions it takes when a serviceAccount definition changes. -Does it stop pods, or just allow someone to list ones that are out of spec? In -general, people may want to customize this? - -Third, if a new namespace is created, it may create a new serviceAccount for -that namespace. This may include a new username (e.g. -`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), -a new securityContext, a newly generated secret to authenticate that -serviceAccount to the Kubernetes API, and default policies for that service -account. - -**TODO**: more concrete example. What are typical default permissions for -default service account (e.g. readonly access to services in the same namespace -and read-write access to events in that namespace?) - -Finally, it may provide an interface to automate creation of new -serviceAccounts. In that case, the user may want to GET serviceAccounts to see -what has been created. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/OWNERS b/contributors/design-proposals/autoscaling/OWNERS deleted file mode 100644 index 9a70bb4c..00000000 --- a/contributors/design-proposals/autoscaling/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-autoscaling-leads -approvers: - - sig-autoscaling-leads -labels: - - sig/autoscaling diff --git a/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md b/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md index 98ad92ef..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md +++ b/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md @@ -1,256 +1,6 @@ -<h2>Warning! This document might be outdated.</h2> +Design proposals have been archived. -# Horizontal Pod Autoscaling +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Preface -This document briefly describes the design of the horizontal autoscaler for -pods. The autoscaler (implemented as a Kubernetes API resource and controller) -is responsible for dynamically controlling the number of replicas of some -collection (e.g. the pods of a ReplicationController) to meet some objective(s), -for example a target per-pod CPU utilization. - -This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). - -## Overview - -The resource usage of a serving application usually varies over time: sometimes -the demand for the application rises, and sometimes it drops. In Kubernetes -version 1.0, a user can only manually set the number of serving pods. Our aim is -to provide a mechanism for the automatic adjustment of the number of pods based -on CPU utilization statistics (a future version will allow autoscaling based on -other resources/metrics). - -## Scale Subresource - -In Kubernetes version 1.1, we are introducing Scale subresource and implementing -horizontal autoscaling of pods based on it. Scale subresource is supported for -replication controllers and deployments. Scale subresource is a Virtual Resource -(does not correspond to an object stored in etcd). It is only present in the API -as an interface that a controller (in this case the HorizontalPodAutoscaler) can -use to dynamically scale the number of replicas controlled by some other API -object (currently ReplicationController and Deployment) and to learn the current -number of replicas. Scale is a subresource of the API object that it serves as -the interface for. The Scale subresource is useful because whenever we introduce -another type we want to autoscale, we just need to implement the Scale -subresource for it. The wider discussion regarding Scale took place in issue -[#1629](https://github.com/kubernetes/kubernetes/issues/1629). - -Scale subresource is in API for replication controller or deployment under the -following paths: - -`apis/extensions/v1beta1/replicationcontrollers/myrc/scale` - -`apis/extensions/v1beta1/deployments/mydeployment/scale` - -It has the following structure: - -```go -// represents a scaling request for a resource. -type Scale struct { - unversioned.TypeMeta - api.ObjectMeta - - // defines the behavior of the scale. - Spec ScaleSpec - - // current status of the scale. - Status ScaleStatus -} - -// describes the attributes of a scale subresource -type ScaleSpec struct { - // desired number of instances for the scaled object. - Replicas int `json:"replicas,omitempty"` -} - -// represents the current status of a scale subresource. -type ScaleStatus struct { - // actual number of observed instances of the scaled object. - Replicas int `json:"replicas"` - - // label query over pods that should match the replicas count. - Selector map[string]string `json:"selector,omitempty"` -} -``` - -Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment -associated with the given Scale subresource. `ScaleStatus.Replicas` reports how -many pods are currently running in the replication controller/deployment, and -`ScaleStatus.Selector` returns selector for the pods. - -## HorizontalPodAutoscaler Object - -In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It -is accessible under: - -`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler` - -It has the following structure: - -```go -// configuration of a horizontal pod autoscaler. -type HorizontalPodAutoscaler struct { - unversioned.TypeMeta - api.ObjectMeta - - // behavior of autoscaler. - Spec HorizontalPodAutoscalerSpec - - // current information about the autoscaler. - Status HorizontalPodAutoscalerStatus -} - -// specification of a horizontal pod autoscaler. -type HorizontalPodAutoscalerSpec struct { - // reference to Scale subresource; horizontal pod autoscaler will learn the current resource - // consumption from its status,and will set the desired number of pods by modifying its spec. - ScaleRef SubresourceReference - // lower limit for the number of pods that can be set by the autoscaler, default 1. - MinReplicas *int - // upper limit for the number of pods that can be set by the autoscaler. - // It cannot be smaller than MinReplicas. - MaxReplicas int - // target average CPU utilization (represented as a percentage of requested CPU) over all the pods; - // if not specified it defaults to the target CPU utilization at 80% of the requested resources. - CPUUtilization *CPUTargetUtilization -} - -type CPUTargetUtilization struct { - // fraction of the requested CPU that should be utilized/used, - // e.g. 70 means that 70% of the requested CPU should be in use. - TargetPercentage int -} - -// current status of a horizontal pod autoscaler -type HorizontalPodAutoscalerStatus struct { - // most recent generation observed by this autoscaler. - ObservedGeneration *int64 - - // last time the HorizontalPodAutoscaler scaled the number of pods; - // used by the autoscaler to control how often the number of pods is changed. - LastScaleTime *unversioned.Time - - // current number of replicas of pods managed by this autoscaler. - CurrentReplicas int - - // desired number of replicas of pods managed by this autoscaler. - DesiredReplicas int - - // current average CPU utilization over all pods, represented as a percentage of requested CPU, - // e.g. 70 means that an average pod is using now 70% of its requested CPU. - CurrentCPUUtilizationPercentage *int -} -``` - -`ScaleRef` is a reference to the Scale subresource. -`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler -configuration. We are also introducing HorizontalPodAutoscalerList object to -enable listing all autoscalers in a namespace: - -```go -// list of horizontal pod autoscaler objects. -type HorizontalPodAutoscalerList struct { - unversioned.TypeMeta - unversioned.ListMeta - - // list of horizontal pod autoscaler objects. - Items []HorizontalPodAutoscaler -} -``` - -## Autoscaling Algorithm - -The autoscaler is implemented as a control loop. It periodically queries pods -described by `Status.PodSelector` of Scale subresource, and collects their CPU -utilization. Then, it compares the arithmetic mean of the pods' CPU utilization -with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of -the Scale if needed to match the target (preserving condition: MinReplicas <= -Replicas <= MaxReplicas). - -The period of the autoscaler is controlled by the -`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The -default value is 30 seconds. - - -CPU utilization is the recent CPU usage of a pod (average across the last 1 -minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU -usage is taken directly from Heapster. In future, there will be API on master -for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). - -The target number of pods is calculated from the following formula: - -``` -TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target) -``` - -Starting and stopping pods may introduce noise to the metric (for instance, -starting may temporarily increase CPU). So, after each action, the autoscaler -should wait some time for reliable data. Scale-up can only happen if there was -no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from -the last rescaling. Moreover any scaling will only be made if: -`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 -(10% tolerance). Such approach has two benefits: - -* Autoscaler works in a conservative way. If new user load appears, it is -important for us to rapidly increase the number of pods, so that user requests -will not be rejected. Lowering the number of pods is not that urgent. - -* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting -decision if the load is not stable. - -## Relative vs. absolute metrics - -We chose values of the target metric to be relative (e.g. 90% of requested CPU -resource) rather than absolute (e.g. 0.6 core) for the following reason. If we -choose absolute metric, user will need to guarantee that the target is lower -than the request. Otherwise, overloaded pods may not be able to consume more -than the autoscaler's absolute target utilization, thereby preventing the -autoscaler from seeing high enough utilization to trigger it to scale up. This -may be especially troublesome when user changes requested resources for a pod -because they would need to also change the autoscaler utilization threshold. -Therefore, we decided to choose relative metric. For user, it is enough to set -it to a value smaller than 100%, and further changes of requested resources will -not invalidate it. - -## Support in kubectl - -To make manipulation of HorizontalPodAutoscaler object simpler, we added support -for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In -addition, in future, we are planning to add kubectl support for the following -use-cases: -* When creating a replication controller or deployment with -`kubectl create [-f]`, there should be a possibility to specify an additional -autoscaler object. (This should work out-of-the-box when creation of autoscaler -is supported by kubectl as we may include multiple objects in the same config -file). -* *[future]* When running an image with `kubectl run`, there should be an -additional option to create an autoscaler for it. -* *[future]* We will add a new command `kubectl autoscale` that will allow for -easy creation of an autoscaler object for already existing replication -controller/deployment. - -## Next steps - -We list here some features that are not supported in Kubernetes version 1.1. -However, we want to keep them in mind, as they will most probably be needed in -the future. -Our design is in general compatible with them. -* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. -memory, network traffic, qps). This includes scaling based on a custom/application metric. -* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler, -instead of computing average for a target metric across pods, will use a single, -external, metric (e.g. qps metric from load balancer). The metric will be -aggregated while the target will remain per-pod (e.g. when observing 100 qps on -load balancer while the target is 20 qps per pod, autoscaler will set the number -of replicas to 5). -* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers -of pods for different metrics are different, choose the largest target number of -pods. -* *[future]* **Scale the number of pods starting from 0.** All pods can be -turned-off, and then turned-on when there is a demand for them. When a request -to service with no pods arrives, kube-proxy will generate an event for -autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247). -* *[future]* **When scaling down, make more educated decision which pods to -kill.** E.g.: if two or more pods from the same replication controller are on -the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/hpa-external-metrics.md b/contributors/design-proposals/autoscaling/hpa-external-metrics.md index a9279f19..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/hpa-external-metrics.md +++ b/contributors/design-proposals/autoscaling/hpa-external-metrics.md @@ -1,205 +1,6 @@ -# **HPA v2 API extension proposal** +Design proposals have been archived. -# Objective +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -[Horizontal Pod Autoscaler v2 API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/hpa-v2.md) allows users to autoscale based on custom metrics. However, there are some use-cases that are not well supported by the current API. The goal of this document is to propose the following changes to the API: -* Allow autoscaling based on metrics coming from outside of Kubernetes. Example use-case is autoscaling based on a hosted cloud service used by a pod. -* Allow specifying per-pod target for global metrics. This makes more sense for many metrics than a global target (ex. 200 QPS / pod makes sense while a global target for QPS doesn't). - -# Overview - -A new External metric source will be added. It will identify a specific metric to autoscale on based on metric name and a label selector. The assumed model is that specific time series in monitoring systems are identified with metric name and a set of key-value labels or tags. The details vary in different systems (ref: [Prometheus](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels), [Stackdriver](https://cloud.google.com/monitoring/api/v3/metrics#time_series), [Datadog](https://docs.datadoghq.com/agent/tagging/), [Sysdig](https://www.sysdig.org/wiki/sysdig-user-guide/#user-content-filtering)), however, in general the adapter should be able to use metric name and a set of labels to construct a query sufficient to identify specific time series in underlying system. - -External and Object metrics will specify the desired target by setting exactly one of two fields: TargetValue (global target) or TargetAverageValue (per-pod target). - - -# Multiple metric values - -Label selector specified by user can match multiple time series, resulting in multiple values provided to HPA. In such case the sum of all those values will be used for autoscaling. This is meant to allow autoscaling using a metric that is drilled down by some criteria not relevant for autoscaling a particular workload (for example to allow autoscaling based on a total number of HTTP requests, regardless of which HTTP method is used). - -If the need arises we can easily add other simple aggregations and allow user to choose one of them. - -# Example - -This is an example HPA configuration autoscaling based on number of pending messages in a message queue (RabbtMQ) running outside of cluster. - -```yaml -kind: HorizontalPodAutoscaler -apiVersion: autoscaling/v2beta2 -spec: - scaleTargetRef: - kind: ReplicationController - name: Worker - minReplicas: 2 - maxReplicas: 10 - metrics: - - type: External - external: - metricName: queue_messages_ready - metricSelector: - matchLabels: - queue: worker_tasks - targetAverageValue: 30 -``` - -# API - -This is the part of autoscaling/v2beta2 API that includes changes from v2beta1. - -Some parts containing obvious changes (MetricSpec and MetricStatus) have been omitted for clarity. - -```go -// MetricSourceType indicates the type of metric. -type MetricSourceType string - -var ( - // ObjectMetricSourceType is a metric describing a Kubernetes object - // (for example, hits-per-second on an Ingress object). - ObjectMetricSourceType MetricSourceType = "Object" - // PodsMetricSourceType is a metric describing each pod in the current scale - // target (for example, transactions-processed-per-second). The values - // will be averaged together before being compared to the target value. - PodsMetricSourceType MetricSourceType = "Pods" - // ResourceMetricSourceType is a resource metric known to Kubernetes, as - // specified in requests and limits, describing each pod in the current - // scale target (e.g. CPU or memory). Such metrics are built in to - // Kubernetes, and have special scaling options on top of those available - // to normal per-pod metrics (the "pods" source). - ResourceMetricSourceType MetricSourceType = "Resource" - // ExternalMetricSourceType is a global metric that is not associated - // with any Kubernetes object. It allows autoscaling based on information - // coming from components running outside of cluster - // (for example length of queue in cloud messaging service, or - // QPS from loadbalancer running outside of cluster). - ExternalMetricSourceType MetricSourceType = "External" -) - -// ObjectMetricSource indicates how to scale on a metric describing a -// Kubernetes object (for example, hits-per-second on an Ingress object). -type ObjectMetricSource struct { - // target is the described Kubernetes object. - Target CrossVersionObjectReference `json:"target" protobuf:"bytes,1,name=target"` - - // metricName is the name of the metric in question. - MetricName string `json:"metricName" protobuf:"bytes,2,name=metricName"` - // TargetValue is the target value of the metric (as a quantity). - // Mutually exclusive with TargetAverageValue. - TargetValue *resource.Quantity `json:"targetValue,omitempty" protobuf:"bytes,3,opt,name=targetValue"` - // TargetAverageValue is the target per-pod value of global metric. - // Mutually exclusive with TargetValue. - TargetAverageValue *resource.Quantity `json:"targetAverageValue,omitempty" protobuf="bytes,4,opt,name=targetAverageValue"` -} - -// ExternalMetricSource indicates how to scale on a metric not associated with -// any Kubernetes object (for example length of queue in cloud -// messaging service, or QPS from loadbalancer running outside of cluster). -type ExternalMetricSource struct { - // MetricName is the name of a metric used for autoscaling in - // metric system. - MetricName string `json:"metricName" protobuf:"bytes,1,name=metricName"` - - // MetricSelector is used to identify a specific time series - // within a given metric. - MetricSelector metav1.LabelSelector `json:"metricSelector" protobuf:"bytes,2,name=metricSelector"` - - // TargetValue is the target value of the metric (as a quantity). - // Mutually exclusive with TargetAverageValue. - TargetValue *resource.Quantity `json:"targetValue,omitempty" protobuf:"bytes,3,opt,name=targetValue"` - // TargetAverageValue is the target per-pod value (as a quantity) of global metric. - // Mutually exclusive with TargetValue. - TargetAverageValue *resource.Quantity `json:"targetAverageValue,omitempty" protobuf="bytes,4,opt,name=targetAverageValue"` -} - -// ExternalMetricStatus indicates the current value of a global metric -// not associated with any Kubernetes object. -type ExternalMetricStatus struct { - // MetricName is the name of a metric used for autoscaling in - // metric system. - MetricName string `json:"metricName" protobuf:"bytes,1,name=metricName"` - - // MetricSelector is used to identify a specific time series - // within a given metric. - MetricSelector metav1.LabelSelector `json:"metricSelector" protobuf:"bytes,2,name=metricSelector"` - - // CurrentValue is the current value of the metric (as a quantity) - CurrentValue resource.Quantity `json:"currentValue" protobuf:"bytes,3,name=currentValue"` - - // CurrentAverageValue is the current value of metric averaged over - // autoscaled pods. - CurrentAverageValue *resource.Quantity `json:"currentAverageValue,omitempty" protobuf:"bytes,4,opt,name=currentAverageValue"` -} -``` - -# Implementation - -A new [External Metrics API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md) will be used to obtain values of External metrics. - -As a result of proposed changes TargetValue field becomes optional. Kubernetes convention is to make optional fields pointers. Changing value to pointer makes this proposal non backward compatible, requiring moving to v2beta2 as opposed to extending v2beta1 API. - -[Kubernetes deprecation policy](https://kubernetes.io/docs/reference/deprecation-policy/) requires that autoscaling/v2beta1 is still supported for 1.10 and 1.11 and roundtrip between different API versions can be made. This will be implemented the same way as current conversion between v2 and v1 - MetricSpec specifying TargetAverageValue will be serialized into json and stored in annotation when converting to v2beta1 representation. - -# Future considerations - -### Add LabelSelector to Object and Pods metric - -Object and Pods metrics rely on implicit assumption that there is just a single time series per metric. This can be pretty limiting when using applications that drill down the metrics by some additional criteria (ex. method in case of metrics related to HTTP requests). We could consider adding LabelSelector to Object and Pods metrics working similarly to how it works for External metrics. - -# Alternatives considered - -### Identifying external metrics - -The main argument for choosing metric name and label selector over other ways of identifying external metric was access control. Any query specified by user in External metric spec will be executed by HPA controller (ie. system account). Per-metric access control can be applied using standard kubernetes mechanisms. It is up to adapter implementing [External Metrics API to ensure access control at labels level](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/external-metrics-api.md#access-control). - -An important consideration is that query languages of monitoring systems can be very powerful (ex. [Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/examples/#using-functions-operators-etc) and [example queries](https://github.com/infinityworks/prometheus-example-queries/blob/master/README.md#promql-examples)), making it extremely difficult (if not impossible) to implement access control in adapter. The idea behind the proposed approach is to limit expressive power available to user to identifying an existing time series in monitoring system. - -In order to use a more advanced query (including aggregations of multiple time series) for autoscaling the user will have to re-export the result of such query as a new metric. Such advanced use-cases are likely to be rare and re-exporting is a viable workaround. On the other hand, allowing user to perform arbitrary query in underlying metric system using system account would have been a blocker for enabling External Metrics API for a large number of users. - -### Using Object metric source for external metrics -An alternative to adding External metric source would be to reuse existing Object metric source. Both External metrics proposed in this document and Object metrics represent a global custom metric and there is no conceptual difference between them, except for how the metric is identified in underlying monitoring system. Using Object metrics for both use-cases can help keep API simple. - -The immediate problem with this approach is that there is no inherent relationship between any Kubernetes object and an arbitrary external metric. It's not clear how Kubernetes object reference used to specify Object metrics could be used to identify such metric. This section discusses different solutions to this problem along with their pros and cons. - -#### Attach external metric to a Kubernetes object -One possible solution would be to allow user to explicitly specify a relationship between an external metric and some Kubernetes object (in particular attaching metric to a Namespace seems logical). Custom Metrics Adapter could use this additional information to translate object reference provided by HPA into query to monitoring system. This could be either left to adapter (i.e. adapter specific config) or introduced to custom metrics API. Below table details pros and cons of each approach. - -<table> - <tbody> - <tr> - <th>Option</th> - <th>Pros</th> - <th>Cons</th> - </tr> - <tr> - <td>Adapter specific config</td> - <td><ul> - <li>No need to change HPA API. - <li>Adapter-specific way of configuring metrics will better match underlying metrics system than any generic solution. It could be both more logical and offer better validation. - </ul></td> - <td><ul> - <li>This just pushes the problem to adapter. Different adapters are likely to solve it differently (or not at all), resulting in widespread incompatibility. - <li>Hard to use. Instead of just understanding HPA API user would need to know about Custom Metric Adapter and learn it’s configuration syntax and best practices. - <li>Potential access control problems if shared config is used for attaching multiple metrics in different namespaces. - </ul></td> - </tr> - <td>Add an object containing mapping to Custom Metrics API</td> - <td><ul> - <li>No need to change HPA API. - </ul></td> - <td><ul> - <li>This is just moving External metric from HPA into a separate object, for no obvious benefit (in most cases the new object will map 1-1 to HPA anyway). - <li>Harder to use (need to create an additional object, than reference it from HPA. - </ul></td> - </tr> - </tbody> -</table> - -Overall it seems that the same information provided to External metric spec would have to be provided by user anyway. Storing it elsewhere makes the feature more complex to use, for no clear benefit. - -#### Implicitly attach external metrics to namespace -After [extending Object metrics with LabelSelector](#add-labelselector-to-object-and-pods-metric) Object metric will contain enough information to identify external metric. Theoretically Custom Metrics Adapter could implicitly attach every available metric to some arbitrary object (ex. `default` namespace or every namespace). However, this is equivalent to just making the object reference optional using the fact that the set of fields in Object metric would be superset of fields in External metric. Technically it would work, but it feels like a hack and it's likely to be confusing to users. Also it makes it easy to rely on access control on referenced object, which could be easily circumvented if every metric is available via every object. - -#### Relabel external metrics to add metadata attaching them to a chosen Kubernetes object -After [extending Object metrics with LabelSelector](#add-labelselector-to-object-and-pods-metric) user could just [relabel their metrics](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config) to add metadata attaching them to Kubernetes object of their choice. However, this approach assumes user has sufficient access to relabel metrics. This is not always the case (for example when autoscaling on metrics from a hosted service). - -A variant of this approach would be to ask user to create a pod that reads a metric and reexports it. This will work even without any changes in HPA, however, it requires complex setup, wastes resources and may introduce additional latency. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/hpa-status-conditions.md b/contributors/design-proposals/autoscaling/hpa-status-conditions.md index d3354582..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/hpa-status-conditions.md +++ b/contributors/design-proposals/autoscaling/hpa-status-conditions.md @@ -1,121 +1,6 @@ -Horizontal Pod Autoscaler Status Conditions -=========================================== +Design proposals have been archived. -Currently, the HPA status conveys the last scale time, current and desired -replicas, and the last-retrieved values of the metrics used to autoscale. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -However, the status field conveys no information about whether or not the -HPA controller encountered difficulties while attempting to fetch metrics, -or to scale. While this information is generally conveyed via events, -events are difficult to use to determine the current state of the HPA. -Other objects, such as Pods, include a `Conditions` field, which describe -the current condition of the object. Adding such a field to the HPA -provides clear indications of the current state of the HPA, allowing users -to more easily recognize problems in their setups. - -API Change ----------- - -The status of the HPA object will gain a new field, `Conditions`, of type -`[]HorizontalPodAutoscalerCondition`, defined as follows: - -```go -// HorizontalPodAutoscalerConditionType are the valid conditions of -// a HorizontalPodAutoscaler (see later on in the proposal for valid -// values) -type HorizontalPodAutoscalerConditionType string - -// HorizontalPodAutoscalerCondition describes the state of -// a HorizontalPodAutoscaler at a certain point. -type HorizontalPodAutoscalerCondition struct { - // type describes the current condition - Type HorizontalPodAutoscalerConditionType - // status is the status of the condition (True, False, Unknown) - Status ConditionStatus - // LastTransitionTime is the last time the condition transitioned from - // one status to another - // +optional - LastTransitionTime metav1.Time - // reason is the reason for the condition's last transition. - // +optional - Reason string - // message is a human-readable explanation containing details about - // the transition - Message string -} -``` - -Current Conditions Conveyed via Events --------------------------------------- - -The following is a list of events emitted by the HPA controller (as of the -writing of this proposal), with descriptions of the conditions which they -represent. All of these events are caused by issues which block scaling -entirely. - -- *SelectorRequired*: the target scalable resource's scale is missing - a selector. - -- *InvalidSelector*: the target scalable's selector couldn't be parsed. - -- *FailedGet{Object,Pods,Resource}Metric*: the HPA controller was unable - to fetch one metric. - -- *InvalidMetricSourceType*: the HPA controller encountered an unknown - metric source type. - -- *FailedComputeMetricsReplicas*: this is fired in conjunction with one of - the two previous events. - -- *FailedConvertHPA*: the HPA controller was unable to convert the given - HPA to the v2alpha1 version. - -- *FailedGetScale*: the HPA controller was unable to actually fetch the - scale for the given scalable resource. - -- *FailedRescale*: a scale update was needed and the HPA controller was - unable to actually update the scale subresource of the target scalable. - -- *SuccessfulRescale*: a scale update was needed and everything went - properly. - -- *FailedUpdateStatus*: the HPA controller failed to update the status of - the HPA object. - -New Conditions Types --------------------- - -The above conditions can be coalesced into several condition types. Each -condition has one or more associated `Reason` values which map back to -some of the events described above. - -- *CanAccessScale*: this condition, when false, indicates issues actually - getting or updating the scale of the target scalable. Potential - `Reason` values include `FailedGet`, `FailedUpdate` -- *InBackoff*: this condition, when true, indicates that the HPA is - currently within a "scale forbidden window", and therefore will not - perform scale operations in a particular direction. Potential `Reason` - values include `BackoffBoth`, `BackoffDownscale`, and `BackoffUpscale`. -- *CanComputeReplicas*: this condition, when false, indicates issues - computing the desired replica counts. Potential `Reason` values include - `FailedGet{Object,Pods,Resource}Metric`, `InvalidMetricSourceType`, and - `InvalidSelector` (which includes both missing and unparsable selectors, - which can be detailed in the `Message` field). -- *DesiredOutsideRange*: this condition, when true, indicates that the - desired scale currently would be outside the range allowed by the HPA - spec, and is therefore capped. Potential `Reason` values include - `TooFewReplicas` and `TooManyReplicas`. - -The `FailedUpdateStatus` event is not described here, as a failure to -update the HPA status would preclude actually conveying this information. -`FailedConvertHPA` is also not described, since it exists more as an -implementation detail of how the current mechanics of the HPA are -implemented, and less as part of the inherent functionality of the HPA -controller. - -Open Questions --------------- - -* Should `CanScale` be split into `CanGetScale` and `CanUpdateScale` or - something equivalent? +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/hpa-v2.md b/contributors/design-proposals/autoscaling/hpa-v2.md index e01fa299..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/hpa-v2.md +++ b/contributors/design-proposals/autoscaling/hpa-v2.md @@ -1,289 +1,6 @@ -Horizontal Pod Autoscaler with Arbitrary Metrics -=============================================== +Design proposals have been archived. -The current Horizontal Pod Autoscaler object only has support for CPU as -a percentage of requested CPU. While this is certainly a common case, one -of the most frequently sought-after features for the HPA is the ability to -scale on different metrics (be they custom metrics, memory, etc). +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The current HPA controller supports targeting "custom" metrics (metrics -with a name prefixed with "custom/") via an annotation, but this is -suboptimal for a number of reasons: it does not allow for arbitrary -"non-custom" metrics (e.g. memory), it does not allow for metrics -describing other objects (e.g. scaling based on metrics on services), and -carries the various downsides of annotations (not be typed/validated, -being hard for a user to hand-construct, etc). - -Object Design -------------- - -### Requirements ### - -This proposal describes a new version of the Horizontal Pod Autoscaler -object with the following requirements kept in mind: - -1. The HPA should continue to support scaling based on percentage of CPU - request - -2. The HPA should support scaling on arbitrary metrics associated with - pods - -3. The HPA should support scaling on arbitrary metrics associated with - other Kubernetes objects in the same namespace as the HPA (and the - namespace itself) - -4. The HPA should make scaling on multiple metrics in a single HPA - possible and explicit (splitting metrics across multiple HPAs leads to - the possibility of fighting between HPAs) - -### Specification ### - -```go -type HorizontalPodAutoscalerSpec struct { - // the target scalable object to autoscale - ScaleTargetRef CrossVersionObjectReference `json:"scaleTargetRef"` - - // the minimum number of replicas to which the autoscaler may scale - // +optional - MinReplicas *int32 `json:"minReplicas,omitempty"` - // the maximum number of replicas to which the autoscaler may scale - MaxReplicas int32 `json:"maxReplicas"` - - // the metrics to use to calculate the desired replica count (the - // maximum replica count across all metrics will be used). The - // desired replica count is calculated multiplying the ratio between - // the target value and the current value by the current number of - // pods. Ergo, metrics used must decrease as the pod count is - // increased, and vice-versa. See the individual metric source - // types for more information about how each type of metric - // must respond. - // +optional - Metrics []MetricSpec `json:"metrics,omitempty"` -} - -// a type of metric source -type MetricSourceType string -var ( - // a metric describing a kubernetes object (for example, hits-per-second on an Ingress object) - ObjectSourceType MetricSourceType = "Object" - // a metric describing each pod in the current scale target (for example, transactions-processed-per-second). - // The values will be averaged together before being compared to the target value - PodsSourceType MetricSourceType = "Pods" - // a resource metric known to Kubernetes, as specified in requests and limits, describing each pod - // in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, - // and have special scaling options on top of those available to normal per-pod metrics (the "pods" source) - ResourceSourceType MetricSourceType = "Resource" -) - -// a specification for how to scale based on a single metric -// (only `type` and one other matching field should be set at once) -type MetricSpec struct { - // the type of metric source (should match one of the fields below) - Type MetricSourceType `json:"type"` - - // a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object) - Object *ObjectMetricSource `json:"object,omitempty"` - // a metric describing each pod in the current scale target (for example, transactions-processed-per-second). - // The values will be averaged together before being compared to the target value - Pods *PodsMetricSource `json:"pods,omitemtpy"` - // a resource metric (such as those specified in requests and limits) known to Kubernetes - // describing each pod in the current scale target (e.g. CPU or memory). Such metrics are - // built in to Kubernetes, and have special scaling options on top of those available to - // normal per-pod metrics using the "pods" source. - Resource *ResourceMetricSource `json:"resource,omitempty"` -} - -// a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object) -type ObjectMetricSource struct { - // the described Kubernetes object - Target CrossVersionObjectReference `json:"target"` - - // the name of the metric in question - MetricName string `json:"metricName"` - // the target value of the metric (as a quantity) - TargetValue resource.Quantity `json:"targetValue"` -} - -// a metric describing each pod in the current scale target (for example, transactions-processed-per-second). -// The values will be averaged together before being compared to the target value -type PodsMetricSource struct { - // the name of the metric in question - MetricName string `json:"metricName"` - // the target value of the metric (as a quantity) - TargetAverageValue resource.Quantity `json:"targetAverageValue"` -} - -// a resource metric known to Kubernetes, as specified in requests and limits, describing each pod -// in the current scale target (e.g. CPU or memory). The values will be averaged together before -// being compared to the target. Such metrics are built in to Kubernetes, and have special -// scaling options on top of those available to normal per-pod metrics using the "pods" source. -// Only one "target" type should be set. -type ResourceMetricSource struct { - // the name of the resource in question - Name api.ResourceName `json:"name"` - // the target value of the resource metric, represented as - // a percentage of the requested value of the resource on the pods. - // +optional - TargetAverageUtilization *int32 `json:"targetAverageUtilization,omitempty"` - // the target value of the resource metric as a raw value, similarly - // to the "pods" metric source type. - // +optional - TargetAverageValue *resource.Quantity `json:"targetAverageValue,omitempty"` -} - -type HorizontalPodAutoscalerStatus struct { - // most recent generation observed by this autoscaler. - ObservedGeneration *int64 `json:"observedGeneration,omitempty"` - // last time the autoscaler scaled the number of pods; - // used by the autoscaler to control how often the number of pods is changed. - LastScaleTime *unversioned.Time `json:"lastScaleTime,omitempty"` - - // the last observed number of replicas from the target object. - CurrentReplicas int32 `json:"currentReplicas"` - // the desired number of replicas as last computed by the autoscaler - DesiredReplicas int32 `json:"desiredReplicas"` - - // the last read state of the metrics used by this autoscaler - CurrentMetrics []MetricStatus `json:"currentMetrics" protobuf:"bytes,5,rep,name=currentMetrics"` -} - -// the status of a single metric -type MetricStatus struct { - // the type of metric source - Type MetricSourceType `json:"type"` - - // a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object) - Object *ObjectMetricStatus `json:"object,omitemtpy"` - // a metric describing each pod in the current scale target (for example, transactions-processed-per-second). - // The values will be averaged together before being compared to the target value - Pods *PodsMetricStatus `json:"pods,omitemtpy"` - // a resource metric known to Kubernetes, as specified in requests and limits, describing each pod - // in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, - // and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. - Resource *ResourceMetricStatus `json:"resource,omitempty"` -} - -// a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object) -type ObjectMetricStatus struct { - // the described Kubernetes object - Target CrossVersionObjectReference `json:"target"` - - // the name of the metric in question - MetricName string `json:"metricName"` - // the current value of the metric (as a quantity) - CurrentValue resource.Quantity `json:"currentValue"` -} - -// a metric describing each pod in the current scale target (for example, transactions-processed-per-second). -// The values will be averaged together before being compared to the target value -type PodsMetricStatus struct { - // the name of the metric in question - MetricName string `json:"metricName"` - // the current value of the metric (as a quantity) - CurrentAverageValue resource.Quantity `json:"currentAverageValue"` -} - -// a resource metric known to Kubernetes, as specified in requests and limits, describing each pod -// in the current scale target (e.g. CPU or memory). The values will be averaged together before -// being compared to the target. Such metrics are built in to Kubernetes, and have special -// scaling options on top of those available to normal per-pod metrics using the "pods" source. -// Only one "target" type should be set. Note that the current raw value is always displayed -// (even when the current values as request utilization is also displayed). -type ResourceMetricStatus struct { - // the name of the resource in question - Name api.ResourceName `json:"name"` - // the target value of the resource metric, represented as - // a percentage of the requested value of the resource on the pods - // (only populated if the corresponding request target was set) - // +optional - CurrentAverageUtilization *int32 `json:"currentAverageUtilization,omitempty"` - // the current value of the resource metric as a raw value - CurrentAverageValue resource.Quantity `json:"currentAverageValue"` -} -``` - -### Example ### - -In this example, we scale based on the `hits-per-second` value recorded as -describing a service in our namespace, plus the CPU usage of the pods in -the ReplicationController being autoscaled. - -```yaml -kind: HorizontalPodAutoscaler -apiVersion: autoscaling/v2alpha1 -metadata: - name: WebFrontend -spec: - scaleTargetRef: - kind: ReplicationController - name: WebFrontend - minReplicas: 2 - maxReplicas: 10 - metrics: - - type: Resource - resource: - name: cpu - targetAverageUtilization: 80 - - type: Object - object: - target: - kind: Service - name: Frontend - metricName: hits-per-second - targetValue: 1k -``` - -### Alternatives and Future Considerations ### - -Since the new design mirrors volume plugins (and similar APIs), it makes -it relatively easy to introduce new fields in a backwards-compatible way: -we simply introduce a new field in `MetricSpec` as a new "metric type". - -#### External #### - -It was discussed adding a source type of `External` which has a single -opaque metric field and target value. This would indicate that the HPA -was under control of an external autoscaler, which would allow external -autoscalers to be present in the cluster while still indicating to tooling -that autoscaling is taking place. - -However, since this raises a number of questions and complications about -interaction with the existing autoscaler, it was decided to exclude this -feature. We may reconsider in the future. - -#### Limit Percentages #### - -In cluster environments where request is automatically set for scheduling -purposes, it is advantageous to be able to autoscale on percentage of -limit for resource metrics. We may wish to consider adding -a `targetPercentageOfLimit` to the `ResourceMetricSource` type. - -#### Referring to the current Namespace #### - -It is beneficial to be able to refer to a metric on the current namespace, -similarly to the `ObjectMetricSource` source type, but without an explicit -name. Because of the similarity to `ObjectMetricSource`, it may simply be -sufficient to allow specificying a `kind` of "Namespace" without a name. -Alternatively, a similar source type to `PodsMetricSource` could be used. - -#### Calculating Final Desired Replica Count #### - -Since we have multiple replica counts (one from each metric), we must have -a way to aggregated them into a final replica count. In this iteration of -the proposal, we simply take the maximum of all the computed replica -counts. However, in certain cases, it could be useful to allow the user -to specify that they wanted the minimum or average instead. - -In the general case, maximum should be sufficient, but if the need arises, -it should be fairly easy to add such a field in. - -Mechanical Concerns -------------------- - -The HPA will derive metrics from two sources: resource metrics (i.e. CPU -request percentage) will come from the -[master metrics API](../instrumentation/resource-metrics-api.md), while other metrics will -come from the [custom metrics API](../instrumentation/custom-metrics-api.md), which is -an adapter API which sources metrics directly from the monitoring -pipeline. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/images/vpa-architecture.png b/contributors/design-proposals/autoscaling/images/vpa-architecture.png Binary files differdeleted file mode 100644 index c8af3073..00000000 --- a/contributors/design-proposals/autoscaling/images/vpa-architecture.png +++ /dev/null diff --git a/contributors/design-proposals/autoscaling/initial-resources.md b/contributors/design-proposals/autoscaling/initial-resources.md index 7ce09770..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/initial-resources.md +++ b/contributors/design-proposals/autoscaling/initial-resources.md @@ -1,72 +1,6 @@ -## Abstract +Design proposals have been archived. -Initial Resources is a data-driven feature that based on historical data tries to estimate resource usage of a container without Resources specified -and set them before the container is run. This document describes design of the component. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Since we want to make Kubernetes as simple as possible for its users we don't want to require setting [Resources](../node/resource-qos.md) for container by its owner. -On the other hand having Resources filled is critical for scheduling decisions. -Current solution to set up Resources to hardcoded value has obvious drawbacks. -We need to implement a component which will set initial Resources to a reasonable value. - -## Design - -InitialResources component will be implemented as an [admission plugin](../../plugin/pkg/admission/) and invoked right before -[LimitRanger](https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/cluster/gce/config-default.sh#L91). -For every container without Resources specified it will try to predict amount of resources that should be sufficient for it. -So that a pod without specified resources will be treated as -. - -InitialResources will set only [request](../node/resource-qos.md#requests-and-limits) (independently for each resource type: cpu, memory) field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources). -To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined. -It will prevent from situation when the pod is rejected due to too low or too high estimation. - -The container won't be marked as managed by this component in any way, however appropriate event will be exported. -The predicting algorithm should have very low latency to not increase significantly e2e pod startup latency -[#3954](https://github.com/kubernetes/kubernetes/pull/3954). - -### Predicting algorithm details - -In the first version estimation will be made based on historical data for the Docker image being run in the container (both the name and the tag matters). -CPU/memory usage of each container is exported periodically (by default with 1 minute resolution) to the backend (see more in [Monitoring pipeline](#monitoring-pipeline)). - -InitialResources will set Request for both cpu/mem as the 90th percentile of the first (in the following order) set of samples defined in the following way: - -* 7 days same image:tag, assuming there is at least 60 samples (1 hour) -* 30 days same image:tag, assuming there is at least 60 samples (1 hour) -* 30 days same image, assuming there is at least 1 sample - -If there is still no data the default value will be set by LimitRanger. Same parameters will be configurable with appropriate flags. - -#### Example - -If we have at least 60 samples from image:tag over the past 7 days, we will use the 90th percentile of all of the samples of image:tag over the past 7 days. -Otherwise, if we have at least 60 samples from image:tag over the past 30 days, we will use the 90th percentile of all of the samples over of image:tag the past 30 days. -Otherwise, if we have at least 1 sample from image over the past 30 days, we will use that the 90th percentile of all of the samples of image over the past 30 days. -Otherwise we will use default value. - -### Monitoring pipeline - -In the first version there will be available 2 options for backend for predicting algorithm: - -* [InfluxDB](../../docs/user-guide/monitoring.md#influxdb-and-grafana) - aggregation will be made in SQL query -* [GCM](../../docs/user-guide/monitoring.md#google-cloud-monitoring) - since GCM is not as powerful as InfluxDB some aggregation will be made on the client side - -Both will be hidden under an abstraction layer, so it would be easy to add another option. -The code will be a part of Initial Resources component to not block development, however in the future it should be a part of Heapster. - - -## Next steps - -The first version will be quite simple so there is a lot of possible improvements. Some of them seem to have high priority -and should be introduced shortly after the first version is done: - -* observe OOM and then react to it by increasing estimation -* add possibility to specify if estimation should be made, possibly as ```InitialResourcesPolicy``` with options: *always*, *if-not-set*, *never* -* add other features to the model like *namespace* -* remember predefined values for the most popular images like *mysql*, *nginx*, *redis*, etc. -* dry mode, which allows to ask system for resource recommendation for a container without running it -* add estimation as annotations for those containers that already has resources set -* support for other data sources like [Hawkular](http://www.hawkular.org/) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md b/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md index d28e4a9d..f0fbec72 100644 --- a/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md +++ b/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md @@ -1,729 +1,6 @@ -Vertical Pod Autoscaler -======================= -**Authors:** kgrygiel, mwielgus -**Contributors:** DirectXMan12, fgrzadkowski, jszczepkowski, smarterclayton +Design proposals have been archived. -Vertical Pod Autoscaler -([#10782](https://github.com/kubernetes/kubernetes/issues/10782)), -later referred to as VPA (aka. "rightsizing" or "autopilot") is an -infrastructure service that automatically sets resource requirements of Pods -and dynamically adjusts them in runtime, based on analysis of historical -resource utilization, amount of resources available in the cluster and real-time -events, such as OOMs. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -- [Introduction](#introduction) - - [Background](#background) - - [Purpose](#purpose) - - [Related features](#related-features) -- [Requirements](#requirements) - - [Functional](#functional) - - [Availability](#availability) - - [Extensibility](#extensibility) -- [Design](#design) - - [Overview](#overview) - - [Architecture overview](#architecture-overview) - - [API](#api) - - [Admission Controller](#admission-controller) - - [Recommender](#recommender) - - [Updater](#updater) - - [Recommendation model](#recommendation-model) - - [History Storage](#history-storage) - - [Open questions](#open-questions) -- [Future work](#future-work) - - [Pods that require VPA to start](#pods-that-require-vpa-to-start) - - [Combining vertical and horizontal scaling](#combining-vertical-and-horizontal-scaling) - - [Batch workloads](#batch-workloads) -- [Alternatives considered](#alternatives-considered) - - [Pods point at VPA](#pods-point-at-vpa) - - [VPA points at Deployment](#vpa-points-at-deployment) - - [Actuation using the Deployment update mechanism](#actuation-using-the-deployment-update-mechanism) - ------------- -Introduction ------------- -### Background ### -* [Compute resources](https://kubernetes.io/docs/user-guide/compute-resources/) -* [Resource QoS](/contributors/design-proposals/node/resource-qos.md) -* [Admission Controllers](https://kubernetes.io/docs/admin/admission-controllers/) -* [External Admission Webhooks](https://kubernetes.io/docs/admin/extensible-admission-controllers/#external-admission-webhooks) - -### Purpose ### -Vertical scaling has two objectives: - -1. Reducing the maintenance cost, by automating configuration of resource -requirements. - -2. Improving utilization of cluster resources, while minimizing the risk of containers running out of memory or getting CPU starved. - -### Related features ### -#### Horizontal Pod Autoscaler #### -["Horizontal Pod Autoscaler"](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) -(often abbreviated to HPA) is an infrastructure service that dynamically adjusts -the number of Pods in a replication controller based on realtime analysis of CPU -utilization or other, user specified signals. -Usually the user will choose horizontal scaling for stateless workloads and -vertical scaling for stateful. In some cases both solutions could be combined -([see more](#combining-vertical-and-horizontal-scaling)). - -#### Cluster Autoscaler #### -["Cluster Autoscaler"](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) -is a tool that automatically adjusts the size of the Kubernetes cluster based on -the overall cluster utilization. -Cluster Autoscaler and Pod Autoscalers (vertical or horizontal) are -complementary features. Combined together they provide a fully automatic scaling -solution. - -#### Initial resources #### -["Initial Resources"](https://github.com/kgrygiel/community/blob/master/contributors/design-proposals/initial-resources.md) -is a very preliminary, proof-of-concept feature providing initial request based -on historical utilization. It is designed to only kick in on Pod creation. -VPA is intended to supersede this feature. - -#### In-place updates #### -In-place Pod updates ([#5774](https://github.com/kubernetes/kubernetes/issues/5774)) is a planned feature to -allow changing resources (request/limit) of existing containers without killing them, assuming sufficient free resources available on the node. -Vertical Pod Autoscaler will greatly benefit from this ability, however it is -not considered a blocker for the MVP. - -#### Resource estimation #### -Resource estimation is another planned feature, meant to improve node resource -utilization by temporarily reclaiming unused resources of running containers. -It is different from Vertical Autoscaling in that it operates on a shorter -timeframe (using only local, short-term history), re-offers resources at a -lower quality, and does not provide initial resource predictions. -VPA and resource estimation are complementary. Details will follow once -Resource Estimation is designed. - ------------- -Requirements ------------- - -### Functional ### - -1. VPA is capable of setting container resources (CPU & memory request/limit) at - Pod submission time. - -2. VPA is capable of adjusting container resources of existing Pods, in - particular reacting to CPU starvation and container OOM events. - -3. When VPA restarts Pods, it respects the disruption budget. - -4. It is possible for the user to configure VPA with fixed constraints on - resources, specifically: min & max request. - -5. VPA is compatible with Pod controllers, at least with Deployments. - In particular: - * Updates of resources do not interfere/conflict with spec updates. - * It is possible to do a rolling update of the VPA policy (e.g. min resources) - on an existing Deployment. - -6. It is possible to create Pod(s) that start following the VPA policy - immediately. In particular such Pods must not be scheduled until VPA policy - is applied. - -7. Disabling VPA is easy and fast ("panic button"), without disrupting existing - Pods. - -### Availability ### -1. Downtime of heavy-weight components (database/recommender) must not block - recreating existing Pods. Components on critical path for Pod creation - (admission controller) are designed to be highly available. - -### Extensibility ### -1. VPA is capable of performing in-place updates once they become available. - ------- -Design ------- - -### Overview ### -(see further sections for details and justification) - -1. We introduce a new type of **API resource**: - `VerticalPodAutoscaler`. It consists of a **label selector** to match Pods, - the **resources policy** (controls how VPA computes the resources), the - **update policy** (controls how changes are applied to Pods) and the - recommended Pod resources (an output field). - -2. **VPA Recommender** is a new component which **consumes utilization signals - and OOM events** for all Pods in the cluster from the - [Metrics Server](https://github.com/kubernetes-incubator/metrics-server). - -3. VPA Recommender **watches all Pods**, keeps calculating fresh recommended - resources for them and **stores the recommendations in the VPA objects**. - -4. Additionally the Recommender **exposes a synchronous API** that takes a Pod - description and returns recommended resources. - -5. All Pod creation requests go through the VPA **Admission Controller**. - If the Pod is matched by any VerticalPodAutoscaler object, the admission - controller **overrides resources** of containers in the Pod with the - recommendation provided by the VPA Recommender. If the Recommender is not - available, it falls back to the recommendation cached in the VPA object. - -6. **VPA Updater** is a component responsible for **real-time updates** of Pods. - If a Pod uses VPA in `"Auto"` mode, the Updater can decide to update it with - recommender resources. - In MVP this is realized by just evicting the Pod in order to have it - recreated with new resources. This approach requires the Pod to belong to a - Replica Set (or some other owner capable of recreating it). - In future the Updater will take advantage of in-place updates, which would - most likely lift this constraint. - Because restarting/rescheduling Pods is disruptive to the service, it must be - rare. - -7. VPA only controls the resource **request** of containers. It sets the limit - to infinity. The request is calculated based on analysis of the current and - previous runs (see [Recommendation model](#recommendation-model) below). - -8. **History Storage** is a component that consumes utilization signals and OOMs - (same data as the Recommender) from the API Server and stores it persistently. - It is used by the Recommender to **initialize its state on startup**. - It can be backed by an arbitrary database. The first implementation will use - [Prometheus](https://github.com/kubernetes/charts/tree/master/stable/prometheus), - at least for the resource utilization part. - -### Architecture overview ### - - -### API ### -We introduce a new type of API object `VerticalPodAutoscaler`, which -consists of the Target, that is a [label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors) -for matching Pods and two policy sections: the update policy and the resources -policy. -Additionally it holds the most recent recommendation computed by VPA. - -#### VPA API object overview #### -```go -// VerticalPodAutoscalerSpec is the specification of the behavior of the autoscaler. -type VerticalPodAutoscalerSpec { - // A label query that determines the set of pods controlled by the Autoscaler. - // More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors - Selector *metav1.LabelSelector - - // Describes the rules on how changes are applied to the pods. - // +optional - UpdatePolicy PodUpdatePolicy - - // Controls how the autoscaler computes recommended resources. - // +optional - ResourcePolicy PodResourcePolicy -} - -// VerticalPodAutoscalerStatus describes the runtime state of the autoscaler. -type VerticalPodAutoscalerStatus { - // The time when the status was last refreshed. - LastUpdateTime metav1.Time - // The most recently computed amount of resources recommended by the - // autoscaler for the controlled pods. - // +optional - Recommendation RecommendedPodResources - // A free-form human readable message describing the status of the autoscaler. - StatusMessage string -} -``` - -The complete API definition is included [below](#complete_vpa_api_object_definition). - -#### Label Selector #### -The label selector determines which Pods will be scaled according to the given -VPA policy. The Recommender will aggregate signals for all Pods matched by a -given VPA, so it is important that the user set labels to group similarly -behaving Pods under one VPA. - -It is yet to be determined how to resolve conflicts, i.e. when the Pod is -matched by more than one VPA (this is not a VPA-specific problem though). - -#### Update Policy #### -The update policy controls how VPA applies changes. In MVP it consists of a -single field `mode` that enables the feature. - -```json -"updatePolicy" { - "mode": "", -} -``` - -Mode can be set to one of the following: - -1. `"Initial"`: VPA only assigns resources on Pod creation and does not - change them during lifetime of the Pod. -2. `"Auto"` (default): VPA assigns resources on Pod creation and - additionally can update them during lifetime of the Pod, including evicting / - rescheduling the Pod. -3. `"Off"`: VPA never changes Pod resources. The recommender still sets the - recommended resources in the VPA object. This can be used for a “dry run”. - -To disable VPA updates the user can do any of the following: (1) change the -updatePolicy to `"Off"` or (2) delete the VPA or (3) change the Pod labels to no -longer match the VPA selector. - -Note: disabling VPA prevents it from doing further changes, but does not revert -resources of the running Pods, until they are updated. -For example, when running a Deployment, the user would need to perform an update -to revert Pod to originally specified resources. - -#### Resource Policy #### -The resources policy controls how VPA computes the recommended resources. -In MVP it consists of (optional) lower and upper bound on the request of each -container. -The resources policy could later be extended with additional knobs to let the -user tune the recommendation algorithm to their specific use-case. - -#### Recommendation #### -The VPA resource has an output-only field keeping a recent recommendation, -filled by the Recommender. This field can be used to obtain a recent -recommendation even during a temporary unavailability of the Recommender. -The recommendation consists of the recommended target amount of resources as -well as an range (min..max), which can be used by the Updater to make decisions -on when to update the pod. -In the case of a resource crunch the Updater may decide to squeeze pod resources -towards the recommended minimum. -The width of the (min..max) range also reflects the confidence of a -recommendation. For example, for a workload with a very spiky usage it is much -harder to determine the optimal balance between performance and resource -utilization, compared to a workload with stable usage. - -#### Complete VPA API object definition #### - -```go -// VerticalPodAutoscaler is the configuration for a vertical pod -// autoscaler, which automatically manages pod resources based on historical and -// real time resource utilization. -type VerticalPodAutoscaler struct { - metav1.TypeMeta - // Standard object metadata. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata - // +optional - metav1.ObjectMeta - - // Specification of the behavior of the autoscaler. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status. - // +optional - Spec VerticalPodAutoscalerSpec - - // Current information about the autoscaler. - // +optional - Status VerticalPodAutoscalerStatus -} - -// VerticalPodAutoscalerSpec is the specification of the behavior of the autoscaler. -type VerticalPodAutoscalerSpec { - // A label query that determines the set of pods controlled by the Autoscaler. - // More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors - Selector *metav1.LabelSelector - - // Describes the rules on how changes are applied to the pods. - // +optional - UpdatePolicy PodUpdatePolicy - - // Controls how the autoscaler computes recommended resources. - // +optional - ResourcePolicy PodResourcePolicy -} - -// VerticalPodAutoscalerStatus describes the runtime state of the autoscaler. -type VerticalPodAutoscalerStatus { - // The time when the status was last refreshed. - LastUpdateTime metav1.Time - // The most recently computed amount of resources recommended by the - // autoscaler for the controlled pods. - // +optional - Recommendation RecommendedPodResources - // A free-form human readable message describing the status of the autoscaler. - StatusMessage string -} - -// UpdateMode controls when autoscaler applies changes to the pod resources. -type UpdateMode string -const ( - // UpdateModeOff means that autoscaler never changes Pod resources. - // The recommender still sets the recommended resources in the - // VerticalPodAutoscaler object. This can be used for a "dry run". - UpdateModeOff UpdateMode = "Off" - // UpdateModeInitial means that autoscaler only assigns resources on pod - // creation and does not change them during the lifetime of the pod. - UpdateModeInitial UpdateMode = "Initial" - // UpdateModeAuto means that autoscaler assigns resources on pod creation - // and additionally can update them during the lifetime of the pod, - // including evicting / rescheduling the pod. - UpdateModeAuto UpdateMode = "Auto" -) - -// PodUpdatePolicy describes the rules on how changes are applied to the pods. -type PodUpdatePolicy struct { - // Controls when autoscaler applies changes to the pod resources. - // +optional - UpdateMode UpdateMode -} - -const ( - // DefaultContainerResourcePolicy can be passed as - // ContainerResourcePolicy.Name to specify the default policy. - DefaultContainerResourcePolicy = "*" -) -// ContainerResourcePolicy controls how autoscaler computes the recommended -// resources for a specific container. -type ContainerResourcePolicy struct { - // Name of the container or DefaultContainerResourcePolicy, in which - // case the policy is used by the containers that don't have their own - // policy specified. - Name string - // Whether autoscaler is enabled for the container. Defaults to "On". - // +optional - Mode ContainerScalingMode - // Specifies the minimal amount of resources that will be recommended - // for the container. - // +optional - MinAllowed api.ResourceRequirements - // Specifies the maximum amount of resources that will be recommended - // for the container. - // +optional - MaxAllowed api.ResourceRequirements -} - -// PodResourcePolicy controls how autoscaler computes the recommended resources -// for containers belonging to the pod. -type PodResourcePolicy struct { - // Per-container resource policies. - ContainerPolicies []ContainerResourcePolicy -} - -// ContainerScalingMode controls whether autoscaler is enabled for a speciifc -// container. -type ContainerScalingMode string -const ( - // ContainerScalingModeOn means autoscaling is enabled for a container. - ContainerScalingModeOn ContainerScalingMode = "On" - // ContainerScalingModeOff means autoscaling is disabled for a container. - ContainerScalingModeOff ContainerScalingMode = "Off" -) - -// RecommendedPodResources is the recommendation of resources computed by -// autoscaler. -type RecommendedPodResources struct { - // Resources recommended by the autoscaler for each container. - ContainerRecommendations []RecommendedContainerResources -} - -// RecommendedContainerResources is the recommendation of resources computed by -// autoscaler for a specific container. Respects the container resource policy -// if present in the spec. -type RecommendedContainerResources struct { - // Name of the container. - Name string - // Recommended amount of resources. - Target api.ResourceRequirements - // Minimum recommended amount of resources. - // Running the application with less resources is likely to have - // significant impact on performance/availability. - // +optional - MinRecommended api.ResourceRequirements - // Maximum recommended amount of resources. - // Any resources allocated beyond this value are likely wasted. - // +optional - MaxRecommended api.ResourceRequirements -} -``` - -### Admission Controller ### - -VPA Admission Controller intercepts Pod creation requests. If the Pod is matched -by a VPA config with mode not set to “off”, the controller rewrites the request -by applying recommended resources to the Pod spec. Otherwise it leaves the Pod -spec unchanged. - -The controller gets the recommended resources by fetching -/recommendedPodResources from the Recommender. If the call times out or fails, -the controller falls back to the recommendation cached in the VPA object. -If this is also not available the controller lets the request pass-through -with originally specified resources. - -Note: in future it will be possible to (optionally) enforce using VPA by marking -the Pod as "requiring VPA". This will disallow scheduling the Pod before a -corresponding VPA config is created. The Admission Controller will reject such -Pods if it finds no matching VPA config. This ability will be convenient for the -user who wants to create the VPA config together with submitting the Pod. - -The VPA Admission Controller will be implemented as an -[External Admission Hook](https://kubernetes.io/docs/admin/extensible-admission-controllers/#external-admission-webhooks). -Note however that this depends on the proposed feature to allow -[mutating webhook admission controllers](/contributors/design-proposals/api-machinery/admission_control_extension.md#future-work). - -### Recommender ### -Recommender is the main component of the VPA. It is responsible for -computing recommended resources. On startup the recommender fetches -historical resource utilization of all Pods (regardless of whether -they use VPA) together with the history of Pod OOM events from the -History Storage. It aggregates this data and keeps it in memory. - -During normal operation the recommender consumes real time updates of -resource utilization and new events via the Metrics API from -the [Metrics Server](https://github.com/kubernetes-incubator/metrics-server). -Additionally it watches all Pods and all VPA objects in the -cluster. For every Pod that is matched by some VPA selector the -Recommender computes the recommended resources and sets the -recommendation on the VPA object. - -It is important to realize that one VPA object has one recommendation. -The user is expected to use one VPA to control Pods with similar -resource usage patterns, typically a group of replicas or shards of -a single workload. - -The Recommender acts as an -[extension-apiserver](https://kubernetes.io/docs/concepts/api-extension/apiserver-aggregation/), -exposing a synchronous method that takes a Pod Spec and the Pod metadata -and returns recommended resources. - -#### Recommender API #### - -```POST /recommendationQuery``` - -Request body: -```go -// RecommendationQuery obtains resource recommendation for a pod. -type RecommendationQuery struct { - metav1.TypeMeta - // +optional - metav1.ObjectMeta - - // Spec is filled in by the caller to request a recommendation. - Spec RecommendationQuerySpec - - // Status is filled in by the server with the recommended pod resources. - // +optional - Status RecommendationQueryStatus -} - -// RecommendationQuerySpec is a request of recommendation for a pod. -type RecommendationQuerySpec struct { - // Pod for which to compute the recommendation. Does not need to exist. - Pod core.Pod -} - -// RecommendationQueryStatus is a response to the recommendation request. -type RecommendationQueryStatus { - // Recommendation holds recommended resources for the pod. - // +optional - Recommendation autoscaler.RecommendedPodResources - // Error indicates that the recommendation was not available. Either - // Recommendation or Error must be present. - // +optional - Error string -} -``` - -Notice that this API method may be called for an existing Pod, as well as for a -yet-to-be-created Pod. - -### Updater ### -VPA Updater is a component responsible for applying recommended resources to -existing Pods. -It monitors all VPA objects and Pods in the cluster, periodically fetching -recommendations for the Pods that are controlled by VPA by calling the -Recommender API. -When recommended resources significantly diverge from actually configured -resources, the Updater may decide to update a Pod. -In MVP (until in-place updates of Pod resources are available) -this means evicting Pods in order to have them recreated with the recommended -resources. - -The Updater relies on other mechanisms (such as Replica Set) to recreate a -deleted Pod. However it does not verify whether such mechanism is actually -configured for the Pod. Such checks could be implemented in the CLI and warn -the user when the VPA would match Pods, that are not automatically restarted. - -While terminating Pods is disruptive and generally undesired, it is sometimes -justified in order to (1) avoid CPU starvation (2) reduce the risk of correlated -OOMs across multiple Pods at random time or (3) save resources over long periods -of time. - -Apart from its own policy on how often a Pod can be evicted, the Updater also -respects the Pod disruption budget, by using Eviction API to evict Pods. - -The Updater only touches pods that point to a VPA with updatePolicy.mode set -to `"Auto"`. - -The Updater will also need to understand how to adjust the recommendation before -applying it to a Pod, based on the current state of the cluster (e.g. quota, -space available on nodes or other scheduling constraints). -Otherwise it may deschedule a Pod permanently. This mechanism is not yet -designed. - -### Recommendation model ### - -VPA controls the request (memory and CPU) of containers. In MVP it always sets -the limit to infinity. It is not yet clear whether there is a use-case for VPA -setting the limit. - -The request is calculated based on analysis of the current and previous runs of -the container and other containers with similar properties (name, image, -command, args). -The recommendation model (MVP) assumes that the memory and CPU consumption are -independent random variables with distribution equal to the one observed in the -last N days (recommended value is N=8 to capture weekly peaks). -A more advanced model in future could attempt to detect trends, periodicity and -other time-related patterns. - -For CPU the objective is to **keep the fraction of time when the container usage -exceeds a high percentage (e.g. 95%) of request below a certain threshold** -(e.g. 1% of time). -In this model the "CPU usage" is defined as mean usage measured over a short -interval. The shorter the measurement interval, the better the quality of -recommendations for spiky, latency sensitive workloads. Minimum reasonable -resolution is 1/min, recommended is 1/sec. - -For memory the objective is to **keep the probability of the container usage -exceeding the request in a specific time window below a certain threshold** -(e.g. below 1% in 24h). The window must be long (≥ 24h) to ensure that evictions -caused by OOM do not visibly affect (a) availability of serving applications -(b) progress of batch computations (a more advanced model could allow user to -specify SLO to control this). - -#### Handling OOMs #### -When a container is evicted due to exceeding available memory, its actual memory -requirements are not known (the amount consumed obviously gives the lower -bound). This is modelled by translating OOM events to artificial memory usage -samples by applying a "safety margin" multiplier to the last observed usage. - -### History Storage ### -VPA defines data access API for providers of historical events and resource -utilization. Initially we will use Prometheus as the reference implementation of -this API, at least for the resource utilization part. The historical events -could be backed by another solution, e.g. -[Infrastore](https://github.com/kubernetes/kubernetes/issues/44095). -Users will be able to plug their own implementations. - -History Storage is populated with real time updates of resources utilization and -events, similarly to the Recommender. The storage keeps at least 8 days of data. -This data is only used to initialize the Recommender on startup. - -### Open questions ### -1. How to resolve conflicts if multiple VPA objects match a Pod. - -2. How to adjust the recommendation before applying it to a specific pod, - based on the current state of the cluster (e.g. quota, space available on - nodes or other scheduling constraints). - ------------ -Future work ------------ - -### Pods that require VPA to start ### -In the current proposal the Pod will be scheduled with originally configured -resources if no matching VPA config is present at the Pod admission time. -This may be undesired behavior. In particular the user may want to create the -VPA config together with submitting the Pod, which leads to a race condition: -the outcome depends on which resource (VPA or the Pod) is processed first. - -In order to address this problem we propose to allow marking Pods with a special -annotation ("requires VPA") that prevents the Admission Controller from allowing -the Pod if a corresponding VPA is not available. - -An alternative would be to introduce a VPA Initializer serving the same purpose. - -### Combining vertical and horizontal scaling ### -In principle it may be possible to use both vertical and horizontal scaling for -a single workload (group of Pods), as long as the two mechanisms operate on -different resources. -The right approach is to let the -[Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) -scale the group based on the _bottleneck_ resource. The Vertical Pod Autoscaler -could then control other resources. Examples: - -1. A CPU-bound workload can be scaled horizontally based on the CPU utilization -while using vertical scaling to adjust memory. - -2. An IO-bound workload can be scaled horizontally based on the IO throughput -while using vertical scaling to adjust both memory and CPU. - -However this is a more advanced form of autoscaling and it is not well supported -by the MVP version of Vertical Pod Autoscaler. The difficulty comes from the -fact that changing the number of instances affects not only the utilization of -the bottleneck resource (which is the principle of horizontal scaling) but -potentially also non-bottleneck resources that are controlled by VPA. -The VPA model will have to be extended to take the size of the group into account -when aggregating the historical resource utilization and when producing a -recommendation, in order to allow combining it with HPA. - -### Batch workloads ### -Batch workloads have different CPU requirements than latency sensitive -workloads. Instead of latency they care about throughput, which means VPA should -base the CPU requirements on average CPU consumption rather than high -percentiles of CPU distribution. - -TODO: describe the recommendation model for the batch workloads and how VPA will -distinguish between batch and serving. A possible approach is to look at -`PodSpec.restartPolicy`. -An alternative would be to let the user specify the latency requirements of the -workload in the `PodResourcePolicy`. - ------------------------ -Alternatives considered ------------------------ - -### Pods point at VPA ### -*REJECTED BECAUSE IT REQUIRES MODIFYING THE POD SPEC* - -#### proposal: #### -Instead of VPA using label selectors, Pod Spec is extended with an optional -field `verticalPodAutoscalerPolicy`, -a [reference](https://kubernetes.io/docs/api-reference/v1/definitions/#_v1_localobjectreference) -to the VPA config. - -#### pros: #### -* Consistency is enforced at the API level: - * At most one VPA can point to a given Pod. - * It is always clear at admission stage whether the Pod should use - VPA or not. No race conditions. -* It is cheap to find the VPA for a given Pod. - -#### cons: #### -* Requires changing the core part of the API (Pod Spec). - -### VPA points at Deployment ### - -#### proposal: #### -VPA has a reference to Deployment object. Doesn’t use label selector to match -Pods. - -#### pros: #### -* More consistent with HPA. - -#### cons: #### -* Extending VPA support from Deployment to other abstractions that manage Pods - requires additional work. VPA must be aware of all such abstractions. -* It is not possible to do a rolling update of the VPA config. - For example setting `max_memory` in the VPA config will apply to the whole - Deployment immediately. -* VPA can’t be shared between deployments. - -### Actuation using the Deployment update mechanism ### - -In this solution the Deployment itself is responsible for actuating VPA -decisions. - -#### Actuation by update of spec #### -In this variant changes of resources are applied similarly to normal changes of -the spec, i.e. using the Deployment rolling update mechanism. - -**pros:** existing clean API (and implementation), one common update policy -(e.g. max surge, max unavailable). - -**cons:** conflicting with user (config) update - update of resources and spec -are tied together (they are executed at the same rate), problem with rollbacks, -problem with pause. Not clear how to handle in-place updates? (this problem has -to be solved regardless of VPA though). - -#### Dedicated method for resource update #### -In this variant Deployment still uses the rolling update mechanism for updating -resources, but update of resources is treated in a special way, so that it can -be performed in parallel with config update. - -**pros:** handles concurrent resources and spec updates, solves resource updates -without VPA, more consistent with HPA, all update logic lives in one place (less -error-prone). - -**cons:** specific to Deployment, high complexity (multiple replica set created -underneath - exposed to the user, can be confusing and error-prone). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/aws/OWNERS b/contributors/design-proposals/aws/OWNERS deleted file mode 100644 index b035a798..00000000 --- a/contributors/design-proposals/aws/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - provider-aws -approvers: - - provider-aws -labels: - - sig/aws diff --git a/contributors/design-proposals/aws/aws_under_the_hood.md b/contributors/design-proposals/aws/aws_under_the_hood.md index ec8b0740..f0fbec72 100644 --- a/contributors/design-proposals/aws/aws_under_the_hood.md +++ b/contributors/design-proposals/aws/aws_under_the_hood.md @@ -1,305 +1,6 @@ -# Peeking under the hood of Kubernetes on AWS +Design proposals have been archived. -This document provides high-level insight into how Kubernetes works on AWS and -maps to AWS objects. We assume that you are familiar with AWS. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -We encourage you to use [kube-up](../getting-started-guides/aws.md) to create -clusters on AWS. We recommend that you avoid manual configuration but are aware -that sometimes it's the only option. - -Tip: You should open an issue and let us know what enhancements can be made to -the scripts to better suit your needs. - -That said, it's also useful to know what's happening under the hood when -Kubernetes clusters are created on AWS. This can be particularly useful if -problems arise or in circumstances where the provided scripts are lacking and -you manually created or configured your cluster. - -**Table of contents:** - * [Architecture overview](#architecture-overview) - * [Storage](#storage) - * [Auto Scaling group](#auto-scaling-group) - * [Networking](#networking) - * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services) - * [Identity and access management (IAM)](#identity-and-access-management-iam) - * [Tagging](#tagging) - * [AWS objects](#aws-objects) - * [Manual infrastructure creation](#manual-infrastructure-creation) - * [Instance boot](#instance-boot) - -### Architecture overview - -Kubernetes is a cluster of several machines that consists of a Kubernetes -master and a set number of nodes (previously known as 'nodes') for which the -master is responsible. See the [Architecture](architecture.md) topic for -more details. - -By default on AWS: - -* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently - modern kernel that pairs well with Docker and doesn't require a - reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) -* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly - because this is what Google Compute Engine uses). - -You can override these defaults by passing different environment variables to -kube-up. - -### Storage - -AWS supports persistent volumes by using [Elastic Block Store (EBS)](https://kubernetes.io/docs/concepts/storage/volumes/#awselasticblockstore). -These can then be attached to pods that should store persistent data (e.g. if -you're running a database). - -By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) -unless you create pods with persistent volumes -[(EBS)](https://kubernetes.io/docs/concepts/storage/volumes/#awselasticblockstore). In general, Kubernetes -containers do not have persistent storage unless you attach a persistent -volume, and so nodes on AWS use instance storage. Instance storage is cheaper, -often faster, and historically more reliable. Unless you can make do with -whatever space is left on your root partition, you must choose an instance type -that provides you with sufficient instance storage for your needs. - -To configure Kubernetes to use EBS storage, pass the environment variable -`KUBE_AWS_STORAGE=ebs` to kube-up. - -Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to -track its state. Similar to nodes, containers are mostly run against instance -storage, except that we repoint some important data onto the persistent volume. - -The default storage driver for Docker images is aufs. Specifying btrfs (by -passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a -good choice for a filesystem. btrfs is relatively reliable with Docker and has -improved its reliability with modern kernels. It can easily span multiple -volumes, which is particularly useful when we are using an instance type with -multiple ephemeral instance disks. - -### Auto Scaling group - -Nodes (but not the master) are run in an -[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) -on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled -([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means -that AWS will relaunch any nodes that are terminated. - -We do not currently run the master in an AutoScalingGroup, but we should -([#11934](http://issues.k8s.io/11934)). - -### Networking - -Kubernetes uses an IP-per-pod model. This means that a node, which runs many -pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced -routing support so each EC2 instance is assigned a /24 CIDR in the VPC routing -table. - -It is also possible to use overlay networking on AWS, but that is not the -default configuration of the kube-up script. - -### NodePort and LoadBalancer services - -Kubernetes on AWS integrates with [Elastic Load Balancing -(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). -When you create a service with `Type=LoadBalancer`, Kubernetes (the -kube-controller-manager) will create an ELB, create a security group for the -ELB which allows access on the service ports, attach all the nodes to the ELB, -and modify the security group for the nodes to allow traffic from the ELB to -the nodes. This traffic reaches kube-proxy where it is then forwarded to the -pods. - -ELB has some restrictions: -* ELB requires that all nodes listen on a single port, -* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below -on ELB annotations for pods speaking HTTP). - -To work with these restrictions, in Kubernetes, [LoadBalancer -services](https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer) are exposed as -[NodePort services](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport). Then -kube-proxy listens externally on the cluster-wide port that's assigned to -NodePort services and forwards traffic to the corresponding pods. - -For example, if we configure a service of Type LoadBalancer with a -public port of 80: -* Kubernetes will assign a NodePort to the service (e.g. port 31234) -* ELB is configured to proxy traffic on the public port 80 to the NodePort -assigned to the service (in this example port 31234). -* Then any in-coming traffic that ELB forwards to the NodePort (31234) -is recognized by kube-proxy and sent to the correct pods for that service. - -Note that we do not automatically open NodePort services in the AWS firewall -(although we do open LoadBalancer services). This is because we expect that -NodePort services are more of a building block for things like inter-cluster -services or for LoadBalancer. To consume a NodePort service externally, you -will likely have to open the port in the node security group -(`kubernetes-node-<clusterid>`). - -For SSL support, starting with 1.3 two annotations can be added to a service: - -``` -service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012 -``` - -The first specifies which certificate to use. It can be either a -certificate from a third party issuer that was uploaded to IAM or one created -within AWS Certificate Manager. - -``` -service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp) -``` - -The second annotation specifies which protocol a pod speaks. For HTTPS and -SSL, the ELB will expect the pod to authenticate itself over the encrypted -connection. - -HTTP and HTTPS will select layer 7 proxying: the ELB will terminate -the connection with the user, parse headers and inject the `X-Forwarded-For` -header with the user's IP address (pods will only see the IP address of the -ELB at the other end of its connection) when forwarding requests. - -TCP and SSL will select layer 4 proxying: the ELB will forward traffic without -modifying the headers. - -### Identity and Access Management (IAM) - -kube-proxy sets up two IAM roles, one for the master called -[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) -and one for the nodes called -[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). - -The master is responsible for creating ELBs and configuring them, as well as -setting up advanced VPC routing. Currently it has blanket permissions on EC2, -along with rights to create and destroy ELBs. - -The nodes do not need a lot of access to the AWS APIs. They need to download -a distribution file, and then are responsible for attaching and detaching EBS -volumes from itself. - -The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR -authorization tokens, refresh them every 12 hours if needed, and fetch Docker -images from it, as long as the appropriate permissions are enabled. Those in -[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly), -without write access, should suffice. The master policy is probably overly -permissive. The security conscious may want to lock-down the IAM policies -further ([#11936](http://issues.k8s.io/11936)). - -We should make it easier to extend IAM permissions and also ensure that they -are correctly configured ([#14226](http://issues.k8s.io/14226)). - -### Tagging - -All AWS resources are tagged with a tag named "KubernetesCluster", with a value -that is the unique cluster-id. This tag is used to identify a particular -'instance' of Kubernetes, even if two clusters are deployed into the same VPC. -Resources are considered to belong to the same cluster if and only if they have -the same value in the tag named "KubernetesCluster". (The kube-up script is -not configured to create multiple clusters in the same VPC by default, but it -is possible to create another cluster in the same VPC.) - -Within the AWS cloud provider logic, we filter requests to the AWS APIs to -match resources with our cluster tag. By filtering the requests, we ensure -that we see only our own AWS objects. - -**Important:** If you choose not to use kube-up, you must pick a unique -cluster-id value, and ensure that all AWS resources have a tag with -`Name=KubernetesCluster,Value=<clusterid>`. - -### AWS objects - -The kube-up script does a number of things in AWS: -* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes -distribution and the salt scripts into it. They are made world-readable and the -HTTP URLs are passed to instances; this is how Kubernetes code gets onto the -machines. -* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): - * `kubernetes-master` is used by the master. - * `kubernetes-node` is used by nodes. -* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is -the OpenSSH key fingerprint, so that multiple users can run the script with -different keys and their keys will not collide (with near-certainty). It will -use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create -one there. (With the default Ubuntu images, if you have to SSH in: the user is -`ubuntu` and that user can `sudo`). -* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and -enables the `dns-support` and `dns-hostnames` options. -* Creates an internet gateway for the VPC. -* Creates a route table for the VPC, with the internet gateway as the default -route. -* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` -(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a -single AZ on AWS. Although, there are two philosophies in discussion on how to -achieve High Availability (HA): - * cluster-per-AZ: An independent cluster for each AZ, where each cluster -is entirely separate. - * cross-AZ-clusters: A single cluster spans multiple AZs. -The debate is open here, where cluster-per-AZ is discussed as more robust but -cross-AZ-clusters are more convenient. -* Associates the subnet to the route table -* Creates security groups for the master (`kubernetes-master-<clusterid>`) -and the nodes (`kubernetes-node-<clusterid>`). -* Configures security groups so that masters and nodes can communicate. This -includes intercommunication between masters and nodes, opening SSH publicly -for both masters and nodes, and opening port 443 on the master for the HTTPS -API endpoints. -* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type -`MASTER_DISK_TYPE`. -* Launches a master with a fixed IP address (172.20.0.9) that is also -configured for the security group and all the necessary IAM credentials. An -instance script is used to pass vital configuration information to Salt. Note: -The hope is that over time we can reduce the amount of configuration -information that must be passed in this way. -* Once the instance is up, it attaches the EBS volume and sets up a manual -routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to -10.246.0.0/24). -* For auto-scaling, on each nodes it creates a launch configuration and group. -The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default -name is kubernetes-node-group. The auto-scaling group has a min and max size -that are both set to NUM_NODES. You can change the size of the auto-scaling -group to add or remove the total number of nodes from within the AWS API or -Console. Each nodes self-configures, meaning that they come up; run Salt with -the stored configuration; connect to the master; are assigned an internal CIDR; -and then the master configures the route-table with the assigned CIDR. The -kube-up script performs a health-check on the nodes but it's a self-check that -is not required. - -If attempting this configuration manually, it is recommend to follow along -with the kube-up script, and being sure to tag everything with a tag with name -`KubernetesCluster` and value set to a unique cluster-id. Also, passing the -right configuration options to Salt when not using the script is tricky: the -plan here is to simplify this by having Kubernetes take on more node -configuration, and even potentially remove Salt altogether. - -### Manual infrastructure creation - -While this work is not yet complete, advanced users might choose to manually -create certain AWS objects while still making use of the kube-up script (to -configure Salt, for example). These objects can currently be manually created: -* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. -* Set the `VPC_ID` environment variable to reuse an existing VPC. -* Set the `SUBNET_ID` environment variable to reuse an existing subnet. -* If your route table has a matching `KubernetesCluster` tag, it will be reused. -* If your security groups are appropriately named, they will be reused. - -Currently there is no way to do the following with kube-up: -* Use an existing AWS SSH key with an arbitrary name. -* Override the IAM credentials in a sensible way -([#14226](http://issues.k8s.io/14226)). -* Use different security group permissions. -* Configure your own auto-scaling groups. - -If any of the above items apply to your situation, open an issue to request an -enhancement to the kube-up script. You should provide a complete description of -the use-case, including all the details around what you want to accomplish. - -### Instance boot - -The instance boot procedure is currently pretty complicated, primarily because -we must marshal configuration from Bash to Salt via the AWS instance script. -As we move more post-boot configuration out of Salt and into Kubernetes, we -will hopefully be able to simplify this. - -When the kube-up script launches instances, it builds an instance startup -script which includes some configuration options passed to kube-up, and -concatenates some of the scripts found in the cluster/aws/templates directory. -These scripts are responsible for mounting and formatting volumes, downloading -Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually -install Kubernetes. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/OWNERS b/contributors/design-proposals/cli/OWNERS deleted file mode 100644 index 96fdea25..00000000 --- a/contributors/design-proposals/cli/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-cli-leads -approvers: - - sig-cli-leads -labels: - - sig/cli diff --git a/contributors/design-proposals/cli/apply_refactor.md b/contributors/design-proposals/cli/apply_refactor.md index 8f9ca6da..f0fbec72 100644 --- a/contributors/design-proposals/cli/apply_refactor.md +++ b/contributors/design-proposals/cli/apply_refactor.md @@ -1,184 +1,6 @@ -# Apply v2 +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -`kubectl apply` reads a file or set of files, and updates the cluster state based off the file contents. -It does a couple things: - -1. Create / Update / (Delete) the live resources based on the file contents -2. Update currently and previously configured fields, without clobbering fields set by other means, - such as imperative kubectl commands, other deployment and management tools, admission controllers, - initializers, horizontal and vertical autoscalers, operators, and other controllers. - -Essential complexity in the apply code comes from supporting custom strategies for -merging fields built into the object schema - -such as merging lists together based on a field `key` and deleting individual -items from a list by `key`. - -Accidental complexity in the apply code comes from the structure growing organically in ways that have -broken encapsulation and separation of concerns. This has lead to maintenance challenges as -keeping ordering for items in a list, and correctly merging lists of primitives (`key`less). - -Round tripping changes through PATCHes introduces additional accidental complexity, -as they require imperative directives that are not part of the object schema. - -## Objective - - -Reduce maintenance burden by minimizing accidental complexity in the apply codebase. - -This should help: - -- Simplify introducing new merge semantics -- Simplify enabling / disabling new logic with flags - -## Changes - -Implementation of proposed changes under review in PR [52349](https://github.com/kubernetes/kubernetes/pull/52349) - -### Use read-update instead of patch - -#### Why - -Building a PATCH from diff creates additional code complexity vs directly updating the object. - -- Need to generate imperative delete directives instead of simply deleting an item from a list. -- Using PATCH semantics and directives is less well known and understood by most users - than using the object schema itself. This makes it harder for non-experts to maintain the codebase. -- Using PATCH semantics is more work to implement a diff of the changes as - PATCH must be separately merged on the remote object for to display the diff. - -#### New approach - -1. Read the live object -2. Compare the live object to last-applied and local files -3. Update the fields on the live object that was read -4. Send a PUT to update the modified object -5. If encountering optimistic lock failure, retry back to 1. - -### Restructure code into modular components - -In the current implementation of apply - parsing and traversing the object trees, diffing the -contents and generating the patch are entangled. This creates maintenance and -testing challenges. We should instead encapsulate discrete responsibilities in separate packages - -such as collating the object values and updating the target object. - -#### Phase 1: Parse last-applied, local, live objects and collate - -Provide a structure that contains the last, local and live value for each field. This -will make it easy to walk the a single tree when making decisions about how to update the object. -Decisions about ordering of lists or parsing metadata for fields are made here. - -#### Phase 2: Diff and update objects - -Use the visitor pattern to encapsulate how to update each field type for each merge strategy. -Unit test each visit function. Decisions about how to replace, merge, or delete a field or -list item are made here. - -## Notable items - -- Merge will use openapi to get the schema from the server -- Merge can be run either on the server side or the client side -- Merge can handle 2-way or 3-way merges of objects (initially will not support PATCH directives) - -## Out of scope of this doc - -In order to make apply sufficiently maintainable and extensible to new API types, as well as to make its -behavior more intuitive for users, the merge behavior, including how it is specified in the API schema, -must be systematically redesigned and more thoroughly tested. - -Examples of issues that need to be resolved - -- schema metadata `patchStrategy` and `mergeKey` are implicit, unversioned and incorrect in some cases. - to fix the incorrect metadata, the metadata must be versioned so PATCHes generated will old metadata continue - to be merged by the server in the manner they were intended - - need to version all schema metadata for each objects and provide this are part of the request - - e.g. container port [39188](https://github.com/kubernetes/kubernetes/issues/39188) -- no semantic way to represent union fields [35345](https://github.com/kubernetes/kubernetes/issues/35345) - - -## Detailed analysis of structure and impact today - -The following PRs constitute the focus of ~6 months of engineering work. Each of the PRs is very complex -for the work what it is solving. - -### Patterns observed - -- PRs frequently closed or deferred because maintainers / reviewers cannot reason about the impact or - correctness of the changes -- Relatively simple changes - - are 200+ lines of code - - modify dozens of existing locations in the code - - are spread across 1000+ lines of existing code -- Changes that add new directives require updates in multiple locations - create patch + apply patch - -### PRs - -[38665](https://github.com/kubernetes/kubernetes/pull/38665/files) -- Support deletion of primitives from lists -- Lines (non-test): ~200 -- ~6 weeks -[44597](https://github.com/kubernetes/kubernetes/pull/44597/files) -- Support deleting fields not listed in the patch -- Lines (non-test): ~250 -- ~6 weeks -[45980](https://github.com/kubernetes/kubernetes/pull/45980/files#diff-101008d96c4444a5813f7cb6b54aaff6) -- Keep ordering of items when merging lists -- Lines (non-test): ~650 -[46161](https://github.com/kubernetes/kubernetes/pull/46161/files#diff-101008d96c4444a5813f7cb6b54aaff6) -- Support using multiple fields for a merge key -- Status: Deferred indefinitely - too hard for maintainers to understand impact and correctness of changes -[46560](https://github.com/kubernetes/kubernetes/pull/46560/files) -- Support diff apply (1st attempt) -- Status: Closed - too hard for maintainers to understand impact and correctness of changes -[49174](https://github.com/kubernetes/kubernetes/pull/49174/files) -- Support diff apply (2nd attempt) -- Status: Deferred indefinitely - too hard for maintainers to understand impact and correctness of changes -- Maintainer reviews: 3 - - -### Analysis - causes of complexity - -Apply is implemented by diffing the 3 sources (last-applied, local, remote) as 2 2-way diffs and then -merging the results of those 2 diffs into a 3rd result. The diffs can each produce patch request where -a single logic update (e.g. remove 'foo' and add 'bar' to a field that is a list) may require spreading the -patch result across multiple pieces of the patch (a 'delete' directive, an 'order' directive -and the list itself). - -Because of the way diff is implemented with 2-way diffs, a simple bit of logic -"compare local to remote" and do X - is non-trivial to define. The code that compares local to remote -is also executed to compare last-applied to local, but with the local argument differing in location. -To compare local to remote means understanding what will happen when the same code is executed -comparing last-applied to local, and then putting in the appropriate guards to short-circuit the -logic in one context or the other as needed. last-applied and remote are not compared directly, and instead -are only compared indirectly when the 2 diff results are merged. Information that is redundant or -should be checked for consistency across all 3 sources (e.g. checking for conflicts) is spread across -3 logic locations - the first 2-way diff, the second 2-way diff and the merge of the 2 diffs. - -That the diffs each may produce multiple patch directives + results that constitute an update to a single -field compounds the complexity of that comparing a single field occurs across 3 locations. - -The diff / patch logic itself does not follow any sort of structure to encapsulate complexity -into components so that logic doesn't bleed cross concerns. The logic to collate the last-applied, local and -remote field values, the logic to diff the field values and the logic to create the patch is -all combined in the same group of package-scoped functions, instead of encapsulating -each of these responsibilities in its own interface. - -Sprinkling the implementation across dozens of locations makes it very challenging to -flag guard the new behavior. If issues are discovered during the stabilization period we cannot -easily revert to the previous behavior by changing a default flag value. The inability to build -in these sorts of break-glass options further degrades confidence in safely accepting PRs. - -This is a text-book example of what the [Visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) -was designed to address. - -- Encapsulate logic in *Element*s and *Visitor*s -- Introduce logic for new a field type by adding a new *Element* type -- Introduce logic for new a merge strategy by defining a new *Visitor* implementation -- Introduce logic on structuring of a field by updating the parsing function for that field type - -If the apply diff logic was redesigned, most of the preceding PRs could be implemented by -only touching a few existing code locations to introduce the new type / method, and -then encapsulating the logic in a single type. This would make it simple to flag guard -new behaviors before defaulting them to on. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/get-describe-apiserver-extensions.md b/contributors/design-proposals/cli/get-describe-apiserver-extensions.md index aa129f4a..f0fbec72 100644 --- a/contributors/design-proposals/cli/get-describe-apiserver-extensions.md +++ b/contributors/design-proposals/cli/get-describe-apiserver-extensions.md @@ -1,192 +1,6 @@ -# Provide open-api extensions for kubectl get / kubectl describe columns +Design proposals have been archived. -Status: Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Alpha -## Motivation - -`kubectl get` and `kubectl describe` do not provide a rich experience -for resources retrieved through federated apiservers and types not -compiled into the kubectl binary. Kubectl should support printing -columns configured per-type without having the types compiled in. - -## Proposal - -Allow the apiserver to define the type specific columns that will be -printed using the open-api swagger.json spec already fetched by kubectl. -This provides a limited describe to only print out fields on the object -and related events. - -**Note:** This solution will only work for types compiled into the apiserver -providing the open-api swagger.json to kubectl. This solution will -not work for TPR, though TPR could possibly be solved in a similar -way by apply an annotation with the same key / value to the TPR. - -## User Experience - -### Use Cases - -- As a user, when I run `kubectl get` on sig-service-catalog resources - defined in a federated apiserver, I want to see more than just the - name and the type of the resource. -- As a user, when I run `kubectl describe` on sig-service-catalog - resources defined in a federated apiserver, I want the command - to succeed, and to see events for the resource along with important - fields of the resource. - -## Implementation - -Define the open-api extensions `x-kubernetes-kubectl-get-columns` and -`x-kubernetes-kubectl-describe-columns`. These extensions have a -string value containing the columns to be printed by kubectl. The -string format is the same as the `--custom-columns` for `kubectl get`. - -### Apiserver - -- Populate the open-api extension value for resource types. - -This is done by hardcoding the extension for types compiled into -the api server. As such this is only a solution for types -implemented using federated apiservers. - -### Kubectl - -Overview: - -- In `kubectl get` use the `x-kubernetes-kubectl-get-columns` value - when printing an object iff 1) it is defined and 2) the output type - is "" (empty string) or "wide". - -- In `kubectl describe` use the `x-kubernetes-kubectl-describe-columns` value - when printing an object iff 1) it is defined - - -#### Option 1: Re-parse the open-api swagger.json in a kubectl library - -Re-parse the open-api swagger.json schema and build a map of group version kind -> columns -parsed from the schema. For this would look similar to validation/schema.go - -In get.go and describe.go: After fetching the "Infos" from the -resource builder, lookup the group version kind from the populated map. - -**Pros:** - - Simple and straightforward solution - - Scope of impacted Kubernetes components is minimal - - Doable in 1.6 - -**Cons:** - - Hacky solution - - Can not be cleanly extended to support TPR - -#### Option 2: Modify api-machinery RestMapper - -Modify the api-machinery RestMapper to parse extensions prefixed -with `x-kubernetes` and include them in the *RestMapping* used by the resource builder. - -```go -type RESTMapping struct { - // Resource is a string representing the name of this resource as a REST client would see it - Resource string - - GroupVersionKind schema.GroupVersionKind - - // Scope contains the information needed to deal with REST Resources that are in a resource hierarchy - Scope RESTScope - - runtime.ObjectConvertor - MetadataAccessor - - // Extensions - ApiExtensions ApiExtensions -} - -type ApiExtensions struct { - Extensions map[string]interface{} -} -``` - -The tags would then be easily accessible from the kubectl get / describe -functions through: `resource.Builder -> Infos -> Mapping -> DisplayOptions` - -**Pros:** - - Clean + generalized solution - - The same strategy can be applied to support TPR - - Can support exposing future extensions such as patchStrategy and mergeKey - - Can be used by other clients / tools - -**Cons:** - - Fields are only loosely tied to rest - - Complicated due to the broad scope and impact - - May not be doable in 1.6 - -#### Considerations - -What should be used for oth an open-api extension columns tag AND a -compiled in printer exist for a type? - -- Apiserver only provides `describe` for types that are never compiled in - - Compiled in `describe` is much more rich - aggregating data across many other types. - e.g. Node describe aggregating Pod data - - kubectl will not be able to provide any `describe` information for new types when version skewed against a newer server -- Always use the extensions if present - - Allows server to control columns. Adds new columns for types on old clients that maybe missing the columns. -- Always use the compiled in commands if present - - The compiled in `describe` is richer and provides aggregated information about many types. -- Always use the `get` extension if present. Always use the `describe` compiled in code if present. - - Inconsistent behavior across how extensions are handled - -### Client/Server Backwards/Forwards compatibility - -#### Newer client - -Client doesn't find the open-api extensions. Fallback on 1.5 behavior. - -In the future, this will provide stronger backwards / forwards compatibility -as it will allow clients to print objects - -#### Newer server - -Client doesn't respect open-api extensions. Uses 1.5 behavior. - -## Alternatives considered - -### Fork Kubectl and compile in go types - -Fork kubectl and compile in the go types. Implement get / describe -for the new types in the forked version. - -**Pros:** *This is what will happen for sig-service catalog if we take no action in 1.6* - -**Cons:** Bad user experience. No clear solution for patching forked kubectl. -User has to use a separate kubectl binary per-apiserver. Bad president. - -I really don't want this solution to be used. - -### Kubectl describe fully implemented in the server - -Implement a sub-resource "/describe" in the apiserver. This executes -the describe business logic for the object and returns either a string -or json blob for kubectl to print. - -**Pros:** Higher fidelity. Can aggregate data and fetch other objects. - -**Cons:** Higher complexity. Requires more api changes. - -### Write per-type columns to kubectl.config or another local file - -Support checking a local file containing per-type information including -the columns to print. - -**Pros:** Simplest solution. Easy for user to override values. - -**Cons:** Requires manual configuration on user side. Does not provide a consistent experience across clients. - -### Write per-type go templates to kubectl.config or another local file - -Support checking a local file containing per-type information including -the go template. - -**Pros:** Higher fidelity. Easy for user to override values. - -**Cons:** Higher complexity. Requires manual configuration on user side. Does not provide a consistent experience across clients. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/kubectl-create-from-env-file.md b/contributors/design-proposals/cli/kubectl-create-from-env-file.md index 71d6d853..f0fbec72 100644 --- a/contributors/design-proposals/cli/kubectl-create-from-env-file.md +++ b/contributors/design-proposals/cli/kubectl-create-from-env-file.md @@ -1,84 +1,6 @@ -# Kubectl create configmap/secret --env-file +Design proposals have been archived. -## Goals +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Allow a Docker environment file (.env) to populate an entire `ConfigMap` or `Secret`. -The populated `ConfigMap` or `Secret` can be referenced by a pod to load all -the data contained within. -## Design - -The `create configmap` subcommand would add a new option called -`--from-env-file`. The option will accept a single file. The option may not be -used in conjunction with `--from-file` or `--from-literal`. - -The `create secret generic` subcommand would add a new option called -`--from-env-file`. The option will accept a single file. The option may not be -used in conjunction with `--from-file` or `--from-literal`. - -### Environment file specification - -An environment file consists of lines to be in VAR=VAL format. Lines beginning -with # (i.e. comments) are ignored, as are blank lines. Any whitespace in -front of the VAR is removed. VAR must be a valid C_IDENTIFIER. If the line -consists of just VAR, then the VAL will be given a value from the current -environment. - -Any ill-formed line will be flagged as an error and will prevent the -`ConfigMap` or `Secret` from being created. - -[Docker's environment file processing](https://github.com/moby/moby/blob/master/opts/env.go) - -## Examples - -``` -$ cat game.env -enemies=aliens -lives=3 -enemies_cheat=true -enemies_cheat_level=noGoodRotten -secret_code_passphrase=UUDDLRLRBABAS -secret_code_allowed=true -secret_code_lives=30 -``` - -Create configmap from an env file: -``` -kubectl create configmap game-config --from-env-file=./game.env -``` - -The populated configmap would look like: -``` -$ kubectl get configmaps game-config -o yaml - -apiVersion: v1 -data: - enemies: aliens - lives: 3 - enemies_cheat: true - enemies_cheat_level: noGoodRotten - secret_code_passphrase: UUDDLRLRBABAS - secret_code_allowed: true - secret_code_lives: 30 -``` - -Create secret from an env file: -``` -kubectl create secret generic game-config --from-env-file=./game.env -``` - -The populated secret would look like: -``` -$ kubectl get secret game-config -o yaml - -apiVersion: v1 -type: Opaque -data: - enemies: YWxpZW5z - enemies_cheat: dHJ1ZQ== - enemies_cheat_level: bm9Hb29kUm90dGVu - lives: Mw== - secret_code_allowed: dHJ1ZQ== - secret_code_lives: MzA= - secret_code_passphrase: VVVERExSTFJCQUJBUw== -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/kubectl-extension.md b/contributors/design-proposals/cli/kubectl-extension.md index 1589f4c3..f0fbec72 100644 --- a/contributors/design-proposals/cli/kubectl-extension.md +++ b/contributors/design-proposals/cli/kubectl-extension.md @@ -1,52 +1,6 @@ +Design proposals have been archived. -# Kubectl Extension +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Abstract --------- - -Allow `kubectl` to be extended to include other commands that can provide new functionality without recompiling Kubectl - - -Motivation and Background -------------------------- - -Kubernetes is designed to be a composable and extensible system, with the ability to add new APIs and features via Third Party Resources -or API federation, by making the server provide functionality that eases writing generic clients, and by supporting other authentication -systems. Given that `kubectl` is the primary method for interacting with the server, some new extensions are difficult to make usable -for end users without recompiling that command. In addition, it is difficult to prototype new functionality for kubectl outside of the -Kubernetes source tree. - -Ecosystem tools like OpenShift, Deis, and Helm add additional workflow around kubectl targeted at the end user. It is beneficial -to encourage workflows to develop around Kubernetes without requiring them to be part of Kubernetes to both the end user community -and the Kubernetes developer community. - -There are many tools that currently offer CLI extension for the same reasons - [Git](https://www.kernel.org/pub/software/scm/git/docs/howto/new-command.html) and -[Heroku](https://devcenter.heroku.com/articles/developing-cli-plug-ins#creating-the-package) are two relevant examples in the space. - - -Proposal --------- - -Define a system for `kubectl` that allows new subcommands and subcommand trees to be added by placing an executable in a specific -location on disk, like Git. Allow third parties to extend kubectl by placing their extensions in that directory. Ensure that help -and other logic correctly includes those extensions. - -A kubectl command extension would be an executable located in `EXEC_PATH` (an arbitrary directory to be defined that follows similar -conventions in Linux) with a name pattern like `kubectl-COMMAND[-SUBCOMMAND[...]]` with one or many sub parts. The presence of -a command extension overrides any built in command. - -A key requirement is that the lookup be fast (since it would be invoked on every execution of `kubectl`) and so some true extension -behavior (such as complex inference of commands) may not be supported in order to reduce the complexity of the lookup. - -Kubectl would lazily include the appropriate commands in preference to the internal command structure if detected (a user asking for -`kubectl a b c` would *first* check for `kubectl-a-b-c`, `kubectl-a-b`, or `kubectl-a` before loading the internal command). - -All kubectl command extensions MUST: - -* Support the `-h` and `--help` flags to display a help page -* Respect the semantics of KUBECONFIG lookup (to be further specified) - -All kubectl command extensions SHOULD: - -* Follow the display and output conventions of normal kubectl commands. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/kubectl-login.md b/contributors/design-proposals/cli/kubectl-login.md index 01ab19bd..f0fbec72 100644 --- a/contributors/design-proposals/cli/kubectl-login.md +++ b/contributors/design-proposals/cli/kubectl-login.md @@ -1,216 +1,6 @@ -# Kubectl Login Subcommand +Design proposals have been archived. -**Authors**: Eric Chiang (@ericchiang) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goals - -`kubectl login` is an entrypoint for any user attempting to connect to an -existing server. It should provide a more tailored experience than the existing -`kubectl config` including config validation, auth challenges, and discovery. - -Short term the subcommand should recognize and attempt to help: - -* New users with an empty configuration trying to connect to a server. -* Users with no credentials, by prompt for any required information. -* Fully configured users who want to validate credentials. -* Users trying to switch servers. -* Users trying to reauthenticate as the same user because credentials have expired. -* Authenticate as a different user to the same server. - -Long term `kubectl login` should enable authentication strategies to be -discoverable from a master to avoid the end-user having to know how their -sysadmin configured the Kubernetes cluster. - -## Design - -The "login" subcommand helps users move towards a fully functional kubeconfig by -evaluating the current state of the kubeconfig and trying to prompt the user for -and validate the necessary information to login to the kubernetes cluster. - -This is inspired by a similar tools such as: - - * [os login](https://docs.openshift.org/latest/cli_reference/get_started_cli.html#basic-setup-and-login) - * [gcloud auth login](https://cloud.google.com/sdk/gcloud/reference/auth/login) - * [aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) - -The steps taken are: - -1. If no cluster configured, prompt user for cluster information. -2. If no user is configured, discover the authentication strategies supported by the API server. -3. Prompt the user for some information based on the authentication strategy they choose. -4. Attempt to login as a user, including authentication challenges such as OAuth2 flows, and display user info. - -Importantly, each step is skipped if the existing configuration is validated or -can be supplied without user interaction (refreshing an OAuth token, redeeming -a Kerberos ticket, etc.). Users with fully configured kubeconfigs will only see -the user they're logged in as, useful for opaque credentials such as X509 certs -or bearer tokens. - -The command differs from `kubectl config` by: - -* Communicating with the API server to determine if the user is supplying valid auth events. -* Validating input and being opinionated about the input it asks for. -* Triggering authentication challenges for example: - * Basic auth: Actually try to communicate with the API server. - * OpenID Connect: Create an OAuth2 redirect. - -However `kubectl login` should still be seen as a supplement to, not a -replacement for, `kubectl config` by helping validate any kubeconfig generated -by the latter command. - -## Credential validation - -When clusters utilize authorization plugins access decisions are based on the -correct configuration of an auth-N plugin, an auth-Z plugin, and client side -credentials. Being rejected then begs several questions. Is the user's -kubeconfig misconfigured? Is the authorization plugin setup wrong? Is the user -authenticating as a different user than the one they assume? - -To help `kubectl login` diagnose misconfigured credentials, responses from the -API server to authenticated requests SHOULD include the `Authentication-Info` -header as defined in [RFC 7615](https://tools.ietf.org/html/rfc7615). The value -will hold name value pairs for `username` and `uid`. Since usernames and IDs -can be arbitrary strings, these values will be escaped using the `quoted-string` -format noted in the RFC. - -``` -HTTP/1.1 200 OK -Authentication-Info: username="janedoe@example.com", uid="123456" -``` - -If the user successfully authenticates this header will be set, regardless of -auth-Z decisions. For example a 401 Unauthorized (user didn't provide valid -credentials) would lack this header, while a 403 Forbidden response would -contain it. - -## Authentication discovery - -A long term goal of `kubectl login` is to facilitate a customized experience -for clusters configured with different auth providers. This will require some -way for the API server to indicate to `kubectl` how a user is expected to -login. - -Currently, this document doesn't propose a specific implementation for -discovery. While it'd be preferable to utilize an existing standard (such as the -`WWW-Authenticate` HTTP header), discovery may require a solution custom to the -API server, such as an additional discovery endpoint with a custom type. - -## Use in non-interactive session - -For the initial implementation, if `kubectl login` requires prompting and is -called from a non-interactive session (determined by if the session is using a -TTY) it errors out, recommending using `kubectl config` instead. In future -updates `kubectl login` may include options for non-interactive sessions so -auth strategies which require custom behavior not built into `kubectl config`, -such as the exchanges in Kerberos or OpenID Connect, can be triggered from -scripts. - -## Examples - -If kubeconfig isn't configured, `kubectl login` will attempt to fully configure -and validate the client's credentials. - -``` -$ kubectl login -Cluster URL []: https://172.17.4.99:443 -Cluster CA [(defaults to host certs)]: ${PWD}/ssl/ca.pem -Cluster Name ["cluster-1"]: - -The kubernetes server supports the following methods: - - 1. Bearer token - 2. Username and password - 3. Keystone - 4. OpenID Connect - 5. TLS client certificate - -Enter login method [1]: 4 - -Logging in using OpenID Connect. - -Issuer ["valuefromdiscovery"]: https://accounts.google.com -Issuer CA [(defaults to host certs)]: -Scopes ["profile email"]: -Client ID []: client@localhost:foobar -Client Secret []: ***** - -Open the following address in a browser. - - https://accounts.google.com/o/oauth2/v2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scopes=openid%20email&access_type=offline&... - -Enter security code: **** - -Logged in as "janedoe@gmail.com" -``` - -Human readable names are provided by a combination of the auth providers -understood by `kubectl login` and the authenticator discovery. For instance, -Keystone uses basic auth credentials in the same way as a static user file, but -if the discovery indicates that the Keystone plugin is being used it should be -presented to the user differently. - -Users with configured credentials will simply auth against the API server and see -who they are. Running this command again simply validates the user's credentials. - -``` -$ kubectl login -Logged in as "janedoe@gmail.com" -``` - -Users who are halfway through the flow will start where they left off. For -instance if a user has configured the cluster field but on a user field, they will -be prompted for credentials. - -``` -$ kubectl login -No auth type configured. The kubernetes server supports the following methods: - - 1. Bearer token - 2. Username and password - 3. Keystone - 4. OpenID Connect - 5. TLS client certificate - -Enter login method [1]: 2 - -Logging in with basic auth. Enter the following fields. - -Username: janedoe -Password: **** - -Logged in as "janedoe@gmail.com" -``` - -Users who wish to switch servers can provide the `--switch-cluster` flag which -will prompt the user for new cluster details and switch the current context. It -behaves identically to `kubectl login` when a cluster is not set. - -``` -$ kubectl login --switch-cluster -# ... -``` - -Switching users goes through a similar flow attempting to prompt the user for -new credentials to the same server. - -``` -$ kubectl login --switch-user -# ... -``` - -## Work to do - -Phase 1: - -* Provide a simple dialog for configuring authentication. -* Kubectl can trigger authentication actions such as trigging OAuth2 redirects. -* Validation of user credentials thought the `Authentication-Info` endpoint. - -Phase 2: - -* Update proposal with auth provider discovery mechanism. -* Customize dialog using discovery data. - -Further improvements will require adding more authentication providers, and -adapting existing plugins to take advantage of challenge based authentication. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/kubectl_apply_getsetdiff_last_applied_config.md b/contributors/design-proposals/cli/kubectl_apply_getsetdiff_last_applied_config.md index af3c0ff8..f0fbec72 100644 --- a/contributors/design-proposals/cli/kubectl_apply_getsetdiff_last_applied_config.md +++ b/contributors/design-proposals/cli/kubectl_apply_getsetdiff_last_applied_config.md @@ -1,192 +1,6 @@ -# Kubectl apply subcommands for last-config +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -`kubectl apply` uses the `last-applied-config` annotation to compute -the removal of fields from local object configuration files and then -send patches to delete those fields from the live object. Reading or -updating the `last-applied-config` is complex as it requires parsing -out and writing to the annotation. Here we propose a set of porcelain -commands for users to better understand what is going on in the system -and make updates. - -## Motivation - -What is going on behind the scenes with `kubectl apply` is opaque. Users -have to interact directly with annotations on the object to view -and make changes. In order to stop having `apply` manage a field on -an object, it must be manually removed from the annotation and then be removed -from the local object configuration. Users should be able to simply edit -the local object configuration and set it as the last-applied-config -to be used for the next diff base. Storing the last-applied-config -in an annotation adds black magic to `kubectl apply`, and it would -help users learn and understand if the value was exposed in a discoverable -manner. - -## Use Cases - -1. As a user, I want to be able to diff the last-applied-configuration - against the current local configuration to see which changes the command is seeing -2. As a user, I want to remove fields from being managed by the local - object configuration by removing them from the local object configuration - and setting the last-applied-configuration to match. -3. As a user, I want to be able to view the last-applied-configuration - on the live object that will be used to calculate the diff patch - to update the live object from the configuration file. - -## Naming and Format possibilities - -### Naming - -1. *cmd*-last-applied - -Rejected alternatives: - -2. ~~last-config~~ -3. ~~last-applied-config~~ -4. ~~last-configuration~~ -5. ~~last-applied-configuration~~ -6. ~~last~~ - -### Formats - -1. Apply subcommands - - `kubectl apply set-last-applied/view-last-applied/diff-last-applied - - a little bit odd to have 2 verbs in a row - - improves discoverability to have these as subcommands so they are tied to apply - -Rejected alternatives: - -2. ~~Set/View subcommands~~ - - `kubectl set/view/diff last-applied - - consistent with other set/view commands - - clutters discoverability of set/view commands since these are only for apply - - clutters discoverability for last-applied commands since they are for apply -3. ~~Apply flags~~ - - `kubectl apply [--set-last-applied | --view-last-applied | --diff-last-applied] - - Not a fan of these - -## view last-applied - -Porcelain command that retrieves the object and prints the annotation value as yaml or json. - -Prints an error message if the object is not managed by `apply`. - -1. Get the last-applied by type/name - -```sh -kubectl apply view-last-applied deployment/nginx -``` - -```yaml -apiVersion: extensions/v1beta1 -kind: Deployment -metadata: - name: nginx -spec: - replicas: 1 - template: - metadata: - labels: - run: nginx - spec: - containers: - - image: nginx - name: nginx -``` - -2. Get the last-applied by file, print as json - -```sh -kubectl apply view-last-applied -f deployment_nginx.yaml -o json -``` - -Same as above, but in json - -## diff last-applied - -Porcelain command that retrieves the object and displays a diff against -the local configuration - -1. Diff the last-applied - -```sh -kubectl apply diff-last-applied -f deployment_nginx.yaml -``` - -Opens up a 2-way diff in the default diff viewer. This should -follow the same semantics as `git diff`. It should accept either a -flag `--diff-viewer=meld` or check the environment variable -`KUBECTL_EXTERNAL_DIFF=meld`. If neither is specified, the `diff` -command should be used. - -This is meant to show the user what they changed in the configuration, -since it was last applied, but not show what has changed in the server. - -The supported output formats should be `yaml` and `json`, as specified -by the `-o` flag. - -A future goal is to provide a 3-way diff with `kubectl apply diff -f deployment_nginx.yaml`. -Together these tools would give the user the ability to see what is going -on and compare changes made to the configuration file vs other -changes made to the server independent of the configuration file. - -## set last-applied - -Porcelain command that sets the last-applied-config annotation to as -if the local configuration file had just been applied. - -1. Set the last-applied-config - -```sh -kubectl apply set-last-applied -f deployment_nginx.yaml -``` - -Sends a Patch request to set the last-applied-config as if -the configuration had just been applied. - -## edit last-applied - -1. Open the last-applied-config in an editor - -```sh -kubectl apply edit-last-applied -f deployment_nginx.yaml -``` - -Since the last-applied-configuration annotation exists only -on the live object, this command can alternatively take the -kind/name. - -```sh -kubectl apply edit-last-applied deployment/nginx -``` - -Sends a Patch request to set the last-applied-config to -the value saved in the editor. - -## Example workflow to stop managing a field with apply - using get/set - -As a user, I want to have the replicas on a Deployment managed by an autoscaler -instead of by the configuration. - -1. Check to make sure the live object is up-to-date - - `kubectl apply diff-last-applied -f deployment_nginx.yaml` - - Expect no changes -2. Update the deployment_nginx.yaml by removing the replicas field -3. Diff the last-applied-config to make sure the only change is the removal of the replicas field -4. Remove the replicas field from the last-applied-config so it doesn't get deleted next apply - - `kubectl apply set-last-applied -f deployment_nginx.yaml` -5. Verify the last-applied-config has been updated - - `kubectl apply view-last-applied -f deployment_nginx.yaml` - -## Example workflow to stop managing a field with apply - using edit - -1. Check to make sure the live object is up-to-date - - `kubectl apply diff-last-applied -f deployment_nginx.yaml` - - Expect no changes -2. Update the deployment_nginx.yaml by removing the replicas field -3. Edit the last-applied-config and remove the replicas field - - `kubectl apply edit-last-applied deployment/nginx` -4. Verify the last-applied-config has been updated - - `kubectl apply view-last-applied -f deployment_nginx.yaml` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/multi-fields-merge-key.md b/contributors/design-proposals/cli/multi-fields-merge-key.md index 857deb25..f0fbec72 100644 --- a/contributors/design-proposals/cli/multi-fields-merge-key.md +++ b/contributors/design-proposals/cli/multi-fields-merge-key.md @@ -1,126 +1,6 @@ -# Multi-fields Merge Key in Strategic Merge Patch +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Support multi-fields merge key in Strategic Merge Patch. -## Background - -Strategic Merge Patch is covered in this [doc](/contributors/devel/sig-api-machinery/strategic-merge-patch.md). -In Strategic Merge Patch, we use Merge Key to identify the entries in the list of non-primitive types. -It must always be present and unique to perform the merge on the list of non-primitive types, -and will be preserved. - -The merge key exists in the struct tag (e.g. in [types.go](https://github.com/kubernetes/kubernetes/blob/5a9759b0b41d5e9bbd90d5a8f3a4e0a6c0b23b47/pkg/api/v1/types.go#L2831)) -and the [OpenAPI spec](https://git.k8s.io/kubernetes/api/openapi-spec/swagger.json). - -## Motivation - -The current implementation only support a single field as merge key. -For some element Kinds, the identity is actually defined using multiple fields. -[Service port](https://github.com/kubernetes/kubernetes/issues/39188) is an evidence indicating that -we need to support multi-fields Merge Key. - -## Scope - -This proposal only covers how we introduce ability to support multi-fields merge key for strategic merge patch. -It will cover how we support new APIs with multi-fields merge key. - -This proposal does NOT cover how we change the merge keys from one single field to multi-fields -for existing APIs without breaking backward compatibility, -e.g. we are not addressing the service port issue mentioned above. -That part will be addressed by [#476](https://github.com/kubernetes/community/pull/476). - -## Proposed Change - -### API Change - -If a merge key has multiple fields, it will be a string of merge key fields separated by ",", i.e. `patchMergeKey:"<key1>,<key2>,<key3>"`. - -If a merge key only has one field, it will be the same as before, i.e. `patchMergeKey:"<key1>"`. - -There are no patch format changes. -Patches for fields that have multiple fields in the merge key must include all of the fields of the merge key in the patch. - -If a new API uses multi-fields merge key, all the fields of the merge key are required to present. -Otherwise, the server will reject the patch. - -E.g. -foo and bar are the merge keys. - -Live list: -```yaml -list: -- foo: a - bar: x - other: 1 -- foo: a - bar: y - other: 2 -- foo: b - bar: x - other: 3 -``` - -Patch 1: -```yaml -list: -- foo: a # field 1 of merge key - bar: x # field 2 of merge key - other: 4 - another: val -``` - -Result after merging patch 1: -```yaml -list: -- foo: a - bar: x - other: 4 - another: val -- foo: a - bar: y - other: 2 -- foo: b - bar: x - other: 3 -``` - -Patch 2: -```yaml -list: -- $patch: delete - foo: a # field 1 of merge key - bar: x # field 2 of merge key -``` - -Result after merging patch 2: -```yaml -list: -- foo: a - bar: y - other: 2 -- foo: b - bar: x - other: 3 -``` - -### Strategic Merge Patch pkg - -We will add logic to support -- returning a list of fields instead of one single field when looking up merge key. -- merging list respecting a list of fields as merge key. - -### Open API - -Open API will not be affected, -since multi-fields merge key is still in one single string as an extension in Open API spec. - -### Docs - -Document that the developer should make sure the merge key can uniquely identify an entry in all cases. - -## Version Skew and Backward Compatibility - -It is fully backward compatibility, -because there are no patch format changes and no changes to the existing APIs. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/preserve-order-in-strategic-merge-patch.md b/contributors/design-proposals/cli/preserve-order-in-strategic-merge-patch.md index 1d3c2484..f0fbec72 100644 --- a/contributors/design-proposals/cli/preserve-order-in-strategic-merge-patch.md +++ b/contributors/design-proposals/cli/preserve-order-in-strategic-merge-patch.md @@ -1,566 +1,6 @@ -# Preserve Order in Strategic Merge Patch +Design proposals have been archived. -Author: @mengqiy - -## Motivation - -Background of the Strategic Merge Patch is covered [here](/contributors/devel/sig-api-machinery/strategic-merge-patch.md). - -The Kubernetes API may apply semantic meaning to the ordering of items within a list, -however the strategic merge patch does not keep the ordering of elements. -Ordering has semantic meaning for Environment variables, -as later environment variables may reference earlier environment variables, -but not the other way around. - -One use case is the environment variables. We don't preserve the order which causes -issue [40373](https://github.com/kubernetes/kubernetes/issues/40373). - -## Proposed Change - -We will use the following notions through the doc. -Notion: -list to be merged: same as live list, which is the list current in the server. -parallel list: the list with `$setElementOrder` directive in the patch. -patch list: the list in the patch that contains the value changes. - -Changes are all in strategic merge patch package. -The proposed solution is similar to the solution used for deleting elements from lists of primitives. - -Add to the current patch, a directive ($setElementOrder) containing a list of element keys - -either the patch merge key, or for primitives the value. When applying the patch, -the server ensures that the relative ordering of elements matches the directive. - -The server will reject the patch if it doesn't satisfy the following 2 requirements. -- the relative order of any two items in the `$setElementOrder` list -matches that in the patch list if they present. -- the items in the patch list must be a subset or the same as the `$setElementOrder` list if the directive presents. - -The relative order of two items are determined by the following order: - -1. relative order in the $setElementOrder if both items are present -2. else relative order in the patch if both items are present -3. else relative order in the server-side list if both items are present -4. else append to the end - -If the relative order of the live config in the server is different from the order of the parallel list, -the user's patch will always override the order in the server. - -Here is a simple example of the patch format: - -Suppose we have a type called list. The patch will look like below. -The order from the parallel list ($setElementOrder/list) will be respected. - -```yaml -$setElementOrder/list: -- A -- B -- C -list: -- A -- C -``` - -All the items in the server's live list but not in the parallel list will come before the parallel list. -The relative order between these appended items are kept. - -The patched list will look like: - -``` -mergingList: -- serverOnlyItem1 \ - ... |===> items in the server's list but not in the parallel list -- serverOnlyItemM / -- parallelListItem1 \ - ... |===> items from the parallel list -- parallelListItemN / -``` - -### When $setElementOrder is not present and patching a list - -The new directive $setElementOrder is optional. -When the $setElementOrder is missing, -relative order in the patch list will be respected. - -Examples where A and C have been changed, B has been deleted and D has been added. - -Patch: - -```yaml -list: -- A' -- B' -- D -``` - -Live: - -```yaml -list: -- B -- C -- A -``` - -Result: - -```yaml -list: -- C # server-only item comes first -- A' -- B' -- D -``` - -### `$setElementOrder` may contain elements not present in the patch list - -The $setElementOrder value may contain elements that are not present in the patch -but present in the list to be merged to reorder the elements as part of the merge. - -Example where A & B have not changed: - -Patch: - -```yaml -$setElementOrder/list: -- A -- B -``` - -Live: - -```yaml -list: -- B -- A -``` - -Result: - -```yaml -list: -- A -- B -``` - -### When the list to be merged contains elements not found in `$setElementOrder` - -If the list to be merged contains elements not found in $setElementOrder, -they will come before all elements defined in $setElementOrder, but keep their relative ordering. - -Example where A & B have been changed: - -Patch: - -```yaml -$setElementOrder/list: -- A -- B -list: -- A -- B -``` - -Live: - -```yaml -list: -- C -- B -- D -- A -- E -``` - -Result: - -```yaml -list: -- C -- D -- E -- A -- B -``` - -### When `$setElementOrder` contains elements not found in the list to be merged - -If `$setElementOrder` contains elements not found in the list to be merged, -the elements that are not found will be ignored instead of failing the request. - -Patch: -```yaml -$setElementOrder/list: -- C -- A -- B -list: -- A -- B -``` - -Live: -```yaml -list: -- A -- B -``` - -Result: - -```yaml -list: -- A -- B -``` - -## Version Skew and Backwards Compatibility - -The new version patch is always a superset of the old version patch. -The new patch has one additional parallel list which will be dropped by the old server. - -As mentioned [above](#when-setelementorder-is-not-present-and-patching-a-list), -the new directive is optional. -Patch requests without the directive will change a little, -but still be fully backward compatible. - -### kubectl -If an old kubectl sends a old patch to a new server, -the server will honor the order in the list as mentioned above. -The behavior is a little different from before but is not a breaking change. - -If a new kubectl sends a new patch to an old server, the server doesn't recognise the parallel list and will drop it. -So it will behave the same as before. - -## Example - -### List of Maps - -We take environment variables as an example. -Environment variables is a list of maps with merge patch strategy. - -Suppose we define a list of environment variables and we call them -the original environment variables: - -```yaml -env: -- name: ENV1 - value: foo -- name: ENV2 - value: bar -- name: ENV3 - value: baz -``` - -Then the server appends two environment variables and reorder the list: - -```yaml -env: -- name: ENV2 - value: bar -- name: ENV5 - value: server-added-2 -- name: ENV1 - value: foo -- name: ENV3 - value: baz -- name: ENV4 - value: server-added-1 -``` - -Then the user wants to change it from the original to the following using `kubectl apply`: - -```yaml -env: -- name: ENV1 - value: foo -- name: ENV2 - value: bar -- name: ENV6 - value: new-env -``` - -The old patch without parallel list will looks like: - -```yaml -env: -- name: ENV3 - $patch: delete -- name: ENV6 - value: new-env -``` - -The new patch will looks like below. It is the - -```yaml -$setElementOrder/env: -- name: ENV1 -- name: ENV2 -- name: ENV6 -env: -- name: ENV3 - $patch: delete -- name: ENV6 - value: new-env -``` - -After server applying the new patch: - -```yaml -env: -- name: ENV5 - value: server-added-2 -- name: ENV4 - value: server-added-1 -- name: ENV1 - value: foo -- name: ENV2 - value: bar -- name: ENV6 - value: new-env -``` - -### List of Primitives - -We take finalizers as an example. -finalizers is a list of strings. - -Suppose we define a list of finalizers and we call them -the original finalizers: - -```yaml -finalizers: -- a -- b -- c -``` - -Then the server appends two finalizers and reorder the list: - -```yaml -finalizers: -- b -- e -- a -- c -- d -``` - -Then the user wants to change it from the original to the following using `kubectl apply`: - -```yaml -finalizers: -- a -- b -- f -``` - -The old patch without parallel list will looks like: - -```yaml -$deleteFromPrimitiveList/finalizers: -- c -finalizers: -- f -``` - -The new patch will looks like below. It is the - -```yaml -$setElementOrder/finalizers: -- a -- b -- f -$deleteFromPrimitiveList/finalizers: -- c -finalizers: -- f -``` - -After server applying the patch: - -```yaml -finalizers: -- e -- d -- a -- b -- f -``` - - -# Alternative Considered - -# 1. Use the patch list to set order - -## Proposed Change - -This approach can considered as merging the parallel list and patch list into one single list. - -For list of maps, the patch list will have all entries that are -either a map that contains the mergeKey and other changes -or a map that contains the mergeKey only. - -For list of primitives, the patch list will be the same as the list in users' local config. - -## Reason of Rejection - -It cannot work correctly in the following concurrent writers case, -because PATCH in k8s doesn't use optimistic locking, so the following may happen. - -Live config is: - -```yaml -list: -- mergeKey: a - other: A -- mergeKey: b - other: B -- mergeKey: c - other: C -``` - -Writer foo first GET the object from the server. -It wants to delete B, so it calculate the patch and is about to send it to the server: - -```yaml -list: -- mergeKey: a -- mergeKey: b - $patch: delete -- mergeKey: c -``` - -Before foo sending the patch to the server, -writer bar GET the object and it want to update A. - -Patch from bar is: - -```yaml -list: -- mergeKey: a - other: A' -- mergeKey: b -- mergeKey: c -``` - -After the server first applying foo's patch and then bar's patch, -the final result will be wrong. -Because entry b has been recreated which is not desired. - -```yaml -list: -- mergeKey: a - other: A -- mergeKey: b -- mergeKey: c - other: C -``` - -# 2. Use $position Directive - -## Proposed Change - -Use an approach similar to [MongoDB](https://docs.mongodb.com/manual/reference/operator/update/position/). -When patching a list of maps with merge patch strategy, -use a new directive `$position` in each map in the list. - -If the order in the user's config is different from the order of the live config, -we will insert the `$position` directive in each map in the list. -We guarantee that the order of the user's list will always override the order of live list. - -All the items in the server's live list but not in the patch list will be append to the end of the patch list. -The relative order between these appended items are kept. -If the relative order of live config in the server is different from the order in the patch, -user's patch will always override the order in the server. - -When patching a list of primitives with merge patch strategy, -we send a whole list from user's config. - -## Version Skew - -It is NOT backward compatible in terms of list of primitives. - -When patching a list of maps: -- An old client sends an old patch to a new server, the server just merges the change and no reordering. -The server behaves the same as before. -- A new client sends a new patch to an old server, the server doesn't understand the new directive. -So it just simply does the merge. - -When patching a list of primitives: -- An old client sends an old patch to a new server, the server will reorder the patch list which is sublist of user's. -The server has the WRONG behavior. -- A new client sends a new patch to an old server, the server will deduplicate after merging. -The server behaves the same as before. - -## Example - -For patching list of maps: - -Suppose we define a list of environment variables and we call them -the original environment variables: -```yaml -env: - - name: ENV1 - value: foo - - name: ENV2 - value: bar - - name: ENV3 - value: baz -``` - -Then the server appends two environment variables and reorder the list: -```yaml -env: - - name: ENV2 - value: bar - - name: ENV5 - value: server-added-2 - - name: ENV1 - value: foo - - name: ENV3 - value: baz - - name: ENV4 - value: server-added-1 -``` - -Then the user wants to change it from the original to the following using `kubectl apply`: -```yaml -env: - - name: ENV1 - value: foo - - name: ENV2 - value: bar - - name: ENV6 - value: new-env -``` - -The patch will looks like: -```yaml -env: - - name: ENV1 - $position: 0 - - name: ENV2 - $position: 1 - - name: ENV6 - value: new-env - $position: 2 - - name: ENV3 - $patch: delete -``` - -After server applying the patch: -```yaml -env: - - name: ENV1 - value: foo - - name: ENV2 - value: bar - - name: ENV6 - value: new-env - - name: ENV5 - value: server-added-2 - - name: ENV4 - value: server-added-1 -``` +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cli/simple-rolling-update.md b/contributors/design-proposals/cli/simple-rolling-update.md index 32d75820..f0fbec72 100644 --- a/contributors/design-proposals/cli/simple-rolling-update.md +++ b/contributors/design-proposals/cli/simple-rolling-update.md @@ -1,126 +1,6 @@ -## Simple rolling update +Design proposals have been archived. -This is a lightweight design document for simple -[rolling update](https://kubernetes.io/docs/user-guide/kubectl/kubectl_rolling-update.md#rolling-update) in `kubectl`. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Complete execution flow can be found [here](#execution-details). See the -[example of rolling update](https://kubernetes.io/docs/tutorials/kubernetes-basics/update-intro/) for more information. -### Lightweight rollout - -Assume that we have a current replication controller named `foo` and it is -running image `image:v1` - -`kubectl rolling-update foo [foo-v2] --image=myimage:v2` - -If the user doesn't specify a name for the 'next' replication controller, then -the 'next' replication controller is renamed to -the name of the original replication controller. - -Obviously there is a race here, where if you kill the client between delete foo, -and creating the new version of 'foo' you might be surprised about what is -there, but I think that's ok. See [Recovery](#recovery) below - -If the user does specify a name for the 'next' replication controller, then the -'next' replication controller is retained with its existing name, and the old -'foo' replication controller is deleted. For the purposes of the rollout, we add -a unique-ifying label `kubernetes.io/deployment` to both the `foo` and -`foo-next` replication controllers. The value of that label is the hash of the -complete JSON representation of the`foo-next` or`foo` replication controller. -The name of this label can be overridden by the user with the -`--deployment-label-key` flag. - -#### Recovery - -If a rollout fails or is terminated in the middle, it is important that the user -be able to resume the roll out. To facilitate recovery in the case of a crash of -the updating process itself, we add the following annotations to each -replication controller in the `kubernetes.io/` annotation namespace: - * `desired-replicas` The desired number of replicas for this replication -controller (either N or zero) - * `update-partner` A pointer to the replication controller resource that is -the other half of this update (syntax `<name>` the namespace is assumed to be -identical to the namespace of this replication controller.) - -Recovery is achieved by issuing the same command again: - -```sh -kubectl rolling-update foo [foo-v2] --image=myimage:v2 -``` - -Whenever the rolling update command executes, the kubectl client looks for -replication controllers called `foo` and `foo-next`, if they exist, an attempt -is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is -created, and the rollout is a new rollout. If `foo` doesn't exist, then it is -assumed that the rollout is nearly completed, and `foo-next` is renamed to -`foo`. Details of the execution flow are given below. - - -### Aborting a rollout - -Abort is assumed to want to reverse a rollout in progress. - -`kubectl rolling-update foo [foo-v2] --rollback` - -This is really just semantic sugar for: - -`kubectl rolling-update foo-v2 foo` - -With the added detail that it moves the `desired-replicas` annotation from -`foo-v2` to `foo` - - -### Execution Details - -For the purposes of this example, assume that we are rolling from `foo` to -`foo-next` where the only change is an image update from `v1` to `v2` - -If the user doesn't specify a `foo-next` name, then it is either discovered from -the `update-partner` annotation on `foo`. If that annotation doesn't exist, -then `foo-next` is synthesized using the pattern -`<controller-name>-<hash-of-next-controller-JSON>` - -#### Initialization - - * If `foo` and `foo-next` do not exist: - * Exit, and indicate an error to the user, that the specified controller -doesn't exist. - * If `foo` exists, but `foo-next` does not: - * Create `foo-next` populate it with the `v2` image, set -`desired-replicas` to `foo.Spec.Replicas` - * Goto Rollout - * If `foo-next` exists, but `foo` does not: - * Assume that we are in the rename phase. - * Goto Rename - * If both `foo` and `foo-next` exist: - * Assume that we are in a partial rollout - * If `foo-next` is missing the `desired-replicas` annotation - * Populate the `desired-replicas` annotation to `foo-next` using the -current size of `foo` - * Goto Rollout - -#### Rollout - - * While size of `foo-next` < `desired-replicas` annotation on `foo-next` - * increase size of `foo-next` - * if size of `foo` > 0 - decrease size of `foo` - * Goto Rename - -#### Rename - - * delete `foo` - * create `foo` that is identical to `foo-next` - * delete `foo-next` - -#### Abort - - * If `foo-next` doesn't exist - * Exit and indicate to the user that they may want to simply do a new -rollout with the old version - * If `foo` doesn't exist - * Exit and indicate not found to the user - * Otherwise, `foo-next` and `foo` both exist - * Set `desired-replicas` annotation on `foo` to match the annotation on -`foo-next` - * Goto Rollout with `foo` and `foo-next` trading places. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md b/contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md index 99c03478..f0fbec72 100644 --- a/contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md +++ b/contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md @@ -1,163 +1,6 @@ -## Refactor Cloud Provider out of Kubernetes Core +Design proposals have been archived. -As kubernetes has evolved tremendously, it has become difficult for different cloudproviders (currently 7) to make changes and iterate quickly. Moreover, the cloudproviders are constrained by the kubernetes build/release life-cycle. This proposal aims to move towards a kubernetes code base where cloud providers specific code will move out of the core repository and into "official" repositories, where it will be maintained by the cloud providers themselves. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### 1. Current use of Cloud Provider -The following components have cloudprovider dependencies - - 1. kube-controller-manager - 2. kubelet - 3. kube-apiserver - -#### Cloud Provider in Kube-Controller-Manager - -The kube-controller-manager has many controller loops - - - nodeController - - volumeController - - routeController - - serviceController - - replicationController - - endpointController - - resourceQuotaController - - namespaceController - - deploymentController - - etc.. - -Among these controller loops, the following are cloud provider dependent. - - - nodeController - - volumeController - - routeController - - serviceController - -The nodeController uses the cloudprovider to check if a node has been deleted from the cloud. If cloud provider reports a node as deleted, then this controller immediately deletes the node from kubernetes. This check removes the need to wait for a specific amount of time to conclude that an inactive node is actually dead. - -The volumeController uses the cloudprovider to create, delete, attach and detach volumes to nodes. For instance, the logic for provisioning, attaching, and detaching a EBS volume resides in the AWS cloudprovider. The volumeController uses this code to perform its operations. - -The routeController configures routes for hosts in the cloud provider. - -The serviceController maintains a list of currently active nodes, and is responsible for creating and deleting LoadBalancers in the underlying cloud. - -#### Cloud Provider in Kubelet - -Moving on to the kubelet, the following cloud provider dependencies exist in kubelet. - - - Find the cloud nodename of the host that kubelet is running on for the following reasons : - 1. To obtain the config map for the kubelet, if one already exists - 2. To uniquely identify current node using nodeInformer - 3. To instantiate a reference to the current node object - - Find the InstanceID, ProviderID, ExternalID, Zone Info of the node object while initializing it - - Periodically poll the cloud provider to figure out if the node has any new IP addresses associated with it - - It sets a condition that makes the node unschedulable until cloud routes are configured. - - It allows the cloud provider to post process DNS settings - -#### Cloud Provider in Kube-apiserver - -Finally, in the kube-apiserver, the cloud provider is used for transferring SSH keys to all of the nodes, and within an admission controller for setting labels on persistent volumes. - -### 2. Strategy for refactoring Kube-Controller-Manager - -In order to create a 100% cloud independent controller manager, the controller-manager will be split into multiple binaries. - -1. Cloud dependent controller-manager binaries -2. Cloud independent controller-manager binaries - This is the existing `kube-controller-manager` that is being shipped with kubernetes releases. - -The cloud dependent binaries will run those loops that rely on cloudprovider as a kubernetes system service. The rest of the controllers will be run in the cloud independent controller manager. - -The decision to run entire controller loops, rather than only the very minute parts that rely on cloud provider was made because it makes the implementation simple. Otherwise, the shared datastructures and utility functions have to be disentangled, and carefully separated to avoid any concurrency issues. This approach among other things, prevents code duplication and improves development velocity. - -Note that the controller loop implementation will continue to reside in the core repository. It takes in cloudprovider.Interface as an input in its constructor. Vendor maintained cloud-controller-manager binary could link these controllers in, as it serves as a reference form of the controller implementation. - -There are four controllers that rely on cloud provider specific code. These are node controller, service controller, route controller and attach detach controller. Copies of each of these controllers have been bundled them together into one binary. The cloud dependent binary registers itself as a controller, and runs the cloud specific controller loops with the user-agent named "external-controller-manager". - -RouteController and serviceController are entirely cloud specific. Therefore, it is really simple to move these two controller loops out of the cloud-independent binary and into the cloud dependent binary. - -NodeController does a lot more than just talk to the cloud. It does the following operations - - -1. CIDR management -2. Monitor Node Status -3. Node Pod Eviction - -While Monitoring Node status, if the status reported by kubelet is either 'ConditionUnknown' or 'ConditionFalse', then the controller checks if the node has been deleted from the cloud provider. If it has already been deleted from the cloud provider, then it deletes the nodeobject without waiting for the `monitorGracePeriod` amount of time. This is the only operation that needs to be moved into the cloud dependent controller manager. - -Finally, The attachDetachController is tricky, and it is not simple to disentangle it from the controller-manager easily, therefore, this will be addressed with Flex Volumes (Discussed under a separate section below) - -### 3. Strategy for refactoring Kubelet - -The majority of the calls by the kubelet to the cloud is done during the initialization of the Node Object. The other uses are for configuring Routes (in case of GCE), scrubbing DNS, and periodically polling for IP addresses. - -All of the above steps, except the Node initialization step can be moved into a controller. Specifically, IP address polling, and configuration of Routes can be moved into the cloud dependent controller manager. - -Scrubbing DNS, after discussing with @thockin, was found to be redundant. So, it can be disregarded. It is being removed. - -Finally, Node initialization needs to be addressed. This is the trickiest part. Pods will be scheduled even on uninitialized nodes. This can lead to scheduling pods on incompatible zones, and other weird errors. Therefore, an approach is needed where kubelet can create a Node, but mark it as "NotReady". Then, some asynchronous process can update it and mark it as ready. This is now possible because of the concept of Taints. - -This approach requires kubelet to be started with known taints. This will make the node unschedulable until these taints are removed. The external cloud controller manager will asynchronously update the node objects and remove the taints. - -### 4. Strategy for refactoring Kube-ApiServer - -Kube-apiserver uses the cloud provider for two purposes - -1. Distribute SSH Keys - This can be moved to the cloud dependent controller manager -2. Admission Controller for PV - This can be refactored using the taints approach used in Kubelet - -### 5. Strategy for refactoring Volumes - -Volumes need cloud providers, but they only need SPECIFIC cloud providers. The majority of volume management logic resides in the controller manager. These controller loops need to be moved into the cloud-controller manager. The cloud controller manager also needs a mechanism to read parameters for initialization from cloud config. This can be done via config maps. - -There is an entirely different approach to refactoring volumes - Flex Volumes. There is an undergoing effort to move all of the volume logic from the controller-manager into plugins called Flex Volumes. In the Flex volumes world, all of the vendor specific code will be packaged in a separate binary as a plugin. After discussing with @thockin, this was decidedly the best approach to remove all cloud provider dependency for volumes out of kubernetes core. - -### 6. Deployment, Upgrades and Downgrades - -This change will introduce new binaries to the list of binaries required to run kubernetes. The change will be designed such that these binaries can be installed via `kubectl apply -f` and the appropriate instances of the binaries will be running. - -##### 6.1 Upgrading kubelet and proxy - -The kubelet and proxy runs on every node in the kubernetes cluster. Based on your setup (systemd/other), you can follow the normal upgrade steps for it. This change does not affect the kubelet and proxy upgrade steps for your setup. - -##### 6.2 Upgrading plugins - -Plugins such as cni, flex volumes can be upgraded just as you normally upgrade them. This change does not affect the plugin upgrade steps for your setup. - -###### 6.3 Upgrading kubernetes core services - -The master node components (kube-controller-manager,kube-scheduler, kube-apiserver etc.) can be upgraded just as you normally upgrade them. This change does not affect the plugin upgrade steps for your setup. - -##### 6.4 Applying the cloud-controller-manager - -This is the only step that is different in the upgrade process. In order to complete the upgrade process, you need to apply the cloud-controller-manager deployment to the setup. A deployment descriptor file will be provided with this change. You need to apply this change using - -``` -kubectl apply -f cloud-controller-manager.yml -``` - -This will start the cloud specific controller manager in your kubernetes setup. - -The downgrade steps are also the same as before for all the components except the cloud-controller-manager. In case of the cloud-controller-manager, the deployment should be deleted using - -``` -kubectl delete -f cloud-controller-manager.yml -``` - -### 7. Roadmap - -##### 7.1 Transition plan - -Release 1.6: Add the first implementation of the cloud-controller-manager binary. This binary's purpose is to let users run two controller managers and address any issues that they uncover, that we might have missed. It also doubles as a reference implementation to the external cloud controller manager for the future. Since the cloud-controller-manager runs cloud specific controller loops, it is important to ensure that the kube-controller-manager does not run these loops as well. This is done by leaving the `--cloud-provider` flag unset in the kube-controller-manager. At this stage, the cloud-controller-manager will still be in "beta" stage and optional. - -Release 1.7: In this release, all of the supported turnups will be converted to use cloud controller by default. At this point users will still be allowed to opt-out. Users will be expected run the monolithic cloud controller binary. The cloud controller manager will still continue to use the existing library, but code will be factored out to reduce literal duplication between the controller-manager and the cloud-controller-manager. A deprecation announcement will be made to inform users to switch to the cloud-controller-manager. - -Release 1.8: The main change aimed for this release is to break up the various cloud providers into individual binaries. Users will still be allowed to opt-out. There will be a second warning to inform users about the deprecation of the `--cloud-provider` option in the controller-manager. - -Release 1.9: All of the legacy cloud providers will be completely removed in this version - -##### 7.2 Code/Library Evolution - -* Break controller-manager into 2 binaries. One binary will be the existing controller-manager, and the other will only run the cloud specific loops with no other changes. The new cloud-controller-manager will still load all the cloudprovider libraries, and therefore will allow the users to choose which cloud-provider to use. -* Move the cloud specific parts of kubelet out using the external admission controller pattern mentioned in the previous sections above. -* The cloud controller will then be made into a library. It will take the cloudprovider.Interface as an argument to its constructor. Individual cloudprovider binaries will be created using this library. -* Cloud specific operations will be moved out of kube-apiserver using the external admission controller pattern mentioned above. -* All cloud specific volume controller loops (attach, detach, provision operation controllers) will be switched to using flex volumes. Flex volumes do not need in-tree cloud specific calls. -* As the final step, all of the cloud provider specific code will be moved out of tree. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md b/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md index 838c7e43..f0fbec72 100644 --- a/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md +++ b/contributors/design-proposals/cloud-provider/cloudprovider-storage-metrics.md @@ -1,136 +1,6 @@ -# Cloud Provider (specifically GCE and AWS) metrics for Storage API calls +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Kubernetes should provide metrics such as - count & latency percentiles -for cloud provider API it uses to provision persistent volumes. -In a ideal world - we would want these metrics for all cloud providers -and for all API calls kubernetes makes but to limit the scope of this feature -we will implement metrics for: - -* GCE -* AWS - -We will also implement metrics only for storage API calls for now. This feature -does introduces hooks into kubernetes code which can be used to add additional metrics -but we only focus on storage API calls here. - -## Motivation - -* Cluster admins should be able to monitor Cloud API usage of Kubernetes. It will help - them detect problems in certain scenarios which can blow up the API quota of Cloud - provider. -* Cluster admins should also be able to monitor health and latency of Cloud API on - which kubernetes depends on. - -## Implementation - -### Metric format and collection - -Metrics emitted from cloud provider will fall under category of service metrics -as defined in [Kubernetes Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md). - - -The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection -from `/metrics` HTTP endpoint of kubelet, controller etc. All Kubernetes core components already emit -metrics on `/metrics` HTTP endpoint. This proposal merely extends available metrics to include Cloud provider metrics as well. - - -Any collector which can parse Prometheus metric format should be able to collect -metrics from these endpoints. - -A more detailed description of monitoring pipeline can be found in [Monitoring architecture] (/contributors/design-proposals/instrumentation/monitoring_architecture.md#monitoring-pipeline) document. - - -#### Metric Types - -Since we are interested in count(or rate) and latency percentile metrics of API calls Kubernetes is making to -the external Cloud Provider - we will use [Histogram](https://prometheus.io/docs/practices/histograms/) type for -emitting these metrics. - -We will be using `HistogramVec` type so as we can attach dimensions at runtime. All metrics will contain API action -being taken as a dimension. The cloudprovider maintainer may choose to add additional dimensions as needed. If a -dimension is not available at point of emission sentinel value `<n/a>` should be emitted as a placeholder. - -We are also interested in counter of cloudprovider API errors. `NewCounterVec` type will be used for keeping -track of API errors. - -### GCE Implementation - -To begin with we will start emitting following metrics for GCE. Because these metrics are of type -`Histogram` - both count and latency will be automatically calculated. - -#### GCE Latency metrics - -All gce latency metrics will be named - `cloudprovider_gce_api_request_duration_seconds`. api request -being made will be reported as dimensions. - - -To begin we will start emitting following metrics: - -``` -cloudprovider_gce_api_request_duration_seconds { request = "instance_list"} -cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"} -cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"} -cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"} -cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"} -cloudprovider_gce_api_request_duration_seconds { request = "list_disk"} -``` - -#### GCE API error metrics. - -All gce error metrics will be named `cloudprovider_gce_api_request_errors`. api request being made will be -reported as a dimension. - -To begin with we expect to report following error metrics: - -``` -cloudprovider_gce_api_request_errors { request = "instance_list"} -cloudprovider_gce_api_request_errors { request = "disk_insert"} -cloudprovider_gce_api_request_errors { request = "disk_delete"} -cloudprovider_gce_api_request_errors { request = "attach_disk"} -cloudprovider_gce_api_request_errors { request = "detach_disk"} -cloudprovider_gce_api_request_errors { request = "list_disk"} -``` - - -### AWS Implementation - -For AWS currently we will use wrapper type `awsSdkEC2` to intercept all storage API calls and -emit metric datapoints. The reason we are not using approach used for `aws/log_handler` is - because AWS SDK doesn't uses Contexts and hence we can't pass custom information such as API call name or namespace to record with metrics. - - -#### AWS Latency metrics - -All aws API metrics will be named - `cloudprovider_aws_api_request_duration_seconds`. `request` will be reported as dimensions. -AWS maintainer may choose to add additional dimensions as needed. - -To begin with we will start emitting following metrics for AWS: - -``` -cloudprovider_aws_api_request_duration_seconds { request = "attach_volume"} -cloudprovider_aws_api_request_duration_seconds { request = "detach_volume"} -cloudprovider_aws_api_request_duration_seconds { request = "create_tags"} -cloudprovider_aws_api_request_duration_seconds { request = "create_volume"} -cloudprovider_aws_api_request_duration_seconds { request = "delete_volume"} -cloudprovider_aws_api_request_duration_seconds { request = "describe_instance"} -cloudprovider_aws_api_request_duration_seconds { request = "describe_volume"} -``` - -#### AWS Error metrics - -All aws error metrics will be named `cloudprovider_aws_api_request_errors`. api request being made will be -reported as a dimension. - -To begin with we expect to report following error metrics: - -``` -cloudprovider_aws_api_request_errors { request = "attach_volume"} -cloudprovider_aws_api_request_errors { request = "detach_volume"} -cloudprovider_aws_api_request_errors { request = "create_tags"} -cloudprovider_aws_api_request_errors { request = "create_volume"} -cloudprovider_aws_api_request_errors { request = "delete_volume"} -cloudprovider_aws_api_request_errors { request = "describe_instance"} -cloudprovider_aws_api_request_errors { request = "describe_volume"} -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/OWNERS b/contributors/design-proposals/cluster-lifecycle/OWNERS deleted file mode 100644 index 71322d9e..00000000 --- a/contributors/design-proposals/cluster-lifecycle/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-cluster-lifecycle-leads -approvers: - - sig-cluster-lifecycle-leads -labels: - - sig/cluster-lifecycle diff --git a/contributors/design-proposals/cluster-lifecycle/bootstrap-discovery.md b/contributors/design-proposals/cluster-lifecycle/bootstrap-discovery.md index f481e02d..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/bootstrap-discovery.md +++ b/contributors/design-proposals/cluster-lifecycle/bootstrap-discovery.md @@ -1,244 +1,6 @@ -# Super Simple Discovery API +Design proposals have been archived. -## Overview +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -It is surprisingly hard to figure out how to talk to a Kubernetes cluster. Not only do clients need to know where to look on the network, they also need to identify the set of root certificates to trust when talking to that endpoint. -This presents a set of problems: -* It should be super easy for users to configure client systems with a minimum of effort `kubectl` or `kubeadm init` (or other client systems). - * Establishing this should be doable even in the face of nodes and master components booting out of order. - * We should have mechanisms that don't require users to ever have to manually manage certificate files. -* Over the life of the cluster this information could change and client systems should be able to adapt. - -While this design is mainly being created to help `kubeadm` possible, these problems aren't isolated there and can be used outside of the kubeadm context. - -Mature organizations should be able to distribute and manage root certificates out of band of Kubernetes installations. In that case, clients will defer to corporation wide system installed root certificates or root certificates distributed through other means. However, for smaller and more casual users distributing or obtaining certificates represents a challenge. - -Similarly, mature organizations will be able to rely on a centrally managed DNS system to distribute the location of a set of API servers and keep those names up to date over time. Those DNS servers will be managed for high availability. - -With that in mind, the proposals here will devolve into simply using DNS names that are validated with system installed root certificates. - -## Cluster location information (aka ClusterInfo) - -First we define a set of information that identifies a cluster and how to talk to it. We will call this ClusterInfo in this document. - -While we could define a new format for communicating the set of information needed here, we'll start by using the standard [`kubeconfig`](http://kubernetes.io/docs/user-guide/kubeconfig-file/) file format. - -It is expected that the `kubeconfig` file will have a single unnamed `Cluster` entry. Other information (especially authentication secrets) MUST be omitted. - -### Evolving kubeconfig - -In the future we look forward to enhancing `kubeconfig` to address some issues. These are out of scope for this design. Some of this is covered in [#30395](https://github.com/kubernetes/kubernetes/issues/30395). - -Additions include: - -* A cluster serial number/identifier. - * In an HA world, API servers may come and go and it is necessary to make sure we are talking to the same cluster as we thought we were talking to. -* A _set_ of addresses for finding the cluster. - * It is implied that all of these are equivalent and that a client can try multiple until an appropriate target is found. - * Initially I'm proposing a flat set here. In the future we can introduce more structure that hints to the user which addresses to try first. -* Better documentation and exposure of: - * The root certificates can be a bundle to enable rotation. - * If no root certificates are given (and the insecure bit isn't set) then the client trusts the system managed list of CAs. - -### Client caching and update - -**This is to be implemented in a later phase** - -Any client of the cluster will want to have this information. As the configuration of the cluster changes we need the client to keep this information up to date. The ClusterInfo ConfigMap (defined below) is expected to be a common place to get the latest ClusterInfo for any cluster. Clients should periodically grab this and cache it. It is assumed that the information here won't drift so fast that clients won't be able to find *some* way to connect. - -In exceptional circumstances it is possible that this information may be out of date and a client would be unable to connect to a cluster. Consider the case where a user has kubectl set up and working well and then doesn't run kubectl for quite a while. It is possible that over this time (a) the set of servers will have migrated so that all endpoints are now invalid or (b) the root certificates will have rotated so that the user can no longer trust any endpoint. - -## Methods - -Now that we know *what* we want to get to the client, the question is how. We want to do this in as secure a way possible (as there are cryptographic keys involved) without requiring a lot of overhead in terms of information that needs to be copied around. - -### Method: Out of Band - -The simplest way to obtain ClusterInfo this would be to simply put this object in a file and copy it around. This is more overhead for the user, but it is easy to implement and lets users rely on existing systems to distribute configuration. - -For the `kubeadm` flow, the command line might look like: - -``` -kubeadm join --discovery-file=my-cluster.yaml -``` - -After loading the ClusterInfo from a file, the client MAY look for updated information from the server by reading the `kube-public/cluster-info` ConfigMap defined below. However, when retrieving this ConfigMap the client MUST validate the certificate chain when talking to the API server. - -**Note:** TLS bootstrap (which establishes a way for a client to authenticate itself to the server) is a separate issue and has its own set of methods. This command line may have a TLS bootstrap token (or config file) on the command line also. For this reason, even thought the `--discovery-file` argument is in the form of a `kubeconfig`, it MUST NOT contain client credentials as defined above. - -### Method: HTTPS Endpoint - -If the ClusterInfo information is hosted in a trusted place via HTTPS you can just request it that way. This will use the root certificates that are installed on the system. It may or may not be appropriate based on the user's constraints. This method MUST use HTTPS. Also, even though the payload for this URL is the `kubeconfig` format, it MUST NOT contain client credentials. - -``` -kubeadm join --discovery-file="https://example/mycluster.yaml" -``` - -This is really a shorthand for someone doing something like (assuming we support stdin with `-`): - -``` -curl https://example.com/mycluster.yaml | kubeadm join --discovery-file=- -``` - -After loading the ClusterInfo from a URL, the client MAY look for updated information from the server by reading the `kube-public/cluster-info` ConfigMap defined below. However, when retrieving this ConfigMap the client MUST validate the certificate chain when talking to the API server. - -**Note:** support for loading from stdin for `--discovery-file` may not be implemented immediately. - -### Method: Bootstrap Token - -There won't always be a trusted external endpoint to talk to and transmitting -the locator file out of band is a pain. However, we want something more secure -than just hitting HTTP and trusting whatever we get back. In this case, we -assume we have the following: - - * An address for at least one of the API servers (which will implement this API). - * This address is technically an HTTPS URL base but is often expressed as a bare domain or IP. - * A shared secret token - -An interesting aspect here is that this information is often easily obtained before the API server is configured or started. This makes some cluster bring-up scenarios much easier. - -The user experience for joining a cluster would be something like: - -``` -kubeadm join --token=ae23dc.faddc87f5a5ab458 <address>:<port> -``` - -**Note:** This is logically a different use of the token used for authentication for TLS bootstrap. We harmonize these usages and allow the same token to play double duty. - -#### Implementation Flow - -`kubeadm` will implement the following flow: - -* `kubeadm` connects to the API server address specified over TLS. As we don't yet have a root certificate to trust, this is an insecure connection and the server certificate is not validated. `kubeadm` provides no authentication credentials at all. - * Implementation note: the API server doesn't have to expose a new and special insecure HTTP endpoint. - * (D)DoS concern: Before this flow is secure to use/enable publicly (when not bootstrapping), the API Server must support rate-limiting. There are a couple of ways rate-limiting can be implemented to work for this use-case, but defining the rate-limiting flow in detail here is out of scope. One simple idea is limiting unauthenticated requests to come from clients in RFC1918 ranges. -* `kubeadm` requests a ConfigMap containing the kubeconfig file defined above. - * This ConfigMap exists at a well known URL: `https://<server>/api/v1/namespaces/kube-public/configmaps/cluster-info` - * This ConfigMap is really public. Users don't need to authenticate to read this ConfigMap. In fact, the client MUST NOT use a bearer token here as we don't trust this endpoint yet. -* The API server returns the ConfigMap with the kubeconfig contents as normal - * Extra data items on that ConfigMap contains JWS signatures. `kubeadm` finds the correct signature based on the `token-id` part of the token. (Described below). -* `kubeadm` verifies the JWS and can now trust the server. Further communication is simpler as the CA certificate in the kubeconfig file can be trusted. - - -#### NEW: Bootstrap Token Structure - -To first make this work, we put some structure into the token. It has both a token identifier and the token value, separated by a dot. Example: - -``` -ae23dc.faddc87f5a5ab458 -``` - -The first part of the token is the `token-id`. The second part is the `token-secret`. By having a token identifier, we make it easier to specify *which* token you are talking about without sending the token itself in the clear. - -This new type of token is different from the current CSV token authenticator that is currently part of Kubernetes. The CSV token authenticator requires an update on disk and a restart of the API server to update/delete tokens. As we prove out this token mechanism we may wish to deprecate and eventually remove that mechanism. - -The `token-id` must be 6 characters and the `token-secret` must be 16 characters. They must be lower case ASCII letters and numbers. Specifically it must match the regular expression: `[a-z0-9]{6}\.[a-z0-9]{16}`. There is no strong reasoning behind this beyond the history of how this has been implemented in alpha versions. - -#### NEW: Bootstrap Token Secrets - -Bootstrap tokens are stored and managed via Kubernetes secrets in the `kube-system` namespace. They have the type `bootstrap.kubernetes.io/token`. - -The following keys are on the secret data: -* **token-id**. As defined above. -* **token-secret**. As defined above. -* **expiration**. After this time the token should be automatically deleted. This is encoded as an absolute UTC time using RFC3339. -* **usage-bootstrap-signing**. Set to `true` to indicate this token should be used for signing bootstrap configs. If this is missing from the token secret or set to any other value, the usage is not allowed. -* **usage-bootstrap-authentication**. Set to true to indicate that this token should be used for authenticating to the API server. If this is missing from the token secret or set to any other value, the usage is not allowed. The bootstrap token authenticator will use this token to auth as a user that is `system:bootstrap:<token-id>` in the group `system:bootstrappers`. -* **description**. An optional free form description field for denoting the purpose of the token. If users have especially complex token management neads, they are encouraged to use labels and annotations instead of packing machined readable data in to this field. -* **auth-groups**. A comma-separated list of which groups the token should be authenticated as. All groups must have the `system:bootstrappers:` prefix. - - -These secrets MUST be named `bootstrap-token-<token-id>`. If a token doesn't adhere to this naming scheme it MUST be ignored. The secret MUST also be ignored if the `token-id` key in the secret doesn't match the name of the secret. - -#### Quick Primer on JWS - -[JSON Web Signatures](https://tools.ietf.org/html/rfc7515) are a way to sign, serialize and verify a payload. It supports both symmetric keys (aka shared secrets) along with asymmetric keys (aka public key infrastructure or key pairs). The JWS is split in to 3 parts: -1. a header about how it is signed -2. the clear text payload -3. the signature. - -There are a couple of different ways of encoding this data -- either as a JSON object or as a set of BASE64URL strings for including in headers or URL parameters. In this case, we are using a shared secret and the HMAC-SHA256 signing algorithm and encoding it as a JSON object. The popular JWT (JSON Web Tokens) specification is a type of JWS. - -The JWS specification [describes how to encode](https://tools.ietf.org/html/rfc7515#appendix-F) "detached content". In this way the signature is calculated as normal but the content isn't included in the signature. - -#### NEW: `kube-public` namespace - -Kubernetes ConfigMaps are per-namespace and are generally only visible to principals that have read access on that namespace. To create a config map that *everyone* can see, we introduce a new `kube-public` namespace. This namespace, by convention, is readable by all users (including those not authenticated). Note that is a convention -(to expose everything in `kube-public`), not something that's done by default in Kubernetes. `kubeadm` does _solely_ expose the `cluster-info` ConfigMap, not anything else. - -In the initial implementation the `kube-public` namespace (and the `cluster-info` config map) will be created by `kubeadm`. That means that these won't exist for clusters that aren't bootstrapped with `kubeadm`. As we have need for this configmap in other contexts (self describing HA clusters?) we'll make this be more generally available. - -#### NEW: `cluster-info` ConfigMap - -A new well known ConfigMap will be created in the `kube-public` namespace called `cluster-info`. - -Users configuring the cluster (and eventually the cluster itself) will update the `kubeconfig` key here with the limited `kubeconfig` above. - -A new controller (`bootstrapsigner`) is introduced that will watch for both new/modified bootstrap tokens and changes to the `cluster-info` ConfigMap. As things change it will generate new JWS signatures. These will be saved under ConfigMap keys of the pattern `jws-kubeconfig-<token-id>`. - -Another controller (`tokencleaner`) is introduced that deletes tokens that are past their expiration time. - -Logically these controllers could run as a separate component in the control plane. But, for the sake of efficiency, they are bundled as part of the Kubernetes controller-manager. - -## `kubeadm` UX - -We extend kubeadm with a set of flags and helper commands for managing and using these tokens. - -### `kubeadm init` flags - -* `--token` If set, this injects the bootstrap token to use when initializing the cluster. If this is unset, then a random token is created and shown to the user. If set explicitly to the empty string then no token is generated or created. This token is used for both discovery and TLS bootstrap by having `usage-bootstrap-signing` and `usage-bootstrap-authentication` set on the token secret. -* `--token-ttl` If set, this sets the TTL for the lifetime of this token. Defaults to 0 which means "forever" in v1.6 and v1.7. Defaults to `24h` in v1.8 - -### `kubeadm join` flags - -* `--token` This sets the token for both discovery and bootstrap auth. -* `--discovery-file` If set, this will load the cluster-info from a file on disk or from a HTTPS URL (the HTTPS requirement due to the sensitive nature of the data) -* `--discovery-token` If set, (or set via `--token`) then we will be using the token scheme described above. -* `--tls-bootstrap-token` (not officially part of this spec) This sets the token used to temporarily authenticate to the API server in order to submit a CSR for signing. If the `system:csr-approver:approve-node-client-csr` ClusterRole is bound to the group the Bootstrap Token authenticates to, the CSR will be approved automatically (by the `csrapprover` controller) for a hands off joining flow. - -Only one of `--discovery-file` or `--discovery-token` can be set. If more than one is set then an error is surfaced and `kubeadm join` exits. Setting `--token` counts as setting `--discovery-token`. - -### `kubeadm token` commands - -`kubeadm` provides a set of utilities for manipulating token secrets in a running server. - -* `kubeadm token create [token]` Creates a token server side. With no options this'll create a token that is used for discovery and TLS bootstrap. - * `[token]` The actual token value (in `id.secret` form) to write in. If unset, a random value is generated. - * `--usages` A list of usages. Defaults to `signing,authentication`. - * If the `signing` usage is specified, the token will be used (by the BootstrapSigner controller in the KCM) to JWS-sign the ConfigMap and can then be used for discovery. - * If the `authentication` usage is specified, the token can be used to authenticate for TLS bootstrap. - * `--ttl` The TTL for this token. This sets the expiration of the token as a duration from the current time. This is converted into an absolute UTC time as it is written into the token secret. - * `--description` Sets the free form description field for the token. -* `kubeadm token delete <token-id>|<token-id>.<token-secret>` - * Users can either just specify the id or the full token. This will delete the token if it exists. -* `kubeadm token list` - * List tokens in a table form listing out the `token-id.token-secret`, the TTL, the absolute expiration time, the usages, and the description. - * **Question** Support a `--json` or `-o json` way to make this info programmatic? We don't want to recreate `kubectl` here and these aren't plain API objects so we can't reuse that plumbing easily. -* `kubeadm token generate` This currently exists but is documented here for completeness. This pure client side method just generated a random token in the correct form. - -## Implementation Details - -Our documentations (and output from `kubeadm`) should stress to users that when the token is configured for authentication and used for TLS bootstrap is a pretty powerful credential due to that any person with access to it can claim to be a node. -The highest risk regarding being able to claim a credential in the `system:nodes` group is that it can read all Secrets in the cluster, which may compromise the cluster. -The [Node Authorizer](/contributors/design-proposals/node/kubelet-authorizer.md) locks this down a bit, but an untrusted person could still try to -guess a node's name, get such a credential, guess the name of the Secret and be able to get that. - -Users should set a TTL on the token to limit the above mentioned risk. `kubeadm` sets a 24h TTL on the node bootstrap token by default in v1.8. -Or, after the cluster is up and running, users should delete the token using `kubeadm token delete`. - -After some back and forth, we decided to keep the separator in the token between the ID and Secret be a `.`. During the 1.6 cycle, at one point `:` was implemented but then reverted. - -See [kubernetes/client-go#114](https://github.com/kubernetes/client-go/issues/114) for details on creating a shared package with common constants for this scheme. - -This proposal assumes RBAC to lock things down in a couple of ways. First, it will open up `cluster-info` ConfigMap in `kube-public` so that it is readable by unauthenticated users (user `system:anonymous`). Next, it will make it so that the identities in the `system:bootstrappers` group can only be used with the certs API to submit CSRs. After a TLS certificate is created, the groupapprover controller will approve the CSR and the CSR identity can be used instead of the bootstrap token. - -The end goal is to make it possible to delegate the client TLS Bootstrapping part to the kubelet, so `kubeadm join`'s function will solely be to verify the validity of the token and fetch the CA bundle. - -The binding of the `system:bootstrappers` (or similar) group to the ability to submit CSRs is not part of the default RBAC configuration. Consumers of this feature like `kubeadm` will have to explicitly create this binding. - -## Revision history - - - Initial proposal ([@jbeda](https://github.com/jbeda)): [link](https://github.com/kubernetes/community/blob/cb9f198a0763e0a7540cdcc9db912a403ab1acab/contributors/design-proposals/bootstrap-discovery.md) - - v1.6 updates ([@jbeda](https://github.com/jbeda)): [link](https://github.com/kubernetes/community/blob/d8ce9e91b0099795318bb06c13f00d9dad41ac26/contributors/design-proposals/bootstrap-discovery.md) - - v1.8 updates ([@luxas](https://github.com/luxas)) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/cluster-deployment.md b/contributors/design-proposals/cluster-lifecycle/cluster-deployment.md index 46af0c5c..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/cluster-deployment.md +++ b/contributors/design-proposals/cluster-lifecycle/cluster-deployment.md @@ -1,167 +1,6 @@ -# Objective +Design proposals have been archived. -Simplify the cluster provisioning process for a cluster with one master and multiple worker nodes. -It should be secured with SSL and have all the default add-ons. There should not be significant -differences in the provisioning process across deployment targets (cloud provider + OS distribution) -once machines meet the node specification. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Overview - -Cluster provisioning can be broken into a number of phases, each with their own exit criteria. -In some cases, multiple phases will be combined together to more seamlessly automate the cluster setup, -but in all cases the phases can be run sequentially to provision a functional cluster. - -It is possible that for some platforms we will provide an optimized flow that combines some of the steps -together, but that is out of scope of this document. - -# Deployment flow - -**Note**: _Exit critieria_ in the following sections are not intended to list all tests that should pass, -rather list those that must pass. - -## Step 1: Provision cluster - -**Objective**: Create a set of machines (master + nodes) where we will deploy Kubernetes. - -For this phase to be completed successfully, the following requirements must be completed for all nodes: -- Basic connectivity between nodes (i.e. nodes can all ping each other) -- Docker installed (and in production setups should be monitored to be always running) -- One of the supported OS - -We will provide a node specification conformance test that will verify if provisioning has been successful. - -This step is provider specific and will be implemented for each cloud provider + OS distribution separately -using provider specific technology (cloud formation, deployment manager, PXE boot, etc). -Some OS distributions may meet the provisioning criteria without needing to run any post-boot steps as they -ship with all of the requirements for the node specification by default. - -**Substeps** (on the GCE example): - -1. Create network -2. Create firewall rules to allow communication inside the cluster -3. Create firewall rule to allow ```ssh``` to all machines -4. Create firewall rule to allow ```https``` to master -5. Create persistent disk for master -6. Create static IP address for master -7. Create master machine -8. Create node machines -9. Install docker on all machines - -**Exit criteria**: - -1. Can ```ssh``` to all machines and run a test docker image -2. Can ```ssh``` to master and nodes and ping other machines - -## Step 2: Generate certificates - -**Objective**: Generate security certificates used to configure secure communication between client, master and nodes - -TODO: Enumerate certificates which have to be generated. - -## Step 3: Deploy master - -**Objective**: Run kubelet and all the required components (e.g. etcd, apiserver, scheduler, controllers) on the master machine. - -**Substeps**: - -1. copy certificates -2. copy manifests for static pods: - 1. etcd - 2. apiserver, controller manager, scheduler -3. run kubelet in docker container (configuration is read from apiserver Config object) -4. run kubelet-checker in docker container - -**v1.2 simplifications**: - -1. kubelet-runner.sh - we will provide a custom docker image to run kubelet; it will contain -kubelet binary and will run it using ```nsenter``` to workaround problems with mount propagation -1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will -be generated locally and copied to all nodes. - -**Exit criteria**: - -1. Can run basic API calls (e.g. create, list and delete pods) from the client side (e.g. replication -controller works - user can create RC object and RC manager can create pods based on that) -2. Critical master components works: - 1. scheduler - 2. controller manager - -## Step 4: Deploy nodes - -**Objective**: Start kubelet on all nodes and configure kubernetes network. -Each node can be deployed separately and the implementation should make it ~impossible to change this assumption. - -### Step 4.1: Run kubelet - -**Substeps**: - -1. copy certificates -2. run kubelet in docker container (configuration is read from apiserver Config object) -3. run kubelet-checker in docker container - -**v1.2 simplifications**: - -1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will -be generated locally and copied to all nodes. - -**Exit criteria**: - -1. All nodes are registered, but not ready due to lack of kubernetes networking. - -### Step 4.2: Setup kubernetes networking - -**Objective**: Configure the Kubernetes networking to allow routing requests to pods and services. - -To keep default setup consistent across open source deployments we will use Flannel to configure -kubernetes networking. However, implementation of this step will allow to easily plug in different -network solutions. - -**Substeps**: - -1. copy manifest for flannel server to master machine -2. create a daemonset with flannel daemon (it will read assigned CIDR and configure network appropriately). - -**v1.2 simplifications**: - -1. flannel daemon will run as a standalone binary (not in docker container) -2. flannel server will assign CIDRs to nodes outside of kubernetes; this will require restarting kubelet -after reconfiguring network bridge on local machine; this will also require running master and node differently -(```--configure-cbr0=false``` on node and ```--allocate-node-cidrs=false``` on master), which breaks encapsulation -between nodes - -**Exit criteria**: - -1. Pods correctly created, scheduled, run and accessible from all nodes. - -## Step 5: Add daemons - -**Objective:** Start all system daemons (e.g. kube-proxy) - -**Substeps:**: - -1. Create daemonset for kube-proxy - -**Exit criteria**: - -1. Services work correctly on all nodes. - -## Step 6: Add add-ons - -**Objective**: Add default add-ons (e.g. dns, dashboard) - -**Substeps:**: - -1. Create Deployments (and daemonsets if needed) for all add-ons - -## Deployment technology - -We will use Ansible as the default technology for deployment orchestration. It has low requirements on the cluster machines -and seems to be popular in kubernetes community which will help us to maintain it. - -For simpler UX we will provide simple bash scripts that will wrap all basic commands for deployment (e.g. ```up``` or ```down```) - -One disadvantage of using Ansible is that it adds a dependency on a machine which runs deployment scripts. We will workaround -this by distributing deployment scripts via a docker image so that user will run the following command to create a cluster: - -```docker run k8s.gcr.io/deploy_kubernetes:v1.2 up --num-nodes=3 --provider=aws``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/clustering.md b/contributors/design-proposals/cluster-lifecycle/clustering.md index e681d8e9..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/clustering.md +++ b/contributors/design-proposals/cluster-lifecycle/clustering.md @@ -1,123 +1,6 @@ -# Clustering in Kubernetes +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview -The term "clustering" refers to the process of having all members of the -Kubernetes cluster find and trust each other. There are multiple different ways -to achieve clustering with different security and usability profiles. This -document attempts to lay out the user experiences for clustering that Kubernetes -aims to address. - -Once a cluster is established, the following is true: - -1. **Master -> Node** The master needs to know which nodes can take work and -what their current status is wrt capacity. - 1. **Location** The master knows the name and location of all of the nodes in -the cluster. - * For the purposes of this doc, location and name should be enough -information so that the master can open a TCP connection to the Node. Most -probably we will make this either an IP address or a DNS name. It is going to be -important to be consistent here (master must be able to reach kubelet on that -DNS name) so that we can verify certificates appropriately. - 2. **Target AuthN** A way to securely talk to the kubelet on that node. -Currently we call out to the kubelet over HTTP. This should be over HTTPS and -the master should know what CA to trust for that node. - 3. **Caller AuthN/Z** This would be the master verifying itself (and -permissions) when calling the node. Currently, this is only used to collect -statistics as authorization isn't critical. This may change in the future -though. -2. **Node -> Master** The nodes currently talk to the master to know which pods -have been assigned to them and to publish events. - 1. **Location** The nodes must know where the master is at. - 2. **Target AuthN** Since the master is assigning work to the nodes, it is -critical that they verify whom they are talking to. - 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to -the master. Ideally this authentication is specific to each node so that -authorization can be narrowly scoped. The details of the work to run (including -things like environment variables) might be considered sensitive and should be -locked down also. - -**Note:** While the description here refers to a singular Master, in the future -we should enable multiple Masters operating in an HA mode. While the "Master" is -currently the combination of the API Server, Scheduler and Controller Manager, -we will restrict ourselves to thinking about the main API and policy engine -- -the API Server. - -## Current Implementation - -A central authority (generally the master) is responsible for determining the -set of machines which are members of the cluster. Calls to create and remove -worker nodes in the cluster are restricted to this single authority, and any -other requests to add or remove worker nodes are rejected. (1.i.) - -Communication from the master to nodes is currently over HTTP and is not secured -or authenticated in any way. (1.ii, 1.iii.) - -The location of the master is communicated out of band to the nodes. For GCE, -this is done via Salt. Other cluster instructions/scripts use other methods. -(2.i.) - -Currently most communication from the node to the master is over HTTP. When it -is done over HTTPS there is currently no verification of the cert of the master -(2.ii.) - -Currently, the node/kubelet is authenticated to the master via a token shared -across all nodes. This token is distributed out of band (using Salt for GCE) and -is optional. If it is not present then the kubelet is unable to publish events -to the master. (2.iii.) - -Our current mix of out of band communication doesn't meet all of our needs from -a security point of view and is difficult to set up and configure. - -## Proposed Solution - -The proposed solution will provide a range of options for setting up and -maintaining a secure Kubernetes cluster. We want to both allow for centrally -controlled systems (leveraging pre-existing trust and configuration systems) or -more ad-hoc automagic systems that are incredibly easy to set up. - -The building blocks of an easier solution: - -* **Move to TLS** We will move to using TLS for all intra-cluster communication. -We will explicitly identify the trust chain (the set of trusted CAs) as opposed -to trusting the system CAs. We will also use client certificates for all AuthN. -* [optional] **API driven CA** Optionally, we will run a CA in the master that -will mint certificates for the nodes/kubelets. There will be pluggable policies -that will automatically approve certificate requests here as appropriate. - * **CA approval policy** This is a pluggable policy object that can -automatically approve CA signing requests. Stock policies will include -`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would -be an API for evaluating and accepting/rejecting requests. Cloud providers could -implement a policy here that verifies other out of band information and -automatically approves/rejects based on other external factors. -* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give -a node permission to register itself. - * To start with, we'd have the kubelets generate a cert/account in the form of -`kubelet:<host>`. To start we would then hard code policy such that we give that -particular account appropriate permissions. Over time, we can make the policy -engine more generic. -* [optional] **Bootstrap API endpoint** This is a helper service hosted outside -of the Kubernetes cluster that helps with initial discovery of the master. - -### Static Clustering - -In this sequence diagram there is out of band admin entity that is creating all -certificates and distributing them. It is also making sure that the kubelets -know where to find the master. This provides for a lot of control but is more -difficult to set up as lots of information must be communicated outside of -Kubernetes. - - - -### Dynamic Clustering - -This diagram shows dynamic clustering using the bootstrap API endpoint. This -endpoint is used to both find the location of the master and communicate the -root CA for the master. - -This flow has the admin manually approving the kubelet signing requests. This is -the `queue` policy defined above. This manual intervention could be replaced by -code that can verify the signing requests via other means. - - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/.gitignore b/contributors/design-proposals/cluster-lifecycle/clustering/.gitignore deleted file mode 100644 index 67bcd6cb..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/.gitignore +++ /dev/null @@ -1 +0,0 @@ -DroidSansMono.ttf diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/Dockerfile b/contributors/design-proposals/cluster-lifecycle/clustering/Dockerfile deleted file mode 100644 index e7abc753..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/Dockerfile +++ /dev/null @@ -1,26 +0,0 @@ -# Copyright 2016 The Kubernetes Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -FROM debian:jessie - -RUN apt-get update -RUN apt-get -qy install python-seqdiag make curl - -WORKDIR /diagrams - -RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf - -ADD . /diagrams - -CMD bash -c 'make >/dev/stderr && tar cf - *.png'
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/Makefile b/contributors/design-proposals/cluster-lifecycle/clustering/Makefile deleted file mode 100644 index e72d441e..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/Makefile +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright 2016 The Kubernetes Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -FONT := DroidSansMono.ttf - -PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag)) - -.PHONY: all -all: $(PNGS) - -.PHONY: watch -watch: - fswatch *.seqdiag | xargs -n 1 sh -c "make || true" - -$(FONT): - curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT) - -%.png: %.seqdiag $(FONT) - seqdiag --no-transparency -a -f '$(FONT)' $< - -# Build the stuff via a docker image -.PHONY: docker -docker: - docker build -t clustering-seqdiag . - docker run --rm clustering-seqdiag | tar xvf - - -.PHONY: docker-clean -docker-clean: - docker rmi clustering-seqdiag || true - docker images -q --filter "dangling=true" | xargs docker rmi diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/OWNERS b/contributors/design-proposals/cluster-lifecycle/clustering/OWNERS deleted file mode 100644 index 741be590..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/OWNERS +++ /dev/null @@ -1,6 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - michelleN -approvers: - - michelleN diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/README.md b/contributors/design-proposals/cluster-lifecycle/clustering/README.md index 9fe1f027..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/clustering/README.md +++ b/contributors/design-proposals/cluster-lifecycle/clustering/README.md @@ -1,31 +1,6 @@ -This directory contains diagrams for the clustering design doc. +Design proposals have been archived. -This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). -Assuming you have a non-borked python install, this should be installable with: +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -```sh -pip install seqdiag -``` - -Just call `make` to regenerate the diagrams. - -## Building with Docker - -If you are on a Mac or your pip install is messed up, you can easily build with -docker: - -```sh -make docker -``` - -The first run will be slow but things should be fast after that. - -To clean up the docker containers that are created (and other cruft that is left -around) you can run `make docker-clean`. - -## Automatically rebuild on file changes - -If you have the fswatch utility installed, you can have it monitor the file -system and automatically rebuild when files have changed. Just do a -`make watch`. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.png b/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.png Binary files differdeleted file mode 100644 index 92b40fee..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.png +++ /dev/null diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.seqdiag b/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.seqdiag deleted file mode 100644 index 567d5bf9..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/dynamic.seqdiag +++ /dev/null @@ -1,24 +0,0 @@ -seqdiag { - activation = none; - - - user[label = "Admin User"]; - bootstrap[label = "Bootstrap API\nEndpoint"]; - master; - kubelet[stacked]; - - user -> bootstrap [label="createCluster", return="cluster ID"]; - user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"]; - - user ->> master [label="start\n- bootstrap-cluster-uri"]; - master => bootstrap [label="setMaster\n- master-location\n- master-ca"]; - - user ->> kubelet [label="start\n- bootstrap-cluster-uri"]; - kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"]; - kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"]; - user => master [label="getSignRequests"]; - user => master [label="approveSignRequests"]; - kubelet <<-- master [label="returns\n- kubelet-cert"]; - - kubelet => master [label="register\n- kubelet-location"] -} diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/static.png b/contributors/design-proposals/cluster-lifecycle/clustering/static.png Binary files differdeleted file mode 100644 index bcdeca7e..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/static.png +++ /dev/null diff --git a/contributors/design-proposals/cluster-lifecycle/clustering/static.seqdiag b/contributors/design-proposals/cluster-lifecycle/clustering/static.seqdiag deleted file mode 100644 index bdc54b76..00000000 --- a/contributors/design-proposals/cluster-lifecycle/clustering/static.seqdiag +++ /dev/null @@ -1,16 +0,0 @@ -seqdiag { - activation = none; - - admin[label = "Manual Admin"]; - ca[label = "Manual CA"] - master; - kubelet[stacked]; - - admin => ca [label="create\n- master-cert"]; - admin ->> master [label="start\n- ca-root\n- master-cert"]; - - admin => ca [label="create\n- kubelet-cert"]; - admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"]; - - kubelet => master [label="register\n- kubelet-location"]; -} diff --git a/contributors/design-proposals/cluster-lifecycle/dramatically-simplify-cluster-creation.md b/contributors/design-proposals/cluster-lifecycle/dramatically-simplify-cluster-creation.md index 3472115d..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/dramatically-simplify-cluster-creation.md +++ b/contributors/design-proposals/cluster-lifecycle/dramatically-simplify-cluster-creation.md @@ -1,261 +1,6 @@ -# Proposal: Dramatically Simplify Kubernetes Cluster Creation +Design proposals have been archived. -> ***Please note: this proposal doesn't reflect final implementation, it's here for the purpose of capturing the original ideas.*** -> ***You should probably [read `kubeadm` docs](http://kubernetes.io/docs/getting-started-guides/kubeadm/), to understand the end-result of this effor.*** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Luke Marsden & many others in [SIG-cluster-lifecycle](/sig-cluster-lifecycle). -17th August 2016 - -*This proposal aims to capture the latest consensus and plan of action of SIG-cluster-lifecycle. It should satisfy the first bullet point [required by the feature description](https://github.com/kubernetes/features/issues/11).* - -See also: [this presentation to community hangout on 4th August 2016](https://docs.google.com/presentation/d/17xrFxrTwqrK-MJk0f2XCjfUPagljG7togXHcC39p0sM/edit?ts=57a33e24#slide=id.g158d2ee41a_0_76) - -## Motivation - -Kubernetes is hard to install, and there are many different ways to do it today. None of them are excellent. We believe this is hindering adoption. - -## Goals - -Have one recommended, official, tested, "happy path" which will enable a majority of new and existing Kubernetes users to: - -* Kick the tires and easily turn up a new cluster on infrastructure of their choice - -* Get a reasonably secure, production-ready cluster, with reasonable defaults and a range of easily-installable add-ons - -We plan to do so by improving and simplifying Kubernetes itself, rather than building lots of tooling which "wraps" Kubernetes by poking all the bits into the right place. - -## Scope of project - -There are logically 3 steps to deploying a Kubernetes cluster: - -1. *Provisioning*: Getting some servers - these may be VMs on a developer's workstation, VMs in public clouds, or bare-metal servers in a user's data center. - -2. *Install & Discovery*: Installing the Kubernetes core components on those servers (kubelet, etc) - and bootstrapping the cluster to a state of basic liveness, including allowing each server in the cluster to discover other servers: for example teaching etcd servers about their peers, having TLS certificates provisioned, etc. - -3. *Add-ons*: Now that basic cluster functionality is working, installing add-ons such as DNS or a pod network (should be possible using kubectl apply). - -Notably, this project is *only* working on dramatically improving 2 and 3 from the perspective of users typing commands directly into root shells of servers. The reason for this is that there are a great many different ways of provisioning servers, and users will already have their own preferences. - -What's more, once we've radically improved the user experience of 2 and 3, it will make the job of tools that want to do all three much easier. - -## User stories - -### Phase I - -**_In time to be an alpha feature in Kubernetes 1.4._** - -Note: the current plan is to deliver `kubeadm` which implements these stories as "alpha" packages built from master (after the 1.4 feature freeze), but which are capable of installing a Kubernetes 1.4 cluster. - -* *Install*: As a potential Kubernetes user, I can deploy a Kubernetes 1.4 cluster on a handful of computers running Linux and Docker by typing two commands on each of those computers. The process is so simple that it becomes obvious to me how to easily automate it if I so wish. - -* *Pre-flight check*: If any of the computers don't have working dependencies installed (e.g. bad version of Docker, too-old Linux kernel), I am informed early on and given clear instructions on how to fix it so that I can keep trying until it works. - -* *Control*: Having provisioned a cluster, I can gain user credentials which allow me to remotely control it using kubectl. - -* *Install-addons*: I can select from a set of recommended add-ons to install directly after installing Kubernetes on my set of initial computers with kubectl apply. - -* *Add-node*: I can add another computer to the cluster. - -* *Secure*: As an attacker with (presumed) control of the network, I cannot add malicious nodes I control to the cluster created by the user. I also cannot remotely control the cluster. - -### Phase II - -**_In time for Kubernetes 1.5:_** -*Everything from Phase I as beta/stable feature, everything else below as beta feature in Kubernetes 1.5.* - -* *Upgrade*: Later, when Kubernetes 1.4.1 or any newer release is published, I can upgrade to it by typing one other command on each computer. - -* *HA*: If one of the computers in the cluster fails, the cluster carries on working. I can find out how to replace the failed computer, including if the computer was one of the masters. - -## Top-down view: UX for Phase I items - -We will introduce a new binary, kubeadm, which ships with the Kubernetes OS packages (and binary tarballs, for OSes without package managers). - -``` -laptop$ kubeadm --help -kubeadm: bootstrap a secure kubernetes cluster easily. - - /==========================================================\ - | KUBEADM IS ALPHA, DO NOT USE IT FOR PRODUCTION CLUSTERS! | - | | - | But, please try it out! Give us feedback at: | - | https://github.com/kubernetes/kubernetes/issues | - | and at-mention @kubernetes/sig-cluster-lifecycle | - \==========================================================/ - -Example usage: - - Create a two-machine cluster with one master (which controls the cluster), - and one node (where workloads, like pods and containers run). - - On the first machine - ==================== - master# kubeadm init master - Your token is: <token> - - On the second machine - ===================== - node# kubeadm join node --token=<token> <ip-of-master> - -Usage: - kubeadm [command] - -Available Commands: - init Run this on the first server you deploy onto. - join Run this on other servers to join an existing cluster. - user Get initial admin credentials for a cluster. - manual Advanced, less-automated functionality, for power users. - -Use "kubeadm [command] --help" for more information about a command. -``` - -### Install - -*On first machine:* - -``` -master# kubeadm init master -Initializing kubernetes master... [done] -Cluster token: 73R2SIPM739TNZOA -Run the following command on machines you want to become nodes: - kubeadm join node --token=73R2SIPM739TNZOA <master-ip> -You can now run kubectl here. -``` - -*On N "node" machines:* - -``` -node# kubeadm join node --token=73R2SIPM739TNZOA <master-ip> -Initializing kubernetes node... [done] -Bootstrapping certificates... [done] -Joined node to cluster, see 'kubectl get nodes' on master. -``` - -Note `[done]` would be colored green in all of the above. - -### Install: alternative for automated deploy - -*The user (or their config management system) creates a token and passes the same one to both init and join.* - -``` -master# kubeadm init master --token=73R2SIPM739TNZOA -Initializing kubernetes master... [done] -You can now run kubectl here. -``` - -### Pre-flight check - -``` -master# kubeadm init master -Error: socat not installed. Unable to proceed. -``` - -### Control - -*On master, after Install, kubectl is automatically able to talk to localhost:8080:* - -``` -master# kubectl get pods -[normal kubectl output] -``` - -*To mint new user credentials on the master:* - -``` -master# kubeadm user create -o kubeconfig-bob bob - -Waiting for cluster to become ready... [done] -Creating user certificate for user... [done] -Waiting for user certificate to be signed... [done] -Your cluster configuration file has been saved in kubeconfig. - -laptop# scp <master-ip>:/root/kubeconfig-bob ~/.kubeconfig -laptop# kubectl get pods -[normal kubectl output] -``` - -### Install-addons - -*Using CNI network as example:* - -``` -master# kubectl apply --purge -f \ - https://git.io/kubernetes-addons/<X>.yaml -[normal kubectl apply output] -``` - -### Add-node - -*Same as Install – "on node machines".* - -### Secure - -``` -node# kubeadm join --token=GARBAGE node <master-ip> -Unable to join mesh network. Check your token. -``` - -## Work streams – critical path – must have in 1.4 before feature freeze - -1. [TLS bootstrapping](https://github.com/kubernetes/features/issues/43) - so that kubeadm can mint credentials for kubelets and users - - * Requires [#25764](https://github.com/kubernetes/kubernetes/pull/25764) and auto-signing [#30153](https://github.com/kubernetes/kubernetes/pull/30153) but does not require [#30094](https://github.com/kubernetes/kubernetes/pull/30094). - * @philips, @gtank & @yifan-gu - -1. Fix for [#30515](https://github.com/kubernetes/kubernetes/issues/30515) - so that kubeadm can install a kubeconfig which kubelet then picks up - - * @smarterclayton - -## Work streams – can land after 1.4 feature freeze - -1. [Debs](https://github.com/kubernetes/release/pull/35) and [RPMs](https://github.com/kubernetes/release/pull/50) (and binaries?) - so that kubernetes can be installed in the first place - - * @mikedanese & @dgoodwin - -1. [kubeadm implementation](https://github.com/lukemarsden/kubernetes/tree/kubeadm-scaffolding) - the kubeadm CLI itself, will get bundled into "alpha" kubeadm packages - - * @lukemarsden & @errordeveloper - -1. [Implementation of JWS server](https://github.com/jbeda/kubernetes/blob/discovery-api/docs/proposals/super-simple-discovery-api.md#method-jws-token) from [#30707](https://github.com/kubernetes/kubernetes/pull/30707) - so that we can implement the simple UX with no dependencies - - * @jbeda & @philips? - -1. Documentation - so that new users can see this in 1.4 (even if it's caveated with alpha/experimental labels and flags all over it) - - * @lukemarsden - -1. `kubeadm` alpha packages - - * @lukemarsden, @mikedanese, @dgoodwin - -### Nice to have - -1. [Kubectl apply --purge](https://github.com/kubernetes/kubernetes/pull/29551) - so that addons can be maintained using k8s infrastructure - - * @lukemarsden & @errordeveloper - -## kubeadm implementation plan - -Based on [@philips' comment here](https://github.com/kubernetes/kubernetes/pull/30361#issuecomment-239588596). -The key point with this implementation plan is that it requires basically no changes to kubelet except [#30515](https://github.com/kubernetes/kubernetes/issues/30515). -It also doesn't require kubelet to do TLS bootstrapping - kubeadm handles that. - -### kubeadm init master - -1. User installs and configures kubelet to look for manifests in `/etc/kubernetes/manifests` -1. API server CA certs are generated by kubeadm -1. kubeadm generates pod manifests to launch API server and etcd -1. kubeadm pushes replica set for prototype jsw-server and the JWS into API server with host-networking so it is listening on the master node IP -1. kubeadm prints out the IP of JWS server and JWS token - -### kubeadm join node --token IP - -1. User installs and configures kubelet to have a kubeconfig at `/var/lib/kubelet/kubeconfig` but the kubelet is in a crash loop and is restarted by host init system -1. kubeadm talks to jws-server on IP with token and gets the cacert, then talks to the apiserver TLS bootstrap API to get client cert, etc and generates a kubelet kubeconfig -1. kubeadm places kubeconfig into `/var/lib/kubelet/kubeconfig` and waits for kubelet to restart -1. Mission accomplished, we think. - -## See also - -* [Joe Beda's "K8s the hard way easier"](https://docs.google.com/document/d/1lJ26LmCP-I_zMuqs6uloTgAnHPcuT7kOYtQ7XSgYLMA/edit#heading=h.ilgrv18sg5t) which combines Kelsey's "Kubernetes the hard way" with history of proposed UX at the end (scroll all the way down to the bottom). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/ha_master.md b/contributors/design-proposals/cluster-lifecycle/ha_master.md index 3d0de1f1..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/ha_master.md +++ b/contributors/design-proposals/cluster-lifecycle/ha_master.md @@ -1,233 +1,6 @@ -# Automated HA master deployment +Design proposals have been archived. -**Author:** filipg@, jsz@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Introduction - -We want to allow users to easily replicate kubernetes masters to have highly available cluster, -initially using `kube-up.sh` and `kube-down.sh`. - -This document describes technical design of this feature. It assumes that we are using aforementioned -scripts for cluster deployment. All of the ideas described in the following sections should be easy -to implement on GCE, AWS and other cloud providers. - -It is a non-goal to design a specific setup for bare-metal environment, which -might be very different. - -# Overview - -In a cluster with replicated master, we will have N VMs, each running regular master components -such as apiserver, etcd, scheduler or controller manager. These components will interact in the -following way: -* All etcd replicas will be clustered together and will be using master election - and quorum mechanism to agree on the state. All of these mechanisms are integral - parts of etcd and we will only have to configure them properly. -* All apiserver replicas will be working independently talking to an etcd on - 127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master - (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)). -* We will introduce provider specific solutions to load balance traffic between master replicas - (see section `load balancing`) -* Controller manager, scheduler & cluster autoscaler will use lease mechanism and - only a single instance will be an active master. All other will be waiting in a standby mode. -* All add-on managers will work independently and each of them will try to keep add-ons in sync - -# Detailed design - -## Components - -### etcd - -``` -Note: This design for etcd clustering is quite pet-set like - each etcd -replica has its name which is explicitly used in etcd configuration etc. In -medium-term future we would like to have the ability to run masters as part of -autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas -automatically. This is pretty tricky and this design does not cover this. -It will be covered in a separate doc. -``` - -All etcd instances will be clustered together and one of them will be an elected master. -In order to commit any change quorum of the cluster will have to confirm it. Etcd will be -configured in such a way that all writes and reads will go through the master (requests -will be forwarded by the local etcd server such that it's invisible for the user). It will -affect latency for all operations, but it should not increase by much more than the network -latency between master replicas (latency between GCE zones with a region is < 10ms). - -Currently etcd exposes port only using localhost interface. In order to allow clustering -and inter-VM communication we will also have to use public interface. To secure the -communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)). - -When generating command line for etcd we will always assume it's part of a cluster -(initially of size 1) and list all existing kubernetes master replicas. -Based on that, we will set the following flags: -* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one) -* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one): - * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty - * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty. - -This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs -with master replicas will be generated in `kube-up.sh` script and passed to as a env variable -`INITIAL_ETCD_CLUSTER`. - -### apiservers - -All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact -etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the -etcd leader. This functionality is completely hidden from the client (apiserver -in our case). - -Caching mechanism, which is implemented in apiserver, will not be affected by -replicating master because: -* GET requests go directly to etcd -* LIST requests go either directly to etcd or to cache populated via watch - (depending on the ResourceVersion in ListOptions). In the second scenario, - after a PUT/POST request, changes might not be visible in LIST response. - This is however not worse than it is with the current single master. -* WATCH does not give any guarantees when change will be delivered. - -#### load balancing - -With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud -providers have different capabilities and limitations, we will not try to find a common lowest -denominator that will work everywhere. Instead we will document various options and apply different -solution for different deployments. Below we list possible approaches: - -1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed -automatically by the deployment tool that will be integrated with solutions like Route53 (AWS) -or Google Cloud DNS (GCP). For load balancing we will have two options: - 1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately - 1.2. use round-robin DNS technique to access all apiservers directly -2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries -will be manually managed by the user. We will provide detailed documentation for the entries we -expect. -3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static -external IP address that is later assigned to the master VM. When creating additional replicas we -will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer -instead of a single master. When removing second to last replica we will reverse this operation (assign -IP address to the remaining master VM and delete load balancer). That way user will not have to provide -a domain name and all client configurations will keep working. - -This will also impact `kubelet <-> master` communication as it should use load -balancing for it. Depending on the chosen method we will use it to properly configure -kubelet. - -#### `kubernetes` service - -Kubernetes maintains a special service called `kubernetes`. Currently it keeps a -list of IP addresses for all apiservers. As it uses a command line flag -`--apiserver-count` it is not very dynamic and would require restarting all -masters to change number of master replicas. - -To allow dynamic changes to the number of apiservers in the cluster, we will -introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration -time for each apiserver (keyed by IP). Each apiserver will do three things: - -1. periodically update expiration time for it's own IP address -2. remove all the stale IP addresses from the endpoints list -3. add it's own IP address if it's not on the list yet. - -That way we will not only solve the problem of dynamically changing number -of apiservers in the cluster, but also the problem of non-responsive apiservers -that should be removed from the `kubernetes` service endpoints list. - -#### Certificates - -Certificate generation will work as today. In particular, on GCE, we will -generate it for the public IP used to access the cluster (see `load balancing` -section) and local IP of the master replica VM. - -That means that with multiple master replicas and a load balancer in front -of them, accessing one of the replicas directly (using it's ephemeral public -IP) will not work on GCE without appropriate flags: - -- `kubectl --insecure-skip-tls-verify=true` -- `curl --insecure` -- `wget --no-check-certificate` - -For other deployment tools and providers the details of certificate generation -may be different, but it must be possible to access the cluster by using either -the main cluster endpoint (DNS name or IP address) or internal service called -`kubernetes` that points directly to the apiservers. - -### controller manager, scheduler & cluster autoscaler - -Controller manager and scheduler will by default use a lease mechanism to choose an active instance -among all masters. Only one instance will be performing any operations. -All other will be waiting in standby mode. - -We will use the same configuration in non-replicated mode to simplify deployment scripts. - -### add-on manager - -All add-on managers will be working independently. Each of them will observe current state of -add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on -can be updated multiple times in a row after upgrading the master. Long-term we should fix this -by using a similar mechanisms as controller manager or scheduler. However, currently add-on -manager is just a bash script and adding a master election mechanism would not be easy. - -## Adding replica - -Command to add new replica on GCE using kube-up script: - -``` -KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh -``` - -A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following: - -``` -1. If there is no load balancer for this cluster: - 1. Create load balancer using ephemeral IP address - 2. Add existing apiserver to the load balancer - 3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) - 4. Update DNS to point to the load balancer. -2. Clone existing master (create a new VM with the same configuration) including - all env variables (certificates, IP ranges etc), with the exception of - `INITIAL_ETCD_CLUSTER`. -3. SSH to an existing master and run the following command to extend etcd cluster - with the new instance: - `curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'` -4. Add IP address of the new apiserver to the load balancer. -``` - -A simplified algorithm for adding a new master replica and promoting master IP to the load balancer -is identical to the one when using DNS, with a different step to setup load balancer: - -``` -1. If there is no load balancer for this cluster: - 1. Unassign IP from the existing master replica - 2. Create load balancer using static IP reclaimed in the previous step - 3. Add existing apiserver to the load balancer - 4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) -... -``` - -## Deleting replica - -Command to delete one replica on GCE using kube-up script: - -``` -KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh -``` - -A pseudo-code for deleting an existing replica for the master is the following: - -``` -1. Remove replica IP address from the load balancer or DNS configuration -2. SSH to one of the remaining masters and run the following command to remove replica from the cluster: - `curl etcd-0:4001/v2/members/<id> -XDELETE -L` -3. Delete replica VM -4. If load balancer has only a single target instance, then delete load balancer -5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM. -``` - -## Upgrades - -Upgrading replicated master will be possible by upgrading them one by one using existing tools -(e.g. upgrade.sh for GCE). This will work out of the box because: -* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible. -* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components -will be in the same version. -* Apiserver talks only to a local etcd replica which will be in a compatible version -* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/high-availability.md b/contributors/design-proposals/cluster-lifecycle/high-availability.md index d893c597..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/high-availability.md +++ b/contributors/design-proposals/cluster-lifecycle/high-availability.md @@ -1,4 +1,6 @@ -# High Availability of Scheduling and Controller Components in Kubernetes +Design proposals have been archived. -This document is deprecated. For more details about running a highly available -cluster master, please see the [admin instructions document](https://kubernetes.io/docs/admin/high-availability/). +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). + + +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md b/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md index f725b1a9..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md +++ b/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md @@ -1,239 +1,6 @@ -# Kubelet TLS bootstrap +Design proposals have been archived. -Author: George Tankersley (george.tankersley@coreos.com) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Preface - -This document describes a method for a kubelet to bootstrap itself -into a TLS-secured cluster. Crucially, it automates the provision and -distribution of signed certificates. - -## Overview - -When a kubelet runs for the first time, it must be given TLS assets -or generate them itself. In the first case, this is a burden on the cluster -admin and a significant logistical barrier to secure Kubernetes rollouts. In -the second, the kubelet must self-sign its certificate and forfeits many of the -advantages of a PKI system. Instead, we propose that the kubelet generate a -private key and a CSR for submission to a cluster-level certificate signing -process. - -## Preliminaries - -We assume the existence of a functioning control plane. The -apiserver should be configured for TLS initially or possess the ability to -generate valid TLS credentials for itself. If secret information is passed in -the request (e.g. auth tokens supplied with the request or included in -ExtraInfo) then all communications from the node to the apiserver must take -place over a verified TLS connection. - -Each node is additionally provisioned with the following information: - -1. Location of the apiserver -2. Any CA certificates necessary to trust the apiserver's TLS certificate -3. Access tokens (if needed) to communicate with the CSR endpoint - -These should not change often and are thus simple to include in a static -provisioning script. - -## API Changes - -### CertificateSigningRequest Object - -We introduce a new API object to represent PKCS#10 certificate signing -requests. It will be accessible under: - -`/apis/certificates/v1beta1/certificatesigningrequests/mycsr` - -It will have the following structure: - -```go -// Describes a certificate signing request -type CertificateSigningRequest struct { - unversioned.TypeMeta `json:",inline"` - api.ObjectMeta `json:"metadata,omitempty"` - - // The certificate request itself and any additional information. - Spec CertificateSigningRequestSpec `json:"spec,omitempty"` - - // Derived information about the request. - Status CertificateSigningRequestStatus `json:"status,omitempty"` -} - -// This information is immutable after the request is created. -type CertificateSigningRequestSpec struct { - // Base64-encoded PKCS#10 CSR data - Request string `json:"request"` - - // Any extra information the node wishes to send with the request. - ExtraInfo []string `json:"extrainfo,omitempty"` -} - -// This information is derived from the request by Kubernetes and cannot be -// modified by users. All information is optional since it might not be -// available in the underlying request. This is intended to aid approval -// decisions. -type CertificateSigningRequestStatus struct { - // Information about the requesting user (if relevant) - // See user.Info interface for details - Username string `json:"username,omitempty"` - UID string `json:"uid,omitempty"` - Groups []string `json:"groups,omitempty"` - - // Fingerprint of the public key in request - Fingerprint string `json:"fingerprint,omitempty"` - - // Subject fields from the request - Subject internal.Subject `json:"subject,omitempty"` - - // DNS SANs from the request - Hostnames []string `json:"hostnames,omitempty"` - - // IP SANs from the request - IPAddresses []string `json:"ipaddresses,omitempty"` - - Conditions []CertificateSigningRequestCondition `json:"conditions,omitempty"` -} - -type RequestConditionType string - -// These are the possible states for a certificate request. -const ( - Approved RequestConditionType = "Approved" - Denied RequestConditionType = "Denied" -) - -type CertificateSigningRequestCondition struct { - // request approval state, currently Approved or Denied. - Type RequestConditionType `json:"type"` - // brief reason for the request state - Reason string `json:"reason,omitempty"` - // human readable message with details about the request state - Message string `json:"message,omitempty"` - // If request was approved, the controller will place the issued certificate here. - Certificate []byte `json:"certificate,omitempty"` -} - -type CertificateSigningRequestList struct { - unversioned.TypeMeta `json:",inline"` - unversioned.ListMeta `json:"metadata,omitempty"` - - Items []CertificateSigningRequest `json:"items,omitempty"` -} -``` - -We also introduce CertificateSigningRequestList to allow listing all the CSRs in the cluster: - -```go -type CertificateSigningRequestList struct { - api.TypeMeta - api.ListMeta - - Items []CertificateSigningRequest -} -``` - -## Certificate Request Process - -### Node initialization - -When the kubelet executes it checks a location on disk for TLS assets -(currently `/var/run/kubernetes/kubelet.{key,crt}` by default). If it finds -them, it proceeds. If there are no TLS assets, the kubelet generates a keypair -and self-signed certificate. We propose the following optional behavior: - -1. Generate a keypair -2. Generate a CSR for that keypair with CN set to the hostname (or - `--hostname-override` value) and DNS/IP SANs supplied with whatever values - the host knows for itself. -3. Post the CSR to the CSR API endpoint. -4. Set a watch on the CSR object to be notified of approval or rejection. - -### Controller response - -The apiserver persists the CertificateSigningRequests and exposes the List of -all CSRs for an administrator to approve or reject. - -A new certificate controller watches for certificate requests. It must first -validate the signature on each CSR and add `Condition=Denied` on -any requests with invalid signatures (with Reason and Message incidicating -such). For valid requests, the controller will derive the information in -`CertificateSigningRequestStatus` and update that object. The controller should -watch for updates to the approval condition of any CertificateSigningRequest. -When a request is approved (signified by Conditions containing only Approved) -the controller should generate and sign a certificate based on that CSR, then -update the condition with the certificate data using the `/approval` -subresource. - -### Manual CSR approval - -An administrator using `kubectl` or another API client can query the -CertificateSigningRequestList and update the approval condition of -CertificateSigningRequests. The default state is empty, indicating that there -has been no decision so far. A state of "Approved" indicates that the admin has -approved the request and the certificate controller should issue the -certificate. A state of "Denied" indicates that admin has denied the -request. An admin may also supply Reason and Message fields to explain the -rejection. - -## kube-apiserver support - -The apiserver will present the new endpoints mentioned above and support the -relevant object types. - -## kube-controller-manager support - -To handle certificate issuance, the controller-manager will need access to CA -signing assets. This could be as simple as a private key and a config file or -as complex as a PKCS#11 client and supplementary policy system. For now, we -will add flags for a signing key, a certificate, and a basic policy file. - -## kubectl support - -To support manual CSR inspection and approval, we will add support for listing, -inspecting, and approving or denying CertificateSigningRequests to kubectl. The -interaction will be similar to -[salt-key](https://docs.saltstack.com/en/latest/ref/cli/salt-key.html). - -Specifically, the admin will have the ability to retrieve the full list of -pending CSRs, inspect their contents, and set their approval conditions to one -of: - -1. **Approved** if the controller should issue the cert -2. **Denied** if the controller should not issue the cert - -The suggested command for listing is `kubectl get csrs`. The approve/deny -interactions can be accomplished with normal updates, but would be more -conveniently accessed by direct subresource updates. We leave this for future -updates to kubectl. - -## Security Considerations - -### Endpoint Access Control - -The ability to post CSRs to the signing endpoint should be controlled. As a -simple solution we propose that each node be provisioned with an auth token -(possibly static across the cluster) that is scoped via ABAC to only allow -access to the CSR endpoint. - -### Expiration & Revocation - -The node is responsible for monitoring its own certificate expiration date. -When the certificate is close to expiration, the kubelet should begin repeating -this flow until it successfully obtains a new certificate. If the expiring -certificate has not been revoked and the previous certificate request is still -approved, then it may do so using the same keypair unless the cluster policy -(see "Future Work") requires fresh keys. - -Revocation is for the most part an unhandled problem in Go, requiring each -application to produce its own logic around a variety of parsing functions. For -now, our suggested best practice is to issue only short-lived certificates. In -the future it may make sense to add CRL support to the apiserver's client cert -auth. - -## Future Work - -- revocation UI in kubectl and CRL support at the apiserver -- supplemental policy (e.g. cluster CA only issues 30-day certs for hostnames *.k8s.example.com, each new cert must have fresh keys, ...) -- fully automated provisioning (using a handshake protocol or external list of authorized machines) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md b/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md index 2933bce9..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md +++ b/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md @@ -1,156 +1,6 @@ -# Kubernetes Local Cluster Experience +Design proposals have been archived. -This proposal attempts to improve the existing local cluster experience for kubernetes. -The current local cluster experience is sub-par and often not functional. -There are several options to setup a local cluster (docker, vagrant, linux processes, etc) and we do not test any of them continuously. -Here are some highlighted issues: -- Docker based solution breaks with docker upgrades, does not support DNS, and many kubelet features are not functional yet inside a container. -- Vagrant based solution are too heavy and have mostly failed on macOS. -- Local linux cluster is poorly documented and is undiscoverable. -From an end user perspective, they want to run a kubernetes cluster. They care less about *how* a cluster is setup locally and more about what they can do with a functional cluster. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Primary Goals - -From a high level the goal is to make it easy for a new user to run a Kubernetes cluster and play with curated examples that require least amount of knowledge about Kubernetes. -These examples will only use kubectl and only a subset of Kubernetes features that are available will be exposed. - -- Works across multiple OSes - macOS, Linux and Windows primarily. -- Single command setup and teardown UX. -- Unified UX across OSes -- Minimal dependencies on third party software. -- Minimal resource overhead. -- Eliminate any other alternatives to local cluster deployment. - -## Secondary Goals - -- Enable developers to use the local cluster for kubernetes development. - -## Non Goals - -- Simplifying kubernetes production deployment experience. [Kube-deploy](https://github.com/kubernetes/kube-deploy) is attempting to tackle this problem. -- Supporting all possible deployment configurations of Kubernetes like various types of storage, networking, etc. - - -## Local cluster requirements - -- Includes all the master components & DNS (Apiserver, scheduler, controller manager, etcd and kube dns) -- Basic auth -- Service accounts should be setup -- Kubectl should be auto-configured to use the local cluster -- Tested & maintained as part of Kubernetes core - -## Existing solutions - -Following are some of the existing solutions that attempt to simplify local cluster deployments. - -### [Spread](https://github.com/redspread/spread) - -Spread's UX is great! -It is adapted from monokube and includes DNS as well. -It satisfies almost all the requirements, excepting that of requiring docker to be pre-installed. -It has a loose dependency on docker. -New releases of docker might break this setup. - -### [Kmachine](https://github.com/skippbox/kmachine) - -Kmachine is adapted from docker-machine. -It exposes the entire docker-machine CLI. -It is possible to repurpose Kmachine to meet all our requirements. - -### [Monokube](https://github.com/polvi/monokube) - -Single binary that runs all kube master components. -Does not include DNS. -This is only a part of the overall local cluster solution. - -### Vagrant - -The kube-up.sh script included in Kubernetes release supports a few Vagrant based local cluster deployments. -kube-up.sh is not user friendly. -It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on macOS. -The [Core OS single machine guide](https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant-single.html) uses Vagrant as well and it just works. -Since we are targeting a single command install/teardown experience, vagrant needs to be an implementation detail and not be exposed to our users. - -## Proposed Solution - -To avoid exposing users to third party software and external dependencies, we will build a toolbox that will be shipped with all the dependencies including all kubernetes components, hypervisor, base image, kubectl, etc. -*Note: Docker provides a [similar toolbox](https://www.docker.com/products/docker-toolbox).* -This "Localkube" tool will be referred to as "Minikube" in this proposal to avoid ambiguity against Spread's existing ["localkube"](https://github.com/redspread/localkube). -The final name of this tool is TBD. Suggestions are welcome! - -Minikube will provide a unified CLI to interact with the local cluster. -The CLI will support only a few operations: -- **Start** - creates & starts a local cluster along with setting up kubectl & networking (if necessary) -- **Stop** - suspends the local cluster & preserves cluster state -- **Delete** - deletes the local cluster completely -- **Upgrade** - upgrades internal components to the latest available version (upgrades are not guaranteed to preserve cluster state) - -For running and managing the kubernetes components themselves, we can re-use [Spread's localkube](https://github.com/redspread/localkube). -Localkube is a self-contained go binary that includes all the master components including DNS and runs them using multiple go threads. -Each Kubernetes release will include a localkube binary that has been tested exhaustively. - -To support Windows and macOS, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines. -Minikube will be shipped with an hypervisor (virtualbox) in the case of macOS. -Minikube will include a base image that will be well tested. - -In the case of Linux, since the cluster can be run locally, we ideally want to avoid setting up a VM. -Since docker is the only fully supported runtime as of Kubernetes v1.2, we can initially use docker to run and manage localkube. -There is risk of being incompatible with the existing version of docker. -By using a VM, we can avoid such incompatibility issues though. -Feedback from the community will be helpful here. - -If the goal is to run outside of a VM, we can have minikube prompt the user if docker is unavailable or version is incompatible. -Alternatives to docker for running the localkube core includes using [rkt](https://coreos.com/rkt/docs/latest/), setting up systemd services, or a System V Init script depending on the distro. - -To summarize the pipeline is as follows: - -##### macOS / Windows - -minikube -> libmachine -> virtualbox/hyper V -> linux VM -> localkube - -##### Linux - -minikube -> docker -> localkube - -### Alternatives considered - -#### Bring your own docker - -##### Pros - -- Kubernetes users will probably already have it -- No extra work for us -- Only one VM/daemon, we can just reuse the existing one - -##### Cons - -- Not designed to be wrapped, may be unstable -- Might make configuring networking difficult on macOS and Windows -- Versioning and updates will be challenging. We can mitigate some of this with testing at HEAD, but we'll - inevitably hit situations where it's infeasible to work with multiple versions of docker. -- There are lots of different ways to install docker, networking might be challenging if we try to support many paths. - -#### Vagrant - -##### Pros - -- We control the entire experience -- Networking might be easier to build -- Docker can't break us since we'll include a pinned version of Docker -- Easier to support rkt or hyper in the future -- Would let us run some things outside of containers (kubelet, maybe ingress/load balancers) - -##### Cons - -- More work -- Extra resources (if the user is also running docker-machine) -- Confusing if there are two docker daemons (images built in one can't be run in another) -- Always needs a VM, even on Linux -- Requires installing and possibly understanding Vagrant. - -## Releases & Distribution - -- Minikube will be released independent of Kubernetes core in order to facilitate fixing of issues that are outside of Kubernetes core. -- The latest version of Minikube is guaranteed to support the latest release of Kubernetes, including documentation. -- The Google Cloud SDK will package minikube and provide utilities for configuring kubectl to use it, but will not in any other way wrap minikube. - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md b/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md index c1f30f5c..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md +++ b/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md @@ -1,66 +1,6 @@ -# Overview +Design proposals have been archived. -Proposes adding a `--feature-config` to core kube system components: -apiserver , scheduler, controller-manager, kube-proxy, and selected addons. -This flag will be used to enable/disable alpha features on a per-component basis. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Motivation is enabling/disabling features that are not tied to -an API group. API groups can be selectively enabled/disabled in the -apiserver via existing `--runtime-config` flag on apiserver, but there is -currently no mechanism to toggle alpha features that are controlled by -e.g. annotations. This means the burden of controlling whether such -features are enabled in a particular cluster is on feature implementors; -they must either define some ad hoc mechanism for toggling (e.g. flag -on component binary) or else toggle the feature on/off at compile time. - -By adding a`--feature-config` to all kube-system components, alpha features -can be toggled on a per-component basis by passing `enableAlphaFeature=true|false` -to `--feature-config` for each component that the feature touches. - -## Design - -The following components will all get a `--feature-config` flag, -which loads a `config.ConfigurationMap`: - -- kube-apiserver -- kube-scheduler -- kube-controller-manager -- kube-proxy -- kube-dns - -(Note kubelet is omitted, it's dynamic config story is being addressed -by [#29459](https://issues.k8s.io/29459)). Alpha features that are not accessed via an alpha API -group should define an `enableFeatureName` flag and use it to toggle -activation of the feature in each system component that the feature -uses. - -## Suggested conventions - -This proposal only covers adding a mechanism to toggle features in -system components. Implementation details will still depend on the alpha -feature's owner(s). The following are suggested conventions: - -- Naming for feature config entries should follow the pattern - "enable<FeatureName>=true". -- Features that touch multiple components should reserve the same key - in each component to toggle on/off. -- Alpha features should be disabled by default. Beta features may - be enabled by default. Refer to [this file](/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions) - for more detailed guidance on alpha vs. beta. - -## Upgrade support - -As the primary motivation for cluster config is toggling alpha -features, upgrade support is not in scope. Enabling or disabling -a feature is necessarily a breaking change, so config should -not be altered in a running cluster. - -## Future work - -1. The eventual plan is for component config to be managed by versioned -APIs and not flags ([#12245](https://issues.k8s.io/12245)). When that is added, toggling of features -could be handled by versioned component config and the component flags -deprecated. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/self-hosted-final-cluster.png b/contributors/design-proposals/cluster-lifecycle/self-hosted-final-cluster.png Binary files differdeleted file mode 100644 index e5302b07..00000000 --- a/contributors/design-proposals/cluster-lifecycle/self-hosted-final-cluster.png +++ /dev/null diff --git a/contributors/design-proposals/cluster-lifecycle/self-hosted-kubelet.md b/contributors/design-proposals/cluster-lifecycle/self-hosted-kubelet.md index 765086f2..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/self-hosted-kubelet.md +++ b/contributors/design-proposals/cluster-lifecycle/self-hosted-kubelet.md @@ -1,130 +1,6 @@ -# Proposal: Self-hosted kubelet +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -In a self-hosted Kubernetes deployment (see [this -comment](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959) -for background on self hosted kubernetes), we have the initial bootstrap problem. -When running self-hosted components, there needs to be a mechanism for pivoting -from the initial bootstrap state to the kubernetes-managed (self-hosted) state. -In the case of a self-hosted kubelet, this means pivoting from the initial -kubelet defined and run on the host, to the kubelet pod which has been scheduled -to the node. -This proposal presents a solution to the kubelet bootstrap, and assumes a -functioning control plane (e.g. an apiserver, controller-manager, scheduler, and -etcd cluster), and a kubelet that can securely contact the API server. This -functioning control plane can be temporary, and not necessarily the "production" -control plane that will be used after the initial pivot / bootstrap. - -## Background and Motivation - -In order to understand the goals of this proposal, one must understand what -"self-hosted" means. This proposal defines "self-hosted" as a kubernetes cluster -that is installed and managed by the kubernetes installation itself. This means -that each kubernetes component is described by a kubernetes manifest (Daemonset, -Deployment, etc) and can be updated via kubernetes. - -The overall goal of this proposal is to make kubernetes easier to install and -upgrade. We can then treat kubernetes itself just like any other application -hosted in a kubernetes cluster, and have access to easy upgrades, monitoring, -and durability for core kubernetes components themselves. - -We intend to achieve this by using kubernetes to manage itself. However, in -order to do that we must first "bootstrap" the cluster, by using kubernetes to -install kubernetes components. This is where this proposal fits in, by -describing the necessary modifications, and required procedures, needed to run a -self-hosted kubelet. - -The approach being proposed for a self-hosted kubelet is a "pivot" style -installation. This procedure assumes a short-lived “bootstrap” kubelet will run -and start a long-running “self-hosted” kubelet. Once the self-hosted kubelet is -running the bootstrap kubelet will exit. As part of this, we propose introducing -a new `--bootstrap` flag to the kubelet. The behaviour of that flag will be -explained in detail below. - -## Proposal - -We propose adding a new flag to the kubelet, the `--bootstrap` flag, which is -assumed to be used in conjunction with the `--lock-file` flag. The `--lock-file` -flag is used to ensure only a single kubelet is running at any given time during -this pivot process. When the `--bootstrap` flag is provided, after the kubelet -acquires the file lock, it will begin asynchronously waiting on -[inotify](http://man7.org/linux/man-pages/man7/inotify.7.html) events. Once an -"open" event is received, the kubelet will assume another kubelet is attempting -to take control and will exit by calling `exit(0)`. - -Thus, the initial bootstrap becomes: - -1. "bootstrap" kubelet is started by $init system. -1. "bootstrap" kubelet pulls down "self-hosted" kubelet as a pod from a - daemonset -1. "self-hosted" kubelet attempts to acquire the file lock, causing "bootstrap" - kubelet to exit -1. "self-hosted" kubelet acquires lock and takes over -1. "bootstrap" kubelet is restarted by $init system and blocks on acquiring the - file lock - -During an upgrade of the kubelet, for simplicity we will consider 3 kubelets, -namely "bootstrap", "v1", and "v2". We imagine the following scenario for -upgrades: - -1. Cluster administrator introduces "v2" kubelet daemonset -1. "v1" kubelet pulls down and starts "v2" -1. Cluster administrator removes "v1" kubelet daemonset -1. "v1" kubelet is killed -1. Both "bootstrap" and "v2" kubelets race for file lock -1. If "v2" kubelet acquires lock, process has completed -1. If "bootstrap" kubelet acquires lock, it is assumed that "v2" kubelet will - fail a health check and be killed. Once restarted, it will try to acquire the - lock, triggering the "bootstrap" kubelet to exit. - -Alternatively, it would also be possible via this mechanism to delete the "v1" -daemonset first, allow the "bootstrap" kubelet to take over, and then introduce -the "v2" kubelet daemonset, effectively eliminating the race between "bootstrap" -and "v2" for lock acquisition, and the reliance on the failing health check -procedure. - -Eventually this could be handled by a DaemonSet upgrade policy. - -This will allow a "self-hosted" kubelet with minimal new concepts introduced -into the core Kubernetes code base, and remains flexible enough to work well -with future [bootstrapping -services](https://github.com/kubernetes/kubernetes/issues/5754). - -## Production readiness considerations / Out of scope issues - -* Deterministically pulling and running kubelet pod: we would prefer not to have - to loop until we finally get a kubelet pod. -* It is possible that the bootstrap kubelet version is incompatible with the - newer versions that were run in the node. For example, the cgroup - configurations might be incompatible. In the beginning, we will require - cluster admins to keep the configuration in sync. Since we want the bootstrap - kubelet to come up and run even if the API server is not available, we should - persist the configuration for bootstrap kubelet on the node. Once we have - checkpointing in kubelet, we will checkpoint the updated config and have the - bootstrap kubelet use the updated config, if it were to take over. -* Currently best practice when upgrading the kubelet on a node is to drain all - pods first. Automatically draining of the node during kubelet upgrade is out - of scope for this proposal. It is assumed that either the cluster - administrator or the daemonset upgrade policy will handle this. - -## Other discussion - -Various similar approaches have been discussed -[here](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959) -and -[here](https://github.com/kubernetes/kubernetes/issues/23073#issuecomment-198478997). -Other discussion around the kubelet being able to be run inside a container is -[here](https://github.com/kubernetes/kubernetes/issues/4869). Note this isn't a -strict requirement as the kubelet could be run in a chroot jail via rkt fly or -other such similar approach. - -Additionally, [Taints and -Tolerations](../../docs/design/taint-toleration-dedicated.md), whose design has -already been accepted, would make the overall kubelet bootstrap more -deterministic. With this, we would also need the ability for a kubelet to -register itself with a given taint when it first contacts the API server. Given -that, a kubelet could register itself with a given taint such as -“component=kubelet”, and a kubelet pod could exist that has a toleration to that -taint, ensuring it is the only pod the “bootstrap” kubelet runs. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md b/contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md index de13b012..f0fbec72 100644 --- a/contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md +++ b/contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md @@ -1,103 +1,6 @@ -# Proposal: Self-hosted Control Plane +Design proposals have been archived. -Author: Brandon Philips <brandon.philips@coreos.com> +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivations -> Running our components in pods would solve many problems, which we'll otherwise need to implement other, less portable, more brittle solutions to, and doesn't require much that we don't need to do for other reasons. Full self-hosting is the eventual goal. -> -> - Brian Grant ([ref](https://github.com/kubernetes/kubernetes/issues/4090#issuecomment-74890508)) - -### What is self-hosted? - -Self-hosted Kubernetes runs all required and optional components of a Kubernetes cluster on top of Kubernetes itself. - -The advantages of a self-hosted Kubernetes cluster are: - -1. **Small Dependencies:** self-hosted should reduce the number of components required, on host, for a Kubernetes cluster to be deployed to a Kubelet (ideally running in a container). This should greatly simplify the perceived complexity of Kubernetes installation. -2. **Deployment consistency:** self-hosted reduces the number of files that are written to disk or managed via configuration management or manual installation via SSH. Our hope is to reduce the number of moving parts relying on the host OS to make deployments consistent in all environments. -3. **Introspection:** internal components can be debugged and inspected by users using existing Kubernetes APIs like `kubectl logs` -4. **Cluster Upgrades:** Related to introspection the components of a Kubernetes cluster are now subject to control via Kubernetes APIs. Upgrades of Kubelet's are possible via new daemon sets, API servers can be upgraded using daemon sets and potentially deployments in the future, and flags of add-ons can be changed by updating deployments, etc. -5. **Easier Highly-Available Configurations:** Using Kubernetes APIs will make it easier to scale up and monitor an HA environment without complex external tooling. Because of the complexity of these configurations tools that create them without self-hosted often implement significant complex logic. - -However, there is a spectrum of ways that a cluster can be self-hosted. To do this we are going to divide the Kubernetes cluster into a variety of layers beginning with the Kubelet (level 0) and going up to the add-ons (level 4). A cluster can self-host all of these levels 0-4 or only partially self-host. - - - -For example, a 0-4 self-hosted cluster means that the kubelet is a daemon set, the API server runs as a pod and is exposed as a service, and so on. While a 1-4 self-hosted cluster would have a system installed Kubelet. And a 2-4 system would have everything except etcd self-hosted. - -It is also important to point out that self-hosted stands alongside other methods to install and configure Kubernetes, including scripts like kube-up.sh, configuration management tools, and deb/rpm/etc packages. A non-goal of this self-hosted proposal is replacing or introducing anything that might impede these installation and management methods. In fact it is likely that by dogfooding Kubernetes APIs via self-hosted improvements will be made to Kubernetes components that will simplify other installation and management methods. - -## Practical Implementation Overview - -This document outlines the current implementation of "self-hosted Kubernetes" installation and upgrade of Kubernetes clusters based on the work that the teams at CoreOS and Google have been doing. The work is motivated by the early ["Self-hosted Proposal"](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959) by Brian Grant. - -The entire system is working today and is used by Bootkube, a Kubernetes Incubator project, to create 2-4 and 1-4 self-hosted clusters. All Tectonic clusters created since July 2016 are 2-4 self-hosted and will be moving to 1-4 early in 2017 as the self-hosted etcd work becomes stable in bootkube. This document outlines the implementation, not the experience. The experience goal is that users not know all of these details and instead get a working Kubernetes cluster out the other end that can be upgraded using the Kubernetes APIs. - -The target audience are projects in SIG Cluster Lifecycle thinking about and building the way forward for install and upgrade of Kubernetes. We hope to inspire direction in various Kubernetes components like kubelet and [kubeadm](https://github.com/kubernetes/kubernetes/pull/38407) to make self-hosted a compelling mainstream installation method. If you want a higher level demonstration of "Self-Hosted" and the value see this [video and blog](https://coreos.com/blog/self-hosted-kubernetes.html). - -### Bootkube - -Today, the first component of the installation of a self-hosted cluster is [`bootkube`](https://github.com/kubernetes-incubator/bootkube). Bootkube provides a temporary Kubernetes control plane that tells a kubelet to execute all of the components necessary to run a full blown Kubernetes control plane. When the kubelet connects to this temporary API server it will deploy the required Kubernetes components as pods. This diagram shows all of the moving parts: - - - -Note: In the future this temporary control plane may be replaced with a kubelet API that will enable injection of this state directly into the kubelet without a temporary Kubernetes API server. - -At the end of this process the bootkube can be shut down and the system kubelet will coordinate, through a POSIX lock (see `kubelet --exit-on-lock-contention`), to let the self-hosted kubelet take over lifecycle and management of the control plane components. The final cluster state looks like this: - - - -There are a few things to note. First, generally, the control components like the API server, etc will be pinned to a set of dedicated control nodes. For security policy, service discovery, and scaling reasons it is easiest to assume that control nodes will always exist on N nodes. - -Another challenge is load balancing the API server. Bedrock for the API server will be DNS, TLS, and a load balancer that live off cluster and that load balancer will want to only healthcheck a handful of servers for the API server port liveness probe. - -### Bootkube Challenges - -This process has a number of moving parts. Most notably the hand off of control from the "host system" to the Kubernetes self-hosted system. And things can go wrong: - -1) The self-hosted Kubelet is in a precarious position as there is no one around to restart the process if it crashes. The high level is that the system init system will watch for the Kubelet POSIX lock and start the system Kubelet if the lock is missing. Once the system Kubelet starts it will launch the self-hosted Kubelet. - -2) Recovering from reboots of single-master installations is a challenge as the Kubelet won't have an API server to talk to restart the self-hosted components. We are solving this today with "[user space checkpointing](https://github.com/kubernetes-incubator/bootkube/tree/master/cmd/checkpoint#checkpoint)" container in the Kubelet pod that will periodically check the pod manifests and persist them to the static pod manifest directory. Longer term we would like for the kubelet to be able to checkpoint itself without external code. - -## Long Term Goals - -Ideally bootkube disappears over time and is replaced by a [Kubelet pod API](https://github.com/kubernetes/kubernetes/issues/28138). The write API would enable an external installation program to setup the control plane of a self-hosted Kubernetes cluster without requiring an existing API server. - -[Checkpointing](https://github.com/kubernetes/kubernetes/issues/489) is also required to make for a reliable system that can survive a number of normal operations like full down scenarios of the control plane. Today, we can sufficiently do checkpointing external of the Kubelet process, but checkpointing inside of the Kubelet would be ideal. - -A simple updater can take care of helping users update from v1.3.0 to v1.3.1, etc over time. - -### Self-hosted Cluster Upgrades - -#### Kubelet upgrades - -The kubelet could be upgraded in a very similar process to that outlined in the self-hosted proposal. - -However, because of the challenges around the self-hosted Kubelet (see above) Tectonic currently has a 1-4 self-hosted cluster with an alternative Kubelet update scheme which side-steps the self-hosted Kubelet issues. First, a kubelet system service is launched that uses the [chrooted kubelet](https://github.com/kubernetes/community/pull/131) implemented by the [kubelet-wrapper](https://coreos.com/kubernetes/docs/latest/kubelet-wrapper.html). Then, when an update is required, a node annotation is made which is read by a long-running daemonset that updates the kubelet-wrapper configuration. This makes Kubelet versions updateable from the cluster API. - -#### API Server, Scheduler, and Controller Manager - -Upgrading these components is fairly straightforward. They are stateless, easily run in containers, and can be modeled as pods and services. Upgrades are simply a matter of deploying new versions, health checking them, and changing the service label selectors. - -In HA configurations the API servers should be able to be upgraded in-place one-by-one and rely on external load balancing or client retries to recover from the temporary downtime. This is not well tested upstream and something we need to fix (see known issues). - -#### etcd self-hosted - -As the primary data store of Kubernetes etcd plays an important role. Today, etcd does not run on top of the self-hosted cluster. However, progress is being made with the introduction of the [etcd Operator](https://coreos.com/blog/introducing-the-etcd-operator.html) and integration into [bootkube](https://github.com/kubernetes-incubator/bootkube/blob/848cf581451425293031647b5754b528ec5bf2a0/cmd/bootkube/start.go#L37). - -### Highly-available Clusters - -Self-hosted will make operating highly-available clusters even easier. For internal critical components like the scheduler and controller manager, which already know how to leader elect using the Kubernetes leader election API, creating HA instances will be a simple matter of `kubectl scale` for most administrators. For the data store, etcd, the etcd Operator will ease much of the scaling concern. - -However, the API server will be a slightly trickier matter for most deployments as the API server relies on either external load balancing or external DNS in most common HA configurations. But, with the addition of Kubernetes label metadata on the [Node API](https://github.com/kubernetes/kubernetes/pull/39112) self-hosted may make it easier for systems administrators to create glue code that finds the appropriate Node IPs and adds them to these external systems. - -### Conclusions - -Kubernetes self-hosted is working today. Bootkube is an implementation of the "temporary control plane" and this entire process has been used by [`bootkube`](https://github.com/kubernetes-incubator/bootkube) users and Tectonic since the Kubernetes v1.4 release. We are excited to give users a simpler installation flow and sustainable cluster lifecycle upgrade/management. - -## Known Issues - -- [Health check endpoints for components don't work correctly](https://github.com/kubernetes-incubator/bootkube/issues/64#issuecomment-228144345) -- [kubeadm does do self-hosted, but isn't tested yet](https://github.com/kubernetes/kubernetes/pull/40075) -- The Kubernetes [versioning policy](/contributors/design-proposals/release/versioning.md) allows for version skew of kubelet and control plane but not skew between control plane components themselves. We must add testing and validation to Kubernetes that this skew works. Otherwise the work to make Kubernetes HA is rather pointless if it can't be upgraded in an HA manner as well. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/cluster-lifecycle/self-hosted-layers.png b/contributors/design-proposals/cluster-lifecycle/self-hosted-layers.png Binary files differdeleted file mode 100644 index 1dc3e06a..00000000 --- a/contributors/design-proposals/cluster-lifecycle/self-hosted-layers.png +++ /dev/null diff --git a/contributors/design-proposals/cluster-lifecycle/self-hosted-moving-parts.png b/contributors/design-proposals/cluster-lifecycle/self-hosted-moving-parts.png Binary files differdeleted file mode 100644 index 423add2e..00000000 --- a/contributors/design-proposals/cluster-lifecycle/self-hosted-moving-parts.png +++ /dev/null diff --git a/contributors/design-proposals/dir_struct.txt b/contributors/design-proposals/dir_struct.txt index ef61ae75..f0fbec72 100644 --- a/contributors/design-proposals/dir_struct.txt +++ b/contributors/design-proposals/dir_struct.txt @@ -1,244 +1,6 @@ -Uncategorized - create_sheet.py - design_proposal_template.md - dir_struct.txt - multi-platform.md - owners - readme.md -./sig-cli - get-describe-apiserver-extensions.md - kubectl-create-from-env-file.md - kubectl-extension.md - kubectl-login.md - kubectl_apply_getsetdiff_last_applied_config.md - multi-fields-merge-key.md - owners - preserve-order-in-strategic-merge-patch.md - simple-rolling-update.md -./network - command_execution_port_forwarding.md - external-lb-source-ip-preservation.md - flannel-integration.md - network-policy.md - networking.md - service-discovery.md - service-external-name.md -./resource-management - admission_control_limit_range.md - admission_control_resource_quota.md - device-plugin-overview.png - device-plugin.md - device-plugin.png - gpu-support.md - hugepages.md - resource-quota-scoping.md -./testing - flakiness-sla.md -./autoscaling - horizontal-pod-autoscaler.md - hpa-status-conditions.md - hpa-v2.md - initial-resources.md -./architecture - architecture.md - architecture.png - architecture.svg - identifiers.md - namespaces.md - principles.md -./api-machinery - add-new-patchstrategy-to-clear-fields-not-present-in-patch.md - admission_control.md - admission_control_event_rate_limit.md - admission_control_extension.md - aggregated-api-servers.md - api-chunking.md - api-group.md - apiserver-build-in-admission-plugins.md - apiserver-count-fix.md - apiserver-watch.md - auditing.md - bulk_watch.md - client-package-structure.md - controller-ref.md - csi-client-structure-proposal.md - csi-new-client-library-procedure.md - customresources-validation.md - dynamic-admission-control-configuration.md - event_compression.md - extending-api.md - garbage-collection.md - metadata-policy.md - protobuf.md - server-get.md - synchronous-garbage-collection.md - thirdpartyresources.md -./node - all-in-one-volume.md - annotations-downward-api.md - configmap.md - container-init.md - container-runtime-interface-v1.md - cpu-manager.md - cri-dockershim-checkpoint.md - disk-accounting.md - downward_api_resources_limits_requests.md - dynamic-kubelet-configuration.md - envvar-configmap.md - expansion.md - kubelet-auth.md - kubelet-authorizer.md - kubelet-cri-logging.md - kubelet-eviction.md - kubelet-hypercontainer-runtime.md - kubelet-rkt-runtime.md - kubelet-rootfs-distribution.md - kubelet-systemd.md - node-allocatable.md - optional-configmap.md - pleg.png - pod-cache.png - pod-lifecycle-event-generator.md - pod-pid-namespace.md - pod-resource-management.md - propagation.md - resource-qos.md - runtime-client-server.md - runtime-pod-cache.md - seccomp.md - secret-configmap-downwardapi-file-mode.md - selinux-enhancements.md - selinux.md - sysctl.md -./service-catalog - pod-preset.md -./instrumentation - core-metrics-pipeline.md - custom-metrics-api.md - metrics-server.md - monitoring_architecture.md - monitoring_architecture.png - performance-related-monitoring.md - resource-metrics-api.md - volume_stats_pvc_ref.md -./auth - access.md - apparmor.md - enhance-pluggable-policy.md - flex-volumes-drivers-psp.md - image-provenance.md - no-new-privs.md - pod-security-context.md - pod-security-policy.md - secrets.md - security.md - security_context.md - service_accounts.md -./multicluster - control-plane-resilience.md - federated-api-servers.md - federated-ingress.md - federated-placement-policy.md - federated-replicasets.md - federated-services.md - federation-clusterselector.md - federation-high-level-arch.png - federation-lite.md - federation-phase-1.md - federation.md - ubernetes-cluster-state.png - ubernetes-design.png - ubernetes-scheduling.png -./scalability - kubemark.md - kubemark_architecture.png - scalability-testing.md -./cluster-lifecycle - bootstrap-discovery.md - cluster-deployment.md - clustering.md - dramatically-simplify-cluster-creation.md - ha_master.md - high-availability.md - kubelet-tls-bootstrap.md - local-cluster-ux.md - runtimeconfig.md - self-hosted-final-cluster.png - self-hosted-kubelet.md - self-hosted-kubernetes.md - self-hosted-layers.png - self-hosted-moving-parts.png -./cluster-lifecycle/clustering - .gitignore - dockerfile - dynamic.png - dynamic.seqdiag - makefile - owners - readme.md - static.png - static.seqdiag -./release - release-notes.md - release-test-signal.md - versioning.md -./scheduling - hugepages.md - multiple-schedulers.md - nodeaffinity.md - pod-preemption.md - pod-priority-api.md - podaffinity.md - rescheduler.md - rescheduling-for-critical-pods.md - rescheduling.md - resources.md - scheduler_extender.md - taint-node-by-condition.md - taint-toleration-dedicated.md -./scheduling/images - .gitignore - owners - preemption_1.png - preemption_2.png - preemption_3.png - preemption_4.png -./apps - controller_history.md - cronjob.md - daemon.md - daemonset-update.md - deploy.md - deployment.md - indexed-job.md - job.md - obsolete_templates.md - selector-generation.md - stateful-apps.md - statefulset-update.md -./storage - containerized-mounter.md - containerized-mounter.md~ - default-storage-class.md - flexvolume-deployment.md - grow-volume-size.md - local-storage-overview.md - mount-options.md - owners - persistent-storage.md - pod-safety.md - volume-hostpath-qualifiers.md - volume-metrics.md - volume-ownership-management.md - volume-provisioning.md - volume-selectors.md - volume-snapshotting.md - volume-snapshotting.png - volumes.md -./aws - aws_under_the_hood.md -./gcp - gce-l4-loadbalancer-healthcheck.md -./cloud-provider - cloud-provider-refactoring.md - cloudprovider-storage-metrics.md +Design proposals have been archived. + +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). + + +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/gcp/OWNERS b/contributors/design-proposals/gcp/OWNERS deleted file mode 100644 index 5edab0a0..00000000 --- a/contributors/design-proposals/gcp/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - provider-gcp -approvers: - - provider-gcp -labels: - - sig/gcp diff --git a/contributors/design-proposals/gcp/gce-l4-loadbalancer-healthcheck.md b/contributors/design-proposals/gcp/gce-l4-loadbalancer-healthcheck.md index 22e6d17b..f0fbec72 100644 --- a/contributors/design-proposals/gcp/gce-l4-loadbalancer-healthcheck.md +++ b/contributors/design-proposals/gcp/gce-l4-loadbalancer-healthcheck.md @@ -1,61 +1,6 @@ -# GCE L4 load-balancers' health checks for nodes +Design proposals have been archived. -## Goal -Set up health checks for GCE L4 load-balancer to ensure it is only -targeting healthy nodes. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -On cloud providers which support external load balancers, setting the -type field to "LoadBalancer" will provision a L4 load-balancer for the -service ([doc](https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer)), -which load-balances traffic to k8s nodes. As of k8s 1.6, we don't -create health check for L4 load-balancer by default, which means all -traffic will be forwarded to any one of the nodes blindly. -This is undesired in cases: -- k8s components including kubelet dead on nodes. Nodes will be flipped -to unhealthy after a long propagation (~40s), even if we remove nodes -from target pool at that point it is too slow. -- kube-proxy dead on nodes while kubelet is still alive. Requests will -be continually forwarded to nodes that may not be able to properly route -traffic. - -For now, the only case health check will be created is for -[OnlyLocal Service](https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-typeloadbalancer). -We should have a node-level health check for load balancers that are used -by non-OnlyLocal services. - -## Design -Healthchecking the kube-proxys seems to be the best choice: -- kube-proxy runs on every nodes and it is the pivot for service traffic -routing. -- Port 10249 on nodes is currently used for both kube-proxy's healthz and -pprof. -- We already have a similar mechanism for healthchecking OnlyLocal services -in kube-proxy. - -The plan is to enable health check on all LoadBalancer services (if use GCP -as cloud provider). - -## Implementation -kube-proxy -- Separate healthz from pprof (/metrics) to use a different port and bind it -to 0.0.0.0. As we will only allow traffic from load-balancer source IPs, this -wouldn't be a big security concern. -- Make healthz check timestamp in iptables mode while always returns "ok" in -other modes. - -GCE cloud provider (through kube-controller-manager) -- Manage `k8s-l4-healthcheck` firewall and healthcheck resources. -These two resources should be shared among all non-OnlyLocal LoadBalancer -services. -- Add a new flag to pipe in the healthz port num as it is configurable on -kube-proxy. - -Version skew: -- Running higher version master (with L4 health check feature enabled) with -lower version nodes (without kube-proxy exposing healthz port) should fall -back to the original behavior (no health check). -- Rollback shouldn't be a big issue. Even if health check is left on Network -load-balancer, it will fail on all nodes and fall back to blindly forwarding -traffic. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/OWNERS b/contributors/design-proposals/instrumentation/OWNERS deleted file mode 100644 index 3e1efb0c..00000000 --- a/contributors/design-proposals/instrumentation/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-instrumentation-leads -approvers: - - sig-instrumentation-leads -labels: - - sig/instrumentation diff --git a/contributors/design-proposals/instrumentation/core-metrics-pipeline.md b/contributors/design-proposals/instrumentation/core-metrics-pipeline.md index 1ca5dbd9..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/core-metrics-pipeline.md +++ b/contributors/design-proposals/instrumentation/core-metrics-pipeline.md @@ -1,150 +1,6 @@ -# Core Metrics in kubelet +Design proposals have been archived. -**Author**: David Ashpole (@dashpole) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Last Updated**: 1/31/2017 - -**Status**: Proposal - -This document proposes a design for the set of metrics included in an eventual Core Metrics Pipeline. - - -- [Core Metrics in kubelet](#core-metrics-in-kubelet) - - [Introduction](#introduction) - - [Definitions](#definitions) - - [Background](#background) - - [Motivations](#motivations) - - [Proposal](#proposal) - - [Non Goals](#non-goals) - - [Design](#design) - - [Metric Requirements:](#metric-requirements) - - [Proposed Core Metrics:](#proposed-core-metrics) - - [On-Demand Design:](#on-demand-design) - - [Future Work](#future-work) - - -## Introduction - -### Definitions -"Kubelet": The daemon that runs on every kubernetes node and controls pod and container lifecycle, among many other things. -["cAdvisor":](https://github.com/google/cadvisor) An open source container monitoring solution which only monitors containers, and has no concept of kubernetes constructs like pods or volumes. -["Summary API":](https://git.k8s.io/kubernetes/pkg/kubelet/apis/stats/v1alpha1/types.go) A kubelet API which currently exposes node metrics for use by both system components and monitoring systems. -["CRI":](/contributors/devel/sig-node/container-runtime-interface.md) The Container Runtime Interface designed to provide an abstraction over runtimes (docker, rkt, etc). -"Core Metrics": A set of metrics described in the [Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md) whose purpose is to provide metrics for first-class resource isolation and utilization features, including [resource feasibility checking](https://github.com/eBay/Kubernetes/blob/master/docs/design/resources.md#the-resource-model) and node resource management. -"Resource": A consumable element of a node (e.g. memory, disk space, CPU time, etc). -"First-class Resource": A resource critical for scheduling, whose requests and limits can be (or soon will be) set via the Pod/Container Spec. -"Metric": A measure of consumption of a Resource. - -### Background -The [Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md) proposal contains a blueprint for a set of metrics referred to as "Core Metrics". The purpose of this proposal is to specify what those metrics are, to enable work relating to the collection, by the kubelet, of the metrics. - -Kubernetes vendors cAdvisor into its codebase, and the kubelet uses cAdvisor as a library that enables it to collect metrics on containers. The kubelet can then combine container-level metrics from cAdvisor with the kubelet's knowledge of kubernetes constructs (e.g. pods) to produce the kubelet Summary statistics, which provides metrics for use by the kubelet, or by users through the [Summary API](https://git.k8s.io/kubernetes/pkg/kubelet/apis/stats/v1alpha1/types.go). cAdvisor works by collecting metrics at an interval (10 seconds, by default), and the kubelet then simply queries these cached metrics whenever it has a need for them. - -Currently, cAdvisor collects a large number of metrics related to system and container performance. However, only some of these metrics are consumed by the kubelet summary API, and many are not used. The kubelet [Summary API](https://git.k8s.io/kubernetes/pkg/kubelet/apis/stats/v1alpha1/types.go) is published to the kubelet summary API endpoint (stats/summary). Some of the metrics provided by the summary API are consumed by kubernetes system components, but many are included for the sole purpose of providing metrics for monitoring. - -### Motivations -The [Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md) proposal explains why a separate monitoring pipeline is required. - -By publishing core metrics, the kubelet is relieved of its responsibility to provide metrics for monitoring. -The third party monitoring pipeline also is relieved of any responsibility to provide these metrics to system components. - -cAdvisor is structured to collect metrics on an interval, which is appropriate for a stand-alone metrics collector. However, many functions in the kubelet are latency-sensitive (eviction, for example), and would benefit from a more "On-Demand" metrics collection design. - -### Proposal -This proposal is to use this set of core metrics, collected by the kubelet, and used solely by kubernetes system components to support "First-Class Resource Isolation and Utilization Features". This proposal is not designed to be an API published by the kubelet, but rather a set of metrics collected by the kubelet that will be transformed, and published in the future. - -The target "Users" of this set of metrics are kubernetes components (though not necessarily directly). This set of metrics itself is not designed to be user-facing, but is designed to be general enough to support user-facing components. - -### Non Goals -Everything covered in the [Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md) design doc will not be covered in this proposal. This includes the third party metrics pipeline, and the methods by which the metrics found in this proposal are provided to other kubernetes components. - -Integration with CRI will not be covered in this proposal. In future proposals, integrating with CRI may provide a better abstraction of information required by the core metrics pipeline to collect metrics. - -The kubelet API endpoint, including the format, url pattern, versioning strategy, and name of the API will be the topic of a follow-up proposal to this proposal. - -## Design -This design covers only metrics to be included in the Core Metrics Pipeline. - -High level requirements for the design are as follows: - - The kubelet collects the minimum possible number of metrics to provide "First-Class Resource Isolation and Utilization Features". - - Metrics can be fetched "On Demand", giving the kubelet more up-to-date stats. - -This proposal purposefully omits many metrics that may eventually become core metrics. This is by design. Once metrics are needed to support First-Class Resource Isolation and Utilization Features, they can be added to the core metrics API. - -### Metric Requirements -The core metrics api is designed to provide metrics for "First Class Resource Isolation and Utilization Features" within kubernetes. - -Many kubernetes system components currently support these features. Many more components that support these features are in development. -The following is not meant to be an exhaustive list, but gives the current set of use cases for these metrics. - -Metrics requirements for "First Class Resource Isolation and Utilization Features", based on kubernetes component needs, are as follows: - - - Kubelet - - Node-level usage metrics for Filesystems, CPU, and Memory - - Pod-level usage metrics for Filesystems and Memory - - Metrics Server (outlined in [Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md)), which exposes the [Resource Metrics API](/contributors/design-proposals/instrumentation/resource-metrics-api.md) to the following system components: - - Scheduler - - Node-level usage metrics for Filesystems, CPU, and Memory - - Pod-level usage metrics for Filesystems, CPU, and Memory - - Vertical-Pod-Autoscaler - - Node-level usage metrics for Filesystems, CPU, and Memory - - Pod-level usage metrics for Filesystems, CPU, and Memory - - Container-level usage metrics for Filesystems, CPU, and Memory - - Horizontal-Pod-Autoscaler - - Node-level usage metrics for CPU and Memory - - Pod-level usage metrics for CPU and Memory - - Cluster Federation - - Node-level usage metrics for Filesystems, CPU, and Memory - - kubectl top and Kubernetes Dashboard - - Node-level usage metrics for Filesystems, CPU, and Memory - - Pod-level usage metrics for Filesystems, CPU, and Memory - - Container-level usage metrics for Filesystems, CPU, and Memory - -### Proposed Core Metrics: -This section defines "usage metrics" for filesystems, CPU, and Memory. -As stated in Non-Goals, this proposal does not attempt to define the specific format by which these are exposed. For convenience, it may be necessary to include static information such as start time, node capacities for CPU, Memory, or filesystems, and more. - -```go -// CpuUsage holds statistics about the amount of cpu time consumed -type CpuUsage struct { - // The time at which these Metrics were updated. - Timestamp metav1.Time - // Cumulative CPU usage (sum of all cores) since object creation. - CumulativeUsageNanoSeconds *uint64 -} - -// MemoryUsage holds statistics about the quantity of memory consumed -type MemoryUsage struct { - // The time at which these metrics were updated. - Timestamp metav1.Time - // The amount of "working set" memory. This includes recently accessed memory, - // dirty memory, and kernel memory. - UsageBytes *uint64 -} - -// FilesystemUsage holds statistics about the quantity of local storage (e.g. disk) resources consumed -type FilesystemUsage struct { - // The time at which these metrics were updated. - Timestamp metav1.Time - // StorageIdentifier must uniquely identify the node-level storage resource that is consumed. - // It may utilize device, partition, filesystem id, or other identifiers. - StorageIdentifier string - // UsedBytes represents the disk space consumed, in bytes. - UsedBytes *uint64 - // UsedInodes represents the inodes consumed - UsedInodes *uint64 -} -``` - -### On-Demand Design -Interface: -The interface for exposing these metrics within the kubelet contains methods for fetching each relevant metric. These methods contains a "recency" parameter which specifies how recently the metrics must have been computed. Kubelet components which require very up-to-date metrics (eviction, for example), use very low values. Other components use higher values. - -Implementation: -To keep performance bounded while still offering metrics "On-Demand", all calls to get metrics are cached, and a minimum recency is established to prevent repeated metrics computation. Before computing new metrics, the previous metrics are checked to see if they meet the recency requirements of the caller. If the age of the metrics meet the recency requirements, then the cached metrics are returned. If not, then new metrics are computed and cached. - -## Future work -Suggested, tentative future work, which may be covered by future proposals: - - Decide on the format, name, and kubelet endpoint for publishing these metrics. - - Integrate with the CRI to allow compatibility with a greater number of runtimes, and to create a better runtime abstraction. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/custom-metrics-api.md b/contributors/design-proposals/instrumentation/custom-metrics-api.md index a10e91d4..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/custom-metrics-api.md +++ b/contributors/design-proposals/instrumentation/custom-metrics-api.md @@ -1,329 +1,6 @@ -Custom Metrics API -================== +Design proposals have been archived. -The new [metrics monitoring vision](monitoring_architecture.md) proposes -an API that the Horizontal Pod Autoscaler can use to access arbitrary -metrics. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Similarly to the [master metrics API](resource-metrics-api.md), the new -API should be structured around accessing metrics by referring to -Kubernetes objects (or groups thereof) and a metric name. For this -reason, the API could be useful for other consumers (most likely -controllers) that want to consume custom metrics (similarly to how the -master metrics API is generally useful to multiple cluster components). -The HPA can refer to metrics describing all pods matching a label -selector, as well as an arbitrary named object. - -API Paths ---------- - -The root API path will look like `/apis/custom-metrics/v1alpha1`. For -brevity, this will be left off below. - -- `/{object-type}/{object-name}/{metric-name...}`: retrieve the given - metric for the given non-namespaced object (e.g. Node, PersistentVolume) - -- `/{object-type}/*/{metric-name...}`: retrieve the given metric for all - non-namespaced objects of the given type - -- `/{object-type}/*/{metric-name...}?labelSelector=foo`: retrieve the - given metric for all non-namespaced objects of the given type matching - the given label selector - -- `/namespaces/{namespace-name}/{object-type}/{object-name}/{metric-name...}`: - retrieve the given metric for the given namespaced object - -- `/namespaces/{namespace-name}/{object-type}/*/{metric-name...}`: retrieve the given metric for all - namespaced objects of the given type - -- `/namespaces/{namespace-name}/{object-type}/*/{metric-name...}?labelSelector=foo`: retrieve the given - metric for all namespaced objects of the given type matching the - given label selector - -- `/namespaces/{namespace-name}/metrics/{metric-name}`: retrieve the given - metric which describes the given namespace. - -For example, to retrieve the custom metric "hits-per-second" for all -ingress objects matching "app=frontend` in the namespaces "webapp", the -request might look like: - -``` -GET /apis/custom-metrics/v1alpha1/namespaces/webapp/ingress.extensions/*/hits-per-second?labelSelector=app%3Dfrontend` - ---- - -Verb: GET -Namespace: webapp -APIGroup: custom-metrics -APIVersion: v1alpha1 -Resource: ingress.extensions -Subresource: hits-per-second -Name: ResourceAll(*) -``` - -Notice that getting metrics which describe a namespace follows a slightly -different pattern from other resources; Since namespaces cannot feasibly -have unbounded subresource names (due to collision with resource names, -etc), we introduce a pseudo-resource named "metrics", which represents -metrics describing namespaces, where the resource name is the metric name: - -``` -GET /apis/custom-metrics/v1alpha1/namespaces/webapp/metrics/queue-length - ---- - -Verb: GET -Namespace: webapp -APIGroup: custom-metrics -APIVersion: v1alpha1 -Resource: metrics -Name: queue-length -``` - -NB: the branch-node LIST operations (e.g. `LIST -/apis/custom-metrics/v1alpha1/namespaces/webapp/pods/`) are unsupported in -v1alpha1. They may be defined in a later version of the API. - -API Path Design, Discovery, and Authorization ---------------------------------------------- - -The API paths in this proposal are designed to a) resemble normal -Kubernetes APIs, b) facilitate writing authorization rules, and c) -allow for discovery. - -Since the API structure follows the same structure as other Kubernetes -APIs, it allows for fine grained control over access to metrics. Access -can be controlled on a per-metric basic (each metric is a subresource, so -metrics may be whitelisted by allowing access to a particular -resource-subresource pair), or granted in general for a namespace (by -allowing access to any resource in the `custom-metrics` API group). - -Similarly, since metrics are simply subresources, a normal Kubernetes API -discovery document can be published by the adapter's API server, allowing -clients to discover the available metrics. - -Note that we introduce the syntax of having a name of ` * ` here since -there is no current syntax for getting the output of a subresource on -multiple objects. - -API Objects ------------ - -The request URLs listed above will return the `MetricValueList` type described -below (when a name is given that is not ` * `, the API should simply return a -list with a single element): - -```go - -// a list of values for a given metric for some set of objects -type MetricValueList struct { - metav1.TypeMeta`json:",inline"` - metav1.ListMeta`json:"metadata,omitempty"` - - // the value of the metric across the described objects - Items []MetricValue `json:"items"` -} - -// a metric value for some object -type MetricValue struct { - metav1.TypeMeta`json:",inline"` - - // a reference to the described object - DescribedObject ObjectReference `json:"describedObject"` - - // the name of the metric - MetricName string `json:"metricName"` - - // indicates the time at which the metrics were produced - Timestamp unversioned.Time `json:"timestamp"` - - // indicates the window ([Timestamp-Window, Timestamp]) from - // which these metrics were calculated, when returning rate - // metrics calculated from cumulative metrics (or zero for - // non-calculated instantaneous metrics). - WindowSeconds *int64 `json:"window,omitempty"` - - // the value of the metric for this - Value resource.Quantity -} -``` - -For instance, the example request above would yield the following object: - -```json -{ - "kind": "MetricValueList", - "apiVersion": "custom-metrics/v1alpha1", - "items": [ - { - "metricName": "hits-per-second", - "describedObject": { - "kind": "Ingress", - "apiVersion": "extensions", - "name": "server1", - "namespace": "webapp" - }, - "timestamp": SOME_TIMESTAMP_HERE, - "windowSeconds": "10", - "value": "10" - }, - { - "metricName": "hits-per-second", - "describedObject": { - "kind": "Ingress", - "apiVersion": "extensions", - "name": "server2", - "namespace": "webapp" - }, - "timestamp": ANOTHER_TIMESTAMP_HERE, - "windowSeconds": "10", - "value": "15" - } - ] -} -``` - -Semantics ---------- - -### Object Types ### - -In order to properly identify resources, we must use resource names -qualified with group names (since the group for the requests will always -be `custom-metrics`). - -The `object-type` parameter should be the string form of -`unversioned.GroupResource`. Note that we do not include version in this; -we simply wish to uniquely identify all the different types of objects in -Kubernetes. For example, the pods resource (which exists in the un-named -legacy API group) would be represented simply as `pods`, while the jobs -resource (which exists in the `batch` API group) would be represented as -`jobs.batch`. - -In the case of cross-group object renames, the adapter should maintain -a list of "equivalent versions" that the monitoring system uses. This is -monitoring-system dependent (for instance, the monitoring system might -record all HorizontalPodAutoscalers as in `autoscaling`, but should be -aware that HorizontalPodAutoscaler also exist in `extensions`). - -Note that for namespace metrics, we use a pseudo-resource called -`metrics`. Since there is no resource in the legacy API group, this will -not clash with any existing resources. - -### Metric Names ### - -Metric names must be able to appear as a single subresource. In particular, -metric names, *as passed to the API*, may not contain the characters '%', '/', -or '?', and may not be named '.' or '..' (but may contain these sequences). -Note, specifically, that URL encoding is not acceptable to escape the forbidden -characters, due to issues in the Go URL handling libraries. Otherwise, metric -names are open-ended. - -### Metric Values and Timing ### - -There should be only one metric value per object requested. The returned -metrics should be the most recently available metrics, as with the resource -metrics API. Implementers *should* attempt to return all metrics with roughly -identical timestamps and windows (when appropriate), but consumers should also -verify that any differences in timestamps are within tolerances for -a particular application (e.g. a dashboard might simply display the older -metric with a note, while the horizontal pod autoscaler controller might choose -to pretend it did not receive that metric value). - -### Labeled Metrics (or lack thereof) ### - -For metrics systems that support differentiating metrics beyond the -Kubernetes object hierarchy (such as using additional labels), the metrics -systems should have a metric which represents all such series aggregated -together. Additionally, implementors may choose to identify the individual -"sub-metrics" via the metric name, but this is expected to be fairly rare, -since it most likely requires specific knowledge of individual metrics. -For instance, suppose we record filesystem usage by filesystem inside the -container. There should then be a metric `filesystem/usage`, and the -implementors of the API may choose to expose more detailed metrics like -`filesystem/usage/my-first-filesystem`. - -### Resource Versions ### - -API implementors should set the `resourceVersion` field based on the -scrape time of the metric. The resource version is expected to increment -when the scrape/collection time of the returned metric changes. While the -API does not support writes, and does not currently support watches, -populating resource version preserves the normal expected Kubernetes API -semantics. - -Relationship to HPA v2 ----------------------- - -The URL paths in this API are designed to correspond to different source -types in the [HPA v2](../autoscaling/hpa-v2.md). Specifically, the `pods` source type -corresponds to a URL of the form -`/namespaces/$NS/pods/*/$METRIC_NAME?labelSelector=foo`, while the -`object` source type corresponds to a URL of the form -`/namespaces/$NS/$RESOURCE.$GROUP/$OBJECT_NAME/$METRIC_NAME`. - -The HPA then takes the results, aggregates them together (in the case of -the former source type), and uses the resulting value to produce a usage -ratio. - -The resource source type is taken from the API provided by the -"metrics" API group (the master/resource metrics API). - -The HPA will consume the API as a federated API server. - -Relationship to Resource Metrics API ------------------------------------- - -The metrics presented by this API may be a superset of those present in the -resource metrics API, but this is not guaranteed. Clients that need the -information in the resource metrics API should use that to retrieve those -metrics, and supplement those metrics with this API. - -Mechanical Concerns -------------------- - -This API is intended to be implemented by monitoring pipelines (e.g. -inside Heapster, or as an adapter on top of a solution like Prometheus). -It shares many mechanical requirements with normal Kubernetes APIs, such -as the need to support encoding different versions of objects in both JSON -and protobuf, as well as acting as a discoverable API server. For these -reasons, it is expected that implemenators will make use of the Kubernetes -genericapiserver code. If implementors choose not to use this, they must -still follow all of the Kubernetes API server conventions in order to work -properly with consumers of the API. - -Specifically, they must support the semantics of the GET verb in -Kubernetes, including outputting in different API versions and formats as -requested by the client. They must support integrating with API discovery -(including publishing a discovery document, etc). - -Location --------- - -The types and clients for this API will live in a separate repository -under the Kubernetes organization (e.g. `kubernetes/metrics`). This -repository will most likely also house other metrics-related APIs for -Kubernetes (e.g. historical metrics API definitions, the resource metrics -API definitions, etc). - -Note that there will not be a canonical implementation of the custom -metrics API under Kubernetes, just the types and clients. Implementations -will be left up to the monitoring pipelines. - -Alternative Considerations --------------------------- - -### Quantity vs Float ### - -In the past, custom metrics were represented as floats. In general, -however, Kubernetes APIs are not supposed to use floats. The API proposed -above thus uses `resource.Quantity`. This adds a bit of encoding -overhead, but makes the API line up nicely with other Kubernetes APIs. - -### Labeled Metrics ### - -Many metric systems support labeled metrics, allowing for dimensionality -beyond the Kubernetes object hierarchy. Since the HPA currently doesn't -support specifying metric labels, this is not supported via this API. We -may wish to explore this in the future. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/events-redesign.md b/contributors/design-proposals/instrumentation/events-redesign.md index bf2ae606..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/events-redesign.md +++ b/contributors/design-proposals/instrumentation/events-redesign.md @@ -1,384 +1,6 @@ -# Make Kubernetes Events useful and safe +Design proposals have been archived. -Status: Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Beta -Implementation Owner: gmarek@google.com - -Approvers: -- [X] thockin - API changes -- [X] briangrant - API changes -- [X] konryd - API changes from UI/UX side -- [X] pszczesniak - logging team side -- [X] wojtekt - performance side -- [ ] derekwaynecarr - "I told you so" Events person:) - -## Overview -This document describes an effort which aims at fixing few issues in current way Events are structured and implemented. This effort has two main goals - reduce performance impact that Events have on the rest of the cluster and add more structure to the Event object which is first and necessary step to make it possible to automate Event analysis. - -This doc combines those two goals in a single effort, which includes both API changes and changes in EventRecorder library. To finish this effort audit of all the Events in the system has to be done, "event style guide" needs to be written, but those are not a part of this proposal. - -Doc starts with more detailed description of the background and motivation for this change. After that introduction we describe our proposal in detail, including both API changes and EventRecorder/deduplication logic updates. Later we consider various effects of this proposal, including performance impact and backward compatibility. We finish with describing considered alternatives and presenting the work plan. - -## Background/motivation -There's a relatively wide agreement that current implementation of Events in Kubernetes is problematic. Events are supposed to give app developer insight into what's happening with his/her app. Important requirement for Event library is that it shouldn't cause/worsen performance problems in the cluster. - -The problem is that neither of those requirements are actually met. Currently Events are extremely spammy (e.g. Event is emitted when Pod is unable to schedule every few seconds) with unclear semantics (e.g. Reason was understood by developers as "reason for taking action" or "reason for emitting event"). Also there are well known performance problems caused by Events (e.g. #47366, #47899) - Events can overload API server if there's something wrong with the cluster (e.g. some correlated crashloop on many Nodes, user created way more Pods that fit on the cluster which fail to schedule repeatedly). This was raised by the community on number of occasions. - -Our goal is to solve both those problems, i.e.: -Update Event semantics such that they'll be considered useful by app developers. -Reduce impact that Events have on the system's performance and stability. - -Those two loosely coupled efforts will drastically improve users experience when they'll need more insight into what's happening with an application. - -In the rest of document I'll shortly characterize both efforts and explain where they interact. - -Style guide for writing Events will be created as a part of this effort and all new Events will need to go through API-like review (process will be described in the style guide). - -### Non goals -It's not a goal of this effort to persist Events outside of etcd or for longer time. - -### Current Event API -Current Event object consists of: -- InvolvedObject (ObjectRef) -- First/LastSeenTimestamp (1s precision time when Event in a given group was first/last seen) -- Reason (short, machine understandable description of what happened that Event was emitted, e.g. ImageNotFound) -- Message (longer description of what happened, e.g. "failed to start a Pod <PodName>" -- Source (component that emits event + its host) -- Standard object stuff (ObjectMeta) -- Type (Normal/Warning) - -Deduplication logic groups together Event which have the same: -- Source Component and Host -- InvolvedObject Kind, Namespace, Name, API version and UID, -- Type (works as Severity) -- Reason - -In particular it ignores Message. It starts grouping events with different messages only after 10 single ones are emitted, which is confusing. - -Current deduplication can be split into two kinds: deduplication happening when "Messages" are different, and one happening when "Messages" are the same. - -First one occurs mostly in Controllers, which create Events for creating single Pods, by putting Pod data inside message. Current Event logic means that we'll create 10 separate Events, with InvolvedObject being, e.g. ReplicationController, and messages saying "RC X created Pod Y" or something equivalent. Afterwards we'll have single Event object with InvolvedObject being the same ReplicationController, but with the message saying "those events were deduped" and count set to size of RC minus 10. Because part of semantics of a given Event is included in the `message` field. - -Deduplication on identical messages can be seen in retry loops. - -### Usability issues with the current API -Users would like to be able to use Events also for debugging and trace analysis of Kubernetes clusters. Current implementation makes it hard for the following reasons: -1s granularity of timestamps (system reacts much quicker than that, making it more or less unusable), -deduplication, that leaves only count and, first and last timestamps (e.g. when Controller is creating a number of Pods information about it is deduplicated), -`InvolvedObject`, `Message`, `Reason` and `Source` semantics are far from obvious. If we treat `Event` as a sentence object of this sentence is stored either in `Message` (if the subject is a Kubernetes object (e.g. Controller)), or in `InvolvedObject`, if the subject is some kind of a controller (e.g. Kubelet). -hard to query for interesting series using standard tools (e.g. all Events mentioning given Pod is pretty much impossible because of deduplication logic) -As semantic information is passed in the message, which in turn is ignored by the deduplication logic it is not clear that this mechanism will not cause deduplication of Events that are completely different. - -## Proposal - -### High level ideas (TL;DR): -When this proposal is implemented users and administrators: -will be able to better track interesting changes in the state of objects they're interested in -will be convinced that Events do not destabilize their clusters - -### API changes goals -We want to achieve following things: -Make it easy to list all interesting Events in common scenarios using kubectl: -Listing Events mentioning given Pod, -Listing Events emitted by a given component (e.g. Kubelet on a given machine, NodeController), -Make timestamps precise enough to allow better events correlation, -Update the field names to better indicate their function. - -### API changes -Make all semantic information about events first-class fields, allowing better deduplication and querying -Add "action" to "reason" to reduce confusion about the semantics of them, -Add "related" field to denote second object taking part in the action, -Increase timestamp precision. - -### Performance changes -"Event series" detection and sending only "series start" and "series finish" Events, -Add more aggressive backoff policy for Events, -API changes -We'd like to propose following structure in Events object in the new events API group: - -```golang -type Event struct { - // <type and object metadata> - - // Time when this Event was first observed. - EventTime metav1.MicroTime - - // Data about the Event series this event represents or nil if it's - // a singleton Event. - // +optional - Series *EventSeries - - // Name of the controller that emitted this Event, e.g. `kubernetes.io/kubelet`. - ReportingController string - - // ID of the controller instance, e.g. `kubelet-xyzf`. - ReportingInstance string - - // What action was taken or what failed regarding the Regarding object. - Action string - - // Why the action was taken or why the operation failed. - Reason string - - // The object this Event is “about”. In most cases it's the object that the - // given controller implements. - // +optional - Regarding ObjectReference - - // Optional secondary object for more complex actions. - // +optional - Related *ObjectReference - - // Human readable description of the Event. Possibly discarded when and - // Event series is being deduplicated. - // +optional - Note string - - // Type of this event (Normal, Warning), new types could be added in the - // future. - // +optional - Type string -} - -type EventSeries struct { - Count int32 - LastObservedTime MicroTime - State EventSeriesState -} - -const ( - EventSeriesStateOngoing = "Ongoing" - EventSeriesStateFinished = "Finished" - EventSeriesStateUnknown = "Unknown" -) -``` - -### Few examples: - -| Regarding | Action | Reason | ReportingController | Related | -| ----------| -------| -------| --------------------|---------| -| Node X | BecameUnreachable | HeartbeatTooOld | kubernetes.io/node-ctrl | <nil> | -| Node Y | FailedToAttachVolume | Unknown | kubernetes.io/pv-attach-ctrl | PVC X | -| ReplicaSet X | FailedToInstantiatePod | QuotaExceeded | kubernetes.io/replica-set-ctrl | <nil> | -| ReplicaSet X | InstantiatedPod | | kubernetes.io/replica-set-ctrl | Pod Y | -| Ingress X | CreatedLoadBalancer | | kubernetes.io/ingress-ctrl | <nil> | -| Pod X | ScheduledOn | | kubernetes.io/scheduler | Node Y | -| Pod X | FailedToSchedule | FitResourcesPredicateFailed | kubernetes.io/scheduler | <nil> | - -### Comparison between old and new API: - -| Old | New | -|-------------|-------------| -| Old Event { | New Event { | -| TypeMeta | TypeMeta | -| ObjectMeta | ObjectMeta | -| InvolvedObject ObjectReference | Regarding ObjectReference | -| | Related *ObjectReference | -| | Action string | -| Reason string | Reason string | -| Message string | Note string | -| Source EventSource | | -| | ReportingController string | -| | ReportingInstance string | -| FirstTimestamp metav1.Time | | -| LastTimestamp metav1.Time | | -| | EventTime metav1.MicroTime | -| Count int32 | | -| | Series EventSeries | -| Type string | Type string | -| } | } | - -Namespace in which Event will live will be equal to -- Namespace of Regarding object, if it's namespaced, -- NamespaceSystem, if it's not. - -Note that this means that if Event has both Regarding and Related objects, and only one of them is namespaced, it should be used as Regarding object. - -The biggest change is the semantics of the Event object in case of loops. If Series is nil it means that Event is a singleton, i.e. it happened only once and the semantics is exactly the same as currently in Events with `count = 1`. If Series is not nil it means that the Event is either beginning or the end of an Event series - equivalence of current Events with `count > 1`. Events for ongoing series have Series.State set to EventSeriesStateOngoing, while endings have Series.State set to EventSeriesStateFinished. - -This change is better described in the section below. - -## Performance improvements design -We want to replace current behavior, where EventRecorder patches Event object every time when deduplicated Event occurs with an approach where being in the loop is treated as a state, hence Events only should be updated only when system enters or exits loop state (or is a singleton Event). - -Because Event object TTL in etcd we can't have above implemented cleanly, as we need to update Event objects periodically to prevent etcd garbage collection from removing ongoing series. We can use this need to update users with new data about number of occurrences. - -The assumption we make for deduplication logic after API changes is that Events with the same <Regarding, Action, Reason, ReportingController, ReportingInstance, Related> tuples are considered isomorphic. This allows us to define notion of "event series", which is series of isomorphic events happening not farther away from each other than some defined threshold. E.g. Events happening every second are considered a series, but Events happening every hour are not. - -The main goal of this change is to limit number of API requests sent to the API server to the minimum. This is important as overloading the API server can severely impact usability of the system. - -In the absence of errors in the system (all Pods are happily running/starting, Nodes are healthy, etc.) the number of Events is easily manageable by the system. This means that it's enough to concentrate on erroneous states and limit number of Events published when something's wrong with the cluster. - -There are two cases to consider: Event series, which result in ~1 API call per ~30 minutes, so won't cause a problem until there's a huge number of them; and huge number of non-series Events. To improve the latter we require that no high-cardinality data are put into any of Regarding, Action, Reason, ReportingController, ReportingInstance, Related fields. Which bound the number of Events to O(number of objects in the system^2). - -## Changes in EventRecorder -EventRecorder is our client library for Events that are used in components to emit Events. The main function in this library is `Eventf`, which takes the data and passes it to the EventRecorder backend, which does deduplication and forwards it to the API server. - -We need to write completely new deduplication logic for new Events, preserving the old one to avoid necessity to rewrite all places when Events are used together with this change. Additionally we need to add a new `Eventf`-equivalent function to the interface that will handle creation of new kind of events. - -New deduplication logic will work in the following way: -- When event is emitted for the first time it's written to the API server without series field set. -- When isomorphic event is emitted within the threshold from the original one EventRecorder detects the start of the series, updates the Event object, with the Series field set carrying count and sets State to EventSeriesStateOngoing. In the EventRecorder it also creates an entry in `activeSeries` map with the timestamp of last observed Event in the series. -- All subsequent isomorphic Events don't result in any API calls, they only update last observed timestamp value and count in the EventRecorder. -- For all active series every 30 minutes EventRecorder will create a "heartbeat" call. Goal of this update is to periodically update user on number of occurrences and prevent garbage collection in etcd. The heartbeat will be an Event update that updates the count and last observed time fields in the series field. -- For all active series every 6 minutes (longer that the longest backoff period) EventRecorder will check if it noticed any attempts to emit isomorphic Event. If there were, it'll check again after aforementioned period (6 minutes). If there weren't it assumes that series is finished and emits closing Event call. This updates the Event by setting state to EventSeriesStateFinished to true and updating the count and last observed time fields in the series field. - -### Short example: -After first occurrence, Event looks like: -``` -{ - regarding: A, - action: B, - reportingController: C, - ..., -} -``` -After second occurrence, Event looks like: -``` -{ - regarding: A, - action: B, - reportingController: C, - ..., - series: {count: 2, state: "Ongoing"}, -} -``` -After half an hour of crashlooping, Event looks like: -``` -{ - regarding: A, - action: B, - reportingController: C, - ..., - series: {count: 4242, state: "Ongoing"}, -} -``` -Minute after crashloop stopped, Event looks like: -``` -{ - regarding: A, - action: B, - reportingController: C, - ..., - series: {count: 424242, state: "Finished"}, -} -``` - -### Client side changes -All clients will need to eventually migrate to use new Events, but no other actions are required from them. Deduplication logic change will be completely transparent after the move to the new API. - -### Restarts -Event recorder library will list all Events emitted by corresponding components and reconstruct internal activeSeries map from it. - -## Defence in depth -Because Events proved problematic we want to add multiple levels of protection in the client library to reduce chances that Events will be overloading API servers in the future. We propose to do two things. - -### Aggressive backoff -We need to make sure that kubernetes client used by EventRecorder uses properly configured and backoff pretty aggressively. Events should not destabilize the cluster, so if EventRecorder receives 429 response it should exponentially back off for non-negligible amount of time, to let API server recover. - -### Other related changes -To allow easier querying we need to make following fields selectable for Events: -- event.reportingComponent -- event.reportingInstance -- event.action -- event.reason -- event.regarding... -- event.related... -- event.type - -Kubectl will need to be updated to use new Events if present. - -## Considerations - -### Performance impact -We're not changing how Events are stored in the etcd (except adding new fields to the storage type). We'll keep current TTL for all Event objects. - -Proposed changes alone will have possibly three effects on performance: we will emit more Events for Pod creation (disable deduplication for "Create Pod" Event emitted by controllers), we will emit fewer Events for hotloops (3 API calls + 1 call/30min per hotloop series, instead of 1/iteration), and Events will be bigger. This means that Event memory footprint will grow slightly, but in the unhealthy looping state number of API calls will be reduced significantly. - -We looked at the amount of memory used in our performance tests in cluster of various size. The results are following: - -| | 5 nodes | 100 nodes | 500 nodes | 5000 nodes | -|-|---------|-----------|-----------|------------| -| event-etcd | 28MB | 65MB | 161MB | n/a | -| All master component | 530MB | 1,2GB | 3,9GB | n/a | -| Excess resources in default config | 3,22GB | 13,8GB | 56,1GB | n/a | - -The difference in size of the Event object comes from new Action and Related fields. We can safely estimate the increase to be smaller than 30%. We'll also emit additional Event per Pod creation, as currently Events for that are being deduplicated. There are currently at least 6 Events emitted when Pod is started, so impact of this change can be bounded by 20%. This means that in the worst case the increase in Event size can be bounded by 56%. As can be seen in the table above we can easily afford such increase. - -### Backward compatibility -Kubernetes API machinery moves towards moving all resources for which it make sense to separate API groups e.g. to allow defining separate storage for it. For this reason we're going to create a new `events` API group in which Event resources will live. - -In the same time we can't stop emitting v1.Events from the Core group as this is considered breaking API change. For this reason we decided to create a new API group for events but map it to the same internal type as core Events. - -As objects are stored in the versioned format we need to add new fields to the Core group, as we're going to use Core group as storage format for new Events. - -After the change we'll have three types of Event objects. Internal representation (denoted internal), "old" core API group type (denoted core) and "new" events API group (denoted events). They will look in the following way - green color denotes added fields: - -| internal.Event | core.Event | events.Event | -|----------------|------------|--------------| -| TypeMeta | TypeMeta | TypeMeta | -| ObjectMeta | ObjectMeta | ObjectMeta | -| InvolvedObject ObjectReference | InvolvedObject ObjectReference | Regarding ObjectReference | -| Related *ObjectReference | Related *ObjectReference | Related *ObjectReference | -| Action string | Action string | Action string | -| Reason string | Reason string | Reason string | -| Message string | Message string | Note string | -| Source.Component string | Source.Component string | ReportingController string | -| Source.Host string | Source.Host string | DeprecatedHost string | -| ReportingInstance string | ReportingInstance string | ReportingInstance string | -| FirstTimestamp metav1.Time | FirstTimestamp metav1.Time | DeprecatedFirstTimestamp metav1.Time | -| LastTimestamp metav1.Time | LastTimestamp metav1.Time | DeprecatedLastTimestamp metav1.Time | -| EventTime metav1.MicroTime | EventTime metav1.MicroTime | EventTime metav1.MicroTime | -| Count int32 | Count int32 | DeprecatedCount int32 | -| Series.Count int32 | Series.Count int32 | Series.Count int32 | -| Series.LastObservedTime | Series.LastObservedTime | Series.LastObservedTime | -| Series.State string | Series.State string | Series.State string | -| Type string | Type string | Type string | - -Considered alternative was to create a separate type that will hold all additional fields in core.Event type. It was dropped, as it's not clear it would help with the clarity of the API. - -There will be conversion functions that'll allow reading/writing Events as both core.Event and events.Event types. As we don't want to officially extend core.Event type, new fields will be set only if Event would be written through events.Event endpoint (e.g. if Event will be created by core.Event endpoint EventTime won't be set). - -This solution gives us clean(-ish) events.Event API and possibility to implement separate storage for Events in the future. The cost is adding more fields to core.Event type. We think that this is not a big price to pay, as the general direction would be to use separate API groups more and core group less in the future. - -`Events` API group will be added directly as beta API, as otherwise kubernetes component's wouldn't be allowed to use it. - -### Sample queries with "new" Events - -#### Get all NodeController Events -List Events from the NamespaceSystem with field selector `reportingController = "kubernetes.io/node-controller"` - -#### Get all Events from lifetime of a given Pod -List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. - -### Related work -There's ongoing effort for adding Event deduplication and teeing to the server side. It will allow even easier usage of Events, but in principle it's independent work that should not interfere with one proposed here. - -Another effort to protect API server from too many Events by dropping requests servers side in admission plugin is worked on by @staebler. -## Considered alternatives for API changes -### Leaving current dedup mechanism but improve backoff behavior -As we're going to move all semantic information to fields, instead of passing some of them in message, we could just call it a day, and leave the deduplication logic as is. When doing that we'd need to depend on the client-recorder library on protecting API server, by using number of techniques, like batching, aggressive backing off and allowing admin to reduce number of Events emitted by the system. This solution wouldn't drastically reduce number of API requests and we'd need to hope that small incremental changes would be enough. - -### Timestamp list as a dedup mechanism -Another considered solution was to store timestamps of Events explicitly instead of only count. This gives users more information, as people complain that current dedup logic is too strong and it's hard to "decompress" Event if needed. This change has clearly worse performance characteristic, but fixes the problem of "decompressing" Events and generally making deduplication lossless. We believe that individual repeated events are not interesting per se, what's interesting is when given series started and when it finished, which is how we ended with the current proposal. - -### Events as an aggregated object -We considered adding nested information about occurrences into the Event. In other words we'd have single Event object per Subject and instead of having only `Count`, we could have stored slice of `timestamp-object` pairs, as a slightly heavier deduplication information. This would have non-negligible impact on size of the event-etcd, and additional price for it would be much harder query logic (querying nested slices is currently not implemented in kubernetes API), e.g. "Give me all Events that refer Pod X" would be hard. - -### Using new API group for storing data -Instead of adding "new" fields to the "old" versioned type, we could have change the version in which we store Events to the new group and use annotations to store "deprecated" fields. This would allow us to avoid having "hybrid" type, as `v1.Events` became, but the change would have a much higher risk (we would have been moving battle-tested and simple `v1.Event` store to new `events.Event` store with some of the data present only in annotations). Additionally performance would degrade, as we'd need to parse JSONs from annotations to get values for "old" fields. -Adding panic button that would stop creation/update of Events -If all other prevention mechanism fail we’d like a way for cluster admin to disable Events in the cluster, to stop them overloading the server. However, we dropped this idea, as it's currently possible to achieve the similar result by changing RBAC rules. - -### Pivoting towards more machine readable Events by introducing stricter structure -We considered making easier for automated systems to use Events by enforcing "active voice" for Event objects. This would allow us to assure which field in the Event points to the active component, and which to direct and indirect objects. We dropped this idea because Events are supposed to be consumed only by humans. - -### Pivoting towards making Events more helpful for cluster operator during debugging -We considered exposing more data that cluster operator would need to use Events for debugging, e.g. making ReportingController more central to the semantics of Event and adding some way to easily grep though the logs of appropriate component when looking for context of a given Event. This idea was dropped because Events are supposed to give application developer who's running his application on the cluster a rough understanding what was happening with his app. - -## Proposed implementation plan -- 1.9 - create a beta api-group Events with the new Event type, -- 1.10 - migrate at controllers running in controller manager to use events API group, -- 1.11 - finish migrating Events in all components and move and move events storage representation to the new type. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/external-metrics-api.md b/contributors/design-proposals/instrumentation/external-metrics-api.md index 2b99ef12..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/external-metrics-api.md +++ b/contributors/design-proposals/instrumentation/external-metrics-api.md @@ -1,87 +1,6 @@ -# **External Metrics API** +Design proposals have been archived. -# Overview +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -[HPA v2 API extension proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/hpa-external-metrics.md) introduces new External metric type for autoscaling based on metrics coming from outside of Kubernetes cluster. This document proposes a new External Metrics API that will be used by HPA controller to get those metrics. -This API performs a similar role to and is based on existing [Custom Metrics API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md). Unless explicitly specified otherwise all sections related to semantics, implementation and design decisions in [Custom Metrics API design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md) apply to External Metrics API as well. It is generally expected that a Custom Metrics Adapter will provide both Custom Metrics API and External Metrics API, however, this is not a requirement and both APIs can be implemented and used separately. - - -# API - -The API will consist of a single path: - - -``` -/apis/external.metrics.k8s.io/v1beta1/namespaces/<namespace_name>/<metric_name>?labelSelector=<selector> -``` - -Similar to endpoints in Custom Metrics API it would only support GET requests. - -The query would return the `ExternalMetricValueList` type described below: - -```go -// a list of values for a given metric for some set labels -type ExternalMetricValueList struct { - metav1.TypeMeta `json:",inline"` - metav1.ListMeta `json:"metadata,omitempty"` - - // value of the metric matching a given set of labels - Items []ExternalMetricValue `json:"items"` -} - -// a metric value for external metric -type ExternalMetricValue struct { - metav1.TypeMeta`json:",inline"` - - // the name of the metric - MetricName string `json:"metricName"` - - // label set identifying the value within metric - MetricLabels map[string]string `json:"metricLabels"` - - // indicates the time at which the metrics were produced - Timestamp unversioned.Time `json:"timestamp"` - - // indicates the window ([Timestamp-Window, Timestamp]) from - // which these metrics were calculated, when returning rate - // metrics calculated from cumulative metrics (or zero for - // non-calculated instantaneous metrics). - WindowSeconds *int64 `json:"window,omitempty"` - - // the value of the metric - Value resource.Quantity -} -``` - -# Semantics - -## Namespaces - -Kubernetes namespaces don't have a natural 1-1 mapping to metrics coming from outside of Kubernetes. It is up to adapter implementing the API to decide which metric is available in which namespace. In particular a single metric may be available through many different namespaces. - -## Metric Values - -A request for a given metric may return multiple values if MetricSelector matches multiple time series. Each value should include a complete set of labels, which is sufficient to uniquely identify a timeseries. - -A single value should always be returned if MetricSelector specifies a single value for every label defined for a given metric. - -## Metric names - -Custom Metrics API [doesn't allow using certain characters in metric names](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md#metric-names). The reason for that is a technical limitation in GO libraries. This list of forbidden characters includes slash (`/`). This is problematic as many systems use slashes in their metric naming convention. - -Rather than expect metric adapters to come up with their custom ways of handling that this document proposes introducing `\|` as a custom escape sequence for slash. HPA controller will automatically replace any slashes in MetricName field for External metric with this escape sequence. - -Otherwise the allowed metric names are the same as in Custom Metrics API. - -## Access Control - -Access can be controlled with per-metric granularity, same as in Custom Metrics API. The API has been designed to allow adapters to implement more granular access control if required. Possible future extension of API supporting label level access control is described in [ExternalMetricsPolicy](#externalmetricspolicy) section. - -# Future considerations - -## ExternalMetricsPolicy - -If a more granular access control turns out to be a common requirement an ExternalMetricPolicy object could be added to API. This object could be defined at cluster level, per namespace or per user and would consist of a list of rules. Each rule would consist of a mandatory regexp and either a label selector or a 'deny' statement. For each metric the rules would be applied top to bottom, with the first matching rule being used. A query that hit a deny rule or specified a selector that is not a subset of selector specified by policy would be rejected with 403 error. - -Additionally an admission controller could be used to check the policy when creating HPA object. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/metrics-server.md b/contributors/design-proposals/instrumentation/metrics-server.md index 163b2385..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/metrics-server.md +++ b/contributors/design-proposals/instrumentation/metrics-server.md @@ -1,89 +1,6 @@ -Metrics Server -============== +Design proposals have been archived. -Resource Metrics API is an effort to provide a first-class Kubernetes API -(stable, versioned, discoverable, available through apiserver and with client support) -that serves resource usage metrics for pods and nodes. The use cases were discussed -and the API was proposed a while ago in -[another proposal](/contributors/design-proposals/instrumentation/resource-metrics-api.md). -This document describes the architecture and the design of the second part of this effort: -making the mentioned API available in the same way as the other Kubernetes APIs. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Scalability limitations ### -We want to collect up to 10 metrics from each pod and node running in a cluster. -Starting with Kubernetes 1.6 we support 5000 nodes clusters with 30 pods per node. -Assuming we want to collect metrics with 1 minute granularity this means: -``` -10 x 5000 x 30 / 60 = 25000 metrics per second by average -``` - -Kubernetes apiserver persists all Kubernetes resources in its key-value store [etcd](https://coreos.com/etcd/). -It’s not able to handle such load. On the other hand metrics tend to change frequently, -are temporary and in case of loss of them we can collect them during the next housekeeping operation. -We will store them in memory then. This means that we can’t reuse the main apiserver -and instead we will introduce a new one - metrics server. - -### Current status ### -The API has been already implemented in Heapster, but users and Kubernetes components -can only access it through master proxy mechanism and have to decode it on their own. -Heapster serves the API using go http library which doesn’t offer a number of functionality -that is offered by Kubernetes API server like authorization/authentication or client generation. -There is also a prototype of Heapster using [generic apiserver](https://github.com/kubernetes/apiserver) library. - -The API is in alpha and there is a plan to graduate it to beta (and later to GA), -but it’s out of the scope of this document. - -### Dependencies ### -In order to make metrics server available for users in exactly the same way -as the regular Kubernetes API we need a mechanism that redirects requests to `/apis/metrics` -endpoint from the apiserver to metrics server. The solution for this problem is -[kube-aggregator](https://github.com/kubernetes/kube-aggregator). -The effort is on track to be completed for Kubernetes 1.7 release. -Previously metrics server was blocked on this dependency. - -### Design ### -Metrics server will be implemented in line with -[Kubernetes monitoring architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md) -and inspired by [Heapster](https://github.com/kubernetes/heapster). -It will be a cluster level component which periodically scrapes metrics from all Kubernetes nodes -served by Kubelet through Summary API. Then metrics will be aggregated, -stored in memory (see Scalability limitations) and served in -[Metrics API](https://git.k8s.io/metrics/pkg/apis/metrics/v1alpha1/types.go) format. - -Metrics server will use apiserver library to implement http server functionality. -The library offers common Kubernetes functionality like authorization/authentication, -versioning, support for auto-generated client. To store data in memory we will replace -the default storage layer (etcd) by introducing in-memory store which will implement -[Storage interface](https://git.k8s.io/apiserver/pkg/registry/rest/rest.go). - -Only the most recent value of each metric will be remembered. If a user needs an access -to historical data they should either use 3rd party monitoring solution or -archive the metrics on their own (more details in the mentioned vision). - -Since the metrics are stored in memory, once the component is restarted, all data are lost. -This is an acceptable behavior because shortly after the restart the newest metrics will be collected, -though we will try to minimize the priority of this (see also Deployment). - -### Deployment ### -Since metrics server is prerequisite for a number of Kubernetes components (HPA, scheduler, kubectl top) -it will run by default in all Kubernetes clusters. Metrics server initiates connections to nodes, -due to security reasons (our policy allows only connection in the opposite direction) so it has to run on user’s node. - -There will be only one instance of metrics server running in each cluster. In order to handle -high metrics volume, metrics server will be vertically autoscaled by -[addon-resizer](https://git.k8s.io/contrib/addon-resizer). -We will measure its resource usage characteristic. Our experience from profiling Heapster shows -that it scales vertically effectively. If we hit performance limits we will consider scaling it -horizontally, though it’s rather complicated and is out of the scope of this doc. - -Metrics server will be Kubernetes addon, create by kube-up script and managed by -[addon-manager](https://git.k8s.io/kubernetes/cluster/addons/addon-manager). -Since there are a number of dependent components, it will be marked as a critical addon. -In the future when the priority/preemption feature is introduced we will migrate to use this -proper mechanism for marking it as a high-priority, system component. - -### Users migration ### -In order to make the API usable we will provide auto-generated set of clients. -Currently the API is being used by a number of components and after we will introduce -the metrics server we will migrate all of them to use the new path. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/monitoring_architecture.md b/contributors/design-proposals/instrumentation/monitoring_architecture.md index 59e4ccbc..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/monitoring_architecture.md +++ b/contributors/design-proposals/instrumentation/monitoring_architecture.md @@ -1,198 +1,6 @@ -# Kubernetes monitoring architecture +Design proposals have been archived. -## Executive Summary +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Monitoring is split into two pipelines: - -* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down -Heapster called metrics-server, and the API server serving the master metrics API. These -metrics are used by core system components, such as scheduling logic (e.g. scheduler and -horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components -(e.g. `kubectl top`). This pipeline is not intended for integration with third-party -monitoring systems. -* A **monitoring pipeline** used for collecting various metrics from the system and exposing -them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore -via adapters. Users can choose from many monitoring system vendors, or run none at all. In -open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options -will be easy to install. We expect that such pipelines will typically consist of a per-node -agent and a cluster-level aggregator. - -The architecture is illustrated in the diagram in the Appendix of this doc. - -## Introduction and Objectives - -This document proposes a high-level monitoring architecture for Kubernetes. It covers -a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc, -specifically focusing on an architecture (components and their interactions) that -hopefully meets the numerous requirements. We do not specify any particular timeframe -for implementing this architecture, nor any particular roadmap for getting there. - -### Terminology - -There are two types of metrics, system metrics and service metrics. System metrics are -generic metrics that are generally available from every entity that is monitored (e.g. -usage of CPU and memory by container and node). Service metrics are explicitly defined -in application code and exported (e.g. number of 500s served by the API server). Both -system metrics and service metrics can originate from users’ containers or from system -infrastructure components (master components like the API server, addon pods running on -the master, and addon pods running on user nodes). - -We divide system metrics into - -* *core metrics*, which are metrics that Kubernetes understands and uses for operation -of its internal components and core utilities -- for example, metrics used for scheduling -(including the inputs to the algorithms for resource estimation, initial resources/vertical -autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics), -the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage, -memory instantaneous usage, disk usage of pods, disk usage of containers -* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they -include the core metrics (though not necessarily in a format Kubernetes understands) plus -additional metrics. - -Service metrics can be divided into those produced by Kubernetes infrastructure components -(and thus useful for operation of the Kubernetes cluster) and those produced by user applications. -Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics. -Of course horizontal pod autoscaling also uses core metrics. - -We consider logging to be separate from monitoring, so logging is outside the scope of -this doc. - -### Requirements - -The monitoring architecture should - -* include a solution that is part of core Kubernetes and - * makes core system metrics about nodes, pods, and containers available via a standard - master API (today the master metrics API), such that core Kubernetes features do not - depend on non-core components - * requires Kubelet to only export a limited set of metrics, namely those required for - core Kubernetes components to correctly operate (this is related to [#18770](https://github.com/kubernetes/kubernetes/issues/18770)) - * can scale up to at least 5000 nodes - * is small enough that we can require that all of its components be running in all deployment - configurations -* include an out-of-the-box solution that can serve historical data, e.g. to support Initial -Resources and vertical pod autoscaling as well as cluster analytics queries, that depends -only on core Kubernetes -* allow for third-party monitoring solutions that are not part of core Kubernetes and can -be integrated with components like Horizontal Pod Autoscaler that require service metrics - -## Architecture - -We divide our description of the long-term architecture plan into the core metrics pipeline -and the monitoring pipeline. For each, it is necessary to think about how to deal with each -type of metric (core metrics, non-core metrics, and service metrics) from both the master -and minions. - -### Core metrics pipeline - -The core metrics pipeline collects a set of core system metrics. There are two sources for -these metrics - -* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that -is part of Kubelet will be slimmed down to provide only core system metrics) -* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from -Kubelet into resource estimates (values used by scheduler for a more advanced usage-based -scheduler) - -These sources are scraped by a component we call *metrics-server* which is like a slimmed-down -version of today's Heapster. metrics-server stores locally only latest values and has no sinks. -metrics-server exposes the master metrics API. (The configuration described here is similar -to the current Heapster in “standalone” mode.) -[Discovery summarizer](../api-machinery/aggregated-api-servers.md) -makes the master metrics API available to external clients such that from the client's perspective -it looks the same as talking to the API server. - -Core (system) metrics are handled as described above in all deployment environments. The only -easily replaceable part is resource estimator, which could be replaced by power users. In -theory, metric-server itself can also be substituted, but it'd be similar to substituting -apiserver itself or controller-manager - possible, but not recommended and not supported. - -Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon -themselves (e.g. CPU usage of Kubelet), even though they do not run in containers. - -The core metrics pipeline is intentionally small and not designed for third-party integrations. -“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline -(see next section) and can run on Kubernetes without having to make changes to upstream components. -In this way we can remove the burden we have today that comes with maintaining Heapster as the -integration point for every possible metrics source, sink, and feature. - -#### Infrastore - -We will build an open-source Infrastore component (most likely reusing existing technologies) -for serving historical queries over core system metrics and events, which it will fetch from -the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries -- -this is TBD) to handle the following use cases - -* initial resources -* vertical autoscaling -* oldtimer API -* decision-support queries for debugging, capacity planning, etc. -* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) - -In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes -infrastructure containers), described in the upcoming sections. - -### Monitoring pipeline - -One of the goals of building a dedicated metrics pipeline for core metrics, as described in the -previous section, is to allow for a separate monitoring pipeline that can be very flexible -because core Kubernetes components do not need to rely on it. By default we will not provide -one, but we will provide an easy way to install one (using a single command, most likely using -Helm). We described the monitoring pipeline in this section. - -Data collected by the monitoring pipeline may contain any sub- or superset of the following groups -of metrics: - -* core system metrics -* non-core system metrics -* service metrics from user application containers -* service metrics from Kubernetes infrastructure containers; these metrics are exposed using -Prometheus instrumentation - -It is up to the monitoring solution to decide which of these are collected. - -In order to enable horizontal pod autoscaling based on custom metrics, the provider of the -monitoring pipeline would also have to create a stateless API adapter that pulls the custom -metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such -API will be a well defined, versioned API similar to regular APIs. Details of how it will be -exposed or discovered will be covered in a detailed design doc for this component. - -The same approach applies if it is desired to make monitoring pipeline metrics available in -Infrastore. These adapters could be standalone components, libraries, or part of the monitoring -solution itself. - -There are many possible combinations of node and cluster-level agents that could comprise a -monitoring pipeline, including -cAdvisor + Heapster + InfluxDB (or any other sink) -* cAdvisor + collectd + Heapster -* cAdvisor + Prometheus -* snapd + Heapster -* snapd + SNAP cluster-level agent -* Sysdig - -As an example we'll describe a potential integration with cAdvisor + Prometheus. - -Prometheus has the following metric sources on a node: -* core and non-core system metrics from cAdvisor -* service metrics exposed by containers via HTTP handler in Prometheus format -* [optional] metrics about node itself from Node Exporter (a Prometheus component) - -All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus -cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a -standalone API adapter that proxies/translates between the Prometheus Query Language endpoint -on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be -used to make the metrics from the monitoring pipeline available in Infrastore. Neither -adapter is necessary if the user does not need the corresponding feature. - -The command that installs cAdvisor+Prometheus should also automatically set up collection -of the metrics from infrastructure containers. This is possible because the names of the -infrastructure containers and metrics of interest are part of the Kubernetes control plane -configuration itself, and because the infrastructure containers export their metrics in -Prometheus format. - -## Appendix: Architecture diagram - -### Open-source monitoring pipeline - - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/monitoring_architecture.png b/contributors/design-proposals/instrumentation/monitoring_architecture.png Binary files differdeleted file mode 100644 index 570996b7..00000000 --- a/contributors/design-proposals/instrumentation/monitoring_architecture.png +++ /dev/null diff --git a/contributors/design-proposals/instrumentation/performance-related-monitoring.md b/contributors/design-proposals/instrumentation/performance-related-monitoring.md index f2b75813..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/performance-related-monitoring.md +++ b/contributors/design-proposals/instrumentation/performance-related-monitoring.md @@ -1,112 +1,6 @@ -# Performance Monitoring +Design proposals have been archived. -## Reason for this document +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document serves as a place to gather information about past performance regressions, their reason and impact and discuss ideas to avoid similar regressions in the future. -Main reason behind doing this is to understand what kind of monitoring needs to be in place to keep Kubernetes fast. -## Known past and present performance issues - -### Higher logging level causing scheduler stair stepping - -Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0 -cut. In the end it turned out the be caused by `--v=4` (instead of default `--v=2`) flag in the scheduler together with the flag `--logtostderr` which disables batching of -log lines and a number of logging without explicit V level. This caused weird behavior of the whole component. - -Because we now know that logging may have big performance impact we should consider instrumenting logging mechanism and compute statistics such as number of logged messages, -total and average size of them. Each binary should be responsible for exposing its metrics. An unaccounted but way too big number of days, if not weeks, of engineering time was -lost because of this issue. - -### Adding per-pod probe-time, which increased the number of PodStatus updates, causing major slowdown - -In September 2015 we tried to add per-pod probe times to the PodStatus. It caused (https://github.com/kubernetes/kubernetes/issues/14273) a massive increase in both number and -total volume of object (PodStatus) changes. It drastically increased the load on API server which wasn't able to handle new number of requests quickly enough, violating our -response time SLO. We had to revert this change. - -### Late Ready->Running PodPhase transition caused test failures as it seemed like slowdown - -In late September we encountered a strange problem (https://github.com/kubernetes/kubernetes/issues/14554): we observed an increased observed latencies in small clusters (few -Nodes). It turned out that it's caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows -how careful we need to be. - -### Huge number of handshakes slows down API server - -It was a long standing issue for performance and is/was an important bottleneck for scalability (https://github.com/kubernetes/kubernetes/issues/13671). The bug directly -causing this problem was incorrect (from the golang's standpoint) handling of TCP connections. Secondary issue was that elliptic curve encryption (only one available in go 1.4) -is unbelievably slow. - -## Proposed metrics/statistics to gather/compute to avoid problems - -### Cluster-level metrics - -Basic ideas: -- number of Pods/ReplicationControllers/Services in the cluster -- number of running replicas of master components (if they are replicated) -- current elected master of etcd cluster (if running distributed version) -- number of master component restarts -- number of lost Nodes - -### Logging monitoring - -Log spam is a serious problem and we need to keep it under control. Simplest way to check for regressions, suggested by @brendandburns, is to compute the rate in which log files -grow in e2e tests. - -Basic ideas: -- log generation rate (B/s) - -### REST call monitoring - -We do measure REST call duration in the Density test, but we need an API server monitoring as well, to avoid false failures caused e.g. by the network traffic. We already have -some metrics in place (https://git.k8s.io/kubernetes/pkg/apiserver/metrics/metrics.go), but we need to revisit the list and add some more. - -Basic ideas: -- number of calls per verb, client, resource type -- latency distribution per verb, client, resource type -- number of calls that was rejected per client, resource type and reason (invalid version number, already at maximum number of requests in flight) -- number of relists in various watchers - -### Rate limit monitoring - -Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of -requests sent we can track how saturated are rate limiters. This has additional advantage of giving us data needed to fine-tune rate limiter constants. - -Because we have rate limiting on both ends (client and API server) we should monitor number of inflight requests in API server and how it relates to `max-requests-inflight`. - -Basic ideas: -- percentage of used non-burst limit, -- amount of time in last hour with depleted burst tokens, -- number of inflight requests in API server. - -### Network connection monitoring - -During development we observed incorrect use/reuse of HTTP connections multiple times already. We should at least monitor number of created connections. - -### ETCD monitoring - -@xiang-90 and @hongchaodeng - you probably have way more experience on what'd be good to look at from the ETCD perspective. - -Basic ideas: -- ETCD memory footprint -- number of objects per kind -- read/write latencies per kind -- number of requests from the API server -- read/write counts per key (it may be too heavy though) - -### Resource consumption - -On top of all things mentioned above we need to monitor changes in resource usage in both: cluster components (API server, Kubelet, Scheduler, etc.) and system add-ons -(Heapster, L7 load balancer, etc.). Monitoring memory usage is tricky, because if no limits are set, system won't apply memory pressure to processes, which makes their memory -footprint constantly grow. We argue that monitoring usage in tests still makes sense, as tests should be repeatable, and if memory usage will grow drastically between two runs -it most likely can be attributed to some kind of regression (assuming that nothing else has changed in the environment). - -Basic ideas: -- CPU usage -- memory usage - -### Other saturation metrics - -We should monitor other aspects of the system, which may indicate saturation of some component. - -Basic ideas: -- queue length for queues in the system, -- wait time for WaitGroups. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/resource-metrics-api.md b/contributors/design-proposals/instrumentation/resource-metrics-api.md index 075c6180..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/resource-metrics-api.md +++ b/contributors/design-proposals/instrumentation/resource-metrics-api.md @@ -1,148 +1,6 @@ -# Resource Metrics API +Design proposals have been archived. -This document describes API part of MVP version of Resource Metrics API effort in Kubernetes. -Once the agreement will be made the document will be extended to also cover implementation details. -The shape of the effort may be also a subject of changes once we will have more well-defined use cases. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goal - -The goal for the effort is to provide resource usage metrics for pods and nodes through the API server. -This will be a stable, versioned API which core Kubernetes components can rely on. -In the first version only the well-defined use cases will be handled, -although the API should be easily extensible for potential future use cases. - -## Main use cases - -This section describes well-defined use cases which should be handled in the first version. -Use cases which are not listed below are out of the scope of MVP version of Resource Metrics API. - -#### Horizontal Pod Autoscaler - -HPA uses the latest value of cpu usage as an average aggregated across 1 minute -(the window may change in the future). The data for a given set of pods -(defined either by pod list or label selector) should be accessible in one request -due to performance issues. - -#### Scheduler - -Scheduler in order to schedule best-effort pods requires node level resource usage metrics -as an average aggregated across 1 minute (the window may change in the future). -The metrics should be available for all resources supported in the scheduler. -Currently the scheduler does not need this information, because it schedules best-effort pods -without considering node usage. But having the metrics available in the API server is a blocker -for adding the ability to take node usage into account when scheduling best-effort pods. - -## Other considered use cases - -This section describes the other considered use cases and explains why they are out -of the scope of the MVP version. - -#### Custom metrics in HPA - -HPA requires the latest value of application level metrics. - -The design of the pipeline for collecting application level metrics should -be revisited and it's not clear whether application level metrics should be -available in API server so the use case initially won't be supported. - -#### Cluster Federation - -The Cluster Federation control system might want to consider cluster-level usage (in addition to cluster-level request) -of running pods when choosing where to schedule new pods. Although -Cluster Federation is still in design, -we expect the metrics API described here to be sufficient. Cluster-level usage can be -obtained by summing over usage of all nodes in the cluster. - -#### kubectl top - -This feature is not yet specified/implemented although it seems reasonable to provide users information -about resource usage on pod/node level. - -Since this feature has not been fully specified yet it will be not supported initially in the API although -it will be probably possible to provide a reasonable implementation of the feature anyway. - -#### Kubernetes dashboard - -[Kubernetes dashboard](https://github.com/kubernetes/dashboard) in order to draw graphs requires resource usage -in timeseries format from relatively long period of time. The aggregations should be also possible on various levels -including replication controllers, deployments, services, etc. - -Since the use case is complicated it will not be supported initially in the API and they will query Heapster -directly using some custom API there. - -## Proposed API - -Initially the metrics API will be in a separate [API group](../api-machinery/api-group.md) called ```metrics```. -Later if we decided to have Node and Pod in different API groups also -NodeMetrics and PodMetrics should be in different API groups. - -#### Schema - -The proposed schema is as follow. Each top-level object has `TypeMeta` and `ObjectMeta` fields -to be compatible with Kubernetes API standards. - -```go -type NodeMetrics struct { - unversioned.TypeMeta - ObjectMeta - - // The following fields define time interval from which metrics were - // collected in the following format [Timestamp-Window, Timestamp]. - Timestamp unversioned.Time - Window unversioned.Duration - - // The memory usage is the memory working set. - Usage v1.ResourceList -} - -type PodMetrics struct { - unversioned.TypeMeta - ObjectMeta - - // The following fields define time interval from which metrics were - // collected in the following format [Timestamp-Window, Timestamp]. - Timestamp unversioned.Time - Window unversioned.Duration - - // Metrics for all containers are collected within the same time window. - Containers []ContainerMetrics -} - -type ContainerMetrics struct { - // Container name corresponding to the one from v1.Pod.Spec.Containers. - Name string - // The memory usage is the memory working set. - Usage v1.ResourceList -} -``` - -By default `Usage` is the mean from samples collected within the returned time window. -The default time window is 1 minute. - -#### Endpoints - -All endpoints are GET endpoints, rooted at `/apis/metrics/v1alpha1/`. -There won't be support for the other REST methods. - -The list of supported endpoints: -- `/nodes` - all node metrics; type `[]NodeMetrics` -- `/nodes/{node}` - metrics for a specified node; type `NodeMetrics` -- `/namespaces/{namespace}/pods` - all pod metrics within namespace with support for `all-namespaces`; type `[]PodMetrics` -- `/namespaces/{namespace}/pods/{pod}` - metrics for a specified pod; type `PodMetrics` - -The following query parameters are supported: -- `labelSelector` - restrict the list of returned objects by labels (list endpoints only) - -In the future we may want to introduce the following params: -`aggregator` (`max`, `min`, `95th`, etc.) and `window` (`1h`, `1d`, `1w`, etc.) -which will allow to get the other aggregates over the custom time window. - -## Further improvements - -Depending on the further requirements the following features may be added: -- support for more metrics -- support for application level metrics -- watch for metrics -- possibility to query for window sizes and aggregation functions (though single window size/aggregation function per request) -- cluster level metrics +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/instrumentation/volume_stats_pvc_ref.md b/contributors/design-proposals/instrumentation/volume_stats_pvc_ref.md index 1b2f599b..f0fbec72 100644 --- a/contributors/design-proposals/instrumentation/volume_stats_pvc_ref.md +++ b/contributors/design-proposals/instrumentation/volume_stats_pvc_ref.md @@ -1,57 +1,6 @@ -# Add PVC reference in Volume Stats - -## Background -Pod volume stats tracked by kubelet do not currently include any information about the PVC (if the pod volume was referenced via a PVC) - -This prevents exposing (and querying) volume metrics labeled by PVC name which is preferable for users, given that PVC is a top-level API object. - -## Proposal - -Modify ```VolumeStats``` tracked in Kubelet and populate with PVC info: - -``` -// VolumeStats contains data about Volume filesystem usage. -type VolumeStats struct { - // Embedded FsStats - FsStats - // Name is the name given to the Volume - // +optional - Name string `json:"name,omitempty"` -+ // PVCRef is a reference to the measured PVC. -+ // +optional -+ PVCRef PVCReference `json:"pvcRef"` -} - -+// PVCReference contains enough information to describe the referenced PVC. -+type PVCReference struct { -+ Name string `json:"name"` -+ Namespace string `json:"namespace"` -+} -``` - -## Implementation -2 options are described below. Option 1 supports current requirements/requested use cases. Option 2 supports an additional use case that was being discussed and is called out for completeness/discussion/feedback. - -### Option 1 -- Modify ```kubelet::server::stats::calcAndStoreStats()``` - - If the pod volume is referenced via a PVC, populate ```PVCRef``` in VolumeStats using the Pod spec - - - The Pod spec is already available in this method, so the changes are contained to this function. - -- The limitation of this approach is that we're limited to reporting only what is available in the pod spec (Pod namespace and PVC claimname) - -### Option 2 -- Modify the ```volumemanager::GetMountedVolumesForPod()``` (or add a new function) to return additional volume information from the actual/desired state-of-world caches - - Use this to populate PVCRef in VolumeStats - -- This allows us to get information not available in the Pod spec such as the PV name/UID which can be used to label metrics - enables exposing/querying volume metrics by PV name -- It's unclear whether this is a use case we need to/should support: - * Volume metrics are only refreshed for mounted volumes which implies a bound/available PVC - * We expect most user-storage interactions to be via the PVC -- Admins monitoring PVs (and not PVC's) so that they know when their users are running out of space or are over-provisioning would be a use case supporting adding PV information to - metrics - - +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multi-platform.md b/contributors/design-proposals/multi-platform.md index 32258ab9..f0fbec72 100644 --- a/contributors/design-proposals/multi-platform.md +++ b/contributors/design-proposals/multi-platform.md @@ -1,529 +1,6 @@ -# Kubernetes for multiple platforms +Design proposals have been archived. -**Author**: Lucas Käldström ([@luxas](https://github.com/luxas)) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status** (25th of August 2016): Some parts are already implemented; but still there quite a lot of work to be done. - -## Abstract - -We obviously want Kubernetes to run on as many platforms as possible, in order to make Kubernetes an even more powerful system. -This is a proposal that explains what should be done in order to achieve a true cross-platform container management system. - -Kubernetes is written in Go, and Go code is portable across platforms. -Docker and rkt are also written in Go, and it's already possible to use them on various platforms. -When it's possible to run containers on a specific architecture, people also want to use Kubernetes to manage the containers. - -In this proposal, a `platform` is defined as `operating system/architecture` or `${GOOS}/${GOARCH}` in Go terms. - -The following platforms are proposed to be built for in a Kubernetes release: - - linux/amd64 - - linux/arm (GOARM=6 initially, but we probably have to bump this to GOARM=7 due to that the most of other ARM things are ARMv7) - - linux/arm64 - - linux/ppc64le - -If there's interest in running Kubernetes on `linux/s390x` too, it won't require many changes to the source now when we've laid the ground for a multi-platform Kubernetes already. - -There is also work going on with porting Kubernetes to Windows (`windows/amd64`). See [this issue](https://github.com/kubernetes/kubernetes/issues/22623) for more details. - -But note that when porting to a new OS like windows, a lot of os-specific changes have to be implemented before cross-compiling, releasing and other concerns this document describes may apply. - -## Motivation - -Then the question probably is: Why? - -In fact, making it possible to run Kubernetes on other platforms will enable people to create customized and highly-optimized solutions that exactly fits their hardware needs. - -Example: [Paypal validates arm64 for real-time data analysis](http://www.datacenterdynamics.com/content-tracks/servers-storage/paypal-successfully-tests-arm-based-servers/93835.fullarticle) - -Also, by including other platforms to the Kubernetes party a healthy competition between platforms can/will take place. - -Every platform obviously has both pros and cons. By adding the option to make clusters of mixed platforms, the end user may take advantage of the good sides of every platform. - -## Use Cases - -For a large enterprise where computing power is the king, one may imagine the following combinations: - - `linux/amd64`: For running most of the general-purpose computing tasks, cluster addons, etc. - - `linux/ppc64le`: For running highly-optimized software; especially massive compute tasks - - `windows/amd64`: For running services that are only compatible on windows; e.g. business applications written in C# .NET - -For a mid-sized business where efficiency is most important, these could be combinations: - - `linux/amd64`: For running most of the general-purpose computing tasks, plus tasks that require very high single-core performance. - - `linux/arm64`: For running webservices and high-density tasks => the cluster could autoscale in a way that `linux/amd64` machines could hibernate at night in order to minimize power usage. - -For a small business or university, arm is often sufficient: - - `linux/arm`: Draws very little power, and can run web sites and app backends efficiently on Scaleway for example. - -And last but not least; Raspberry Pi's should be used for [education at universities](http://kubecloud.io/) and are great for **demoing Kubernetes' features at conferences.** - -## Main proposal - -### Release binaries for all platforms - -First and foremost, binaries have to be released for all platforms. -This affects the build-release tools. Fortunately, this is quite straightforward to implement, once you understand how Go cross-compilation works. - -Since Kubernetes' release and build jobs run on `linux/amd64`, binaries have to be cross-compiled and Docker images should be cross-built. -Builds should be run in a Docker container in order to get reproducible builds; and `gcc` should be installed for all platforms inside that image (`kube-cross`) - -All released binaries should be uploaded to `https://storage.googleapis.com/kubernetes-release/release/${version}/bin/${os}/${arch}/${binary}` - -This is a fairly long topic. If you're interested how to cross-compile, see [details about cross-compilation](#cross-compilation-details) - -### Support all platforms in a "run everywhere" deployment - -The easiest way of running Kubernetes on another architecture at the time of writing is probably by using the docker-multinode deployment. Of course, you may choose whatever deployment you want, the binaries are easily downloadable from the URL above. - -[docker-multinode](https://git.k8s.io/kube-deploy/docker-multinode) is intended to be a "kick-the-tires" multi-platform solution with Docker as the only real dependency (but it's not production ready) - -But when we (`sig-cluster-lifecycle`) have standardized the deployments to about three and made them production ready; at least one deployment should support **all platforms**. - -### Set up a build and e2e CI's - -#### Build CI - -Kubernetes should always enforce that all binaries are compiling. -**On every PR, `make release` have to be run** in order to require the code proposed to be merged to be compatible for all architectures. - -For more information, see [conflicts](#conflicts) - -#### e2e CI - -To ensure all functionality really is working on all other platforms, the community should be able to set up a CI. -To be able to do that, all the test-specific images have to be ported to multiple architectures, and the test images should preferably be manifest lists. -If the test images aren't manifest lists, the test code should automatically choose the right image based on the image naming. - -IBM volunteered to run continuously running e2e tests for `linux/ppc64le`. -Still it's hard to set up a such CI (even on `linux/amd64`), but that work belongs to `kubernetes/test-infra` proposals. - -When it's possible to test Kubernetes using Kubernetes; volunteers should be given access to publish their results on `k8s-testgrid.appspot.com`. - -### Official support level - -When all e2e tests are passing for a given platform; the platform should be officially supported by the Kubernetes team. -At the time of writing, `amd64` is in the officially supported category. - -When a platform is building and it's possible to set up a cluster with the core functionality, the platform is supported on a "best-effort" and experimental basis. -At the time of writing, `arm`, `arm64` and `ppc64le` are in the experimental category; the e2e tests aren't cross-platform yet. - -### Docker image naming and manifest lists - -#### Docker manifest lists - -Here's a good article about how the "manifest list" in the Docker image [manifest spec v2](https://github.com/docker/distribution/pull/1068) works: [A step towards multi-platform Docker images](https://integratedcode.us/2016/04/22/a-step-towards-multi-platform-docker-images/) - -A short summary: A manifest list is a list of Docker images with a single name (e.g. `busybox`), that holds layers for multiple platforms _when it's stored in a registry_. -When the image is pulled by a client (`docker pull busybox`), only layers for the target platforms are downloaded. -Right now we have to write `busybox-${ARCH}` for example instead, but that leads to extra scripting and unnecessary logic. - -For reference see [docker/docker#24739](https://github.com/docker/docker/issues/24739) and [appc/docker2aci#193](https://github.com/appc/docker2aci/issues/193) - -#### Image naming - -This has been debated quite a lot about; how we should name non-amd64 docker images that are pushed to `gcr.io`. See [#23059](https://github.com/kubernetes/kubernetes/pull/23059) and [#23009](https://github.com/kubernetes/kubernetes/pull/23009). - -This means that the naming `k8s.gcr.io/${binary}:${version}` should contain a _manifest list_ for future tags. -The manifest list thereby becomes a wrapper that is pointing to the `-${arch}` images. -This requires `docker-1.10` or newer, which probably means Kubernetes v1.4 and higher. - -TL;DR; - - `${binary}-${arch}:${version}` images should be pushed for all platforms - - `${binary}:${version}` images should point to the `-${arch}`-specific ones, and docker will then download the right image. - -### Components should expose their platform - -It should be possible to run clusters with mixed platforms smoothly. After all, bringing heterogeneous machines together to a single unit (a cluster) is one of Kubernetes' greatest strengths. And since the Kubernetes' components communicate over HTTP, two binaries of different architectures may talk to each other normally. - -The crucial thing here is that the components that handle platform-specific tasks (e.g. kubelet) should expose their platform. In the kubelet case, we've initially solved it by exposing the labels `beta.kubernetes.io/{os,arch}` on every node. This way a user may run binaries for different platforms on a multi-platform cluster, but still it requires manual work to apply the label to every manifest. - -Also, [the apiserver now exposes](https://github.com/kubernetes/kubernetes/pull/19905) it's platform at `GET /version`. But note that the value exposed at `/version` only is the apiserver's platform; there might be kubelets of various other platforms. - -### Standardize all image Makefiles to follow the same pattern - -All Makefiles should push for all platforms when doing `make push`, and build for all platforms when doing `make build`. -Under the hood; they should compile binaries in a container for reproducibility, and use QEMU for emulating Dockerfile `RUN` commands if necessary. - -### Remove linux/amd64 hard-codings from the codebase - -All places where `linux/amd64` is hardcoded in the codebase should be rewritten. - -#### Make kubelet automatically use the right pause image - -The `pause` is used for connecting containers into Pods. It's a binary that just sleeps forever. -When Kubernetes starts up a Pod, it first starts a `pause` container, and let's all "real" containers join the same network by setting `--net=${pause_container_id}`. - -So in order to start Kubernetes Pods on any other architecture, an ever-sleeping image have to exist. - -Fortunately, `kubelet` has the `--pod-infra-container-image` option, and it has been used when running Kubernetes on other platforms. - -But relying on the deployment setup to specify the right image for the platform isn't great, the kubelet should be smarter than that. - -This specific problem has been fixed in [#23059](https://github.com/kubernetes/kubernetes/pull/23059). - -#### Vendored packages - -Here are two common problems that a vendored package might have when trying to add/update it: - - Including constants combined with build tags - -```go -//+ build linux,amd64 -const AnAmd64OnlyConstant = 123 -``` - - - Relying on platform-specific syscalls (e.g. `syscall.Dup2`) - -If someone tries to add a dependency that doesn't satisfy these requirements; the CI will catch it and block the PR until the author has updated the vendored repo and fixed the problem. - -### kubectl should be released for all platforms that are relevant - -kubectl is released for more platforms than the proposed server platforms, if you want to check out an up-to-date list of them, [see here](https://git.k8s.io/kubernetes/hack/lib/golang.sh). - -kubectl is trivial to cross-compile, so if there's interest in adding a new platform for it, it may be as easy as appending the platform to the list linked above. - -### Addons - -Addons like dns, heapster and ingress play a big role in a working Kubernetes cluster, and we should aim to be able to deploy these addons on multiple platforms too. - -`kube-dns`, `dashboard` and `addon-manager` are the most important images, and they are already ported for multiple platforms. - -These addons should also be converted to multiple platforms: - - heapster, influxdb + grafana - - nginx-ingress - - elasticsearch, fluentd + kibana - - registry - -### Conflicts - -What should we do if there's a conflict between keeping e.g. `linux/ppc64le` builds vs. merging a release blocker? - -In fact, we faced this problem while this proposal was being written; in [#25243](https://github.com/kubernetes/kubernetes/pull/25243). It is quite obvious that the release blocker is of higher priority. - -However, before temporarily [deactivating builds](https://github.com/kubernetes/kubernetes/commit/2c9b83f291e3e506acc3c08cd10652c255f86f79), the author of the breaking PR should first try to fix the problem. If it turns out being really hard to solve, builds for the affected platform may be deactivated and a P1 issue should be made to activate them again. - -## Cross-compilation details (for reference) - -### Go language details - -Go 1.5 introduced many changes. To name a few that are relevant to Kubernetes: - - C was eliminated from the tree (it was earlier used for the bootstrap runtime). - - All processors are used by default, which means we should be able to remove [lines like this one](https://github.com/kubernetes/kubernetes/blob/v1.2.0/cmd/kubelet/kubelet.go#L37) - - The garbage collector became more efficient (but also [confused our latency test](https://github.com/golang/go/issues/14396)). - - `linux/arm64` and `linux/ppc64le` were added as new ports. - - The `GO15VENDOREXPERIMENT` was started. We switched from `Godeps/_workspace` to the native `vendor/` in [this PR](https://github.com/kubernetes/kubernetes/pull/24242). - - It's not required to pre-build the whole standard library `std` when cross-compiling. [Details](#prebuilding-the-standard-library-std) - - Builds are approximately twice as slow as earlier. That affects the CI. [Details](#releasing) - - The native Go DNS resolver will suffice in the most situations. This makes static linking much easier. - -All release notes for Go 1.5 [are here](https://golang.org/doc/go1.5) - -Go 1.6 didn't introduce as many changes as Go 1.5 did, but here are some of notes: - - It should perform a little bit better than Go 1.5. - - `linux/mips64` and `linux/mips64le` were added as new ports. - - Go < 1.6.2 for `ppc64le` had [bugs in it](https://github.com/kubernetes/kubernetes/issues/24922). - -All release notes for Go 1.6 [are here](https://golang.org/doc/go1.6) - -In Kubernetes 1.2, the only supported Go version was `1.4.2`, so `linux/arm` was the only possible extra architecture: [#19769](https://github.com/kubernetes/kubernetes/pull/19769). -In Kubernetes 1.3, [we upgraded to Go 1.6](https://github.com/kubernetes/kubernetes/pull/22149), which made it possible to build Kubernetes for even more architectures [#23931](https://github.com/kubernetes/kubernetes/pull/23931). - -#### The `sync/atomic` bug on 32-bit platforms - -From https://golang.org/pkg/sync/atomic/#pkg-note-BUG: -> On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a global variable or in an allocated struct or slice can be relied upon to be 64-bit aligned. - -`etcd` have had [issues](https://github.com/coreos/etcd/issues/2308) with this. See [how to fix it here](https://github.com/coreos/etcd/pull/3249) - -```go -// 32-bit-atomic-bug.go -package main -import "sync/atomic" - -type a struct { - b chan struct{} - c int64 -} - -func main(){ - d := a{} - atomic.StoreInt64(&d.c, 10 * 1000 * 1000 * 1000) -} -``` - -```console -$ GOARCH=386 go build 32-bit-atomic-bug.go -$ file 32-bit-atomic-bug -32-bit-atomic-bug: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped -$ ./32-bit-atomic-bug -panic: runtime error: invalid memory address or nil pointer dereference -[signal 0xb code=0x1 addr=0x0 pc=0x808cd9b] - -goroutine 1 [running]: -panic(0x8098de0, 0x1830a038) - /usr/local/go/src/runtime/panic.go:481 +0x326 -sync/atomic.StoreUint64(0x1830e0f4, 0x540be400, 0x2) - /usr/local/go/src/sync/atomic/asm_386.s:190 +0xb -main.main() - /tmp/32-bit-atomic-bug.go:11 +0x4b -``` - -This means that all structs should keep all `int64` and `uint64` fields at the top of the struct to be safe. If we would move `a.c` to the top of the `a` struct above, the operation would succeed. - -The bug affects `32-bit` platforms when a `(u)int64` field is accessed by an `atomic` method. -It would be great to write a tool that checks so all `atomic` accessed fields are aligned at the top of the struct, but it's hard: [coreos/etcd#5027](https://github.com/coreos/etcd/issues/5027). - -## Prebuilding the Go standard library (`std`) - -A great blog post [that is describing this](https://medium.com/@rakyll/go-1-5-cross-compilation-488092ba44ec#.5jcd0owem) - -Before Go 1.5, the whole Go project had to be cross-compiled from source for **all** platforms that _might_ be used, and that was quite a slow process: - -```console -# From build/build-image/cross/Dockerfile when we used Go 1.4 -$ cd /usr/src/go/src -$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} ./make.bash --no-clean; done -``` - -With Go 1.5+, cross-compiling the Go repository isn't required anymore. Go will automatically cross-compile the `std` packages that are being used by the code that is being compiled, _and throw it away after the compilation_. -If you cross-compile multiple times, Go will build parts of `std`, throw it away, compile parts of it again, throw that away and so on. - -However, there is an easy way of cross-compiling all `std` packages in advance with Go 1.5+: - -```console -# From build/build-image/cross/Dockerfile when we're using Go 1.5+ -$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} go install std; done -``` - -### Static cross-compilation - -Static compilation with Go 1.5+ is dead easy: - -```go -// main.go -package main -import "fmt" -func main() { - fmt.Println("Hello Kubernetes!") -} -``` - -```console -$ go build main.go -$ file main -main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped -$ GOOS=linux GOARCH=arm go build main.go -$ file main -main: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -``` - -The only thing you have to do is change the `GOARCH` and `GOOS` variables. Here's a list of valid values for [GOOS/GOARCH](https://golang.org/doc/install/source#environment) - -#### Static compilation with `net` - -Consider this: - -```go -// main-with-net.go -package main -import "net" -import "fmt" -func main() { - fmt.Println(net.ParseIP("10.0.0.10").String()) -} -``` - -```console -$ go build main-with-net.go -$ file main-with-net -main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, - interpreter /lib64/ld-linux-x86-64.so.2, not stripped -$ GOOS=linux GOARCH=arm go build main-with-net.go -$ file main-with-net -main-with-net: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -``` - -Wait, what? Just because we included `net` from the `std` package, the binary defaults to being dynamically linked when the target platform equals to the host platform? -Let's take a look at `go env` to get a clue why this happens: - -```console -$ go env -GOARCH="amd64" -GOHOSTARCH="amd64" -GOHOSTOS="linux" -GOOS="linux" -GOPATH="/go" -GOROOT="/usr/local/go" -GO15VENDOREXPERIMENT="1" -CC="gcc" -CXX="g++" -CGO_ENABLED="1" -``` - -See the `CGO_ENABLED=1` at the end? That's where compilation for the host and cross-compilation differs. By default, Go will link statically if no `cgo` code is involved. `net` is one of the packages that prefers `cgo`, but doesn't depend on it. - -When cross-compiling on the other hand, `CGO_ENABLED` is set to `0` by default. - -To always be safe, run this when compiling statically: - -```console -$ CGO_ENABLED=0 go build -a -installsuffix cgo main-with-net.go -$ file main-with-net -main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped -``` - -See [golang/go#9344](https://github.com/golang/go/issues/9344) for more details. - -### Dynamic cross-compilation - -In order to dynamically compile a go binary with `cgo`, we need `gcc` installed at build time. - -The only Kubernetes binary that is using C code is the `kubelet`, or in fact `cAdvisor` on which `kubelet` depends. `hyperkube` is also dynamically linked as long as `kubelet` is. We should aim to make `kubelet` statically linked. - -The normal `x86_64-linux-gnu` can't cross-compile binaries, so we have to install gcc cross-compilers for every platform. We do this in the [`kube-cross`](https://git.k8s.io/kubernetes/build/build-image/cross/Dockerfile) image, -and depend on the [`emdebian.org` repository](https://wiki.debian.org/CrossToolchains). Depending on `emdebian` isn't ideal, so we should consider using the latest `gcc` cross-compiler packages from the `ubuntu` main repositories in the future. - -Here's an example when cross-compiling plain C code: - -```c -// main.c -#include <stdio.h> -main() -{ - printf("Hello Kubernetes!\n"); -} -``` - -```console -$ arm-linux-gnueabi-gcc -o main-c main.c -$ file main-c -main-c: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, - interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped -``` - -And here's an example when cross-compiling `go` and `c`: - -```go -// main-cgo.go -package main -/* -char* sayhello(void) { return "Hello Kubernetes!"; } -*/ -import "C" -import "fmt" -func main() { - fmt.Println(C.GoString(C.sayhello())) -} -``` - -```console -$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOOS=linux GOARCH=arm go build main-cgo.go -$ file main-cgo -./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, - interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped -``` - -The bad thing with dynamic compilation is that it adds an unnecessary dependency on `glibc` _at runtime_. - -### Static compilation with CGO code - -Lastly, it's even possible to cross-compile `cgo` code _statically_: - -```console -$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOARCH=arm go build -ldflags '-extldflags "-static"' main-cgo.go -$ file main-cgo -./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, - for GNU/Linux 2.6.32, not stripped -``` - -This is especially useful if we want to include the binary in a container. -If the binary is statically compiled, we may use `busybox` or even `scratch` as the base image. -This should be the preferred way of compiling binaries that strictly require C code to be a part of it. - -#### GOARM - -32-bit ARM comes in two main flavours: ARMv5 and ARMv7. Go has the `GOARM` environment variable that controls which version of ARM Go should target. Here's a table of all ARM versions and how they play together: - -ARM Version | GOARCH | GOARM | GCC package | No. of bits ------------ | ------ | ----- | ----------- | ----------- -ARMv5 | arm | 5 | armel | 32-bit -ARMv6 | arm | 6 | - | 32-bit -ARMv7 | arm | 7 | armhf | 32-bit -ARMv8 | arm64 | - | aarch64 | 64-bit - -The compatibility between the versions is pretty straightforward, ARMv5 binaries may run on ARMv7 hosts, but not vice versa. - -## Cross-building docker images for linux - -After binaries have been cross-compiled, they should be distributed in some manner. - -The default and maybe the most intuitive way of doing this is by packaging it in a docker image. - -### Trivial Dockerfile - -All `Dockerfile` commands except for `RUN` works for any architecture without any modification. -The base image has to be switched to an arch-specific one, but except from that, a cross-built image is only a `docker build` away. - -```Dockerfile -FROM armel/busybox -ENV kubernetes=true -COPY kube-apiserver /usr/local/bin/ -CMD ["/usr/local/bin/kube-apiserver"] -``` - -```console -$ file kube-apiserver -kube-apiserver: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -$ docker build -t k8s.gcr.io/kube-apiserver-arm:v1.x.y . -Step 1 : FROM armel/busybox - ---> 9bb1e6d4f824 -Step 2 : ENV kubernetes true - ---> Running in 8a1bfcb220ac - ---> e4ef9f34236e -Removing intermediate container 8a1bfcb220ac -Step 3 : COPY kube-apiserver /usr/local/bin/ - ---> 3f0c4633e5ac -Removing intermediate container b75a054ab53c -Step 4 : CMD /usr/local/bin/kube-apiserver - ---> Running in 4e6fe931a0a5 - ---> 28f50e58c909 -Removing intermediate container 4e6fe931a0a5 -Successfully built 28f50e58c909 -``` - -### Complex Dockerfile - -However, in the most cases, `RUN` statements are needed when building the image. - -The `RUN` statement invokes `/bin/sh` inside the container, but in this example, `/bin/sh` is an ARM binary, which can't execute on an `amd64` processor. - -#### QEMU to the rescue - -Here's a way to run ARM Docker images on an amd64 host by using `qemu`: - -```console -# Register other architectures` magic numbers in the binfmt_misc kernel module, so it`s possible to run foreign binaries -$ docker run --rm --privileged multiarch/qemu-user-static:register --reset -# Download qemu 2.5.0 -$ curl -sSL https://github.com/multiarch/qemu-user-static/releases/download/v2.5.0/x86_64_qemu-arm-static.tar.xz \ - | tar -xJ -# Run a foreign docker image, and inject the amd64 qemu binary for translating all syscalls -$ docker run -it -v $(pwd)/qemu-arm-static:/usr/bin/qemu-arm-static armel/busybox /bin/sh - -# Now we`re inside an ARM container although we`re running on an amd64 host -$ uname -a -Linux 0a7da80f1665 4.2.0-25-generic #30-Ubuntu SMP Mon Jan 18 12:31:50 UTC 2016 armv7l GNU/Linux -``` - -Here a linux module called `binfmt_misc` registered the "magic numbers" in the kernel, so the kernel may detect which architecture a binary is, and prepend the call with `/usr/bin/qemu-(arm|aarch64|ppc64le)-static`. For example, `/usr/bin/qemu-arm-static` is a statically linked `amd64` binary that translates all ARM syscalls to `amd64` syscalls. - -The multiarch guys have done a great job here, you may find the source for this and other images at [GitHub](https://github.com/multiarch) - - -## Implementation - -## History - -32-bit ARM (`linux/arm`) was the first platform Kubernetes was ported to, and luxas' project [`Kubernetes on ARM`](https://github.com/luxas/kubernetes-on-arm) (released on GitHub the 31st of September 2015) -served as a way of running Kubernetes on ARM devices easily. -The 30th of November 2015, a tracking issue about making Kubernetes run on ARM was opened: [#17981](https://github.com/kubernetes/kubernetes/issues/17981). It later shifted focus to how to make Kubernetes a more platform-independent system. - -The 27th of April 2016, Kubernetes `v1.3.0-alpha.3` was released, and it became the first release that was able to run the [docker getting started guide](http://kubernetes.io/docs/getting-started-guides/docker/) on `linux/amd64`, `linux/arm`, `linux/arm64` and `linux/ppc64le` without any modification. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/OWNERS b/contributors/design-proposals/multicluster/OWNERS deleted file mode 100644 index bedef962..00000000 --- a/contributors/design-proposals/multicluster/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-multicluster-leads -approvers: - - sig-multicluster-leads -labels: - - sig/multicluster diff --git a/contributors/design-proposals/multicluster/cluster-registry/api-design.md b/contributors/design-proposals/multicluster/cluster-registry/api-design.md index 3c2b748c..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/cluster-registry/api-design.md +++ b/contributors/design-proposals/multicluster/cluster-registry/api-design.md @@ -1,164 +1,6 @@ -# Cluster Registry API +Design proposals have been archived. -@perotinus, @madhusudancs +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Original draft: 08/16/2017 -**Reviewed** in SIG multi-cluster meeting on 8/29 - -*This doc is a Markdown conversion of the original Cluster Registry API -[Google doc](https://docs.google.com/document/d/1Oi9EO3Jwtp69obakl-9YpLkP764GZzsz95XJlX1a960/edit). -That doc is deprecated, and this one is canonical; however, the old doc will be -preserved so as not to lose comment and revision history that it contains.* -## Table of Contents - -- [Purpose](#purpose) -- [Motivating use cases](#motivating-use-cases) -- [API](#api) -- [Authorization-based filtering of the list of clusters](#authorization-based-filtering-of-the-list-of-clusters) -- [Status](#status) -- [Auth](#auth) -- [Key differences vs existing Federation API `Cluster` object](#key-differences-vs-existing-federation-api-cluster-object) -- [Open questions](#open-questions) - -## Purpose - -The cluster registry API is intended to provide a common abstraction for other -tools that will perform operations on multiple clusters. It provides an -interface to a list of objects that will store metadata about clusters that can -be used by other tools. The cluster registry implementation is meant to remain -simple: we believe there is benefit in defining a common layer that can be used -by many different tools to solve different kinds of multi-cluster problems. - -It may be helpful to consider this API as an extension of the `kubeconfig` file. -The `kubeconfig` file contains a list of clusters with the auth data necessary -for kubectl to access them; the cluster registry API intends to provide this -data, plus some additional useful metadata, from a remote location instead of -from the user's local machine. - -## Motivating use cases - -These were presented at the SIG-Federation F2F meeting on 8/4/17 -([Atlassian](https://docs.google.com/document/d/1PH859COCWSkRxILrQd6wDdYLGJaBtWQkSN3I-Lnam3g/edit#heading=h.suxgoa67n1aw), -[CoreOS](https://docs.google.com/presentation/d/1InJagQNOxqA0ftK0peJLzyEFU2IZEXrJprDN6fcleMg/edit#slide=id.p), -[Google](https://docs.google.com/presentation/d/1Php_HnHI-Sy20ieyd_jBgr7XTs0fKT0Cq9z6dC4zOMc/), -[RedHat](https://docs.google.com/presentation/d/1dExjeSQTXI8_k00nqXRkSIFPTkzAzUTFtETU4Trg5yw/edit#slide=id.p)). -Each of the use cases presented assumes the ability to access a registry of -clusters, and so all are valid motivating use cases for the cluster registry -API. Note that these use cases will generally require more tooling than the -cluster registry itself. The cluster registry API will support what these other -tools will need in order to operate, but will not intrinsically support these -use cases. - -- Consistent configuration across clusters/replication of resources -- Federated Ingress: load balancing across multiple clusters, potentially - geo-aware -- Multi-cluster application distribution, with policy/balancing -- Disaster recovery/failover -- Human- and tool- parseable interaction with a list of clusters -- Monitoring/health checking/status reporting for clusters, potentially with - dashboards -- Policy-based and jurisdictional placement of workloads - -## API - -This document defines the cluster registry API. It is an evolution of the -[current Federation cluster API](https://git.k8s.io/federation/apis/federation/types.go#L99), -and is designed more specifically for the "cluster registry" use case in -contrast to the Federation `Cluster` object, which was made for the -active-control-plane Federation. - -The API is a Kubernetes-style REST API that supports the following operations: - -1. `POST` - to create new objects. -1. `GET` - to retrieve both lists and individual objects. -1. `PUT` - to update or create an object. -1. `DELETE` - to delete an object. -1. `PATCH` - to modify the fields of an object. - -Optional API operations: - -1. `WATCH` - to receive a stream of changes made to a given object. As `WATCH` - is not a standard HTTP method, this operation will be implemented as `GET - /<resource>&watch=true`. We believe that it's not always necessary to - support WATCH for this API. Implementations can choose to support or not - support this operation. An implementation that does not support the - operation should return HTTP error 405, StatusMethodNotAllowed, per the - [relevant Kubernetes API conventions](/contributors/devel/sig-architecture/api-conventions.md#error-codes). - -We also intend to support a use case where the server returns a file that can be -stored for later use. We expect this to be doable with the standard API -machinery; and if the API is implemented not using the Kubernetes API machinery, -that the returned file must be interoperable with the response from a Kubernetes -API server. - -[The API](https://git.k8s.io/cluster-registry/pkg/apis/clusterregistry/v1alpha1/types.go) -is defined in the cluster registry repo, and is not replicated here in order to -avoid mismatches. - -All top-level objects that define resources in Kubernetes embed a -`meta.ObjectMeta` that in-turn contains a number of fields. All the fields in -that struct are potentially useful with the exception of the `ClusterName` and -the `Namespace` fields. Having a `ClusterName` field alongside a `Name` field in -the cluster registry API will be confusing to our users. Therefore, in the -initial API implementation, we will add validation logic that rejects `Cluster` -objects that contain a value for the `ClusterName` field. The `Cluster` object's -`Namespace` field will be disabled by making the object be root scoped instead -of namespace scoped. - -The `Cluster` object will have `Spec` and `Status` fields, following the -[Kubernetes API conventions](/contributors/devel/sig-architecture/api-conventions.md#spec-and-status). -There was argument in favor of a `State` field instead of `Spec` and `Status` -fields, since the `Cluster` in the registry does not necessarily hold a user's -intent about the cluster being represented, but instead may hold descriptive -information about the cluster and information about the status of the cluster; -and because the cluster registry provides no controller that performs -reconciliation on `Cluster` objects. However, after -[discussion with SIG-arch](https://groups.google.com/forum/#!topic/kubernetes-sig-architecture/ptK2mVtha38), -the decision was made in favor of spec and status. - -## Authorization-based filtering of the list of clusters - -The initial version of the cluster registry supports a cluster list API that -does not take authorization rules into account. It returns a list of clusters -similar to how other Kubernetes List APIs list the objects in the presence of -RBAC rules. A future version of this API will take authorization rules into -account and only return the subset of clusters a user is authorized to access in -the registry. - -## Status - -There are use cases for the cluster registry that call for storing status that -is provided by more active controllers, e.g. health checks and cluster capacity. -At this point, these use cases are not as well-defined as the use cases that -require a data store, and so we do not intend to propose a complete definition -for the `ClusterStatus` type. We recognize the value of conventions, so as these -use cases become more clearly defined, the API of the `ClusterStatus` will be -extended appropriately. - -## Auth - -The cluster registry API will not provide strongly-typed objects for returning -auth info. Instead, it will provide a generic type that clients can use as they -see fit. This is intended to mirror what `kubectl` does with its -[AuthProviderConfig](https://git.k8s.io/client-go/tools/clientcmd/api/types.go#L144). -As open standards are developed for cluster auth, the API can be extended to -provide first-class support for these. We want to avoid baking non-open -standards into the API, and so having to support potentially a multiplicity of -them as they change. The cluster registry itself is not intended to be a -credential store, but instead to provide "pointers" that will provide the -information needed by callers to authenticate to a cluster. There is some more -context -[here](https://docs.google.com/a/google.com/document/d/1cxKV4Faywsn_to49csN0S0TZLYuHgExusEsmgKQWc28/edit?usp=sharing). - -## Key differences vs existing Federation API `Cluster` object - -- Active controller is not required; the registry can be used without any - controllers -- `WATCH` support is not required - -## Open questions - -All open questions have been -[migrated](https://github.com/kubernetes/cluster-registry/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aopen%20%22Migrated%20from%20the%20Cluster%22) -to issues in the cluster registry repo. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/cluster-registry/project-design-and-plan.md b/contributors/design-proposals/multicluster/cluster-registry/project-design-and-plan.md index 09685ffd..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/cluster-registry/project-design-and-plan.md +++ b/contributors/design-proposals/multicluster/cluster-registry/project-design-and-plan.md @@ -1,335 +1,6 @@ -# Cluster registry design and plan +Design proposals have been archived. -@perotinus +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Updated: 11/2/17 -*REVIEWED* in SIG-multicluster meeting on 10/24 - -*This doc is a Markdown conversion of the original Cluster registry design and -plan -[Google doc](https://docs.google.com/document/d/1bVvq9lDIbE-Glyr6GkSGWYkLb2cCNk9bR8LL7Wm-L6g/edit). -That doc is deprecated, and this one is canonical; however, the old doc will be -preserved so as not to lose comment and revision history that it contains.* - -## Table of Contents - -- [Background](#background) -- [Goal](#goal) -- [Technical requirements](#technical-requirements) - - [Alpha](#alpha) - - [Beta](#beta) - - [Later](#later) -- [Implementation design](#implementation-design) - - [Alternatives](#alternatives) - - [Using a CRD](#using-a-crd) -- [Tooling design](#tooling-design) - - [User tooling](#user-tooling) -- [Repository process](#repository-process) -- [Release strategy](#release-strategy) - - [Version skew](#version-skew) -- [Test strategy](#test-strategy) -- [Milestones and timelines](#milestones-and-timelines) - - [Alpha (targeting late Q4 '17)](#alpha-targeting-late-q4-'17) - - [Beta (targeting mid Q1 '18)](#beta-targeting-mid-q1-'18) - - [Stable (targeting mid Q2 '18)](#stable-targeting-mid-q2-'18) - - [Later](#later) - -## Background - -SIG-multicluster has identified a cluster registry as being a key enabling -component for multi-cluster use cases. The goal of the SIG in this project is -that the API defined for the cluster registry become a standard for -multi-cluster tools. The API design is being discussed in -[a separate doc](https://drive.google.com/a/google.com/open?id=1Oi9EO3Jwtp69obakl-9YpLkP764GZzsz95XJlX1a960). -A working prototype of the cluster registry has been assembled in -[a new repository](https://github.com/kubernetes/cluster-registry). - -## Goal - -This document intends to lay out an initial plan for moving the cluster registry -from the prototype state through alpha, beta and eventually to a stable release. - -## Technical requirements - -These requirements are derived mainly from the output of the -[August 9th multi-cluster SIG meeting](https://docs.google.com/document/d/11cB3HK67BZUb7aNOCpQK8JsA8Na8F-F_6bFu77KrKT4/edit#heading=h.cuvqls7pl9qc). -However, they also derive (at least indirectly) from the results of the -[SIG F2F meeting ](https://docs.google.com/document/d/1HkVBSm9L9UJC2f3wfs_8zt1PJmv6iepdtJ2fmkCOHys/edit) -that took place earlier, as well as the use cases presented by various parties -at that meeting. - -### Alpha - -- [API] Provides an HTTP server with a Kubernetes-style API for CRUD - operations on a registry of clusters - - Provides a way to filter across clusters by label - - Supports annotations -- [API] Provides information about each cluster's API server's - location/endpoint -- [API] Provides the ability to assign user-friendly names for clusters -- [API] Provides pointers to authentication information for clusters' API - servers - - This implies that it does not support storing credentials directly -- [Implementation] Supports both independent and aggregated deployment models - - Supports delegated authn/authz in aggregated mode - - Supports integration into Federation via aggregated deployment model - -### Beta - -- Supports providing a flat file of clusters for storage and later use - - This may be provided by the ecosystem rather than the registry API - implementation directly -- Supports independent authz for reading/writing clusters -- `kubectl` integration with the cluster registry API server is first-class - and on par with kubectl integration with core Kubernetes APIs -- Supports grouping of clusters -- Supports specifying and enforcing read and/or write authorization for groups - of clusters in the registry -- Working Federation integration -- Supports status from active controllers - -### Later - -- Supports an HA deployment strategy -- Supports guarantees around immutability/identity of clusters in list -- Version skew between various components is understood and supported skews - are defined and tested - -## Implementation design - -The cluster registry will be implemented using the -[Kubernetes API machinery](https://github.com/kubernetes/apimachinery). The -cluster registry API server will be a fork and rework of the existing Federation -API server, scaled down and simplified to match the simpler set of requirements -for the cluster registry. It will use the -[apiserver](https://github.com/kubernetes/apiserver) library, plus some code -copied from the core Kubernetes repo that provides scaffolding for certain -features in the API server. This is currently implemented in a prototype form -in the [cluster-registry repo](https://github.com/kubernetes/cluster-registry). - -The API will be implemented using the Kubernetes API machinery, as a new API -with two objects, `Cluster` and `ClusterList`. Other APIs may be added in the -future to support future use cases, but the intention is that the cluster -registry API server remain minimal and only provide the APIs that users of a -cluster registry would want for the cluster registry. - -The cluster registry will not be responsible for storing secrets. It will -contain pointers to other secret stores which will need to be implemented -independently. The cluster registry API will not provide proxy access to -clusters, and will not interact with clusters on a user's behalf. Storing secret -information in the cluster registry will be heavily discouraged by its -documentation, and will be considered a misuse of the registry. This allows us -to sidestep the complexity of implementing a secure credential storage. - -The expectation is that Federation and other programs will use the -cluster-registry as an aggregated API server rather than via direct code -integration. Therefore, the cluster registry will explicitly support being -deployed as an API server that can be aggregated by other API servers. - -### Alternatives - -#### Using a CRD - -The cluster registry could be implemented as a CRD that is registered with a -Kubernetes API server. This implementation is lighter weight than running a full -API server. If desired, the administrator could then disable the majority of the -Kubernetes APIs for non-admin users and so make it appear as if the API server -only supports cluster objects. It should be possible for a user to migrate -without much effort from a CRD-based to an API-server-based implementation of -the cluster registry, but the cluster-registry project is not currently planning -to spend time supporting this use case. CRDs do not support (and may never -support) versioning, which is very desirable for the cluster registry API. Users -who wish to use a CRD implementation will have to design and maintain it -themselves. - -## Tooling design - -### User tooling - -In the alpha stage, the cluster registry will repurpose the kubefed tool from -Federation and use it to initialize a cluster registry. In early stages, this -will only create a deployment with one replica, running the API server and etcd -in a `Pod`. As the HA requirements for the cluster registry are fleshed out, -this tool may need to be updated or replaced to support deploying a cluster -registry in multiple clusters and with multiple etcd replicas. - -Since the cluster registry is a Kubernetes API server that serves a custom -resource type, it will be usable by `kubectl`. We expect that the kubectl -experience for custom APIs will soon be on par with that of core Kubernetes -APIs, since there has been a significant investment in making `kubectl` provide -very detailed output from its describe subcommand; and `kubectl` 1.9 is expected -to introduce API server-originated columns. Therefore, we will not initially -implement any special tooling for interacting with the registry, and will tell -users to use `kubectl` or generated client libraries. - -## Repository process - -The cluster registry is a top-level Kubernetes repository, and thus it requires -some process to ensure stability for dependent projects and accountability. -Since the Federation project wants to use the cluster registry instead of its -current cluster API, there is a requirement to establish process even though the -project is young and does not yet have a lot of contributors. - -The standard Kubernetes Prow bots have been enabled on the repo. There is -currently some functionality around managing PR approval and reviewer assignment -and such that does not yet live in Prow, but given the limited number of -contributors at this point it seems reasonable to wait for sig-testing to -implement this functionality in Prow rather than enabling the deprecated tools. -In most cases where a process is necessary, we will defer to the spirit of the -Kubernetes process (if not the letter) though we will modify it as necessary for -the scope and scale of the cluster registry project. There is not yet a merge -queue, and until there is a clear need for one we do not intend to add one. - -The code in the repository will use bazel as its build system. This is in-line -with what the Kubernetes project is attempting to move towards, and since the -cluster registry has a similar but more limited set of needs than Kubernetes, we -expect that bazel will support our needs adequately. The structure that bazel -uses is meant to be compatible with go tooling, so if necessary, we can migrate -away from bazel in the future without having to entirely revamp the repository -structure. - -There is not currently a good process in Kubernetes for keeping vendored -dependencies up-to-date. The strategy we expect to take with the cluster -registry is to update only when necessary, and to use the same dependency -versions that Kubernetes uses unless there is some particular incompatibility. - -## Release strategy - -For early versions, the cluster registry release will consist of a container -image and a client tool. The container image will contain the cluster registry -API server, and the tool will be used to bootstrap this image plus an etcd image -into a running cluster. The container will be published to GCR, and the client -tool releases will be stored in a GCS bucket, following a pattern used by other -k/ projects. - -For now, the release process will be managed mostly manually by repository -maintainers. We will create a release branch that will be used for releases, and -use GitHub releases along with some additional per-release documentation -(`CHANGELOG`, etc). `CHANGELOG`s and release notes will be collected manually -until the volume of work becomes too great. We do not intend to create multiple -release branches until the project is more stable. Releases will undergo a -more-detailed set of tests that will ensure compatibility with recent released -versions of `kubectl` (for the registry) and Kubernetes (for the cluster -registry as an aggregated API server). Having a well-defined release process -will be a stable release requirement, and by that point we expect to have gained -some practical experience that will make it easier to codify the requirements -around doing releases. The cluster registry will use semantic versioning, but -its versions will not map to Kubernetes versions. Cluster registry releases will -not follow the Kubernetes release cycle, though Kubernetes releases may trigger -cluster registry releases if there are compatibility issues that need to be -fixed. - -Projects that want to vendor the cluster registry will be able to do so. We -expect that these projects will vendor from the release branch if they want a -stable version, or from a desired SHA if they are comfortable using a version -that has not necessarily been fully vetted. - -As the project matures, we expect the tool to evolve (or be replaced) in order -to support deployment against an existing etcd instance (potentially provided by -an etcd operator), and to provide a HA story for hosting a cluster registry. -This is considered future work and will not be addressed directly in this -document. - -Cross-compilation support in bazel is still a work in progress, so the cluster -registry will not be able to easily provide binary releases for every platform -until this is supported by bazel. If it becomes necessary to provide -cross-platform binaries before bazel cross-compilation is available, the -repository is setup to support common go tooling, so we should be able to devise -a process for doing so. - -### Version skew - -There are several participants in the cluster registry ecosystem whose versions -will be conceptually able to float relative to each other: - -- The cluster registry API server -- The host cluster's API server -- `kubectl` - -We will need to define the version skew restraints between these components and -ensure that our testing validates key skews that we care about. - -## Test strategy - -The cluster registry is a simple component, and so should be able to be tested -extensively without too much difficulty. The API machinery code is already -tested by its owning teams, and since the cluster registry is a straightforward -API server, it should not require much testing of its own. The bulk of the tests -written will be integration tests, to ensure that it runs correctly in -aggregated and independent modes in a Kubernetes cluster; to verify that various -versions of kubectl can interact with it; and to verify that it can be upgraded -safely. A full testing strategy is a requirement for a GA launch; we expect -development of a test suite to be an ongoing effort in the early stages of -development. - -The command-line tool will require testing. It should be E2E tested against -recent versions of Kubernetes, to ensure that a simple cluster registry can be -created in a Kubernetes cluster. The multiplicity of configuration options it -provides cannot conveniently be tested in E2E tests, and so will be validated in -unit tests. - -## Milestones and timelines - -### Alpha (targeting late Q4 '17) - -- Test suite is running on each PR, with reasonable unit test coverage and - minimal integration/E2E testing -- Repository processes (OWNERship, who can merge, review lag standards, - project planning, issue triaging, etc.) established -- Contributor documentation written -- User documentation drafted -- Cluster registry API also alpha -- All Alpha technical requirements met - -### Beta (targeting mid Q1 '18) - -- Full suite of integration/E2E tests running on each PR -- API is beta -- Preparatory tasks for GA -- All Beta technical requirements met -- User documentation complete and proofed, content-wise -- Enough feedback solicited from users, or inferred from download - statistics/repository issues - -### Stable (targeting mid Q2 '18) - -- Fully fleshed-out user documentation -- User documentation is published in a finalized location -- First version of API is GA -- Documented upgrade test procedure, with appropriate test tooling implemented -- Plan for and documentation about Kubernetes version support (e.g., which - versions of Kubernetes a cluster registry can be a delegated API server - with) -- Releases for all platforms -- Well-defined and documented release process - -### Later - -- Documented approach to doing a HA deployment -- Work on Later technical requirements - -## Questions - -- How does RBAC work in this case? -- Generated code. How does it work? How do we make it clean? Do we check it - in? -- Do we need/want an example client that uses the cluster registry for some - basic operations? A demo, as it were? -- Labels vs. fields. Versioning? Do we graduate? Do we define a policy for - label names and have the server validate it? -- Where should the user documentation for the cluster registry live? - kubernetes.io doesn't quite seem appropriate, but perhaps that's the right - place? -- Is there a reasonable way to support a CRD-based implementation? Should this - project support it directly, or work to not prevent it from working, or - ignore it? - -## History - -| Date | Details | -|--|--| -| 10/9/17 | Initial draft | -| 10/16/17 | Minor edits based on comments | -| 10/19/17 | Added section specifically about version skew; change P0/P1/P2 to alpha/beta/stable; added some milestone requirements | -| 11/2/17 | Resolved most comments and added minor tweaks to text in order to do so | +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/control-plane-resilience.md b/contributors/design-proposals/multicluster/control-plane-resilience.md index 174e7de0..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/control-plane-resilience.md +++ b/contributors/design-proposals/multicluster/control-plane-resilience.md @@ -1,235 +1,6 @@ -# Kubernetes and Cluster Federation Control Plane Resilience +Design proposals have been archived. -## Long Term Design and Current Status +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### by Quinton Hoole, Mike Danese and Justin Santa-Barbara - -### December 14, 2015 - -## Summary - -Some amount of confusion exists around how we currently, and in future -want to ensure resilience of the Kubernetes (and by implication -Kubernetes Cluster Federation) control plane. This document is an attempt to capture that -definitively. It covers areas including self-healing, high -availability, bootstrapping and recovery. Most of the information in -this document already exists in the form of github comments, -PR's/proposals, scattered documents, and corridor conversations, so -document is primarily a consolidation and clarification of existing -ideas. - -## Terms - -* **Self-healing:** automatically restarting or replacing failed - processes and machines without human intervention -* **High availability:** continuing to be available and work correctly - even if some components are down or uncontactable. This typically - involves multiple replicas of critical services, and a reliable way - to find available replicas. Note that it's possible (but not - desirable) to have high - availability properties (e.g. multiple replicas) in the absence of - self-healing properties (e.g. if a replica fails, nothing replaces - it). Fairly obviously, given enough time, such systems typically - become unavailable (after enough replicas have failed). -* **Bootstrapping**: creating an empty cluster from nothing -* **Recovery**: recreating a non-empty cluster after perhaps - catastrophic failure/unavailability/data corruption - -## Overall Goals - -1. **Resilience to single failures:** Kubernetes clusters constrained - to single availability zones should be resilient to individual - machine and process failures by being both self-healing and highly - available (within the context of such individual failures). -1. **Ubiquitous resilience by default:** The default cluster creation - scripts for (at least) GCE, AWS and basic bare metal should adhere - to the above (self-healing and high availability) by default (with - options available to disable these features to reduce control plane - resource requirements if so required). It is hoped that other - cloud providers will also follow the above guidelines, but the - above 3 are the primary canonical use cases. -1. **Resilience to some correlated failures:** Kubernetes clusters - which span multiple availability zones in a region should by - default be resilient to complete failure of one entire availability - zone (by similarly providing self-healing and high availability in - the default cluster creation scripts as above). -1. **Default implementation shared across cloud providers:** The - differences between the default implementations of the above for - GCE, AWS and basic bare metal should be minimized. This implies - using shared libraries across these providers in the default - scripts in preference to highly customized implementations per - cloud provider. This is not to say that highly differentiated, - customized per-cloud cluster creation processes (e.g. for GKE on - GCE, or some hosted Kubernetes provider on AWS) are discouraged. - But those fall squarely outside the basic cross-platform OSS - Kubernetes distro. -1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms - for achieving system resilience (replication controllers, health - checking, service load balancing etc) should be used in preference - to building a separate set of mechanisms to achieve the same thing. - This implies that self hosting (the kubernetes control plane on - kubernetes) is strongly preferred, with the caveat below. -1. **Recovery from catastrophic failure:** The ability to quickly and - reliably recover a cluster from catastrophic failure is critical, - and should not be compromised by the above goal to self-host - (i.e. it goes without saying that the cluster should be quickly and - reliably recoverable, even if the cluster control plane is - broken). This implies that such catastrophic failure scenarios - should be carefully thought out, and the subject of regular - continuous integration testing, and disaster recovery exercises. - -## Relative Priorities - -1. **(Possibly manual) recovery from catastrophic failures:** having a -Kubernetes cluster, and all applications running inside it, disappear forever -perhaps is the worst possible failure mode. So it is critical that we be able to -recover the applications running inside a cluster from such failures in some -well-bounded time period. - 1. In theory a cluster can be recovered by replaying all API calls - that have ever been executed against it, in order, but most - often that state has been lost, and/or is scattered across - multiple client applications or groups. So in general it is - probably infeasible. - 1. In theory a cluster can also be recovered to some relatively - recent non-corrupt backup/snapshot of the disk(s) backing the - etcd cluster state. But we have no default consistent - backup/snapshot, verification or restoration process. And we - don't routinely test restoration, so even if we did routinely - perform and verify backups, we have no hard evidence that we - can in practise effectively recover from catastrophic cluster - failure or data corruption by restoring from these backups. So - there's more work to be done here. -1. **Self-healing:** Most major cloud providers provide the ability to - easily and automatically replace failed virtual machines within a - small number of minutes (e.g. GCE - [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) - and Managed Instance Groups, - AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) - and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This - can fairly trivially be used to reduce control-plane down-time due - to machine failure to a small number of minutes per failure - (i.e. typically around "3 nines" availability), provided that: - 1. cluster persistent state (i.e. etcd disks) is either: - 1. truly persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using etcd [dynamic member - addition](https://github.com/coreos/etcd/blob/master/Documentation/v2/runtime-configuration.md#add-a-new-member) - or [backup and - recovery](https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery)). - 1. and boot disks are either: - 1. truly persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using boot-from-snapshot, - boot-from-pre-configured-image or - boot-from-auto-initializing image). -1. **High Availability:** This has the potential to increase - availability above the approximately "3 nines" level provided by - automated self-healing, but it's somewhat more complex, and - requires additional resources (e.g. redundant API servers and etcd - quorum members). In environments where cloud-assisted automatic - self-healing might be infeasible (e.g. on-premise bare-metal - deployments), it also gives cluster administrators more time to - respond (e.g. replace/repair failed machines) without incurring - system downtime. - -## Design and Status (as of December 2015) - -<table> -<tr> -<td><b>Control Plane Component</b></td> -<td><b>Resilience Plan</b></td> -<td><b>Current Status</b></td> -</tr> -<tr> -<td><b>API Server</b></td> -<td> - -Multiple stateless, self-hosted, self-healing API servers behind a HA -load balancer, built out by the default "kube-up" automation on GCE, -AWS and basic bare metal (BBM). Note that the single-host approach of -having etcd listen only on localhost to ensure that only API server can -connect to it will no longer work, so alternative security will be -needed in the regard (either using firewall rules, SSL certs, or -something else). All necessary flags are currently supported to enable -SSL between API server and etcd (OpenShift runs like this out of the -box), but this needs to be woven into the "kube-up" and related -scripts. Detailed design of self-hosting and related bootstrapping -and catastrophic failure recovery will be detailed in a separate -design doc. - -</td> -<td> - -No scripted self-healing or HA on GCE, AWS or basic bare metal -currently exists in the OSS distro. To be clear, "no self healing" -means that even if multiple e.g. API servers are provisioned for HA -purposes, if they fail, nothing replaces them, so eventually the -system will fail. Self-healing and HA can be set up -manually by following documented instructions, but this is not -currently an automated process, and it is not tested as part of -continuous integration. So it's probably safest to assume that it -doesn't actually work in practise. - -</td> -</tr> -<tr> -<td><b>Controller manager and scheduler</b></td> -<td> - -Multiple self-hosted, self healing warm standby stateless controller -managers and schedulers with leader election and automatic failover of API -server clients, automatically installed by default "kube-up" automation. - -</td> -<td>As above.</td> -</tr> -<tr> -<td><b>etcd</b></td> -<td> - -Multiple (3-5) etcd quorum members behind a load balancer with session -affinity (to prevent clients from being bounced from one to another). - -Regarding self-healing, if a node running etcd goes down, it is always necessary -to do three things: -<ol> -<li>allocate a new node (not necessary if running etcd as a pod, in -which case specific measures are required to prevent user pods from -interfering with system pods, for example using node selectors as -described in <A HREF="https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector">nodeSelector</A>), -<li>start an etcd replica on that new node, and -<li>have the new replica recover the etcd state. -</ol> -In the case of local disk (which fails in concert with the machine), the etcd -state must be recovered from the other replicas. This is called -<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#add-a-new-member">dynamic member addition</A>. - -In the case of remote persistent disk, the etcd state can be recovered by -attaching the remote persistent disk to the replacement node, thus the state is -recoverable even if all other replicas are down. - -There are also significant performance differences between local disks and remote -persistent disks. For example, the -<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types"> -sustained throughput local disks in GCE is approximately 20x that of remote disks</A>. - -Hence we suggest that self-healing be provided by remotely mounted persistent -disks in non-performance critical, single-zone cloud deployments. For -performance critical installations, faster local SSD's should be used, in which -case remounting on node failure is not an option, so -<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md "> -etcd runtime configuration</A> should be used to replace the failed machine. -Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so -automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md"> -runtime configuration</A> is required. Similarly, basic bare metal deployments -cannot generally rely on remote persistent disks, so the same approach applies -there. -</td> -<td> -<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html"> -Somewhat vague instructions exist</A> on how to set some of this up manually in -a self-hosted configuration. But automatic bootstrapping and self-healing is not -described (and is not implemented for the non-PD cases). This all still needs to -be automated and continuously tested. -</td> -</tr> -</table> +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-api-servers.md b/contributors/design-proposals/multicluster/federated-api-servers.md index 00b1c23b..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-api-servers.md +++ b/contributors/design-proposals/multicluster/federated-api-servers.md @@ -1,4 +1,6 @@ -# Federated API Servers +Design proposals have been archived. -Moved to [aggregated-api-servers.md](../api-machinery/aggregated-api-servers.md) since cluster -federation stole the word "federation" from this effort and it was very confusing. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). + + +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-hpa.md b/contributors/design-proposals/multicluster/federated-hpa.md index 7639425b..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-hpa.md +++ b/contributors/design-proposals/multicluster/federated-hpa.md @@ -1,271 +1,6 @@ -# Federated Pod Autoscaler +Design proposals have been archived. -# Requirements & Design Document +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -irfan.rehman@huawei.com, quinton.hoole@huawei.com -# Use cases - -1 – Users can schedule replicas of same application, across the -federated clusters, using replicaset (or deployment). -Users however further might need to let the replicas be scaled -independently in each cluster, depending on the current usage metrics -of the replicas; including the CPU, memory and application defined -custom metrics. - -2 - As stated in the previous use case, a federation user schedules -replicas of same application, into federated clusters and subsequently -creates a horizontal pod autoscaler targeting the object responsible for -the replicas. User would want the auto-scaling to continue based on -the in-cluster metrics, even if for some reason, there is an outage at -federation level. User (or other users) should still be able to access -the deployed application into all federated clusters. Further, if the -load on the deployed app varies, the autoscaler should continue taking -care of scaling the replicas for a smooth user experience. - -3 - A federation that consists of an on-premise cluster and a cluster -running in a public cloud has a user workload (eg. deployment or rs) -preferentially running in the on-premise cluster. However if there are -spikes in the app usage, such that the capacity in the on-premise cluster -is not sufficient, the workload should be able to get scaled beyond the -on-premise cluster boundary and into the other clusters which are part -of this federation. - -Please refer to some additional use cases, which partly led to the derivation -of the above use case, and are listed in the **glossary** section of this document. - -# User workflow - -User wants to schedule a set of common workload across federated clusters. -He creates a replicaset or a deployment to schedule the workload (with or -without preferences). The federation then distributes the replicas of the -given workload into the federated clusters. As the user at this point is -unaware of the exact usage metrics of the individual pods created in the -federated clusters, he creates an HPA into the federation, providing metric -parameters to be used in the scale request for a resource. It is now the -responsibility of this HPA to monitor the relevant resource metrics and the -scaling of the pods per cluster then is controlled by the associated HPA. - -# Alternative approaches - -## Design Alternative 1 - -Make the autoscaling resource available and implement support for -horizontalpodautoscalers objects at federation. The HPA API resource -will need to be exposed at the federation level, which can follow the -version similar to one implemented in the latest k8s cluster release. - -Once the HPA object is created at federation, the federation controller -creates and monitors a similar HPA object (partitioning the min and max values) -in each of the federated clusters. Based on the metadata in spec of the HPA -describing the scaleTargetRef, the HPA will be applied on the already existing -target objects. If the target object is not present in the cluster (either -because, its not created until now, or deleted for some reason), the HPA will -still exist but no action will be taken. The HPA's action will become -applicable when the target object is created in the given cluster anytime in -future. Also as stated already the federation controller will need to partition -the min and max values appropriately into the federated clusters among the HPA -objects such that the total of min and that of max replicas satisfies the -constraints specified by the user at federation. The point of control over the -scaling of replicas will lie locally with the federated hpa controller. The -federated controller will however watch the cluster local HPAs wrt current -replicas of the target objects and will do intelligent dynamic adjustments of -min and max values of the HPA replicas across the clusters based on the run time -conditions. - -The federation controller by default will distribute the min and max replicas of the -HPA equally among all clusters. The min values will first be distributed such that -any cluster into which the replicas are distributed does not get a min replicas -lesser than 1. This means that HPA can actually be created in lesser number of -ready clusters then available in federation. Once this distribution happens, the -max replicas of the hpa will be distributed across all those clusters into which -the HPA needs to be created. The default distribution can be overridden using the -annotations on the HPA object, very similar to the annotations on federated -replicaset object as described -[here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md#federatereplicaset-preferences). - -One of the points to note here is that, doing this brings a two point control on -number of replicas of the target object, one by the federated target object (rs or -deployment) and other by the hpa local to the federated cluster. Solution to which -is discussed in the following section. Another additional note here is that, the -preferences would consider use of only minreplicas and maxreplicas in this phase -of implementation and weights will be discarded for this alternative design. - -### Rebalancing of workload replicas and control over the same. - -The current implementation of federated replicasets (and deployments) first -distributes the replicas into underlying clusters and then monitors the status -of the pods in each cluster. In case there are clusters which have active pods -lesser than what federation reconciler desires, federation control plane will -trigger creation of the missing pods (which federation considers missing), or -in other case would trigger removal of pods, if the control plane considers that -the given cluster has more pods than needed. This is something which counters -the role of HPA in individual cluster. To handle this, the knowledge that HPA -is active separately targeting this object has to be percolated to the federation -control plane monitoring the individual replicas such that, the federation control -plane stops reconciling the replicas in the individual clusters. In other words -the link between the HPA wrt to the corresponding objects will need to be -maintained and if an HPA is active, other federation controllers (aka replicaset -and deployment controllers) reconcile process, would stop updating and/or -rebalancing the replicas in and across the underlying clusters. The reconcile -of the objects (rs or deployment) would still continue, to handle the scenario -of the object missing from any given federated cluster. -The mechanism to achieve this behaviour shall be as below: - - User creates a workload object (for example rs) in federation. - - User then creates an HPA object in federation (this step and the previous - step can follow either order of creation). - - The rs as an object will exist in federation control plane with or without - the user preferences and/or cluster selection annotations. - - The HPA controller will first evaluate which cluster(s) get the replicas - and which don't (if any). This list of clusters will be a subset of the - cluster selector already applied on the hpa object. - - The HPA controller will apply this list on the federated rs object as the - cluster selection annotation overriding the user provided preferences (if any). - The control over the placement of workload replicas and the add on preferences - will thus lie completely with the HPA objects. This is an important assumption - that the user of these federated objects interacting with each other should be - aware of; and if the user needs to place replicas in specific clusters, together - with workload autoscaling he/she should apply these preferences on the HPA - object. Any preferences applied on the workload object (rs or deployment) will - be overridden. - - The target workload object (for example rs) replicas will be kept unchanged - in the cluster which already has the replicas, will be created with one replica - if the particular cluster does not have the same and HPA calculation resulted - in some replicas for that cluster and deleted from the clusters which has the - replicas and the federated HPA calculations result in no replicas for that - particular cluster. - - The desired replicas per cluster as per the federated HPA dynamic rebalance - mechanism, elaborated in the next section, will be set on individual clusters - local HPA, which in turn will set the same on the target local object. - -### Dynamic HPA min/max rebalance - -The proposal in this section can be used to improve the distribution of replicas -across the clusters such that there are more replicas in those clusters, where -they are needed more. The federation hpa controller will monitor the status of -the local HPAs in the federated clusters and update the min and/or max values -set on the local HPAs as below (assuming that all previous steps are done and -local HPAs in federated clusters are active): - -1. At some point, one or more of the cluster HPA's hit the upper limit of their -allowed scaling such that _DesiredReplicas == MaxReplicas_; Or more appropriately -_CurrentReplicas == DesiredReplicas == MaxReplicas_. - -2. If the above is observed the Federation HPA tries to transfer allocation -of _MaxReplicas_ from clusters where it is not needed (_DesiredReplicas < MaxReplicas_) -or where it cannot be used, e.g. due to capacity constraints -(_CurrentReplicas < DesiredReplicas <= MaxReplicas_) to the clusters which have -reached their upper limit (1 above). - -3. It will be taken care that the _MaxReplica_ does not become lesser than _MinReplica_ -in any of the clusters in this redistribution. Additionally if the usage of the same -could be established, _MinReplicas_ can also be distributed as in 4 below. - -4. An exactly similar approach can also be applied to _MinReplicas_ of the local HPAs, -so as to reduce the min from those clusters, where -_CurrentReplicas == DesiredReplicas == MinReplicas_ and the observed average resource -metric usage (on the HPA) is lesser then a given threshold, to those clusters, -where the _DesiredReplicas > MinReplicas_. - -However, as stated in 3 above, the approach of distribution will first be implemented -only for _MaxReplicas_ to establish it utility, before implementing the same for _MinReplicas_. - -## Design Alternative 2 - -Same as the previous one, the API will need to be exposed at federation. - -However, when the request to create HPA is sent to federation, federation controller -will not create the HPA into the federated clusters. The HPA object will reside in the -federation API server only. The federation controller will need to get a metrics -client to each of the federated clusters and collect all the relevant metrics -periodically from all those clusters. The federation controller will further calculate -the current average metrics utilisation across all clusters (using the collected metrics) -of the given target object and calculate the replicas globally to attain the target -utilisation as specified in the federation HPA. After arriving at the target replicas, -the target replica number is set directly on the target object (replicaset, deployment, ..) -using its scales sub-resource at federation. It will be left to the actual target object -controller (for example RS controller) to distribute the replicas accordingly into the -federated clusters. The point of control over the scaling of replicas will lie completely -with the federation controllers. - -### Algorithm (for alternative 2) - -Federated HPA (FHPA), from every cluster gets: - -- ```avg_i``` average metric value (like CPU utilization) for all pods matching the -deployment/rs selector. -- ```count_i``` number of replicas that were used to calculate the average. - -To calculate the target number of replicas HPA calculates the sum of all metrics from -all clusters: - -```sum(avg_i * count_i)``` and divides it by target metric value. The target replica -count (validated against HPA min/max and thresholds) is set on Federated -Deployment/replica set. So the deployment has the correct number of replicas -(that should match the desired metric value) and provides all of the rebalancing/failover -mechanisms. - -Further, this can be expanded such that FHPA places replicas where they are needed the -most (in cluster that have the most traffic). For that FHPA would play with weights in -Federated Deployment. Each cluster will get the weight of ```100 * avg_i/sum(avg_i)```. -Weights hint Federated Deployment where to put replicas. But they are only hints so -if placing a replica in the desired cluster is not possible then it will be placed elsewhere, -what is probably better than not having the replica at all. - -# Other Scenario - -Other scenario, for example rolling updates (when user updates the deployment or RS), -recreation of the object (when user specifies the strategy as recreate while updating -the object), will continue to be handled the way they are handled in an individual k8s -cluster. Additionally there is a shortcoming in the current implementation of the -federated deployments rolling update. There is an existing proposal as part of the -[federated deployment design doc](https://github.com/kubernetes/community/pull/325). -Given it is implemented, the rolling updates for a federated deployment while a -federated HPA is active on the same object will also work fine. - -# Conclusion - -The design alternative 2 has the following major drawbacks, which are sufficient to -discard it as a probable implementation option: -- This option needs the federation control plane controller to collect metrics -data from each cluster, which is an overhead with increasing gravity of the problem -with increasing number of federated clusters, in a given federation. -- The monitoring and update of objects which are targeted by the federated HPA object -(when needed) for a particular federated cluster would stop if for whatever reasons -the network link between the federated cluster and federation control plane is severed. -A bigger problem can happen in case of an outage of the federation control plane -altogether. - -In Design Alternative 1 the autoscaling of replicas will continue, even if a given -cluster gets disconnected from federation or in case of the federation control plane -outage. This would happen because the local HPAs with the last know maxreplica and -minreplicas would exist in the local clusters. Additionally in this alternative there -is no need of collection and processing of the pod metrics for the target object from -each individual cluster. -This document proposes to use ***design alternative 1*** as the preferred implementation. - -# Glossary - -These use cases are specified using the terminology partly specific to telecom products/platforms: - -1 - A telecom service provider has a large number of base stations, for a particular region, -each with some set of virtualized resources each running some specific network functions. -In a specific scenario the resources need to be treated logically separate (thus making large -number of smaller clusters), but still a very similar workload needs to be deployed on each -cluster (network function stacks, for example). - -2 - In one of the architectures, the IOT matrix has IOT gateways, which aggregate a large -number of IOT sensors in a small area (for example a shopping mall). The IOT gateway is -envisioned as a virtualized resource, and in some cases multiple such resources need -aggregation, each forming a small cluster. Each of these clusters might run very similar -functions, but will independently scale based on the demand of that area. - -3 - A telecom service provider has a large number of base stations, each with some set of -virtualized resources, and each running specific network functions and each specifically -catering to different network abilities (2g, 3g, 4g, etc). Each of these virtualized base -stations, make small clusters and can cater to specific network abilities, such that one -can cater to one or more network abilities. At a given point of time there would be some -number of end user agents (cell phones) associated with each, and these UEs can come and -go within the range of each. While the UEs move, a more centralized entity (read federation) -needs to make a decision as to which exact base station cluster is suitable and with needed -resources to handle the incoming UEs. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-ingress.md b/contributors/design-proposals/multicluster/federated-ingress.md index ad07f1dc..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-ingress.md +++ b/contributors/design-proposals/multicluster/federated-ingress.md @@ -1,190 +1,6 @@ -# Kubernetes Federated Ingress +Design proposals have been archived. - Requirements and High Level Design +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). - Quinton Hoole - - July 17, 2016 - -## Overview/Summary - -[Kubernetes Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) -provides an abstraction for sophisticated L7 load balancing through a -single IP address (and DNS name) across multiple pods in a single -Kubernetes cluster. Multiple alternative underlying implementations -are provided, including one based on GCE L7 load balancing and another -using an in-cluster nginx/HAProxy deployment (for non-GCE -environments). An AWS implementation, based on Elastic Load Balancers -and Route53 is under way by the community. - -To extend the above to cover multiple clusters, Kubernetes Federated -Ingress aims to provide a similar/identical API abstraction and, -again, multiple implementations to cover various -cloud-provider-specific as well as multi-cloud scenarios. The general -model is to allow the user to instantiate a single Ingress object via -the Federation API, and have it automatically provision all of the -necessary underlying resources (L7 cloud load balancers, in-cluster -proxies etc) to provide L7 load balancing across a service spanning -multiple clusters. - -Four options are outlined: - -1. GCP only -1. AWS only -1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP). -1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS). - -Option 1 is the: - -1. easiest/quickest, -1. most featureful - -Recommendations: - -+ Suggest tackling option 1 (GCP only) first (target beta in v1.4) -+ Thereafter option 3 (cross-cloud via GCP) -+ We should encourage/facilitate the community to tackle option 2 (AWS-only) - -## Options - -## Google Cloud Platform only - backed by GCE L7 Load Balancers - -This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE) - -### Features - -In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features: - -1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS) -1. Geo-locality for both external and GCP-internal clients -1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service). -1. URL-based request direction (different backend services can fulfill each different URL). -1. HTTPS request termination (at the GCE load balancer, with server SSL certs) - -### Implementation - -1. Federation user creates (federated) Ingress object (the services - backing the ingress object must share the same nodePort, as they - share a single GCP health check). -1. Federated Ingress Controller creates Ingress object in each cluster - in the federation (after [configuring each cluster ingress - controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)). -1. Each cluster-level Ingress Controller ("GLBC") creates Google L7 - Load Balancer machinery (forwarding rules, target proxy, URL map, - backend service, health check) which ensures that traffic to the - Ingress (backed by a Service), is directed to the nodes in the cluster. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) - -An alternative implementation approach involves lifting the current -Federated Ingress Controller functionality up into the Federation -control plane. This alternative is not considered any further -detail in this document. - -### Outstanding work Items - -1. This should in theory all work out of the box. Need to confirm -with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341)) -1. Implement Federated Ingress: - 1. API machinery (~1 day) - 1. Controller (~3 weeks) -1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation) - 1. API machinery (~1 day) - 1. KubeDNS support (~ 1 week?) - -### Pros - -1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below) -1. Leverages existing K8s Ingress machinery - not too much to add. -1. Leverages existing Federated Service machinery - controller looks - almost identical, DNS provider also re-used. - -### Cons - -1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions). - -## Amazon Web Services only - backed by Route53 - -This is an option for AWS-only federations. Parts of this are -apparently work in progress, see e.g. -[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -### Features - -In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/). - -1. Geo-aware DNS direction to closest regional elastic load balancer -1. DNS health checks to route traffic to only healthy elastic load -balancers -1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin -1. Elastic Load Balancing automatically routes traffic across multiple - instances and multiple Availability Zones within the same region. -1. Health checks ensure that only healthy Amazon EC2 instances receive traffic. - -### Implementation - -1. Federation user creates (federated) Ingress object -1. Federated Ingress Controller creates Ingress object in each cluster in the federation -1. Each cluster-level AWS Ingress Controller creates/updates - 1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region. - 1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster. - -### Outstanding Work Items - -Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -1. K8s AWS Ingress Controller -1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...". - -### Pros - -1. Geo-locality (via geo-DNS, not VIP) -1. Load-based overflow -1. Real load balancing (same caveats as for GCP above). -1. L7 SSL connection termination. -1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required. - -### Cons - -1. K8s Ingress Controller still needs to be developed. Lots of work. -1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless) -1. Only works on AWS (initial version, at least). - -## Cross-cloud via GCP - -### Summary - -Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods). - -### Features - -As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster). - -### Implementation - -TBD - see Summary above in the mean time. - -### Outstanding Work - -Assuming that GCP-only (see above) is complete: - -1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters -1. Probably some more - additional detailed research and design necessary. - -### Pros - -1. Works for cross-cloud. - -### Cons - -1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases. - -## Cross-cloud via AWS - -In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters. -Detail docs TBD. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-placement-policy.md b/contributors/design-proposals/multicluster/federated-placement-policy.md index c30374ea..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-placement-policy.md +++ b/contributors/design-proposals/multicluster/federated-placement-policy.md @@ -1,371 +1,6 @@ -# Policy-based Federated Resource Placement +Design proposals have been archived. -This document proposes a design for policy-based control over placement of -Federated resources. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Tickets: -- https://github.com/kubernetes/kubernetes/issues/39982 - -Authors: - -- Torin Sandall (torin@styra.com, tsandall@github) and Tim Hinrichs - (tim@styra.com). -- Based on discussions with Quinton Hoole (quinton.hoole@huawei.com, - quinton-hoole@github), Nikhil Jindal (nikhiljindal@github). - -## Background - -Resource placement is a policy-rich problem affecting many deployments. -Placement may be based on company conventions, external regulation, pricing and -performance requirements, etc. Furthermore, placement policies evolve over time -and vary across organizations. As a result, it is difficult to anticipate the -policy requirements of all users. - -A simple example of a placement policy is - -> Certain apps must be deployed on clusters in EU zones with sufficient PCI -> compliance. - -The [Kubernetes Cluster -Federation](/contributors/design-proposals/multicluster/federation.md#policy-engine-and-migrationreplication-controllers) -design proposal includes a pluggable policy engine component that decides how -applications/resources are placed across federated clusters. - -Currently, the placement decision can be controlled for Federated ReplicaSets -using the `federation.kubernetes.io/replica-set-preferences` annotation. In the -future, the [Cluster -Selector](https://github.com/kubernetes/kubernetes/issues/29887) annotation will -provide control over placement of other resources. The proposed design supports -policy-based control over both of these annotations (as well as others). - -This proposal is based on a POC built using the Open Policy Agent project. [This -short video (7m)](https://www.youtube.com/watch?v=hRz13baBhfg) provides an -overview and demo of the POC. - -## Design - -The proposed design uses the [Open Policy Agent](http://www.openpolicyagent.org) -project (OPA) to realize the policy engine component from the Federation design -proposal. OPA is an open-source, general purpose policy engine that includes a -declarative policy language and APIs to answer policy queries. - -The proposed design allows administrators to author placement policies and have -them automatically enforced when resources are created or updated. The design -also covers support for automatic remediation of resource placement when policy -(or the relevant state of the world) changes. - -In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes -in the same cluster as the Federation Control Plane: - - - -The proposed design is divided into following sections: - -1. Control over the initial placement decision (admission controller) -1. Remediation of resource placement (opa-kube-sync/remediator) -1. Replication of Kubernetes resources (opa-kube-sync/replicator) -1. Management and storage of policies (ConfigMap) - -### 1. Initial Placement Decision - -To provide policy-based control over the initial placement decision, we propose -a new admission controller that integrates with OPA: - -When admitting requests, the admission controller executes an HTTP API call -against OPA. The API call passes the JSON representation of the resource in the -message body. - -The response from OPA contains the desired value for the resource’s annotations -(defined in policy by the administrator). The admission controller updates the -annotations on the resource and admits the request: - - - -The admission controller updates the resource by **merging** the annotations in -the response with existing annotations on the resource. If there are overlapping -annotation keys the admission controller replaces the existing value with the -value from the response. - -#### Example Policy Engine Query: - -```http -POST /v1/data/io/k8s/federation/admission HTTP/1.1 -Content-Type: application/json -``` - -```json -{ - "input": { - "apiVersion": "extensions/v1beta1", - "kind": "ReplicaSet", - "metadata": { - "annotations": { - "policy.federation.alpha.kubernetes.io/eu-jurisdiction-required": "true", - "policy.federation.alpha.kubernetes.io/pci-compliance-level": "2" - }, - "creationTimestamp": "2017-01-23T16:25:14Z", - "generation": 1, - "labels": { - "app": "nginx-eu" - }, - "name": "nginx-eu", - "namespace": "default", - "resourceVersion": "364993", - "selfLink": "/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-eu", - "uid": "84fab96d-e188-11e6-ac83-0a580a54020e" - }, - "spec": { - "replicas": 4, - "selector": {...}, - "template": {...}, - } - } -} -``` - -#### Example Policy Engine Response: - -```http -HTTP/1.1 200 OK -Content-Type: application/json -``` - -```json -{ - "result": { - "annotations": { - "federation.kubernetes.io/replica-set-preferences": { - "clusters": { - "gce-europe-west1": { - "weight": 1 - }, - "gce-europe-west2": { - "weight": 1 - } - }, - "rebalance": true - } - } - } -} -``` - -> This example shows the policy engine returning the replica-set-preferences. -> The policy engine could similarly return a desired value for other annotations -> such as the Cluster Selector annotation. - -#### Conflicts - -A conflict arises if the developer and the policy define different values for an -annotation. In this case, the developer's intent is provided as a policy query -input and the policy author's intent is encoded in the policy itself. Since the -policy is the only place where both the developer and policy author intents are -known, the policy (or policy engine) should be responsible for resolving the -conflict. - -There are a few options for handling conflicts. As a concrete example, this is -how a policy author could handle invalid clusters/conflicts: - -``` -package io.k8s.federation.admission - -errors["requested replica-set-preferences includes invalid clusters"] { - invalid_clusters = developer_clusters - policy_defined_clusters - invalid_clusters != set() -} - -annotations["replica-set-preferences"] = value { - value = developer_clusters & policy_defined_clusters -} - -# Not shown here: -# -# policy_defined_clusters[...] { ... } -# developer_clusters[...] { ... } -``` - -The admission controller will execute a query against -/io/k8s/federation/admission and if the policy detects an invalid cluster, the -"errors" key in the response will contain a non-empty array. In this case, the -admission controller will deny the request. - -```http -HTTP/1.1 200 OK -Content-Type: application/json -``` - -```json -{ - "result": { - "errors": [ - "requested replica-set-preferences includes invalid clusters" - ], - "annotations": { - "federation.kubernetes.io/replica-set-preferences": { - ... - } - } - } -} -``` - -This example shows how the policy could handle conflicts when the author's -intent is to define clusters that MAY be used. If the author's intent is to -define what clusters MUST be used, then the logic would not use intersection. - -#### Configuration - -The admission controller requires configuration for the OPA endpoint: - -``` -{ - "EnforceSchedulingPolicy": { - "url": “https://opa.federation.svc.cluster.local:8181/v1/data/io/k8s/federation/annotations”, - "token": "super-secret-token-value" - } -} -``` - -- `url` specifies the URL of the policy engine API to query. The query response - contains the annotations to apply to the resource. -- `token` specifies a static token to use for authentication when contacting the - policy engine. In the future, other authentication schemes may be supported. - -The configuration file is provided to the federation-apiserver with the -`--admission-control-config-file` command line argument. - -The admission controller is enabled in the federation-apiserver by providing the -`--admission-control` command line argument. E.g., -`--admission-control=AlwaysAdmit,EnforceSchedulingPolicy`. - -The admission controller will be enabled by default. - -#### Error Handling - -The admission controller is designed to **fail closed** if policies have been -created. - -Request handling may fail because of: - -- Serialization errors -- Request timeouts or other network errors -- Authentication or authorization errors from the policy engine -- Other unexpected errors from the policy engine - -In the event of request timeouts (or other network errors) or back-pressure -hints from the policy engine, the admission controller should retry after -applying a backoff. The admission controller should also create an event so that -developers can identify why their resources are not being scheduled. - -Policies are stored as ConfigMap resources in a well-known namespace. This -allows the admission controller to check if one or more policies exist. If one -or more policies exist, the admission controller will fail closed. Otherwise -the admission controller will **fail open**. - -### 2. Remediation of Resource Placement - -When policy changes or the environment in which resources are deployed changes -(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might -need to be moved for them to obey the placement policy. Sometimes administrators -may decide to remediate manually, other times they may want Kubernetes to -remediate automatically. - -To automatically reschedule resources onto desired clusters, we introduce a -remediator component (**opa-kube-sync**) that is deployed as a sidecar with OPA. - - - -The notifications sent to the remediator by OPA specify the new value for -annotations such as replica-set-preferences. - -When the remediator component (in the sidecar) receives the notification it -sends a PATCH request to the federation-apiserver to update the affected -resource. This way, the actual rebalancing of ReplicaSets is still handled by -the [Rescheduling -Algorithm](/contributors/design-proposals/multicluster/federated-replicasets.md) -in the Federated ReplicaSet controller. - -The remediator component must be deployed with a kubeconfig for the -federation-apiserver so that it can identify itself when sending the PATCH -requests. We can use the same mechanism that is used for the -federation-controller-manager (which also needs ot identify itself when sending -requests to the federation-apiserver.) - -### 3. Replication of Kubernetes Resources - -Administrators must be able to author policies that refer to properties of -Kubernetes resources. For example, assuming the following sample policy (in -English): - -> Certain apps must be deployed on Clusters in EU zones with sufficient PCI -> compliance. - -The policy definition must refer to the geographic region and PCI compliance -rating of federated clusters. Today, the geographic region is stored as an -attribute on the cluster resource and the PCI compliance rating is an example of -data that may be included in a label or annotation. - -When the policy engine is queried for a placement decision (e.g., by the -admission controller), it must have access to the data representing the -federated clusters. - -To provide OPA with the data representing federated clusters as well as other -Kubernetes resource types (such as federated ReplicaSets), we use a sidecar -container that is deployed alongside OPA. The sidecar (“opa-kube-sync”) is -responsible for replicating Kubernetes resources into OPA: - - - -The sidecar/replicator component will implement the (somewhat common) list/watch -pattern against the federation-apiserver: - -- Initially, it will GET all resources of a particular type. -- Subsequently, it will GET with the **watch** and **resourceVersion** - parameters set and process add, remove, update events accordingly. - -Each resource received by the sidecar/replicator component will be pushed into -OPA. The sidecar will likely rely on one of the existing Kubernetes Go client -libraries to handle the low-level list/watch behavior. - -As new resource types are introduced in the federation-apiserver, the -sidecar/replicator component will need to be updated to support them. As a -result, the sidecar/replicator component must be designed so that it is easy to -add support for new resource types. - -Eventually, the sidecar/replicator component may allow admins to configure which -resource types are replicated. As an optimization, the sidecar may eventually -analyze policies to determine which resource properties are requires for policy -evaluation. This would allow it to replicate the minimum amount of data into -OPA. - -### 4. Policy Management - -Policies are written in a text-based, declarative language supported by OPA. The -policies can be loaded into the policy engine either on startup or via HTTP -APIs. - -To avoid introducing additional persistent state, we propose storing policies -in ConfigMap resources in the Federation Control Plane inside a well-known -namespace (e.g., `kube-federationscheduling-policy`). The ConfigMap resources -will be replicated into the policy engine by the sidecar. - -The sidecar can establish a watch on the ConfigMap resources in the Federation -Control Plane. This will enable hot-reloading of policies whenever they change. - -## Applicability to Other Policy Engines - -This proposal was designed based on a POC with OPA, but it can be applied to -other policy engines as well. The admission and remediation components are -comprised of two main pieces of functionality: (i) applying annotation values to -federated resources and (ii) asking the policy engine for annotation values. The -code for applying annotation values is completely independent of the policy -engine. The code that asks the policy engine for annotation values happens both -within the admission and remediation components. In the POC, asking OPA for -annotation values amounts to a simple RESTful API call that any other policy -engine could implement. - -## Future Work - -- This proposal uses ConfigMaps to store and manage policies. In the future, we - want to introduce a first-class **Policy** API resource. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-replicasets.md b/contributors/design-proposals/multicluster/federated-replicasets.md index f6c5b1cb..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-replicasets.md +++ b/contributors/design-proposals/multicluster/federated-replicasets.md @@ -1,508 +1,6 @@ -# Federated ReplicaSets +Design proposals have been archived. -# Requirements & Design Document +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. -Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) -Based on discussions with -Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) - -## Overview - -### Summary & Vision - -When running a global application on a federation of Kubernetes -clusters the owner currently has to start it in multiple clusters and -control whether he has both enough application replicas running -locally in each of the clusters (so that, for example, users are -handled by a nearby cluster, with low latency) and globally (so that -there is always enough capacity to handle all traffic). If one of the -clusters has issues or hasn't enough capacity to run the given set of -replicas the replicas should be automatically moved to some other -cluster to keep the application responsive. - -In single cluster Kubernetes there is a concept of ReplicaSet that -manages the replicas locally. We want to expand this concept to the -federation level. - -### Goals - -+ Win large enterprise customers who want to easily run applications - across multiple clusters -+ Create a reference controller implementation to facilitate bringing - other Kubernetes concepts to Federated Kubernetes. - -## Glossary - -Federation Cluster - a cluster that is a member of federation. - -Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster -that is a member of federation. - -Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. - -Federated ReplicaSet Controller (FRSC) - A controller running inside -of Federated K8S server that controls FRS. - -## User Experience - -### Critical User Journeys - -+ [CUJ1] User wants to create a ReplicaSet in each of the federation - cluster. They create a definition of federated ReplicaSet on the - federated master and (local) ReplicaSets are automatically created - in each of the federation clusters. The number of replicas is each - of the Local ReplicaSets is (perhaps indirectly) configurable by - the user. -+ [CUJ2] When the current number of replicas in a cluster drops below - the desired number and new replicas cannot be scheduled then they - should be started in some other cluster. - -### Features Enabling Critical User Journeys - -Feature #1 -> CUJ1: -A component which looks for newly created Federated ReplicaSets and -creates the appropriate Local ReplicaSet definitions in the federated -clusters. - -Feature #2 -> CUJ2: -A component that checks how many replicas are actually running in each -of the subclusters and if the number matches to the -FederatedReplicaSet preferences (by default spread replicas evenly -across the clusters but custom preferences are allowed - see -below). If it doesn't and the situation is unlikely to improve soon -then the replicas should be moved to other subclusters. - -### API and CLI - -All interaction with FederatedReplicaSet will be done by issuing -kubectl commands pointing on the Federated Master API Server. All the -commands would behave in a similar way as on the regular master, -however in the next versions (1.5+) some of the commands may give -slightly different output. For example kubectl describe on federated -replica set should also give some information about the subclusters. - -Moreover, for safety, some defaults will be different. For example for -kubectl delete federatedreplicaset cascade will be set to false. - -FederatedReplicaSet would have the same object as local ReplicaSet -(although it will be accessible in a different part of the -api). Scheduling preferences (how many replicas in which cluster) will -be passed as annotations. - -### FederateReplicaSet preferences - -The preferences are expressed by the following structure, passed as a -serialized json inside annotations. - -```go -type FederatedReplicaSetPreferences struct { - // If set to true then already scheduled and running replicas may be moved to other clusters to - // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, - // up and running replicas will not be moved. - Rebalance bool `json:"rebalance,omitempty"` - - // Map from cluster name to preferences for that cluster. It is assumed that if a cluster - // doesn't have a matching entry then it should not have local replica. The cluster matches - // to "*" if there is no entry with the real cluster name. - Clusters map[string]LocalReplicaSetPreferences -} - -// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. -type ClusterReplicaSetPreferences struct { - // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. - MinReplicas int64 `json:"minReplicas,omitempty"` - - // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). - MaxReplicas *int64 `json:"maxReplicas,omitempty"` - - // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. - Weight int64 -} -``` - -How this works in practice: - -**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity. - Replica layout: A=16 B=17 C=17. -+ Clusters A,B,C and C has capacity for 6 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C. B and C are offline: - Replica layout: A=50 - -**Scenario 2**. I want to have only 2 replicas in each of the clusters. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} - } -} -``` - -Or - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } - } - } - -``` - -Or - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} - } -} -``` - -There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. - -**Scenario 3**. I want to have 20 replicas in each of 3 clusters. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} - } -} -``` - -There is a global target for 50, however clusters require 60. So some clusters will have less replicas. - Replica layout: A=20 B=20 C=10. - -**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don't put more than 20 replicas to cluster C. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=16 B=17 C=17. -+ B is offline/has no capacity - Replica layout: A=30 B=0 C=20 -+ A and B are offline: - Replica layout: C=20 - -**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. - -```go -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 1000000} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=50 B=0 C=0. -+ A has capacity for only 40 replicas - Replica layout: A=40 B=5 C=5 - -**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. - -```go -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 2} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there -are already some replicas, please do not move them. Config: - -```go -FederatedReplicaSetPreferences { - Rebalance : false - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity, but A already has 20 replicas - Replica layout: A=20 B=15 C=15. -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. - Replica layout: A=30 B=14 C=6 - -## The Idea - -A new federated controller - Federated Replica Set Controller (FRSC) -will be created inside federated controller manager. Below are -enumerated the key idea elements: - -+ [I0] It is considered OK to have slightly higher number of replicas - globally for some time. - -+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens - on FRS being created, updated or deleted. On each create/update the - scheduling code will be started to calculate where to put the - replicas. The default behavior is to start the same amount of - replicas in each of the cluster. While creating LocalReplicaSets - (LRS) the following errors/issues can occur: - - + [E1] Master rejects LRS creation (for known or unknown - reason). In this case another attempt to create a LRS should be - attempted in 1m or so. This action can be tied with - [[I5]](#heading=h.ififs95k9rng). Until the LRS is created - the situation is the same as [E5]. If this happens multiple - times all due replicas should be moved elsewhere and later moved - back once the LRS is created. - - + [E2] LRS with the same name but different configuration already - exists. The LRS is then overwritten and an appropriate event - created to explain what happened. Pods under the control of the - old LRS are left intact and the new LRS may adopt them if they - match the selector. - - + [E3] LRS is new but the pods that match the selector exist. The - pods are adopted by the RS (if not owned by some other - RS). However they may have a different image, configuration - etc. Just like with regular LRS. - -+ [I2] For each of the cluster FRSC starts a store and an informer on - LRS that will listen for status updates. These status changes are - only interesting in case of troubles. Otherwise it is assumed that - LRS runs trouble free and there is always the right number of pod - created but possibly not scheduled. - - - + [E4] LRS is manually deleted from the local cluster. In this case - a new LRS should be created. It is the same case as - [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind - won't be killed and will be adopted after the LRS is recreated. - - + [E5] LRS fails to create (not necessary schedule) the desired - number of pods due to master troubles, admission control - etc. This should be considered as the same situation as replicas - unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). - - + [E6] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216). - -+ [I3] For each of the cluster start an store and informer to monitor - whether the created pods are eventually scheduled and what is the - current number of correctly running ready pods. Errors: - - + [E7] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216) - -+ [I4] It is assumed that a not scheduled pod is a normal situation -and can last up to X min if there is a huge traffic on the -cluster. However if the replicas are not scheduled in that time then -FRSC should consider moving most of the unscheduled replicas -elsewhere. For that purpose FRSC will maintain a data structure -where for each FRS controlled LRS we store a list of pods belonging -to that LRS along with their current status and status change timestamp. - -+ [I5] If a new cluster is added to the federation then it doesn't - have a LRS and the situation is equal to - [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). - -+ [I6] If a cluster is removed from the federation then the situation - is equal to multiple [E4]. It is assumed that if a connection with - a cluster is lost completely then the cluster is removed from the - cluster list (or marked accordingly) so - [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) - don't need to be handled. - -+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), - checked against the current list of clusters, and all missing LRS - are created. This will be executed in combination with [I8]. - -+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min - (configurable) to check whether some replica move between clusters - is needed or not. - -+ FRSC never moves replicas to LRS that have not scheduled/running -pods or that has pods that failed to be created. - - + When FRSC notices that a number of pods are not scheduler/running - or not_even_created in one LRS for more than Y minutes it takes - most of them from LRS, leaving couple still waiting so that once - they are scheduled FRSC will know that it is ok to put some more - replicas to that cluster. - -+ [I9] FRS becomes ToBeChecked if: - + It is newly created - + Some replica set inside changed its status - + Some pods inside cluster changed their status - + Some cluster is added or deleted. -> FRS stops ToBeChecked if is in desired configuration (or is stable enough). - -## (RE)Scheduling algorithm - -To calculate the (re)scheduling moves for a given FRS: - -1. For each cluster FRSC calculates the number of replicas that are placed -(not necessary up and running) in the cluster and the number of replicas that -failed to be scheduled. Cluster capacity is the difference between the -placed and failed to be scheduled. - -2. Order all clusters by their weight and hash of the name so that every time -we process the same replica-set we process the clusters in the same order. -Include federated replica set name in the cluster name hash so that we get -slightly different ordering for different RS. So that not all RS of size 1 -end up on the same cluster. - -3. Assign minimum preferred number of replicas to each of the clusters, if -there is enough replicas and capacity. - -4. If rebalance = false, assign the previously present replicas to the clusters, -remember the number of extra replicas added (ER). Of course if there -is enough replicas and capacity. - -5. Distribute the remaining replicas with regard to weights and cluster capacity. -In multiple iterations calculate how many of the replicas should end up in the cluster. -For each of the cluster cap the number of assigned replicas by max number of replicas and -cluster capacity. If there were some extra replicas added to the cluster in step -4, don't really add the replicas but balance them gains ER from 4. - -## Goroutines layout - -+ [GR1] Involved in FRS informer (see - [[I1]]). Whenever a FRS is created and - updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with - delay 0. - -+ [GR2_1...GR2_N] Involved in informers/store on LRS (see - [[I2]]). On all changes the FRS is put on - FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR3_1...GR3_N] Involved in informers/store on Pods - (see [[I3]] and [[I4]]). They maintain the status store - so that for each of the LRS we know the number of pods that are - actually running and ready in O(1) time. They also put the - corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR4] Involved in cluster informer (see - [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE - with delay 0. - -+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on - FRS_CHANNEL after the given delay (and remove from - FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to - FRS_TO_CHECK_QUEUE the delays are compared and updated so that the - shorter delay is used. - -+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever - a FRS is received it is put to a work queue. Work queue has no delay - and makes sure that a single replica set is process is processed by - only one goroutine. - -+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. - Multiple replica set can be processed in parallel. Two Goroutines cannot - process the same FRS at the same time. - - -## Func DoFrsCheck - -The function does [[I7]] and[[I8]]. It is assumed that it is run on a -single thread/goroutine so we check and evaluate the same FRS on many -goroutines (however if needed the function can be parallelized for -different FRS). It takes data only from store maintained by GR2_* and -GR3_*. The external communication is only required to: - -+ Create LRS. If a LRS doesn't exist it is created after the - rescheduling, when we know how much replicas should it have. - -+ Update LRS replica targets. - -If FRS is not in the desired state then it is put to -FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). - -## Monitoring and status reporting - -FRCS should expose a number of metrics form the run, like - -+ FRSC -> LRS communication latency -+ Total times spent in various elements of DoFrsCheck - -FRSC should also expose the status of FRS as an annotation on FRS and -as events. - -## Workflow - -Here is the sequence of tasks that need to be done in order for a -typical FRS to be split into a number of LRS's and to be created in -the underlying federated clusters. - -Note a: the reason the workflow would be helpful at this phase is that -for every one or two steps we can create PRs accordingly to start with -the development. - -Note b: we assume that the federation is already in place and the -federated clusters are added to the federation. - -Step 1. the client sends an RS create request to the -federation-apiserver - -Step 2. federation-apiserver persists an FRS into the federation etcd - -Note c: federation-apiserver populates the clusterid field in the FRS -before persisting it into the federation etcd - -Step 3: the federation-level “informer” in FRSC watches federation -etcd for new/modified FRS's, with empty clusterid or clusterid equal -to federation ID, and if detected, it calls the scheduling code - -Step 4. - -Note d: scheduler populates the clusterid field in the LRS with the -IDs of target clusters - -Note e: at this point let us assume that it only does the even -distribution, i.e., equal weights for all of the underlying clusters - -Step 5. As soon as the scheduler function returns the control to FRSC, -the FRSC starts a number of cluster-level “informer”s, one per every -target cluster, to watch changes in every target cluster etcd -regarding the posted LRS's and if any violation from the scheduled -number of replicase is detected the scheduling code is re-called for -re-scheduling purposes. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federated-services.md b/contributors/design-proposals/multicluster/federated-services.md index a43726d4..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federated-services.md +++ b/contributors/design-proposals/multicluster/federated-services.md @@ -1,515 +1,6 @@ -# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") +Design proposals have been archived. -## Cross-cluster Load Balancing and Service Discovery +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Requirements and System Design - -### by Quinton Hoole, Dec 3 2015 - -## Requirements - -### Discovery, Load-balancing and Failover - -1. **Internal discovery and connection**: Pods/containers (running in - a Kubernetes cluster) must be able to easily discover and connect - to endpoints for Kubernetes services on which they depend in a - consistent way, irrespective of whether those services exist in a - different kubernetes cluster within the same cluster federation. - Hence-forth referred to as "cluster-internal clients", or simply - "internal clients". -1. **External discovery and connection**: External clients (running - outside a Kubernetes cluster) must be able to discover and connect - to endpoints for Kubernetes services on which they depend. - 1. **External clients predominantly speak HTTP(S)**: External - clients are most often, but not always, web browsers, or at - least speak HTTP(S) - notable exceptions include Enterprise - Message Busses (Java, TLS), DNS servers (UDP), - SIP servers and databases) -1. **Find the "best" endpoint:** Upon initial discovery and - connection, both internal and external clients should ideally find - "the best" endpoint if multiple eligible endpoints exist. "Best" - in this context implies the closest (by network topology) endpoint - that is both operational (as defined by some positive health check) - and not overloaded (by some published load metric). For example: - 1. An internal client should find an endpoint which is local to its - own cluster if one exists, in preference to one in a remote - cluster (if both are operational and non-overloaded). - Similarly, one in a nearby cluster (e.g. in the same zone or - region) is preferable to one further afield. - 1. An external client (e.g. in New York City) should find an - endpoint in a nearby cluster (e.g. U.S. East Coast) in - preference to one further away (e.g. Japan). -1. **Easy fail-over:** If the endpoint to which a client is connected - becomes unavailable (no network response/disconnected) or - overloaded, the client should reconnect to a better endpoint, - somehow. - 1. In the case where there exist one or more connection-terminating - load balancers between the client and the serving Pod, failover - might be completely automatic (i.e. the client's end of the - connection remains intact, and the client is completely - oblivious of the fail-over). This approach incurs network speed - and cost penalties (by traversing possibly multiple load - balancers), but requires zero smarts in clients, DNS libraries, - recursing DNS servers etc, as the IP address of the endpoint - remains constant over time. - 1. In a scenario where clients need to choose between multiple load - balancer endpoints (e.g. one per cluster), multiple DNS A - records associated with a single DNS name enable even relatively - dumb clients to try the next IP address in the list of returned - A records (without even necessarily re-issuing a DNS resolution - request). For example, all major web browsers will try all A - records in sequence until a working one is found (TBD: justify - this claim with details for Chrome, IE, Safari, Firefox). - 1. In a slightly more sophisticated scenario, upon disconnection, a - smarter client might re-issue a DNS resolution query, and - (modulo DNS record TTL's which can typically be set as low as 3 - minutes, and buggy DNS resolvers, caches and libraries which - have been known to completely ignore TTL's), receive updated A - records specifying a new set of IP addresses to which to - connect. - -### Portability - -A Kubernetes application configuration (e.g. for a Pod, Replication -Controller, Service etc) should be able to be successfully deployed -into any Kubernetes Cluster or Federation of Clusters, -without modification. More specifically, a typical configuration -should work correctly (although possibly not optimally) across any of -the following environments: - -1. A single Kubernetes Cluster on one cloud provider (e.g. Google - Compute Engine, GCE). -1. A single Kubernetes Cluster on a different cloud provider - (e.g. Amazon Web Services, AWS). -1. A single Kubernetes Cluster on a non-cloud, on-premise data center -1. A Federation of Kubernetes Clusters all on the same cloud provider - (e.g. GCE). -1. A Federation of Kubernetes Clusters across multiple different cloud - providers and/or on-premise data centers (e.g. one cluster on - GCE/GKE, one on AWS, and one on-premise). - -### Trading Portability for Optimization - -It should be possible to explicitly opt out of portability across some -subset of the above environments in order to take advantage of -non-portable load balancing and DNS features of one or more -environments. More specifically, for example: - -1. For HTTP(S) applications running on GCE-only Federations, - [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - should be usable. These provide single, static global IP addresses - which load balance and fail over globally (i.e. across both regions - and zones). These allow for really dumb clients, but they only - work on GCE, and only for HTTP(S) traffic. -1. For non-HTTP(S) applications running on GCE-only Federations within - a single region, - [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - should be usable. These provide TCP (i.e. both HTTP/S and - non-HTTP/S) load balancing and failover, but only on GCE, and only - within a single region. - [Google Cloud DNS](https://cloud.google.com/dns) can be used to - route traffic between regions (and between different cloud - providers and on-premise clusters, as it's plain DNS, IP only). -1. For applications running on AWS-only Federations, - [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - should be usable. These provide both L7 (HTTP(S)) and L4 load - balancing, but only within a single region, and only on AWS - ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be - used to load balance and fail over across multiple regions, and is - also capable of resolving to non-AWS endpoints). - -## Component Cloud Services - -Cross-cluster Federated load balancing is built on top of the following: - -1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - provide single, static global IP addresses which load balance and - fail over globally (i.e. across both regions and zones). These - allow for really dumb clients, but they only work on GCE, and only - for HTTP(S) traffic. -1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - provide both HTTP(S) and non-HTTP(S) load balancing and failover, - but only on GCE, and only within a single region. -1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - provide both L7 (HTTP(S)) and L4 load balancing, but only within a - single region, and only on AWS. -1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other - programmable DNS service, like - [CloudFlare](http://www.cloudflare.com) can be used to route - traffic between regions (and between different cloud providers and - on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS - doesn't provide any built-in geo-DNS, latency-based routing, health - checking, weighted round robin or other advanced capabilities. - It's plain old DNS. We would need to build all the aforementioned - on top of it. It can provide internal DNS services (i.e. serve RFC - 1918 addresses). - 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can - be used to load balance and fail over across regions, and is also - capable of routing to non-AWS endpoints). It provides built-in - geo-DNS, latency-based routing, health checking, weighted - round robin and optional tight integration with some other - AWS services (e.g. Elastic Load Balancers). -1. Kubernetes L4 Service Load Balancing: This provides both a - [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) - and a - [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) - service IP which is load-balanced (currently simple round-robin) - across the healthy pods comprising a service within a single - Kubernetes cluster. -1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): -A generic wrapper around cloud-provided L4 and L7 load balancing services, and -roll-your-own load balancers run in pods, e.g. HA Proxy. - -## Cluster Federation API - -The Cluster Federation API for load balancing should be compatible with the equivalent -Kubernetes API, to ease porting of clients between Kubernetes and -federations of Kubernetes clusters. -Further details below. - -## Common Client Behavior - -To be useful, our load balancing solution needs to work properly with real -client applications. There are a few different classes of those... - -### Browsers - -These are the most common external clients. These are all well-written. See below. - -### Well-written clients - -1. Do a DNS resolution every time they connect. -1. Don't cache beyond TTL (although a small percentage of the DNS - servers on which they rely might). -1. Do try multiple A records (in order) to connect. -1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. - -Examples: - -+ all common browsers (except for SRV records) -+ ... - -### Dumb clients - -1. Don't do a DNS resolution every time they connect (or do cache beyond the -TTL). -1. Do try multiple A records - -Examples: - -+ ... - -### Dumber clients - -1. Only do a DNS lookup once on startup. -1. Only try the first returned DNS A record. - -Examples: - -+ ... - -### Dumbest clients - -1. Never do a DNS lookup - are pre-configured with a single (or possibly -multiple) fixed server IP(s). Nothing else matters. - -## Architecture and Implementation - -### General Control Plane Architecture - -Each cluster hosts one or more Cluster Federation master components (Federation API -servers, controller managers with leader election, and etcd quorum members. This -is documented in more detail in a separate design doc: -[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). - -In the description below, assume that 'n' clusters, named 'cluster-1'... -'cluster-n' have been registered against a Cluster Federation "federation-1", -each with their own set of Kubernetes API endpoints,so, -"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), -[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) -... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . - -### Federated Services - -Federated Services are pretty straight-forward. They're comprised of multiple -equivalent underlying Kubernetes Services, each with their own external -endpoint, and a load balancing mechanism across them. Let's work through how -exactly that works in practice. - -Our user creates the following Federated Service (against a Federation -API endpoint): - - $ kubectl create -f my-service.yaml --context="federation-1" - -where service.yaml contains the following: - - kind: Service - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - ports: - - port: 2379 - protocol: TCP - targetPort: 2379 - name: client - - port: 2380 - protocol: TCP - targetPort: 2380 - name: peer - selector: - run: my-service - type: LoadBalancer - -The Cluster Federation control system in turn creates one equivalent service (identical config to the above) -in each of the underlying Kubernetes clusters, each of which results in -something like this: - - $ kubectl get -o yaml --context="cluster-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:25Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "147365" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00002 - spec: - clusterIP: 10.0.153.185 - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - ip: 104.197.117.10 - -Similar services are created in `cluster-2` and `cluster-3`, each of which are -allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. - -In the Cluster Federation `federation-1`, the resulting federated service looks as follows: - - $ kubectl get -o yaml --context="federation-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:23Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "157333" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00007 - spec: - clusterIP: - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - hostname: my-service.my-namespace.my-federation.my-domain.com - -Note that the federated service: - -1. Is API-compatible with a vanilla Kubernetes service. -1. has no clusterIP (as it is cluster-independent) -1. has a federation-wide load balancer hostname - -In addition to the set of underlying Kubernetes services (one per cluster) -described above, the Cluster Federation control system has also created a DNS name (e.g. on -[Google Cloud DNS](https://cloud.google.com/dns) or -[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) -which provides load balancing across all of those services. For example, in a -very basic configuration: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -Each of the above IP addresses (which are just the external load balancer -ingress IP's of each cluster service) is of course load balanced across the pods -comprising the service in each cluster. - -In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster -Federation control system -automatically creates a -[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) -which exposes a single, globally load-balanced IP: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 - -Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) -in each Kubernetes cluster to preferentially return the local -clusterIP for the service in that cluster, with other clusters' -external service IP's (or a global load-balanced IP) also configured -for failover purposes: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -If Cluster Federation Global Service Health Checking is enabled, multiple service health -checkers running across the federated clusters collaborate to monitor the health -of the service endpoints, and automatically remove unhealthy endpoints from the -DNS record (e.g. a majority quorum is required to vote a service endpoint -unhealthy, to avoid false positives due to individual health checker network -isolation). - -### Federated Replication Controllers - -So far we have a federated service defined, with a resolvable load balancer -hostname by which clients can reach it, but no pods serving traffic directed -there. So now we need a Federated Replication Controller. These are also fairly -straight-forward, being comprised of multiple underlying Kubernetes Replication -Controllers which do the hard work of keeping the desired number of Pod replicas -alive in each Kubernetes cluster. - - $ kubectl create -f my-service-rc.yaml --context="federation-1" - -where `my-service-rc.yaml` contains the following: - - kind: ReplicationController - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - replicas: 6 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - -The Cluster Federation control system in turn creates one equivalent replication controller -(identical config to the above, except for the replica count) in each -of the underlying Kubernetes clusters, each of which results in -something like this: - - $ ./kubectl get -o yaml rc my-service --context="cluster-1" - kind: ReplicationController - metadata: - creationTimestamp: 2015-12-02T23:00:47Z - labels: - run: my-service - name: my-service - namespace: my-namespace - selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service - uid: 86542109-9948-11e5-a38c-42010af00002 - spec: - replicas: 2 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - resources: {} - dnsPolicy: ClusterFirst - restartPolicy: Always - status: - replicas: 2 - -The exact number of replicas created in each underlying cluster will of course -depend on what scheduling policy is in force. In the above example, the -scheduler created an equal number of replicas (2) in each of the three -underlying clusters, to make up the total of 6 replicas required. To handle -entire cluster failures, various approaches are possible, including: -1. **simple overprovisioning**, such that sufficient replicas remain even if a - cluster fails. This wastes some resources, but is simple and reliable. - -2. **pod autoscaling**, where the replication controller in each - cluster automatically and autonomously increases the number of - replicas in its cluster in response to the additional traffic - diverted from the failed cluster. This saves resources and is relatively - simple, but there is some delay in the autoscaling. - -3. **federated replica migration**, where the Cluster Federation - control system detects the cluster failure and automatically - increases the replica count in the remaining clusters to make up - for the lost replicas in the failed cluster. This does not seem to - offer any benefits relative to pod autoscaling above, and is - arguably more complex to implement, but we note it here as a - possibility. - -### Implementation Details - -The implementation approach and architecture is very similar to Kubernetes, so -if you're familiar with how Kubernetes works, none of what follows will be -surprising. One additional design driver not present in Kubernetes is that -the Cluster Federation control system aims to be resilient to individual cluster and availability zone -failures. So the control plane spans multiple clusters. More specifically: - -+ Cluster Federation runs it's own distinct set of API servers (typically one - or more per underlying Kubernetes cluster). These are completely - distinct from the Kubernetes API servers for each of the underlying - clusters. -+ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, - by default). Approximately 1 quorum member runs in each underlying - cluster ("approximately" because we aim for an odd number of quorum - members, and typically don't want more than 5 quorum members, even - if we have a larger number of federated clusters, so 2 clusters->3 - quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). - -Cluster Controllers in the Federation control system watch against the -Federation API server/etcd -state, and apply changes to the underlying kubernetes clusters accordingly. They -also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" -state against kubernetes "actual desired" state. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federation-clusterselector.md b/contributors/design-proposals/multicluster/federation-clusterselector.md index 9cc9f45f..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federation-clusterselector.md +++ b/contributors/design-proposals/multicluster/federation-clusterselector.md @@ -1,81 +1,6 @@ -# ClusterSelector Federated Resource Placement +Design proposals have been archived. -This document proposes a design for label based control over placement of -Federated resources. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Tickets: -- https://github.com/kubernetes/kubernetes/issues/29887 - -Authors: - -- Dan Wilson (emaildanwilson@github.com). -- Nikhil Jindal (nikhiljindal@github). - -## Background - -End users will often need a simple way to target a subset of clusters for deployment of resources. In some cases this will be for a specific cluster in other cases it will be groups of clusters. -A few examples... - -1. Deploy the foo service to all clusters in Europe -1. Deploy the bar service to cluster test15 -1. Deploy the baz service to all prod clusters globally - -Currently, it's possible to control placement decision of Federated ReplicaSets -using the `federation.kubernetes.io/replica-set-preferences` annotation. This provides functionality to change the number of ReplicaSets created on each Federated Cluster, by setting the quantity for each Cluster by Cluster Name. Since cluster names are required, in situations where clusters are add/removed from Federation it would require the object definitions to change in order to maintain the same configuration. From the example above, if a new cluster is created in Europe and added to federation, then the replica-set-preferences would need to be updated to include the new cluster name. - -This proposal is to provide placement decision support for all object types using Labels on the Federated Clusters as opposed to cluster names. The matching language currently used for nodeAffinity placement decisions onto nodes can be leveraged. - -Carrying forward the examples from above... - -1. "location=europe" -1. "someLabel exists" -1. "environment notin ["qa", "dev"] - -## Design - -The proposed design uses a ClusterSelector annotation that has a value that is parsed into a struct definition that follows the same design as the [NodeSelector type used w/ nodeAffinity](https://git.k8s.io/kubernetes/pkg/api/types.go#L1972) and will also use the [Matches function](https://git.k8s.io/apimachinery/pkg/labels/selector.go#L172) of the apimachinery project to determine if an object should be sent on to federated clusters or not. - -In situations where objects are not to be forwarded to federated clusters, instead a delete api call will be made using the object definition. If the object does not exist it will be ignored. - -The federation-controller will be used to implement this with shared logic stored as utility functions to reduce duplicated code where appropriate. - -### End User Functionality -The annotation `federation.alpha.kubernetes.io/cluster-selector` is used on kubernetes objects to specify additional placement decisions that should be made. The value of the annotation will be a json object of type ClusterSelector which is an array of type ClusterSelectorRequirement. - -Each ClusterSelectorRequirement is defined in three possible parts consisting of -1. Key - Matches against label keys on the Federated Clusters. -1. Operator - Represents how the Key and/or Values will be matched against the label keys and values on the Federated Clusters one of ("In", in", "=", "==", "NotIn", notin", "Exists", "exists", "!=", "DoesNotExist", "!", "Gt", "gt", "Lt", "lt"). -1. Values - Matches against the label values on the Federated Clusters using the Key specified. When the operator is "Exists", "exists", "DoesNotExist" or "!" then Values should not be specified. - -Example ConfigMap that uses the ClusterSelector annotation. The yaml format is used here to show that the value of the annotation will still be json. -```yaml -apiVersion: v1 -data: - myconfigkey: myconfigdata -kind: ConfigMap -metadata: - annotations: - federation.alpha.kubernetes.io/cluster-selector: '[{"key": "location", "operator": - "in", "values": ["europe"]}, {"key": "environment", "operator": "==", "values": - ["prod"]}]' - creationTimestamp: 2017-02-07T19:43:40Z - name: myconfig -``` - -In order for the configmap in the example above to be forwarded to any Federated Clusters they MUST have two Labels: "location" with at least one value of "europe" and "environment" that has a value of "prod". - -### Matching Logic - -The logic to determine if an object is sent to a Federated Cluster will have two rules. - -1. An object with no `federation.alpha.kubernetes.io/cluster-selector` annotation will always be forwarded on to all Federated Clusters even if they have labels configured. (this ensures no regression from existing functionality) - -1. If an object contains the `federation.alpha.kubernetes.io/cluster-selector` annotation then ALL ClusterSelectorRequirements must match in order for the object to be forwarded to the Federated Cluster. - -1. If `federation.kubernetes.io/replica-set-preferences` are also defined they will be applied AFTER the ClusterSelectorRequirements. - -## Open Questions - -1. Should there be any special considerations for when dependent resources would not be forwarded together to a Federated Cluster. -1. How to improve usability of this feature long term. It will certainly help to give first class API support but easier ways to map labels or requirements to objects may be required. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federation-high-level-arch.png b/contributors/design-proposals/multicluster/federation-high-level-arch.png Binary files differdeleted file mode 100644 index 8a416cc1..00000000 --- a/contributors/design-proposals/multicluster/federation-high-level-arch.png +++ /dev/null diff --git a/contributors/design-proposals/multicluster/federation-lite.md b/contributors/design-proposals/multicluster/federation-lite.md index 3dd1f4d8..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federation-lite.md +++ b/contributors/design-proposals/multicluster/federation-lite.md @@ -1,197 +1,6 @@ -# Kubernetes Multi-AZ Clusters +Design proposals have been archived. -## (previously nicknamed "Ubernetes-Lite") +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Introduction - -Full Cluster Federation will offer sophisticated federation between multiple kubernetes -clusters, offering true high-availability, multiple provider support & -cloud-bursting, multiple region support etc. However, many users have -expressed a desire for a "reasonably" high-available cluster, that runs in -multiple zones on GCE or availability zones in AWS, and can tolerate the failure -of a single zone without the complexity of running multiple clusters. - -Multi-AZ Clusters aim to deliver exactly that functionality: to run a single -Kubernetes cluster in multiple zones. It will attempt to make reasonable -scheduling decisions, in particular so that a replication controller's pods are -spread across zones, and it will try to be aware of constraints - for example -that a volume cannot be mounted on a node in a different zone. - -Multi-AZ Clusters are deliberately limited in scope; for many advanced functions -the answer will be "use full Cluster Federation". For example, multiple-region -support is not in scope. Routing affinity (e.g. so that a webserver will -prefer to talk to a backend service in the same zone) is similarly not in -scope. - -## Design - -These are the main requirements: - -1. kube-up must allow bringing up a cluster that spans multiple zones. -1. pods in a replication controller should attempt to spread across zones. -1. pods which require volumes should not be scheduled onto nodes in a different zone. -1. load-balanced services should work reasonably - -### kube-up support - -kube-up support for multiple zones will initially be considered -advanced/experimental functionality, so the interface is not initially going to -be particularly user-friendly. As we design the evolution of kube-up, we will -make multiple zones better supported. - -For the initial implementation, kube-up must be run multiple times, once for -each zone. The first kube-up will take place as normal, but then for each -additional zone the user must run kube-up again, specifying -`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then -create additional nodes in a different zone, but will register them with the -existing master. - -### Zone spreading - -This will be implemented by modifying the existing scheduler priority function -`SelectorSpread`. Currently this priority function aims to put pods in an RC -on different hosts, but it will be extended first to spread across zones, and -then to spread across hosts. - -So that the scheduler does not need to call out to the cloud provider on every -scheduling decision, we must somehow record the zone information for each node. -The implementation of this will be described in the implementation section. - -Note that zone spreading is 'best effort'; zones are just be one of the factors -in making scheduling decisions, and thus it is not guaranteed that pods will -spread evenly across zones. However, this is likely desirable: if a zone is -overloaded or failing, we still want to schedule the requested number of pods. - -### Volume affinity - -Most cloud providers (at least GCE and AWS) cannot attach their persistent -volumes across zones. Thus when a pod is being scheduled, if there is a volume -attached, that will dictate the zone. This will be implemented using a new -scheduler predicate (a hard constraint): `VolumeZonePredicate`. - -When `VolumeZonePredicate` observes a pod scheduling request that includes a -volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any -nodes not in that zone. - -Again, to avoid the scheduler calling out to the cloud provider, this will rely -on information attached to the volumes. This means that this will only support -PersistentVolumeClaims, because direct mounts do not have a place to attach -zone information. PersistentVolumes will then include zone information where -volumes are zone-specific. - -### Load-balanced services should operate reasonably - -For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each -service of type LoadBalancer. The native cloud load-balancers on both AWS & -GCE are region-level, and support load-balancing across instances in multiple -zones (in the same region). For both clouds, the behaviour of the native cloud -load-balancer is reasonable in the face of failures (indeed, this is why clouds -provide load-balancing as a primitive). - -For multi-AZ clusters we will therefore simply rely on the native cloud provider -load balancer behaviour, and we do not anticipate substantial code changes. - -One notable shortcoming here is that load-balanced traffic still goes through -kube-proxy controlled routing, and kube-proxy does not (currently) favor -targeting a pod running on the same instance or even the same zone. This will -likely produce a lot of unnecessary cross-zone traffic (which is likely slower -and more expensive). This might be sufficiently low-hanging fruit that we -choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed -after the initial implementation. - - -## Implementation - -The main implementation points are: - -1. how to attach zone information to Nodes and PersistentVolumes -1. how nodes get zone information -1. how volumes get zone information - -### Attaching zone information - -We must attach zone information to Nodes and PersistentVolumes, and possibly to -other resources in future. There are two obvious alternatives: we can use -labels/annotations, or we can extend the schema to include the information. - -For the initial implementation, we propose to use labels. The reasoning is: - -1. It is considerably easier to implement. -1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and -`failure-domain.alpha.kubernetes.io/region` for the two pieces of information -we need. By putting this under the `kubernetes.io` namespace there is no risk -of collision, and by putting it under `alpha.kubernetes.io` we clearly mark -this as an experimental feature. -1. We do not yet know whether these labels will be sufficient for all -environments, nor which entities will require zone information. Labels give us -more flexibility here. -1. Because the labels are reserved, we can move to schema-defined fields in -future using our cross-version mapping techniques. - -### Node labeling - -We do not want to require an administrator to manually label nodes. We instead -modify the kubelet to include the appropriate labels when it registers itself. -The information is easily obtained by the kubelet from the cloud provider. - -### Volume labeling - -As with nodes, we do not want to require an administrator to manually label -volumes. We will create an admission controller `PersistentVolumeLabel`. -`PersistentVolumeLabel` will intercept requests to create PersistentVolumes, -and will label them appropriately by calling in to the cloud provider. - -## AWS Specific Considerations - -The AWS implementation here is fairly straightforward. The AWS API is -region-wide, meaning that a single call will find instances and volumes in all -zones. In addition, instance ids and volume ids are unique per-region (and -hence also per-zone). I believe they are actually globally unique, but I do -not know if this is guaranteed; in any case we only need global uniqueness if -we are to span regions, which will not be supported by multi-AZ clusters (to do -that correctly requires a full Cluster Federation type approach). - -## GCE Specific Considerations - -The GCE implementation is more complicated than the AWS implementation because -GCE APIs are zone-scoped. To perform an operation, we must perform one REST -call per zone and combine the results, unless we can determine in advance that -an operation references a particular zone. For many operations, we can make -that determination, but in some cases - such as listing all instances, we must -combine results from calls in all relevant zones. - -A further complexity is that GCE volume names are scoped per-zone, not -per-region. Thus it is permitted to have two volumes both named `myvolume` in -two different GCE zones. (Instance names are currently unique per-region, and -thus are not a problem for multi-AZ clusters). - -The volume scoping leads to a (small) behavioural change for multi-AZ clusters on -GCE. If you had two volumes both named `myvolume` in two different GCE zones, -this would not be ambiguous when Kubernetes is operating only in a single zone. -But, when operating a cluster across multiple zones, `myvolume` is no longer -sufficient to specify a volume uniquely. Worse, the fact that a volume happens -to be unambiguous at a particular time is no guarantee that it will continue to -be unambiguous in future, because a volume with the same name could -subsequently be created in a second zone. While perhaps unlikely in practice, -we cannot automatically enable multi-AZ clusters for GCE users if this then causes -volume mounts to stop working. - -This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e. -there must be a feature-flag). It may be that we can make this feature -semi-automatic in future, by detecting whether nodes are running in multiple -zones, but it seems likely that kube-up could instead simply set this flag. - -For the initial implementation, creating volumes with identical names will -yield undefined results. Later, we may add some way to specify the zone for a -volume (and possibly require that volumes have their zone specified when -running in multi-AZ cluster mode). We could add a new `zone` field to the -PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted -name for the volume name (<name>.<zone>) - -Initially therefore, the GCE changes will be to: - -1. change kube-up to support creation of a cluster in multiple zones -1. pass a flag enabling multi-AZ clusters with kube-up -1. change the kubernetes cloud provider to iterate through relevant zones when resolving items -1. tag GCE PD volumes with the appropriate zone information +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federation-phase-1.md b/contributors/design-proposals/multicluster/federation-phase-1.md index 85c10ddb..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federation-phase-1.md +++ b/contributors/design-proposals/multicluster/federation-phase-1.md @@ -1,402 +1,6 @@ -# Ubernetes Design Spec (phase one) +Design proposals have been archived. -**Huawei PaaS Team** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## INTRODUCTION -In this document we propose a design for the “Control Plane” of -Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of -this work please refer to -[this proposal](federation.md). -The document is arranged as following. First we briefly list scenarios -and use cases that motivate K8S federation work. These use cases drive -the design and they also verify the design. We summarize the -functionality requirements from these use cases, and define the “in -scope” functionalities that will be covered by this design (phase -one). After that we give an overview of the proposed architecture, API -and building blocks. And also we go through several activity flows to -see how these building blocks work together to support use cases. - -## REQUIREMENTS - -There are many reasons why customers may want to build a K8S -federation: - -+ **High Availability:** Customers want to be immune to the outage of - a single availability zone, region or even a cloud provider. -+ **Sensitive workloads:** Some workloads can only run on a particular - cluster. They cannot be scheduled to or migrated to other clusters. -+ **Capacity overflow:** Customers prefer to run workloads on a - primary cluster. But if the capacity of the cluster is not - sufficient, workloads should be automatically distributed to other - clusters. -+ **Vendor lock-in avoidance:** Customers want to spread their - workloads on different cloud providers, and can easily increase or - decrease the workload proportion of a specific provider. -+ **Cluster Size Enhancement:** Currently K8S cluster can only support -a limited size. While the community is actively improving it, it can -be expected that cluster size will be a problem if K8S is used for -large workloads or public PaaS infrastructure. While we can separate -different tenants to different clusters, it would be good to have a -unified view. - -Here are the functionality requirements derived from above use cases: - -+ Clients of the federation control plane API server can register and deregister -clusters. -+ Workloads should be spread to different clusters according to the - workload distribution policy. -+ Pods are able to discover and connect to services hosted in other - clusters (in cases where inter-cluster networking is necessary, - desirable and implemented). -+ Traffic to these pods should be spread across clusters (in a manner - similar to load balancing, although it might not be strictly - speaking balanced). -+ The control plane needs to know when a cluster is down, and migrate - the workloads to other clusters. -+ Clients have a unified view and a central control point for above - activities. - -## SCOPE - -It's difficult to have a perfect design with one click that implements -all the above requirements. Therefore we will go with an iterative -approach to design and build the system. This document describes the -phase one of the whole work. In phase one we will cover only the -following objectives: - -+ Define the basic building blocks and API objects of control plane -+ Implement a basic end-to-end workflow - + Clients register federated clusters - + Clients submit a workload - + The workload is distributed to different clusters - + Service discovery - + Load balancing - -The following parts are NOT covered in phase one: - -+ Authentication and authorization (other than basic client - authentication against the ubernetes API, and from ubernetes control - plane to the underlying kubernetes clusters). -+ Deployment units other than replication controller and service -+ Complex distribution policy of workloads -+ Service affinity and migration - -## ARCHITECTURE - -The overall architecture of a control plane is shown as following: - - - -Some design principles we are following in this architecture: - -1. Keep the underlying K8S clusters independent. They should have no - knowledge of control plane or of each other. -1. Keep the Ubernetes API interface compatible with K8S API as much as - possible. -1. Re-use concepts from K8S as much as possible. This reduces -customers' learning curve and is good for adoption. Below is a brief -description of each module contained in above diagram. - -## Ubernetes API Server - -The API Server in the Ubernetes control plane works just like the API -Server in K8S. It talks to a distributed key-value store to persist, -retrieve and watch API objects. This store is completely distinct -from the kubernetes key-value stores (etcd) in the underlying -kubernetes clusters. We still use `etcd` as the distributed -storage so customers don't need to learn and manage a different -storage system, although it is envisaged that other storage systems -(consol, zookeeper) will probably be developedand supported over -time. - -## Ubernetes Scheduler - -The Ubernetes Scheduler schedules resources onto the underlying -Kubernetes clusters. For example it watches for unscheduled Ubernetes -replication controllers (those that have not yet been scheduled onto -underlying Kubernetes clusters) and performs the global scheduling -work. For each unscheduled replication controller, it calls policy -engine to decide how to spit workloads among clusters. It creates a -Kubernetes Replication Controller on one ore more underlying cluster, -and post them back to `etcd` storage. - -One subtlety worth noting here is that the scheduling decision is arrived at by -combining the application-specific request from the user (which might -include, for example, placement constraints), and the global policy specified -by the federation administrator (for example, "prefer on-premise -clusters over AWS clusters" or "spread load equally across clusters"). - -## Ubernetes Cluster Controller - -The cluster controller -performs the following two kinds of work: - -1. It watches all the sub-resources that are created by Ubernetes - components, like a sub-RC or a sub-service. And then it creates the - corresponding API objects on the underlying K8S clusters. -1. It periodically retrieves the available resources metrics from the - underlying K8S cluster, and updates them as object status of the - `cluster` API object. An alternative design might be to run a pod - in each underlying cluster that reports metrics for that cluster to - the Ubernetes control plane. Which approach is better remains an - open topic of discussion. - -## Ubernetes Service Controller - -The Ubernetes service controller is a federation-level implementation -of K8S service controller. It watches service resources created on -control plane, creates corresponding K8S services on each involved K8S -clusters. Besides interacting with services resources on each -individual K8S clusters, the Ubernetes service controller also -performs some global DNS registration work. - -## API OBJECTS - -## Cluster - -Cluster is a new first-class API object introduced in this design. For -each registered K8S cluster there will be such an API resource in -control plane. The way clients register or deregister a cluster is to -send corresponding REST requests to following URL: -`/api/{$version}/clusters`. Because control plane is behaving like a -regular K8S client to the underlying clusters, the spec of a cluster -object contains necessary properties like K8S cluster address and -credentials. The status of a cluster API object will contain -following information: - -1. Which phase of its lifecycle -1. Cluster resource metrics for scheduling decisions. -1. Other metadata like the version of cluster - -$version.clusterSpec - -<table style="border:1px solid #000000;border-collapse:collapse;"> -<tbody> -<tr> -<td style="padding:5px;"><b>Name</b><br> -</td> -<td style="padding:5px;"><b>Description</b><br> -</td> -<td style="padding:5px;"><b>Required</b><br> -</td> -<td style="padding:5px;"><b>Schema</b><br> -</td> -<td style="padding:5px;"><b>Default</b><br> -</td> -</tr> -<tr> -<td style="padding:5px;">Address<br> -</td> -<td style="padding:5px;">address of the cluster<br> -</td> -<td style="padding:5px;">yes<br> -</td> -<td style="padding:5px;">address<br> -</td> -<td style="padding:5px;"><p></p></td> -</tr> -<tr> -<td style="padding:5px;">Credential<br> -</td> -<td style="padding:5px;">the type (e.g. bearer token, client -certificate etc) and data of the credential used to access cluster. It's used for system routines (not behalf of users)<br> -</td> -<td style="padding:5px;">yes<br> -</td> -<td style="padding:5px;">string <br> -</td> -<td style="padding:5px;"><p></p></td> -</tr> -</tbody> -</table> - -$version.clusterStatus - -<table style="border:1px solid #000000;border-collapse:collapse;"> -<tbody> -<tr> -<td style="padding:5px;"><b>Name</b><br> -</td> -<td style="padding:5px;"><b>Description</b><br> -</td> -<td style="padding:5px;"><b>Required</b><br> -</td> -<td style="padding:5px;"><b>Schema</b><br> -</td> -<td style="padding:5px;"><b>Default</b><br> -</td> -</tr> -<tr> -<td style="padding:5px;">Phase<br> -</td> -<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br> -</td> -<td style="padding:5px;">yes<br> -</td> -<td style="padding:5px;">enum<br> -</td> -<td style="padding:5px;"><p></p></td> -</tr> -<tr> -<td style="padding:5px;">Capacity<br> -</td> -<td style="padding:5px;">represents the available resources of a cluster<br> -</td> -<td style="padding:5px;">yes<br> -</td> -<td style="padding:5px;">any<br> -</td> -<td style="padding:5px;"><p></p></td> -</tr> -<tr> -<td style="padding:5px;">ClusterMeta<br> -</td> -<td style="padding:5px;">Other cluster metadata like the version<br> -</td> -<td style="padding:5px;">yes<br> -</td> -<td style="padding:5px;">ClusterMeta<br> -</td> -<td style="padding:5px;"><p></p></td> -</tr> -</tbody> -</table> - -**For simplicity we didn't introduce a separate “cluster metrics” API -object here**. The cluster resource metrics are stored in cluster -status section, just like what we did to nodes in K8S. In phase one it -only contains available CPU resources and memory resources. The -cluster controller will periodically poll the underlying cluster API -Server to get cluster capability. In phase one it gets the metrics by -simply aggregating metrics from all nodes. In future we will improve -this with more efficient ways like leveraging heapster, and also more -metrics will be supported. Similar to node phases in K8S, the “phase” -field includes following values: - -+ pending: newly registered clusters or clusters suspended by admin - for various reasons. They are not eligible for accepting workloads -+ running: clusters in normal status that can accept workloads -+ offline: clusters temporarily down or not reachable -+ terminated: clusters removed from federation - -Below is the state transition diagram. - - - -## Replication Controller - -A global workload submitted to control plane is represented as a - replication controller in the Cluster Federation control plane. When a replication controller -is submitted to control plane, clients need a way to express its -requirements or preferences on clusters. Depending on different use -cases it may be complex. For example: - -+ This workload can only be scheduled to cluster Foo. It cannot be - scheduled to any other clusters. (use case: sensitive workloads). -+ This workload prefers cluster Foo. But if there is no available - capacity on cluster Foo, it's OK to be scheduled to cluster Bar - (use case: workload ) -+ Seventy percent of this workload should be scheduled to cluster Foo, - and thirty percent should be scheduled to cluster Bar (use case: - vendor lock-in avoidance). In phase one, we only introduce a - _clusterSelector_ field to filter acceptable clusters. In default - case there is no such selector and it means any cluster is - acceptable. - -Below is a sample of the YAML to create such a replication controller. - -```yaml -apiVersion: v1 -kind: ReplicationController -metadata: - name: nginx-controller -spec: - replicas: 5 - selector: - app: nginx - template: - metadata: - labels: - app: nginx - spec: - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - clusterSelector: - name in (Foo, Bar) -``` - -Currently clusterSelector (implemented as a -[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) -only supports a simple list of acceptable clusters. Workloads will be -evenly distributed on these acceptable clusters in phase one. After -phase one we will define syntax to represent more advanced -constraints, like cluster preference ordering, desired number of -split workloads, desired ratio of workloads spread on different -clusters, etc. - -Besides this explicit “clusterSelector” filter, a workload may have -some implicit scheduling restrictions. For example it defines -“nodeSelector” which can only be satisfied on some particular -clusters. How to handle this will be addressed after phase one. - -## Federated Services - -The Service API object exposed by the Cluster Federation is similar to service -objects on Kubernetes. It defines the access to a group of pods. The -federation service controller will create corresponding Kubernetes -service objects on underlying clusters. These are detailed in a -separate design document: [Federated Services](federated-services.md). - -## Pod - -In phase one we only support scheduling replication controllers. Pod -scheduling will be supported in later phase. This is primarily in -order to keep the Cluster Federation API compatible with the Kubernetes API. - -## ACTIVITY FLOWS - -## Scheduling - -The below diagram shows how workloads are scheduled on the Cluster Federation control\ -plane: - -1. A replication controller is created by the client. -1. APIServer persists it into the storage. -1. Cluster controller periodically polls the latest available resource - metrics from the underlying clusters. -1. Scheduler is watching all pending RCs. It picks up the RC, make - policy-driven decisions and split it into different sub RCs. -1. Each cluster control is watching the sub RCs bound to its - corresponding cluster. It picks up the newly created sub RC. -1. The cluster controller issues requests to the underlying cluster -API Server to create the RC. In phase one we don't support complex -distribution policies. The scheduling rule is basically: - 1. If a RC does not specify any nodeSelector, it will be scheduled - to the least loaded K8S cluster(s) that has enough available - resources. - 1. If a RC specifies _N_ acceptable clusters in the - clusterSelector, all replica will be evenly distributed among - these clusters. - -There is a potential race condition here. Say at time _T1_ the control -plane learns there are _m_ available resources in a K8S cluster. As -the cluster is working independently it still accepts workload -requests from other K8S clients or even another Cluster Federation control -plane. The Cluster Federation scheduling decision is based on this data of -available resources. However when the actual RC creation happens to -the cluster at time _T2_, the cluster may don't have enough resources -at that time. We will address this problem in later phases with some -proposed solutions like resource reservation mechanisms. - - - -## Service Discovery - -This part has been included in the section “Federated Service” of -document -“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. -Please refer to that document for details. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/federation.md b/contributors/design-proposals/multicluster/federation.md index 21c159d7..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/federation.md +++ b/contributors/design-proposals/multicluster/federation.md @@ -1,643 +1,6 @@ -# Kubernetes Cluster Federation +Design proposals have been archived. -## (previously nicknamed "Ubernetes") +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Requirements Analysis and Product Proposal -## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_ - -_Initial revision: 2015-03-05_ -_Last updated: 2015-08-20_ -This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2) -Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides) -Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto) - -## Introduction - -Today, each Kubernetes cluster is a relatively self-contained unit, -which typically runs in a single "on-premise" data centre or single -availability zone of a cloud provider (Google's GCE, Amazon's AWS, -etc). - -Several current and potential Kubernetes users and customers have -expressed a keen interest in tying together ("federating") multiple -clusters in some sensible way in order to enable the following kinds -of use cases (intentionally vague): - -1. _"Preferentially run my workloads in my on-premise cluster(s), but - automatically overflow to my cloud-hosted cluster(s) if I run out - of on-premise capacity"_. -1. _"Most of my workloads should run in my preferred cloud-hosted - cluster(s), but some are privacy-sensitive, and should be - automatically diverted to run in my secure, on-premise - cluster(s)"_. -1. _"I want to avoid vendor lock-in, so I want my workloads to run - across multiple cloud providers all the time. I change my set of - such cloud providers, and my pricing contracts with them, - periodically"_. -1. _"I want to be immune to any single data centre or cloud - availability zone outage, so I want to spread my service across - multiple such zones (and ideally even across multiple cloud - providers)."_ - -The above use cases are by necessity left imprecisely defined. The -rest of this document explores these use cases and their implications -in further detail, and compares a few alternative high level -approaches to addressing them. The idea of cluster federation has -informally become known as _"Ubernetes"_. - -## Summary/TL;DR - -Four primary customer-driven use cases are explored in more detail. -The two highest priority ones relate to High Availability and -Application Portability (between cloud providers, and between -on-premise and cloud providers). - -Four primary federation primitives are identified (location affinity, -cross-cluster scheduling, service discovery and application -migration). Fortunately not all four of these primitives are required -for each primary use case, so incremental development is feasible. - -## What exactly is a Kubernetes Cluster? - -A central design concept in Kubernetes is that of a _cluster_. While -loosely speaking, a cluster can be thought of as running in a single -data center, or cloud provider availability zone, a more precise -definition is that each cluster provides: - -1. a single Kubernetes API entry point, -1. a consistent, cluster-wide resource naming scheme -1. a scheduling/container placement domain -1. a service network routing domain -1. an authentication and authorization model. - -The above in turn imply the need for a relatively performant, reliable -and cheap network within each cluster. - -There is also assumed to be some degree of failure correlation across -a cluster, i.e. whole clusters are expected to fail, at least -occasionally (due to cluster-wide power and network failures, natural -disasters etc). Clusters are often relatively homogeneous in that all -compute nodes are typically provided by a single cloud provider or -hardware vendor, and connected by a common, unified network fabric. -But these are not hard requirements of Kubernetes. - -Other classes of Kubernetes deployments than the one sketched above -are technically feasible, but come with some challenges of their own, -and are not yet common or explicitly supported. - -More specifically, having a Kubernetes cluster span multiple -well-connected availability zones within a single geographical region -(e.g. US North East, UK, Japan etc) is worthy of further -consideration, in particular because it potentially addresses -some of these requirements. - -## What use cases require Cluster Federation? - -Let's name a few concrete use cases to aid the discussion: - -## 1.Capacity Overflow - -_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_ - -This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)". - -**Clarifying questions:** What is the unit of overflow? Individual - pods? Probably not always. Replication controllers and their - associated sets of pods? Groups of replication controllers - (a.k.a. distributed applications)? How are persistent disks - overflowed? Can the "overflowed" pods communicate with their - brethren and sistren pods and services in the other cluster(s)? - Presumably yes, at higher cost and latency, provided that they use - external service discovery. Is "overflow" enabled only when creating - new workloads/replication controllers, or are existing workloads - dynamically migrated between clusters based on fluctuating available - capacity? If so, what is the desired behaviour, and how is it - achieved? How, if at all, does this relate to quota enforcement - (e.g. if we run out of on-premise capacity, can all or only some - quotas transfer to other, potentially more expensive off-premise - capacity?) - -It seems that most of this boils down to: - -1. **location affinity** (pods relative to each other, and to other - stateful services like persistent storage - how is this expressed - and enforced?) -1. **cross-cluster scheduling** (given location affinity constraints - and other scheduling policy, which resources are assigned to which - clusters, and by what?) -1. **cross-cluster service discovery** (how do pods in one cluster - discover and communicate with pods in another cluster?) -1. **cross-cluster migration** (how do compute and storage resources, - and the distributed applications to which they belong, move from - one cluster to another) -1. **cross-cluster load-balancing** (how does is user traffic directed - to an appropriate cluster?) -1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility) - -## 2. Sensitive Workloads - -_"I want most of my workloads to run in my preferred cloud-hosted -cluster(s), but some are privacy-sensitive, and should be -automatically diverted to run in my secure, on-premise cluster(s). The -list of privacy-sensitive workloads changes over time, and they're -subject to external auditing."_ - -**Clarifying questions:** -1. What kinds of rules determine which -workloads go where? - 1. Is there in fact a requirement to have these rules be - declaratively expressed and automatically enforced, or is it - acceptable/better to have users manually select where to run - their workloads when starting them? - 1. Is a static mapping from container (or more typically, - replication controller) to cluster maintained and enforced? - 1. If so, is it only enforced on startup, or are things migrated - between clusters when the mappings change? - -This starts to look quite similar to "1. Capacity Overflow", and again -seems to boil down to: - -1. location affinity -1. cross-cluster scheduling -1. cross-cluster service discovery -1. cross-cluster migration -1. cross-cluster monitoring and auditing -1. cross-cluster load balancing - -## 3. Vendor lock-in avoidance - -_"My CTO wants us to avoid vendor lock-in, so she wants our workloads -to run across multiple cloud providers at all times. She changes our -set of preferred cloud providers and pricing contracts with them -periodically, and doesn't want to have to communicate and manually -enforce these policy changes across the organization every time this -happens. She wants it centrally and automatically enforced, monitored -and audited."_ - -**Clarifying questions:** - -1. How does this relate to other use cases (high availability, -capacity overflow etc), as they may all be across multiple vendors. -It's probably not strictly speaking a separate -use case, but it's brought up so often as a requirement, that it's -worth calling out explicitly. -1. Is a useful intermediate step to make it as simple as possible to - migrate an application from one vendor to another in a one-off fashion? - -Again, I think that this can probably be - reformulated as a Capacity Overflow problem - the fundamental - principles seem to be the same or substantially similar to those - above. - -## 4. "High Availability" - -_"I want to be immune to any single data centre or cloud availability -zone outage, so I want to spread my service across multiple such zones -(and ideally even across multiple cloud providers), and have my -service remain available even if one of the availability zones or -cloud providers "goes down"_. - -It seems useful to split this into multiple sets of sub use cases: - -1. Multiple availability zones within a single cloud provider (across - which feature sets like private networks, load balancing, - persistent disks, data snapshots etc are typically consistent and - explicitly designed to inter-operate). - 1. within the same geographical region (e.g. metro) within which network - is fast and cheap enough to be almost analogous to a single data - center. - 1. across multiple geographical regions, where high network cost and - poor network performance may be prohibitive. -1. Multiple cloud providers (typically with inconsistent feature sets, - more limited interoperability, and typically no cheap inter-cluster - networking described above). - -The single cloud provider case might be easier to implement (although -the multi-cloud provider implementation should just work for a single -cloud provider). Propose high-level design catering for both, with -initial implementation targeting single cloud provider only. - -**Clarifying questions:** -**How does global external service discovery work?** In the steady - state, which external clients connect to which clusters? GeoDNS or - similar? What is the tolerable failover latency if a cluster goes - down? Maybe something like (make up some numbers, notwithstanding - some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of - clients to re-issue DNS lookups and reconnect to a new cluster when - their home cluster fails is good enough for most Kubernetes users - (or at least way better than the status quo), given that these sorts - of failure only happen a small number of times a year? - -**How does dynamic load balancing across clusters work, if at all?** - One simple starting point might be "it doesn't". i.e. if a service - in a cluster is deemed to be "up", it receives as much traffic as is - generated "nearby" (even if it overloads). If the service is deemed - to "be down" in a given cluster, "all" nearby traffic is redirected - to some other cluster within some number of seconds (failover could - be automatic or manual). Failover is essentially binary. An - improvement would be to detect when a service in a cluster reaches - maximum serving capacity, and dynamically divert additional traffic - to other clusters. But how exactly does all of this work, and how - much of it is provided by Kubernetes, as opposed to something else - bolted on top (e.g. external monitoring and manipulation of GeoDNS)? - -**How does this tie in with auto-scaling of services?** More - specifically, if I run my service across _n_ clusters globally, and - one (or more) of them fail, how do I ensure that the remaining _n-1_ - clusters have enough capacity to serve the additional, failed-over - traffic? Either: - -1. I constantly over-provision all clusters by 1/n (potentially expensive), or -1. I "manually" (or automatically) update my replica count configurations in the - remaining clusters by 1/n when the failure occurs, and Kubernetes - takes care of the rest for me, or -1. Auto-scaling in the remaining clusters takes - care of it for me automagically as the additional failed-over - traffic arrives (with some latency). Note that this implies that - the cloud provider keeps the necessary resources on hand to - accommodate such auto-scaling (e.g. via something similar to AWS reserved - and spot instances) - -Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others). - -All of the above (regarding "Unavailability Zones") refers primarily -to already-running user-facing services, and minimizing the impact on -end users of those services becoming unavailable in a given cluster. -What about the people and systems that deploy Kubernetes services -(devops etc)? Should they be automatically shielded from the impact -of the cluster outage? i.e. have their new resource creation requests -automatically diverted to another cluster during the outage? While -this specific requirement seems non-critical (manual fail-over seems -relatively non-arduous, ignoring the user-facing issues above), it -smells a lot like the first three use cases listed above ("Capacity -Overflow, Sensitive Services, Vendor lock-in..."), so if we address -those, we probably get this one free of charge. - -## Core Challenges of Cluster Federation - -As we saw above, a few common challenges fall out of most of the use -cases considered above, namely: - -## Location Affinity - -Can the pods comprising a single distributed application be -partitioned across more than one cluster? More generally, how far -apart, in network terms, can a given client and server within a -distributed application reasonably be? A server need not necessarily -be a pod, but could instead be a persistent disk housing data, or some -other stateful network service. What is tolerable is typically -application-dependent, primarily influenced by network bandwidth -consumption, latency requirements and cost sensitivity. - -For simplicity, let's assume that all Kubernetes distributed -applications fall into one of three categories with respect to relative -location affinity: - -1. **"Strictly Coupled"**: Those applications that strictly cannot be - partitioned between clusters. They simply fail if they are - partitioned. When scheduled, all pods _must_ be scheduled to the - same cluster. To move them, we need to shut the whole distributed - application down (all pods) in one cluster, possibly move some - data, and then bring the up all of the pods in another cluster. To - avoid downtime, we might bring up the replacement cluster and - divert traffic there before turning down the original, but the - principle is much the same. In some cases moving the data might be - prohibitively expensive or time-consuming, in which case these - applications may be effectively _immovable_. -1. **"Strictly Decoupled"**: Those applications that can be - indefinitely partitioned across more than one cluster, to no - disadvantage. An embarrassingly parallel YouTube porn detector, - where each pod repeatedly dequeues a video URL from a remote work - queue, downloads and chews on the video for a few hours, and - arrives at a binary verdict, might be one such example. The pods - derive no benefit from being close to each other, or anything else - (other than the source of YouTube videos, which is assumed to be - equally remote from all clusters in this example). Each pod can be - scheduled independently, in any cluster, and moved at any time. -1. **"Preferentially Coupled"**: Somewhere between Coupled and - Decoupled. These applications prefer to have all of their pods - located in the same cluster (e.g. for failure correlation, network - latency or bandwidth cost reasons), but can tolerate being - partitioned for "short" periods of time (for example while - migrating the application from one cluster to another). Most small - to medium sized LAMP stacks with not-very-strict latency goals - probably fall into this category (provided that they use sane - service discovery and reconnect-on-fail, which they need to do - anyway to run effectively, even in a single Kubernetes cluster). - -From a fault isolation point of view, there are also opposites of the -above. For example, a master database and its slave replica might -need to be in different availability zones. We'll refer to this a -anti-affinity, although it is largely outside the scope of this -document. - -Note that there is somewhat of a continuum with respect to network -cost and quality between any two nodes, ranging from two nodes on the -same L2 network segment (lowest latency and cost, highest bandwidth) -to two nodes on different continents (highest latency and cost, lowest -bandwidth). One interesting point on that continuum relates to -multiple availability zones within a well-connected metro or region -and single cloud provider. Despite being in different data centers, -or areas within a mega data center, network in this case is often very fast -and effectively free or very cheap. For the purposes of this network location -affinity discussion, this case is considered analogous to a single -availability zone. Furthermore, if a given application doesn't fit -cleanly into one of the above, shoe-horn it into the best fit, -defaulting to the "Strictly Coupled and Immovable" bucket if you're -not sure. - -And then there's what I'll call _absolute_ location affinity. Some -applications are required to run in bounded geographical or network -topology locations. The reasons for this are typically -political/legislative (data privacy laws etc), or driven by network -proximity to consumers (or data providers) of the application ("most -of our users are in Western Europe, U.S. West Coast" etc). - -**Proposal:** First tackle Strictly Decoupled applications (which can - be trivially scheduled, partitioned or moved, one pod at a time). - Then tackle Preferentially Coupled applications (which must be - scheduled in totality in a single cluster, and can be moved, but - ultimately in total, and necessarily within some bounded time). - Leave strictly coupled applications to be manually moved between - clusters as required for the foreseeable future. - -## Cross-cluster service discovery - -I propose having pods use standard discovery methods used by external -clients of Kubernetes applications (i.e. DNS). DNS might resolve to a -public endpoint in the local or a remote cluster. Other than Strictly -Coupled applications, software should be largely oblivious of which of -the two occurs. - -_Aside:_ How do we avoid "tromboning" through an external VIP when DNS -resolves to a public IP on the local cluster? Strictly speaking this -would be an optimization for some cases, and probably only matters to -high-bandwidth, low-latency communications. We could potentially -eliminate the trombone with some kube-proxy magic if necessary. More -detail to be added here, but feel free to shoot down the basic DNS -idea in the mean time. In addition, some applications rely on private -networking between clusters for security (e.g. AWS VPC or more -generally VPN). It should not be necessary to forsake this in -order to use Cluster Federation, for example by being forced to use public -connectivity between clusters. - -## Cross-cluster Scheduling - -This is closely related to location affinity above, and also discussed -there. The basic idea is that some controller, logically outside of -the basic Kubernetes control plane of the clusters in question, needs -to be able to: - -1. Receive "global" resource creation requests. -1. Make policy-based decisions as to which cluster(s) should be used - to fulfill each given resource request. In a simple case, the - request is just redirected to one cluster. In a more complex case, - the request is "demultiplexed" into multiple sub-requests, each to - a different cluster. Knowledge of the (albeit approximate) - available capacity in each cluster will be required by the - controller to sanely split the request. Similarly, knowledge of - the properties of the application (Location Affinity class -- - Strictly Coupled, Strictly Decoupled etc, privacy class etc) will - be required. It is also conceivable that knowledge of service - SLAs and monitoring thereof might provide an input into - scheduling/placement algorithms. -1. Multiplex the responses from the individual clusters into an - aggregate response. - -There is of course a lot of detail still missing from this section, -including discussion of: - -1. admission control -1. initial placement of instances of a new -service vs. scheduling new instances of an existing service in response -to auto-scaling -1. rescheduling pods due to failure (response might be -different depending on if it's failure of a node, rack, or whole AZ) -1. data placement relative to compute capacity, -etc. - -## Cross-cluster Migration - -Again this is closely related to location affinity discussed above, -and is in some sense an extension of Cross-cluster Scheduling. When -certain events occur, it becomes necessary or desirable for the -cluster federation system to proactively move distributed applications -(either in part or in whole) from one cluster to another. Examples of -such events include: - -1. A low capacity event in a cluster (or a cluster failure). -1. A change of scheduling policy ("we no longer use cloud provider X"). -1. A change of resource pricing ("cloud provider Y dropped their - prices - let's migrate there"). - -Strictly Decoupled applications can be trivially moved, in part or in -whole, one pod at a time, to one or more clusters (within applicable -policy constraints, for example "PrivateCloudOnly"). - -For Preferentially Decoupled applications, the federation system must -first locate a single cluster with sufficient capacity to accommodate -the entire application, then reserve that capacity, and incrementally -move the application, one (or more) resources at a time, over to the -new cluster, within some bounded time period (and possibly within a -predefined "maintenance" window). Strictly Coupled applications (with -the exception of those deemed completely immovable) require the -federation system to: - -1. start up an entire replica application in the destination cluster -1. copy persistent data to the new application instance (possibly - before starting pods) -1. switch user traffic across -1. tear down the original application instance - -It is proposed that support for automated migration of Strictly -Coupled applications be deferred to a later date. - -## Other Requirements - -These are often left implicit by customers, but are worth calling out explicitly: - -1. Software failure isolation between Kubernetes clusters should be - retained as far as is practically possible. The federation system - should not materially increase the failure correlation across - clusters. For this reason the federation control plane software - should ideally be completely independent of the Kubernetes cluster - control software, and look just like any other Kubernetes API - client, with no special treatment. If the federation control plane - software fails catastrophically, the underlying Kubernetes clusters - should remain independently usable. -1. Unified monitoring, alerting and auditing across federated Kubernetes clusters. -1. Unified authentication, authorization and quota management across - clusters (this is in direct conflict with failure isolation above, - so there are some tough trade-offs to be made here). - -## Proposed High-Level Architectures - -Two distinct potential architectural approaches have emerged from discussions -thus far: - -1. An explicitly decoupled and hierarchical architecture, where the - Federation Control Plane sits logically above a set of independent - Kubernetes clusters, each of which is (potentially) unaware of the - other clusters, and of the Federation Control Plane itself (other - than to the extent that it is an API client much like any other). - One possible example of this general architecture is illustrated - below, and will be referred to as the "Decoupled, Hierarchical" - approach. -1. A more monolithic architecture, where a single instance of the - Kubernetes control plane itself manages a single logical cluster - composed of nodes in multiple availability zones and cloud - providers. - -A very brief, non-exhaustive list of pro's and con's of the two -approaches follows. (In the interest of full disclosure, the author -prefers the Decoupled Hierarchical model for the reasons stated below). - -1. **Failure isolation:** The Decoupled Hierarchical approach provides - better failure isolation than the Monolithic approach, as each - underlying Kubernetes cluster, and the Federation Control Plane, - can operate and fail completely independently of each other. In - particular, their software and configurations can be updated - independently. Such updates are, in our experience, the primary - cause of control-plane failures, in general. -1. **Failure probability:** The Decoupled Hierarchical model incorporates - numerically more independent pieces of software and configuration - than the Monolithic one. But the complexity of each of these - decoupled pieces is arguably better contained in the Decoupled - model (per standard arguments for modular rather than monolithic - software design). Which of the two models presents higher - aggregate complexity and consequent failure probability remains - somewhat of an open question. -1. **Scalability:** Conceptually the Decoupled Hierarchical model wins - here, as each underlying Kubernetes cluster can be scaled - completely independently w.r.t. scheduling, node state management, - monitoring, network connectivity etc. It is even potentially - feasible to stack federations of clusters (i.e. create - federations of federations) should scalability of the independent - Federation Control Plane become an issue (although the author does - not envision this being a problem worth solving in the short - term). -1. **Code complexity:** I think that an argument can be made both ways - here. It depends on whether you prefer to weave the logic for - handling nodes in multiple availability zones and cloud providers - within a single logical cluster into the existing Kubernetes - control plane code base (which was explicitly not designed for - this), or separate it into a decoupled Federation system (with - possible code sharing between the two via shared libraries). The - author prefers the latter because it: - 1. Promotes better code modularity and interface design. - 1. Allows the code - bases of Kubernetes and the Federation system to progress - largely independently (different sets of developers, different - release schedules etc). -1. **Administration complexity:** Again, I think that this could be argued - both ways. Superficially it would seem that administration of a - single Monolithic multi-zone cluster might be simpler by virtue of - being only "one thing to manage", however in practise each of the - underlying availability zones (and possibly cloud providers) has - its own capacity, pricing, hardware platforms, and possibly - bureaucratic boundaries (e.g. "our EMEA IT department manages those - European clusters"). So explicitly allowing for (but not - mandating) completely independent administration of each - underlying Kubernetes cluster, and the Federation system itself, - in the Decoupled Hierarchical model seems to have real practical - benefits that outweigh the superficial simplicity of the - Monolithic model. -1. **Application development and deployment complexity:** It's not clear - to me that there is any significant difference between the two - models in this regard. Presumably the API exposed by the two - different architectures would look very similar, as would the - behavior of the deployed applications. It has even been suggested - to write the code in such a way that it could be run in either - configuration. It's not clear that this makes sense in practise - though. -1. **Control plane cost overhead:** There is a minimum per-cluster - overhead -- two possibly virtual machines, or more for redundant HA - deployments. For deployments of very small Kubernetes - clusters with the Decoupled Hierarchical approach, this cost can - become significant. - -### The Decoupled, Hierarchical Approach - Illustrated - - - -## Cluster Federation API - -It is proposed that this look a lot like the existing Kubernetes API -but be explicitly multi-cluster. - -+ Clusters become first class objects, which can be registered, - listed, described, deregistered etc via the API. -+ Compute resources can be explicitly requested in specific clusters, - or automatically scheduled to the "best" cluster by the Cluster - Federation control system (by a - pluggable Policy Engine). -+ There is a federated equivalent of a replication controller type (or - perhaps a [deployment](deployment.md)), - which is multicluster-aware, and delegates to cluster-specific - replication controllers/deployments as required (e.g. a federated RC for n - replicas might simply spawn multiple replication controllers in - different clusters to do the hard work). - -## Policy Engine and Migration/Replication Controllers - -The Policy Engine decides which parts of each application go into each -cluster at any point in time, and stores this desired state in the -Desired Federation State store (an etcd or -similar). Migration/Replication Controllers reconcile this against the -desired states stored in the underlying Kubernetes clusters (by -watching both, and creating or updating the underlying Replication -Controllers and related Services accordingly). - -## Authentication and Authorization - -This should ideally be delegated to some external auth system, shared -by the underlying clusters, to avoid duplication and inconsistency. -Either that, or we end up with multilevel auth. Local readonly -eventually consistent auth slaves in each cluster and in the Cluster -Federation control system -could potentially cache auth, to mitigate an SPOF auth system. - -## Data consistency, failure and availability characteristics - -The services comprising the Cluster Federation control plane) have to run - somewhere. Several options exist here: -* For high availability Cluster Federation deployments, these - services may run in either: - * a dedicated Kubernetes cluster, not co-located in the same - availability zone with any of the federated clusters (for fault - isolation reasons). If that cluster/availability zone, and hence the Federation - system, fails catastrophically, the underlying pods and - applications continue to run correctly, albeit temporarily - without the Federation system. - * across multiple Kubernetes availability zones, probably with - some sort of cross-AZ quorum-based store. This provides - theoretically higher availability, at the cost of some - complexity related to data consistency across multiple - availability zones. - * For simpler, less highly available deployments, just co-locate the - Federation control plane in/on/with one of the underlying - Kubernetes clusters. The downside of this approach is that if - that specific cluster fails, all automated failover and scaling - logic which relies on the federation system will also be - unavailable at the same time (i.e. precisely when it is needed). - But if one of the other federated clusters fails, everything - should work just fine. - -There is some further thinking to be done around the data consistency - model upon which the Federation system is based, and it's impact - on the detailed semantics, failure and availability - characteristics of the system. - -## Proposed Next Steps - -Identify concrete applications of each use case and configure a proof -of concept service that exercises the use case. For example, cluster -failure tolerance seems popular, so set up an apache frontend with -replicas in each of three availability zones with either an Amazon Elastic -Load Balancer or Google Cloud Load Balancer pointing at them? What -does the zookeeper config look like for N=3 across 3 AZs -- and how -does each replica find the other replicas and how do clients find -their primary zookeeper replica? And now how do I do a shared, highly -available redis database? Use a few common specific use cases like -this to flesh out the detailed API and semantics of Cluster Federation. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/multicluster-reserved-namespaces.md b/contributors/design-proposals/multicluster/multicluster-reserved-namespaces.md index 0f664cb2..f0fbec72 100644 --- a/contributors/design-proposals/multicluster/multicluster-reserved-namespaces.md +++ b/contributors/design-proposals/multicluster/multicluster-reserved-namespaces.md @@ -1,51 +1,6 @@ -# Multicluster reserved namespaces +Design proposals have been archived. -@perotinus +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -06/06/2018 -## Background - -sig-multicluster has identified the need for a canonical set of namespaces that -can be used for supporting multicluster applications and use cases. Initially, -an [issue](https://github.com/kubernetes/cluster-registry/issues/221) was filed -in the cluster-registry repository describing the need for a namespace that -would be used for public, global cluster records. This topic was further -discussed at the -[SIG meeting on June 5, 2018](https://www.youtube.com/watch?v=j6tHK8_mWz8&t=3012) -and in a -[thread](https://groups.google.com/forum/#!topic/kubernetes-sig-multicluster/8u-li_ZJpDI) -on the SIG mailing list. - -## Reserved namespaces - -We determined that there is currently a strong case for two reserved namespaces -for multicluster use: - -- `kube-multicluster-public`: a global, public namespace for storing cluster - registry Cluster objects. If there are other custom resources that - correspond with the global, public Cluster objects, they can also be stored - here. For example, a custom resource that contains cloud-provider-specific - metadata about a cluster. Tools built against the cluster registry can - expect to find the canonical set of Cluster objects in this namespace[1]. - -- `kube-multicluster-system`: an administrator-accessible namespace that - contains components, such as multicluster controllers and their - dependencies, that are not meant to be seen by most users directly. - -The definition of these namespaces is not intended to be exhaustive: in the -future, there may be reason to define more multicluster namespaces, and -potentially conventions for namespaces that are replicated between clusters (for -example, to support a global cluster list that is replicated to all clusters -that are contained in the list). - -## Conventions for reserved namespaces - -By convention, resources in these namespaces are local to the clusters in which -they exist and will not be replicated to other clusters. In other words, these -namespaces are private to the clusters they are in, and multicluster operations -must not replicate them or their resources into other clusters. - -[1] Tools are by no means compelled to look in this namespace for clusters, and -can choose to reference Cluster objects from other namespaces as is suitable to -their design and environment. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/multicluster/ubernetes-cluster-state.png b/contributors/design-proposals/multicluster/ubernetes-cluster-state.png Binary files differdeleted file mode 100644 index 56ec2df8..00000000 --- a/contributors/design-proposals/multicluster/ubernetes-cluster-state.png +++ /dev/null diff --git a/contributors/design-proposals/multicluster/ubernetes-design.png b/contributors/design-proposals/multicluster/ubernetes-design.png Binary files differdeleted file mode 100644 index 44924846..00000000 --- a/contributors/design-proposals/multicluster/ubernetes-design.png +++ /dev/null diff --git a/contributors/design-proposals/multicluster/ubernetes-scheduling.png b/contributors/design-proposals/multicluster/ubernetes-scheduling.png Binary files differdeleted file mode 100644 index 01774882..00000000 --- a/contributors/design-proposals/multicluster/ubernetes-scheduling.png +++ /dev/null diff --git a/contributors/design-proposals/network/OWNERS b/contributors/design-proposals/network/OWNERS deleted file mode 100644 index 42bb9ad2..00000000 --- a/contributors/design-proposals/network/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-network-leads -approvers: - - sig-network-leads -labels: - - sig/network diff --git a/contributors/design-proposals/network/command_execution_port_forwarding.md b/contributors/design-proposals/network/command_execution_port_forwarding.md index b4545662..f0fbec72 100644 --- a/contributors/design-proposals/network/command_execution_port_forwarding.md +++ b/contributors/design-proposals/network/command_execution_port_forwarding.md @@ -1,154 +1,6 @@ -# Container Command Execution & Port Forwarding in Kubernetes +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document describes how to use Kubernetes to execute commands in containers, -with stdin/stdout/stderr streams attached and how to implement port forwarding -to the containers. - -## Background - -See the following related issues/PRs: - -- [Support attach](http://issue.k8s.io/1521) -- [Real container ssh](http://issue.k8s.io/1513) -- [Provide easy debug network access to services](http://issue.k8s.io/1863) -- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576) - -## Motivation - -Users and administrators are accustomed to being able to access their systems -via SSH to run remote commands, get shell access, and do port forwarding. - -Supporting SSH to containers in Kubernetes is a difficult task. You must -specify a "user" and a hostname to make an SSH connection, and `sshd` requires -real users (resolvable by NSS and PAM). Because a container belongs to a pod, -and the pod belongs to a namespace, you need to specify namespace/pod/container -to uniquely identify the target container. Unfortunately, a -namespace/pod/container is not a real user as far as SSH is concerned. Also, -most Linux systems limit user names to 32 characters, which is unlikely to be -large enough to contain namespace/pod/container. We could devise some scheme to -map each namespace/pod/container to a 32-character user name, adding entries to -`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the -time. Alternatively, we could write custom NSS and PAM modules that allow the -host to resolve a namespace/pod/container to a user without needing to keep -files or LDAP in sync. - -As an alternative to SSH, we are using a multiplexed streaming protocol that -runs on top of HTTP. There are no requirements about users being real users, -nor is there any limitation on user name length, as the protocol is under our -control. The only downside is that standard tooling that expects to use SSH -won't be able to work with this mechanism, unless adapters can be written. - -## Constraints and Assumptions - -- SSH support is not currently in scope. -- CGroup confinement is ultimately desired, but implementing that support is not -currently in scope. -- SELinux confinement is ultimately desired, but implementing that support is -not currently in scope. - -## Use Cases - -- A user of a Kubernetes cluster wants to run arbitrary commands in a -container with local stdin/stdout/stderr attached to the container. -- A user of a Kubernetes cluster wants to connect to local ports on his computer -and have them forwarded to ports in a container. - -## Process Flow - -### Remote Command Execution Flow - -1. The client connects to the Kubernetes Master to initiate a remote command -execution request. -2. The Master proxies the request to the Kubelet where the container lives. -3. The Kubelet executes nsenter + the requested command and streams -stdin/stdout/stderr back and forth between the client and the container. - -### Port Forwarding Flow - -1. The client connects to the Kubernetes Master to initiate a remote command -execution request. -2. The Master proxies the request to the Kubelet where the container lives. -3. The client listens on each specified local port, awaiting local connections. -4. The client connects to one of the local listening ports. -4. The client notifies the Kubelet of the new connection. -5. The Kubelet executes nsenter + socat and streams data back and forth between -the client and the port in the container. - -## Design Considerations - -### Streaming Protocol - -The current multiplexed streaming protocol used is SPDY. This is not the -long-term desire, however. As soon as there is viable support for HTTP/2 in Go, -we will switch to that. - -### Master as First Level Proxy - -Clients should not be allowed to communicate directly with the Kubelet for -security reasons. Therefore, the Master is currently the only suggested entry -point to be used for remote command execution and port forwarding. This is not -necessarily desirable, as it means that all remote command execution and port -forwarding traffic must travel through the Master, potentially impacting other -API requests. - -In the future, it might make more sense to retrieve an authorization token from -the Master, and then use that token to initiate a remote command execution or -port forwarding request with a load balanced proxy service dedicated to this -functionality. This would keep the streaming traffic out of the Master. - -### Kubelet as Backend Proxy - -The kubelet is currently responsible for handling remote command execution and -port forwarding requests. Just like with the Master described above, this means -that all remote command execution and port forwarding streaming traffic must -travel through the Kubelet, which could result in a degraded ability to service -other requests. - -In the future, it might make more sense to use a separate service on the node. - -Alternatively, we could possibly inject a process into the container that only -listens for a single request, expose that process's listening port on the node, -and then issue a redirect to the client such that it would connect to the first -level proxy, which would then proxy directly to the injected process's exposed -port. This would minimize the amount of proxying that takes place. - -### Scalability - -There are at least 2 different ways to execute a command in a container: -`docker exec` and `nsenter`. While `docker exec` might seem like an easier and -more obvious choice, it has some drawbacks. - -#### `docker exec` - -We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port -on the node), but this would require proxying from the edge and securing the -Docker API. `docker exec` calls go through the Docker daemon, meaning that all -stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop. -Additionally, you can't isolate 1 malicious `docker exec` call from normal -usage, meaning an attacker could initiate a denial of service or other attack -and take down the Docker daemon, or the node itself. - -We expect remote command execution and port forwarding requests to be long -running and/or high bandwidth operations, and routing all the streaming data -through the Docker daemon feels like a bottleneck we can avoid. - -#### `nsenter` - -The implementation currently uses `nsenter` to run commands in containers, -joining the appropriate container namespaces. `nsenter` runs directly on the -node and is not proxied through any single daemon process. - -### Security - -Authentication and authorization hasn't specifically been tested yet with this -functionality. We need to make sure that users are not allowed to execute -remote commands or do port forwarding to containers they aren't allowed to -access. - -Additional work is required to ensure that multiple command execution or port -forwarding connections from different clients are not able to see each other's -data. This can most likely be achieved via SELinux labeling and unique process - contexts. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/coredns.md b/contributors/design-proposals/network/coredns.md index 50217366..f0fbec72 100644 --- a/contributors/design-proposals/network/coredns.md +++ b/contributors/design-proposals/network/coredns.md @@ -1,223 +1,6 @@ -# Add CoreDNS for DNS-based Service Discovery +Design proposals have been archived. -Status: Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Alpha -Implementation Owner: @johnbelamaric - -## Motivation - -CoreDNS is another CNCF project and is the successor to SkyDNS, which kube-dns is based on. It is a flexible, extensible -authoritative DNS server and directly integrates with the Kubernetes API. It can serve as cluster DNS, -complying with the [dns spec](https://git.k8s.io/dns/docs/specification.md). - -CoreDNS has fewer moving parts than kube-dns, since it is a single executable and single process. It is written in Go so -it is memory-safe (kube-dns includes dnsmasq which is not). It supports a number of use cases that kube-dns does not -(see below). As a general-purpose authoritative DNS server it has a lot of functionality that kube-dns could not reasonably -be expected to add. See, for example, the [intro](https://docs.google.com/presentation/d/1v6Coq1JRlqZ8rQ6bv0Tg0usSictmnN9U80g8WKxiOjQ/edit#slide=id.g249092e088_0_181) or [coredns.io](https://coredns.io) or the [CNCF webinar](https://youtu.be/dz9S7R8r5gw). - -## Proposal - -The proposed solution is to enable the selection of CoreDNS as an alternate to Kube-DNS during cluster deployment, with the -intent to make it the default in the future. - -## User Experience - -### Use Cases - - * Standard DNS-based service discovery - * Federation records - * Stub domain support - * Adding custom DNS entries - * Making an alias for an external name [#39792](https://github.com/kubernetes/kubernetes/issues/39792) - * Dynamically adding services to another domain, without running another server [#55](https://github.com/kubernetes/dns/issues/55) - * Adding an arbitrary entry inside the cluster domain (for example TXT entries [#38](https://github.com/kubernetes/dns/issues/38)) - * Verified pod DNS entries (ensure pod exists in specified namespace) - * Experimental server-side search path to address latency issues [#33554](https://github.com/kubernetes/kubernetes/issues/33554) - * Limit PTR replies to the cluster CIDR [#125](https://github.com/kubernetes/dns/issues/125) - * Serve DNS for selected namespaces [#132](https://github.com/kubernetes/dns/issues/132) - * Serve DNS based on a label selector - * Support for wildcard queries (e.g., `*.namespace.svc.cluster.local` returns all services in `namespace`) - -By default, the user experience would be unchanged. For more advanced uses, existing users would need to modify the -ConfigMap that contains the CoreDNS configuration file. - -### Configuring CoreDNS - -The CoreDNS configuration file is called a `Corefile` and syntactically is the same as a -[Caddyfile](https://caddyserver.com/docs/caddyfile). The file consists of multiple stanzas called _server blocks_. -Each of these represents a set of zones for which that server block should respond, along with the list -of plugins to apply to a given request. More details on this can be found in the -[Corefile Explained](https://coredns.io/2017/07/23/corefile-explained/) and -[How Queries Are Processed](https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/) blog -entries. - -### Configuration for Standard Kubernetes DNS - -The intent is to make configuration as simple as possible. The following Corefile will behave according -to the spec, except that it will not respond to Pod queries. It assumes the cluster domain is `cluster.local` -and the cluster CIDRs are all within 10.0.0.0/8. - -``` -. { - errors - log - cache 30 - health - prometheus - kubernetes 10.0.0.0/8 cluster.local - proxy . /etc/resolv.conf -} - -``` - -The `.` means that queries for the root zone (`.`) and below should be handled by this server block. Each -of the lines within `{ }` represent individual plugins: - - * `errors` enables [error logging](https://coredns.io/plugins/errors) - * `log` enables [query logging](https://coredns.io/plugins/log/) - * `cache 30` enables [caching](https://coredns.io/plugins/cache/) of positive and negative responses for 30 seconds - * `health` opens an HTTP port to allow [health checks](https://coredns.io/plugins/health) from Kubernetes - * `prometheus` enables Prometheus [metrics](https://coredns.io/plugins/metrics) - * `kubernetes 10.0.0.0/8 cluster.local` connects to the Kubernetes API and [serves records](https://coredns.io/plugins/kubernetes/) for the `cluster.local` domain and reverse DNS for 10.0.0.0/8 per the [spec](https://git.k8s.io/dns/docs/specification.md) - * `proxy . /etc/resolv.conf` [forwards](https://coredns.io/plugins/proxy) any queries not handled by other plugins (the `.` means the root domain) to the nameservers configured in `/etc/resolv.conf` - -### Configuring Stub Domains - -To configure stub domains, you add additional server blocks for those domains: - -``` -example.com { - proxy example.com 8.8.8.8:53 -} - -. { - errors - log - cache 30 - health - prometheus - kubernetes 10.0.0.0/8 cluster.local - proxy . /etc/resolv.conf -} -``` - -### Configuring Federation - -Federation is implemented as a separate plugin. You simply list the federation names and -their corresponding domains. - -``` -. { - errors - log - cache 30 - health - prometheus - kubernetes 10.0.0.0/8 cluster.local - federation cluster.local { - east east.example.com - west west.example.com - } - proxy . /etc/resolv.conf -} -``` - -### Reverse DNS - -Reverse DNS is supported for Services and Endpoints. It is not for Pods. - -You have to configure the reverse zone to make it work. That means knowing the service CIDR and configuring that -ahead of time (until [#25533](https://github.com/kubernetes/kubernetes/issues/25533) is implemented). - -Since reverse DNS zones are on classful boundaries, if you have a classless CIDR for your service CIDR -(say, a /12), then you have to widen that to the containing classful network. That leaves a subset of that network -open to the spoofing described in [#125](https://github.com/kubernetes/dns/issues/125); this is to be fixed -in [#1074](https://github.com/coredns/coredns/issues/1074). - -PTR spoofing by manual endpoints -([#124](https://github.com/kubernetes/dns/issues/124)) would -still be an issue even with [#1074](https://github.com/coredns/coredns/issues/1074) solved (as it is in kube-dns). This could be resolved in the case -where `pods verified` is enabled but that is not done at this time. - -### Deployment and Operations - -Typically when deployed for cluster DNS, CoreDNS is managed by a Deployment. The -CoreDNS pod only contains a single container, as opposed to kube-dns which requires three -containers. This simplifies troubleshooting. - -The Kubernetes integration is stateless and so multiple pods may be run. Each pod will have its -own connection to the API server. If you (like OpenShift) run a DNS pod for each node, you should not enable -`pods verified` as that could put a high load on the API server. Instead, if you wish to support -that functionality, you can run another central deployment and configure the per-node -instances to proxy `pod.cluster.local` to the central deployment. - -All logging is to standard out, and may be disabled if -desired. In very high queries-per-second environments, it is advisable to disable query logging to -avoid I/O for every query. - -CoreDNS can be configured to provide an HTTP health check endpoint, so that it can be monitored -by a standard Kubernetes HTTP health check. Readiness checks are not currently supported but -are in the works (see [#588](https://github.com/coredns/coredns/issues/588)). For Kubernetes, a -CoreDNS instance will be considered ready when it has finished syncing with the API. - -CoreDNS performance metrics can be published for Prometheus. - -When a change is made to the Corefile, you can send each CoreDNS instance a SIGUSR1, which will -trigger a graceful reload of the Corefile. - -### Performance and Resource Load - -The performance test was done in GCE with the following components: - - * CoreDNS system with machine type : n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell)) - * Client system with machine type: n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell)) - * Kubemark Cluster with 5000 nodes - -CoreDNS and client are running out-of-cluster (due to it being a Kubemark cluster). - -The following is the summary of the performance of CoreDNS. CoreDNS cache was disabled. - -Services (with 1% change per minute\*) | Max QPS\*\* | Latency (Median) | CoreDNS memory (at max QPS) | CoreDNS CPU (at max QPS) | ------------- | ------------- | -------------- | --------------------- | ----------------- | -1,000 | 18,000 | 0.1 ms | 38 MB | 95 % | -5,000 | 16,000 | 0.1 ms | 73 MB | 93 % | -10,000 | 10,000 | 0.1 ms | 115 MB | 78 % | - -\* We simulated service change load by creating and destroying 1% of services per minute. - -\** Max QPS with < 1 % packet loss - -## Implementation - -Each distribution project (kubeadm, minikube, kubespray, and others) will implement CoreDNS as an optional -add-on as appropriate for that project. - -### Client/Server Backwards/Forwards compatibility - -No changes to other components are needed. - -The method for configuring the DNS server will change. Thus, in cases where users have customized -the DNS configuration, they will need to modify their configuration if they move to CoreDNS. -For example, if users have configured stub domains, they would need to modify that configuration. - -When serving SRV requests for headless services, some responses are different from kube-dns, though still within -the specification (see [#975](https://github.com/coredns/coredns/issues/975)). In summary, these are: - - * kube-dns uses endpoint names that have an opaque identifier. CoreDNS instead uses the pod IP with dashes. - * kube-dns returns a bogus SRV record with port = 0 when no SRV prefix is present in the query. - coredns returns all SRV record for the service (see also [#140](https://github.com/kubernetes/dns/issues/140)) - -Additionally, federation may return records in a slightly different manner (see [#1034](https://github.com/coredns/coredns/issues/1034)), -though this may be changed prior to completing this proposal. - -In the plan for the Alpha, there will be no automated conversion of the kube-dns configuration. However, as -part of the Beta, code will be provided that will produce a proper Corefile based upon the existing kube-dns -configuration. - -## Alternatives considered - -Maintain existing kube-dns, add functionality to meet the currently unmet use cases above, and fix underlying issues. -Ensuring the use of memory-safe code would require replacing dnsmasq with another (memory-safe) caching DNS server, -or implementing caching within kube-dns. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/external-lb-source-ip-preservation.md b/contributors/design-proposals/network/external-lb-source-ip-preservation.md index f6a7d680..f0fbec72 100644 --- a/contributors/design-proposals/network/external-lb-source-ip-preservation.md +++ b/contributors/design-proposals/network/external-lb-source-ip-preservation.md @@ -1,235 +1,6 @@ -- [Overview](#overview) - - [Motivation](#motivation) -- [Alpha Design](#alpha-design) - - [Overview](#overview-1) - - [Traffic Steering using LB programming](#traffic-steering-using-lb-programming) - - [Traffic Steering using Health Checks](#traffic-steering-using-health-checks) - - [Choice of traffic steering approaches by individual Cloud Provider implementations](#choice-of-traffic-steering-approaches-by-individual-cloud-provider-implementations) - - [API Changes](#api-changes) - - [Local Endpoint Recognition Support](#local-endpoint-recognition-support) - - [Service Annotation to opt-in for new behaviour](#service-annotation-to-opt-in-for-new-behaviour) - - [NodePort allocation for HealthChecks](#nodeport-allocation-for-healthchecks) - - [Behavior Changes expected](#behavior-changes-expected) - - [External Traffic Blackholed on nodes with no local endpoints](#external-traffic-blackholed-on-nodes-with-no-local-endpoints) - - [Traffic Balancing Changes](#traffic-balancing-changes) - - [Cloud Provider support](#cloud-provider-support) - - [GCE 1.4](#gce-14) - - [GCE Expected Packet Source/Destination IP (Datapath)](#gce-expected-packet-sourcedestination-ip-datapath) - - [GCE Expected Packet Destination IP (HealthCheck path)](#gce-expected-packet-destination-ip-healthcheck-path) - - [AWS TBD](#aws-tbd) - - [Openstack TBD](#openstack-tbd) - - [Azure TBD](#azure-tbd) - - [Testing](#testing) -- [Beta Design](#beta-design) - - [API Changes from Alpha to Beta](#api-changes-from-alpha-to-beta) -- [Future work](#future-work) -- [Appendix](#appendix) +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Overview -Kubernetes provides an external loadbalancer service type which creates a virtual external ip -(in supported cloud provider environments) that can be used to load-balance traffic to -the pods matching the service pod-selector. - -## Motivation - -The current implementation requires that the cloud loadbalancer balances traffic across all -Kubernetes worker nodes, and this traffic is then equally distributed to all the backend -pods for that service. -Due to the DNAT required to redirect the traffic to its ultimate destination, the return -path for each session MUST traverse the same node again. To ensure this, the node also -performs a SNAT, replacing the source ip with its own. - -This causes the service endpoint to see the session as originating from a cluster local ip address. -*The original external source IP is lost* - -This is not a satisfactory solution - the original external source IP MUST be preserved for a -lot of applications and customer use-cases. - -# Alpha Design - -This section describes the proposed design for -[alpha-level](/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions) support, although -additional features are described in [future work](#future-work). - -## Overview - -The double hop must be prevented by programming the external load balancer to direct traffic -only to nodes that have local pods for the service. This can be accomplished in two ways, either -by API calls to add/delete nodes from the LB node pool or by adding health checking to the LB and -failing/passing health checks depending on the presence of local pods. - -## Traffic Steering using LB programming - -This approach requires that the Cloud LB be reprogrammed to be in sync with endpoint presence. -Whenever the first service endpoint is scheduled onto a node, the node is added to the LB pool. -Whenever the last service endpoint is unhealthy on a node, the node needs to be removed from the LB pool. - -This is a slow operation, on the order of 30-60 seconds, and involves the Cloud Provider API path. -If the API endpoint is temporarily unavailable, the datapath will be misprogrammed till the -reprogramming is successful and the API->datapath tables are updated by the cloud provider backend. - -## Traffic Steering using Health Checks - -This approach requires that all worker nodes in the cluster be programmed into the LB target pool. -To steer traffic only onto nodes that have endpoints for the service, we program the LB to perform -node healthchecks. The kube-proxy daemons running on each node will be responsible for responding -to these healthcheck requests (URL `/healthz`) from the cloud provider LB healthchecker. An additional nodePort -will be allocated for these health check for this purpose. -kube-proxy already watches for Service and Endpoint changes, it will maintain an in-memory lookup -table indicating the number of local endpoints for each service. -For a value of zero local endpoints, it responds with a health check failure (503 Service Unavailable), -and success (200 OK) for non-zero values. - -Healthchecks are programmable with a min period of 1 second on most cloud provider LBs, and min -failures to trigger node health state change can be configurable from 2 through 5. - -This will allow much faster transition times on the order of 1-5 seconds, and involve no -API calls to the cloud provider (and hence reduce the impact of API unreliability), keeping the -time window where traffic might get directed to nodes with no local endpoints to a minimum. - -## Choice of traffic steering approaches by individual Cloud Provider implementations - -The cloud provider package may choose either of these approaches. kube-proxy will provide these -healthcheck responder capabilities, regardless of the cloud provider configured on a cluster. - -## API Changes - -### Local Endpoint Recognition Support - -To allow kube-proxy to recognize if an endpoint is local requires that the EndpointAddress struct -should also contain the NodeName it resides on. This new string field will be read-only and -populated *only* by the Endpoints Controller. - -### Service Annotation to opt-in for new behaviour - -A new annotation `service.alpha.kubernetes.io/external-traffic` will be recognized -by the service controller only for services of Type LoadBalancer. Services that wish to opt-in to -the new LoadBalancer behaviour must annotate the Service to request the new ESIPP behavior. -Supported values for this annotation are OnlyLocal and Global. -- OnlyLocal activates the new logic (described in this proposal) and balances locally within a node. -- Global activates the old logic of balancing traffic across the entire cluster. - -### NodePort allocation for HealthChecks - -An additional nodePort allocation will be necessary for services that are of type LoadBalancer and -have the new annotation specified. This additional nodePort is necessary for kube-proxy to listen for -healthcheck requests on all nodes. -This NodePort will be added as an annotation (`service.alpha.kubernetes.io/healthcheck-nodeport`) to -the Service after allocation (in the alpha release). The value of this annotation may also be -specified during the Create call and the allocator will reserve that specific nodePort. - - -## Behavior Changes expected - -### External Traffic Blackholed on nodes with no local endpoints - -When the last endpoint on the node has gone away and the LB has not marked the node as unhealthy, -worst-case window size = (N+1) * HCP, where N = minimum failed healthchecks and HCP = Health Check Period, -external traffic will still be steered to the node. This traffic will be blackholed and not forwarded -to other endpoints elsewhere in the cluster. - -Internal pod to pod traffic should behave as before, with equal probability across all pods. - -### Traffic Balancing Changes - -GCE/AWS load balancers do not provide weights for their target pools. This was not an issue with the old LB -kube-proxy rules which would correctly balance across all endpoints. - -With the new functionality, the external traffic will not be equally load balanced across pods, but rather -equally balanced at the node level (because GCE/AWS and other external LB implementations do not have the ability -for specifying the weight per node, they balance equally across all target nodes, disregarding the number of -pods on each node). - -We can, however, state that for NumServicePods << NumNodes or NumServicePods >> NumNodes, a fairly close-to-equal -distribution will be seen, even without weights. - -Once the external load balancers provide weights, this functionality can be added to the LB programming path. -*Future Work: No support for weights is provided for the 1.4 release, but may be added at a future date* - -## Cloud Provider support - -This feature is added as an opt-in annotation. -Default behaviour of LoadBalancer type services will be unchanged for all Cloud providers. -The annotation will be ignored by existing cloud provider libraries until they add support. - -### GCE 1.4 - -For the 1.4 release, this feature will be implemented for the GCE cloud provider. - -#### GCE Expected Packet Source/Destination IP (Datapath) - -- Node: On the node, we expect to see the real source IP of the client. Destination IP will be the Service Virtual External IP. - -- Pod: For processes running inside the Pod network namespace, the source IP will be the real client source IP. The destination address will the be Pod IP. - -#### GCE Expected Packet Destination IP (HealthCheck path) - -kube-proxy listens on the health check node port for TCP health checks on :::. -This allow responding to health checks when the destination IP is either the VM IP or the Service Virtual External IP. -In practice, tcpdump traces on GCE show source IP is 169.254.169.254 and destination address is the Service Virtual External IP. - -### AWS TBD - -TBD *discuss timelines and feasibility with Kubernetes sig-aws team members* - -### Openstack TBD - -This functionality may not be introduced in Openstack in the near term. - -*Note from Openstack team member @anguslees* -Underlying vendor devices might be able to do this, but we only expose full-NAT/proxy loadbalancing through the OpenStack API (LBaaS v1/v2 and Octavia). So I'm afraid this will be unsupported on OpenStack, afaics. - -### Azure TBD - -*To be confirmed* For the 1.4 release, this feature will be implemented for the Azure cloud provider. - -## Testing - -The cases we should test are: - -1. Core Functionality Tests - - 1.1 Source IP Preservation - - Test the main intent of this change, source ip preservation - use the all-in-one network tests container - with new functionality that responds with the client IP. Verify the container is seeing the external IP - of the test client. - - 1.2 Health Check responses - - Testcases use pods explicitly pinned to nodes and delete/add to nodes randomly. Validate that healthchecks succeed - and fail on the expected nodes as endpoints move around. Gather LB response times (time from pod declares ready to - time for Cloud LB to declare node healthy and vice versa) to endpoint changes. - -2. Inter-Operability Tests - - Validate that internal cluster communications are still possible from nodes without local endpoints. This change - is only for externally sourced traffic. - -3. Backward Compatibility Tests - - Validate that old and new functionality can simultaneously exist in a single cluster. Create services with and without - the annotation, and validate datapath correctness. - -# Beta Design - -The only part of the design that changes for beta is the API, which is upgraded from -annotation-based to first class fields. - -## API Changes from Alpha to Beta - -Annotation `service.alpha.kubernetes.io/node-local-loadbalancer` will switch to a Service object field. - -# Future work - -Post-1.4 feature ideas. These are not fully-fleshed designs. - - - -# Appendix - -<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> -[]() -<!-- END MUNGE: GENERATED_ANALYTICS --> +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/flannel-integration.md b/contributors/design-proposals/network/flannel-integration.md index 3448ab28..f0fbec72 100644 --- a/contributors/design-proposals/network/flannel-integration.md +++ b/contributors/design-proposals/network/flannel-integration.md @@ -1,128 +1,6 @@ -# Flannel integration with Kubernetes +Design proposals have been archived. -## Why? +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -* Networking works out of the box. -* Cloud gateway configuration is regulated by quota. -* Consistent bare metal and cloud experience. -* Lays foundation for integrating with networking backends and vendors. - -## How? - -Thus: - -``` -Master | Node1 ----------------------------------------------------------------------- -{192.168.0.0/16, 256 /24} | docker - | | | restart with podcidr -apiserver <------------------ kubelet (sends podcidr) - | | | here's podcidr, mtu -flannel-server:10253 <------------------ flannel-daemon -Allocates a /24 ------------------> [config iptables, VXLan] - <------------------ [watch subnet leases] -I just allocated ------------------> [config VXLan] -another /24 | -``` - -## Proposal - -Explaining vxlan is out of the scope of this document, however it does take some basic understanding to grok the proposal. Assume some pod wants to communicate across nodes with the above setup. Check the flannel vxlan devices: - -```console -node1 $ ip -d link show flannel.1 -4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT - link/ether a2:53:86:b5:5f:c1 brd ff:ff:ff:ff:ff:ff - vxlan -node1 $ ip -d link show eth0 -2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000 - link/ether 42:01:0a:f0:00:04 brd ff:ff:ff:ff:ff:ff - -node2 $ ip -d link show flannel.1 -4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT - link/ether 56:71:35:66:4a:d8 brd ff:ff:ff:ff:ff:ff - vxlan -node2 $ ip -d link show eth0 -2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000 - link/ether 42:01:0a:f0:00:03 brd ff:ff:ff:ff:ff:ff -``` - -Note that we're ignoring cbr0 for the sake of simplicity. Spin-up a container on each node. We're using raw docker for this example only because we want control over where the container lands: - -``` -node1 $ docker run -it radial/busyboxplus:curl /bin/sh -[ root@5ca3c154cde3:/ ]$ ip addr show -1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue -8: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue - link/ether 02:42:12:10:20:03 brd ff:ff:ff:ff:ff:ff - inet 192.168.32.3/24 scope global eth0 - valid_lft forever preferred_lft forever - -node2 $ docker run -it radial/busyboxplus:curl /bin/sh -[ root@d8a879a29f5d:/ ]$ ip addr show -1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue -16: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue - link/ether 02:42:12:10:0e:07 brd ff:ff:ff:ff:ff:ff - inet 192.168.14.7/24 scope global eth0 - valid_lft forever preferred_lft forever -[ root@d8a879a29f5d:/ ]$ ping 192.168.32.3 -PING 192.168.32.3 (192.168.32.3): 56 data bytes -64 bytes from 192.168.32.3: seq=0 ttl=62 time=1.190 ms -``` - -__What happened?__: - -From 1000 feet: -* vxlan device driver starts up on node1 and creates a udp tunnel endpoint on 8472 -* container 192.168.32.3 pings 192.168.14.7 - - what's the MAC of 192.168.14.0? - - L2 miss, flannel looks up MAC of subnet - - Stores `192.168.14.0 <-> 56:71:35:66:4a:d8` in neighbor table - - what's tunnel endpoint of this MAC? - - L3 miss, flannel looks up destination VM ip - - Stores `10.240.0.3 <-> 56:71:35:66:4a:d8` in bridge database -* Sends `[56:71:35:66:4a:d8, 10.240.0.3][vxlan: port, vni][02:42:12:10:20:03, 192.168.14.7][icmp]` - -__But will it blend?__ - -Kubernetes integration is fairly straight-forward once we understand the pieces involved, and can be prioritized as follows: -* Kubelet understands flannel daemon in client mode, flannel server manages independent etcd store on master, node controller backs off CIDR allocation -* Flannel server consults the Kubernetes master for everything network related -* Flannel daemon works through network plugins in a generic way without bothering the kubelet: needs CNI x Kubernetes standardization - -The first is accomplished in this PR, while a timeline for 2. and 3. is TDB. To implement the flannel api we can either run a proxy per node and get rid of the flannel server, or service all requests in the flannel server with something like a go-routine per node: -* `/network/config`: read network configuration and return -* `/network/leases`: - - Post: Return a lease as understood by flannel - - Lookip node by IP - - Store node metadata from [flannel request] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L34) in annotations - - Return [Lease object] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L40) reflecting node cidr - - Get: Handle a watch on leases -* `/network/leases/subnet`: - - Put: This is a request for a lease. If the nodecontroller is allocating CIDRs we can probably just no-op. -* `/network/reservations`: TDB, we can probably use this to accommodate node controller allocating CIDR instead of flannel requesting it - -The ick-iest part of this implementation is going to the `GET /network/leases`, i.e. the watch proxy. We can side-step by waiting for a more generic Kubernetes resource. However, we can also implement it as follows: -* Watch all nodes, ignore heartbeats -* On each change, figure out the lease for the node, construct a [lease watch result](https://github.com/coreos/flannel/blob/0bf263826eab1707be5262703a8092c7d15e0be4/subnet/subnet.go#L72), and send it down the watch with the RV from the node -* Implement a lease list that does a similar translation - -I say this is gross without an api object because for each node->lease translation one has to store and retrieve the node metadata sent by flannel (eg: VTEP) from node annotations. [Reference implementation](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/flannel_server.go) and [watch proxy](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/watch_proxy.go). - -# Limitations - -* Integration is experimental -* Flannel etcd not stored in persistent disk -* CIDR allocation does *not* flow from Kubernetes down to nodes anymore - -# Wishlist - -This proposal is really just a call for community help in writing a Kubernetes x flannel backend. - -* CNI plugin integration -* Flannel daemon in privileged pod -* Flannel server talks to apiserver, described in proposal above -* HTTPs between flannel daemon/server -* Investigate flannel server running on every node (as done in the reference implementation mentioned above) -* Use flannel reservation mode to support node controller podcidr allocation +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/network-policy.md b/contributors/design-proposals/network/network-policy.md index 6a4b01a8..f0fbec72 100644 --- a/contributors/design-proposals/network/network-policy.md +++ b/contributors/design-proposals/network/network-policy.md @@ -1,299 +1,6 @@ -# NetworkPolicy +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal for implementing a new resource - NetworkPolicy - which -will enable definition of ingress policies for selections of pods. -The design for this proposal has been created by, and discussed -extensively within the Kubernetes networking SIG. It has been implemented -and tested using Kubernetes API extensions by various networking solutions already. - -In this design, users can create various NetworkPolicy objects which select groups of pods and -define how those pods should be allowed to communicate with each other. The -implementation of that policy at the network layer is left up to the -chosen networking solution. - -> Note that this proposal does not yet include egress / cidr-based policy, which is still actively undergoing discussion in the SIG. These are expected to augment this proposal in a backwards compatible way. - -## Implementation - -The implementation in Kubernetes consists of: -- A v1beta1 NetworkPolicy API object -- A structure on the `Namespace` object to control policy, to be developed as an annotation for now. - -### Namespace changes - -The following objects will be defined on a Namespace Spec. ->NOTE: In v1beta1 the Namespace changes will be implemented as an annotation. - -```go -type IngressIsolationPolicy string - -const ( - // Deny all ingress traffic to pods in this namespace. Ingress means - // any incoming traffic to pods, whether that be from other pods within this namespace - // or any source outside of this namespace. - DefaultDeny IngressIsolationPolicy = "DefaultDeny" -) - -// Standard NamespaceSpec object, modified to include a new -// NamespaceNetworkPolicy field. -type NamespaceSpec struct { - // This is a pointer so that it can be left undefined. - NetworkPolicy *NamespaceNetworkPolicy `json:"networkPolicy,omitempty"` -} - -type NamespaceNetworkPolicy struct { - // Ingress configuration for this namespace. This config is - // applied to all pods within this namespace. For now, only - // ingress is supported. This field is optional - if not - // defined, then the cluster default for ingress is applied. - Ingress *NamespaceIngressPolicy `json:"ingress,omitempty"` -} - -// Configuration for ingress to pods within this namespace. -// For now, this only supports specifying an isolation policy. -type NamespaceIngressPolicy struct { - // The isolation policy to apply to pods in this namespace. - // Currently this field only supports "DefaultDeny", but could - // be extended to support other policies in the future. When set to DefaultDeny, - // pods in this namespace are denied ingress traffic by default. When not defined, - // the cluster default ingress isolation policy is applied (currently allow all). - Isolation *IngressIsolationPolicy `json:"isolation,omitempty"` -} -``` - -```yaml -kind: Namespace -apiVersion: v1 -spec: - networkPolicy: - ingress: - isolation: DefaultDeny -``` - -The above structures will be represented in v1beta1 as a json encoded annotation like so: - -```yaml -kind: Namespace -apiVersion: v1 -metadata: - annotations: - net.beta.kubernetes.io/network-policy: | - { - "ingress": { - "isolation": "DefaultDeny" - } - } -``` - -### NetworkPolicy Go Definition - -For a namespace with ingress isolation, connections to pods in that namespace (from any source) are prevented. -The user needs a way to explicitly declare which connections are allowed into pods of that namespace. - -This is accomplished through ingress rules on `NetworkPolicy` -objects (of which there can be multiple in a single namespace). Pods selected by -one or more NetworkPolicy objects should allow any incoming connections that match any -ingress rule on those NetworkPolicy objects, per the network plugin's capabilities. - -NetworkPolicy objects and the above namespace isolation both act on _connections_ rather than individual packets. That is to say that if traffic from pod A to pod B is allowed by the configured -policy, then the return packets for that connection from B -> A are also allowed, even if the policy in place would not allow B to initiate a connection to A. NetworkPolicy objects act on a broad definition of _connection_ which includes both TCP and UDP streams. If new network policy is applied that would block an existing connection between two endpoints, the enforcer of policy -should terminate and block the existing connection as soon as can be expected by the implementation. - -We propose adding the new NetworkPolicy object to the `extensions/v1beta1` API group for now. - -The SIG also considered the following while developing the proposed NetworkPolicy object: -- A per-pod policy field. We discounted this in favor of the loose coupling that labels provide, similar to Services. -- Per-Service policy. We chose not to attach network policy to services to avoid semantic overloading of a single object, and conflating the existing semantics of load-balancing and service discovery with those of network policy. - -```go -type NetworkPolicy struct { - TypeMeta - ObjectMeta - - // Specification of the desired behavior for this NetworkPolicy. - Spec NetworkPolicySpec -} - -type NetworkPolicySpec struct { - // Selects the pods to which this NetworkPolicy object applies. The array of ingress rules - // is applied to any pods selected by this field. Multiple network policies can select the - // same set of pods. In this case, the ingress rules for each are combined additively. - // This field is NOT optional and follows standard unversioned.LabelSelector semantics. - // An empty podSelector matches all pods in this namespace. - PodSelector unversioned.LabelSelector `json:"podSelector"` - - // List of ingress rules to be applied to the selected pods. - // Traffic is allowed to a pod if namespace.networkPolicy.ingress.isolation is undefined and cluster policy allows it, - // OR if the traffic source is the pod's local node, - // OR if the traffic matches at least one ingress rule across all of the NetworkPolicy - // objects whose podSelector matches the pod. - // If this field is empty then this NetworkPolicy does not affect ingress isolation. - // If this field is present and contains at least one rule, this policy allows any traffic - // which matches at least one of the ingress rules in this list. - Ingress []NetworkPolicyIngressRule `json:"ingress,omitempty"` -} - -// This NetworkPolicyIngressRule matches traffic if and only if the traffic matches both ports AND from. -type NetworkPolicyIngressRule struct { - // List of ports which should be made accessible on the pods selected for this rule. - // Each item in this list is combined using a logical OR. - // If this field is not provided, this rule matches all ports (traffic not restricted by port). - // If this field is empty, this rule matches no ports (no traffic matches). - // If this field is present and contains at least one item, then this rule allows traffic - // only if the traffic matches at least one port in the ports list. - Ports *[]NetworkPolicyPort `json:"ports,omitempty"` - - // List of sources which should be able to access the pods selected for this rule. - // Items in this list are combined using a logical OR operation. - // If this field is not provided, this rule matches all sources (traffic not restricted by source). - // If this field is empty, this rule matches no sources (no traffic matches). - // If this field is present and contains at least on item, this rule allows traffic only if the - // traffic matches at least one item in the from list. - From *[]NetworkPolicyPeer `json:"from,omitempty"` -} - -type NetworkPolicyPort struct { - // Optional. The protocol (TCP or UDP) which traffic must match. - // If not specified, this field defaults to TCP. - Protocol *api.Protocol `json:"protocol,omitempty"` - - // If specified, the port on the given protocol. This can - // either be a numerical or named port. If this field is not provided, - // this matches all port names and numbers. - // If present, only traffic on the specified protocol AND port - // will be matched. - Port *intstr.IntOrString `json:"port,omitempty"` -} - -type NetworkPolicyPeer struct { - // Exactly one of the following must be specified. - - // This is a label selector which selects Pods in this namespace. - // This field follows standard unversioned.LabelSelector semantics. - // If present but empty, this selector selects all pods in this namespace. - PodSelector *unversioned.LabelSelector `json:"podSelector,omitempty"` - - // Selects Namespaces using cluster scoped-labels. This - // matches all pods in all namespaces selected by this label selector. - // This field follows standard unversioned.LabelSelector semantics. - // If present but empty, this selector selects all namespaces. - NamespaceSelector *unversioned.LabelSelector `json:"namespaceSelector,omitempty"` -} -``` - -### Behavior - -The following pseudo-code attempts to define when traffic is allowed to a given pod when using this API. - -```python -def is_traffic_allowed(traffic, pod): - """ - Returns True if traffic is allowed to this pod, False otherwise. - """ - if not pod.Namespace.Spec.NetworkPolicy.Ingress.Isolation: - # If ingress isolation is disabled on the Namespace, use cluster default. - return clusterDefault(traffic, pod) - elif traffic.source == pod.node.kubelet: - # Traffic is from kubelet health checks. - return True - else: - # If namespace ingress isolation is enabled, only allow traffic - # that matches a network policy which selects this pod. - for network_policy in network_policies(pod.Namespace): - if not network_policy.Spec.PodSelector.selects(pod): - # This policy doesn't select this pod. Try the next one. - continue - - # This policy selects this pod. Check each ingress rule - # defined on this policy to see if it allows the traffic. - # If at least one does, then the traffic is allowed. - for ingress_rule in network_policy.Ingress or []: - if ingress_rule.matches(traffic): - return True - - # Ingress isolation is DefaultDeny and no policies match the given pod and traffic. - return False -``` - -### Potential Future Work / Questions - -- A single podSelector per NetworkPolicy may lead to managing a large number of NetworkPolicy objects, each of which is small and easy to understand on its own. However, this may lead for a policy change to require touching several policy objects. Allowing an optional podSelector per ingress rule additionally to the podSelector per NetworkPolicy object would allow the user to group rules into logical segments and define size/complexity ratio where it makes sense. This may lead to a smaller number of objects with more complexity if the user opts in to the additional podSelector. This increases the complexity of the NetworkPolicy object itself. This proposal has opted to favor a larger number of smaller objects that are easier to understand, with the understanding that additional podSelectors could be added to this design in the future should the requirement become apparent. - -- Is the `Namespaces` selector in the `NetworkPolicyPeer` struct too coarse? Do we need to support the AND combination of `Namespaces` and `Pods`? - -### Examples - -1) Only allow traffic from frontend pods on TCP port 6379 to backend pods in the same namespace. - -```yaml -kind: Namespace -apiVersion: v1 -metadata: - name: myns - annotations: - net.beta.kubernetes.io/network-policy: | - { - "ingress": { - "isolation": "DefaultDeny" - } - } ---- -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-frontend - namespace: myns -spec: - podSelector: - matchLabels: - role: backend - ingress: - - from: - - podSelector: - matchLabels: - role: frontend - ports: - - protocol: TCP - port: 6379 -``` - -2) Allow TCP 443 from any source in Bob's namespaces. - -```yaml -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-tcp-443 -spec: - podSelector: - matchLabels: - role: frontend - ingress: - - ports: - - protocol: TCP - port: 443 - from: - - namespaceSelector: - matchLabels: - user: bob -``` - -3) Allow all traffic to all pods in this namespace. - -```yaml -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-all -spec: - podSelector: - ingress: - - {} -``` - -## References - -- https://github.com/kubernetes/kubernetes/issues/22469 tracks network policy in kubernetes. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/networking.md b/contributors/design-proposals/network/networking.md index ff97aa83..f0fbec72 100644 --- a/contributors/design-proposals/network/networking.md +++ b/contributors/design-proposals/network/networking.md @@ -1,188 +1,6 @@ -# Networking +Design proposals have been archived. -There are 4 distinct networking problems to solve: +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -1. Highly-coupled container-to-container communications -2. Pod-to-Pod communications -3. Pod-to-Service communications -4. External-to-internal communications -## Model and motivation - -Kubernetes deviates from the default Docker networking model (though as of -Docker 1.8 their network plugins are getting closer). The goal is for each pod -to have an IP in a flat shared networking namespace that has full communication -with other physical computers and containers across the network. IP-per-pod -creates a clean, backward-compatible model where pods can be treated much like -VMs or physical hosts from the perspectives of port allocation, networking, -naming, service discovery, load balancing, application configuration, and -migration. - -Dynamic port allocation, on the other hand, requires supporting both static -ports (e.g., for externally accessible services) and dynamically allocated -ports, requires partitioning centrally allocated and locally acquired dynamic -ports, complicates scheduling (since ports are a scarce resource), is -inconvenient for users, complicates application configuration, is plagued by -port conflicts and reuse and exhaustion, requires non-standard approaches to -naming (e.g. consul or etcd rather than DNS), requires proxies and/or -redirection for programs using standard naming/addressing mechanisms (e.g. web -browsers), requires watching and cache invalidation for address/port changes -for instances in addition to watching group membership changes, and obstructs -container/pod migration (e.g. using CRIU). NAT introduces additional complexity -by fragmenting the addressing space, which breaks self-registration mechanisms, -among other problems. - -## Container to container - -All containers within a pod behave as if they are on the same host with regard -to networking. They can all reach each other's ports on localhost. This offers -simplicity (static ports know a priori), security (ports bound to localhost -are visible within the pod but never outside it), and performance. This also -reduces friction for applications moving from the world of uncontainerized apps -on physical or virtual hosts. People running application stacks together on -the same host have already figured out how to make ports not conflict and have -arranged for clients to find them. - -The approach does reduce isolation between containers within a pod — -ports could conflict, and there can be no container-private ports, but these -seem to be relatively minor issues with plausible future workarounds. Besides, -the premise of pods is that containers within a pod share some resources -(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. -Additionally, the user can control what containers belong to the same pod -whereas, in general, they don't control what pods land together on a host. - -## Pod to pod - -Because every pod gets a "real" (not machine-private) IP address, pods can -communicate without proxies or translations. The pod can use well-known port -numbers and can avoid the use of higher-level service discovery systems like -DNS-SD, Consul, or Etcd. - -When any container calls ioctl(SIOCGIFADDR) (get the address of an interface), -it sees the same IP that any peer container would see them coming from — -each pod has its own IP address that other pods can know. By making IP addresses -and ports the same both inside and outside the pods, we create a NAT-less, flat -address space. Running "ip addr show" should work as expected. This would enable -all existing naming/discovery mechanisms to work out of the box, including -self-registration mechanisms and applications that distribute IP addresses. We -should be optimizing for inter-pod network communication. Within a pod, -containers are more likely to use communication through volumes (e.g., tmpfs) or -IPC. - -This is different from the standard Docker model. In that mode, each container -gets an IP in the 172-dot space and would only see that 172-dot address from -SIOCGIFADDR. If these containers connect to another container the peer would see -the connect coming from a different IP than the container itself knows. In short -— you can never self-register anything from a container, because a -container can not be reached on its private IP. - -An alternative we considered was an additional layer of addressing: pod-centric -IP per container. Each container would have its own local IP address, visible -only within that pod. This would perhaps make it easier for containerized -applications to move from physical/virtual hosts to pods, but would be more -complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) -and to reason about, due to the additional layer of address translation, and -would break self-registration and IP distribution mechanisms. - -Like Docker, ports can still be published to the host node's interface(s), but -the need for this is radically diminished. - -## Implementation - -For the Google Compute Engine cluster configuration scripts, we use [advanced -routing rules](https://developers.google.com/compute/docs/networking#routing) -and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that -get routed to it. This is in addition to the 'main' IP address assigned to the -VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to -differentiate it from `docker0`) is set up outside of Docker proper. - -Example of GCE's advanced routing rules: - -```sh -gcloud compute routes add "${NODE_NAMES[$i]}" \ - --project "${PROJECT}" \ - --destination-range "${NODE_IP_RANGES[$i]}" \ - --network "${NETWORK}" \ - --next-hop-instance "${NODE_NAMES[$i]}" \ - --next-hop-instance-zone "${ZONE}" & -``` - -GCE itself does not know anything about these IPs, though. This means that when -a pod tries to egress beyond GCE's project the packets must be SNAT'ed -(masqueraded) to the VM's IP, which GCE recognizes and allows. - -### Other implementations - -With the primary aim of providing IP-per-pod-model, other implementations exist -to serve the purpose outside of GCE. - - [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md) - - [Flannel](https://github.com/coreos/flannel#flannel) - - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) - ("With Linux Bridge devices" section) - - [Weave](https://github.com/zettio/weave) is yet another way to build an - overlay network, primarily aiming at Docker integration. - - [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real - container IPs. - - [Cilium](https://github.com/cilium/cilium) supports Overlay Network mode (IPv4/IPv6) and Direct Routing model (IPv6) - -## Pod to service - -The [service](https://kubernetes.io/docs/concepts/services-networking/service/) abstraction provides a way to group pods under a -common access policy (e.g. load-balanced). The implementation of this creates a -virtual IP which clients can access and which is transparently proxied to the -pods in a Service. Each node runs a kube-proxy process which programs -`iptables` rules to trap access to service IPs and redirect them to the correct -backends. This provides a highly-available load-balancing solution with low -performance overhead by balancing client traffic from a node on that same node. - -## External to internal - -So far the discussion has been about how to access a pod or service from within -the cluster. Accessing a pod from outside the cluster is a bit more tricky. We -want to offer highly-available, high-performance load balancing to target -Kubernetes Services. Most public cloud providers are simply not flexible enough -yet. - -The way this is generally implemented is to set up external load balancers (e.g. -GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When -traffic arrives at a node it is recognized as being part of a particular Service -and routed to an appropriate backend Pod. This does mean that some traffic will -get double-bounced on the network. Once cloud providers have better offerings -we can take advantage of those. - -## Challenges and future work - -### Docker API - -Right now, docker inspect doesn't show the networking configuration of the -containers, since they derive it from another container. That information should -be exposed somehow. - -### External IP assignment - -We want to be able to assign IP addresses externally from Docker -[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need -to statically allocate fixed-size IP ranges to each node, so that IP addresses -can be made stable across pod infra container restarts -([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate -pod migration. Right now, if the pod infra container dies, all the user -containers must be stopped and restarted because the netns of the pod infra -container will change on restart, and any subsequent user container restart -will join that new netns, thereby not being able to see its peers. -Additionally, a change in IP address would encounter DNS caching/TTL problems. -External IP assignment would also simplify DNS support (see below). - -### IPv6 - -IPv6 support would be nice but requires significant internal changes in a few -areas. First pods should be able to report multiple IP addresses -[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398) -and the network plugin architecture Kubernetes uses needs to allow returning -IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245). -Kubernetes code that deals with IP addresses must then be audited and fixed to -support both IPv4 and IPv6 addresses and not assume IPv4. -AWS started rolling out basic -[ipv6 support](https://aws.amazon.com/about-aws/whats-new/2016/12/announcing-internet-protocol-version-6-support-for-ec2-instances-in-amazon-virtual-private-cloud/), -but direct ipv6 assignment to instances doesn't appear to be supported by other -major cloud providers (e.g. GCE) yet. We'd happily take pull requests from people -running Kubernetes on bare metal, though. :-) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/nodeport-ip-range.md b/contributors/design-proposals/network/nodeport-ip-range.md index 908222de..f0fbec72 100644 --- a/contributors/design-proposals/network/nodeport-ip-range.md +++ b/contributors/design-proposals/network/nodeport-ip-range.md @@ -1,85 +1,6 @@ -# Support specifying NodePort IP range +Design proposals have been archived. -Author: @m1093782566 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Objective -This document proposes creating a option for kube-proxy to specify NodePort IP range. - -# Background - -NodePort type service gives developers the freedom to set up their own load balancers, to expose one or more nodes’ IPs directly. The service will be visible as the nodes's IPs. For now, the NodePort addresses are the IPs from all available interfaces. - -With iptables magic, all the IPs whose `ADDRTYPE` matches `dst-type LOCAL` will be taken as the address of NodePort, which might look like, - -```shell -Chain KUBE-SERVICES (2 references) -target prot opt source destination -KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL -``` -By default, kube-proxy accepts everything from NodePort without any filter. It can be a problem for nodes which has both public and private NICs, and people only want to provide a service in private network and avoid exposing any internal service on the public IPs. - -# Proposal - -This proposal builds off of earlier requests to [[proxy] Listening on a specific IP for nodePort ](https://github.com/kubernetes/kubernetes/issues/21070), but proposes that we should find a way to tell kube-proxy what the NodePort IP blocks are instead of a single IP. - -## Create new kube-proxy configuration option - -There should be an admin option to kube-proxy for specifying which IP to NodePort. The option is a list of IP blocks, say `--nodeport-addresses`. These IP blocks as a parameter to select the interfaces where nodeport works. In case someone would like to expose a service on localhost for local visit and some other interfaces for particular purpose, an array of IP blocks would do that. People can populate it from their private subnets the same on every node. - -The `--nodeport-addresses` is defaulted to `0.0.0.0/0`, which means select all available interfaces and is compliance with current NodePort behaviour. - -If people set the `--nodeport-addresses` option to "127.0.0.0/8", kube-proxy will only select the loopback interface for NodePort. - -If people set the `--nodeport-addresses` option to "default-route", kube-proxy will select the "who has the default route" interfaces. It's the same heuristic we use for `--advertise-address` in kube-apiserver and others. - -If people provide a non-zero IP block for `--nodeport-addresses`, kube-proxy will filter that down to just the IPs that applied to the node. - -So, the following values for `--nodeport-addresses` are all valid: - -``` -0.0.0.0/0 -127.0.0.0/8 -default-route -127.0.0.1/32,default-route -127.0.0.0/8,192.168.0.0/16 -``` - - -And an empty string for `--nodeport-addresses` is considered as invalid. - -> NOTE: There is already a option `--bind-address`, but it has nothing to do with nodeport and we need IP blocks instead of single IP. - -kube-proxy will periodically refresh proxy rules based on the list of IP blocks specified by `--nodeport-addresses`, in case of something like DHCP. - -For example, if IP address of `eth0` changes from `172.10.1.2` to `172.10.2.100` and user specifies `172.10.0.0/16` for `--advertise-address`. Kube-proxy will make sure proxy rules `-d 172.10.0.0/16` exist. - -However, if IP address of `eth0` changes from `172.10.1.2` to `192.168.3.4` and user only specifies `172.10.0.0/16` for `--advertise-address`. Kube-proxy will NOT create proxy rules for `192.168.3.4` unless `eth0` has the default route. - -When refer to DHCP user case, network administrator usually reserves a RANGE of IP addresses for the DHCP server. So, IP address change will always fall in an IP range in DHCP scenario. That's to say an IP address of a interface will not change from `172.10.1.2` to `192.168.3.4` in our example. - -## Kube-proxy implementation support - -The implementation is simple. - -### iptables - -iptables support specify CIDR in the destination parameter(`-d`), e.g. `-d 192.168.0.0/16`. - -For the special `default-route` case, we should use `-i` option in iptables command, e.g. `-i eth0`. - -### Linux userspace - -Same as iptables. - -### ipvs - -Create IPVS virtual services one by one according to provided node IPs, which is almost same as current behaviour(fetch all IPs from host). - -### Window userspace - -Create multiple goroutines, each goroutine listens on a specific node IP to serve NodePort. - -### winkernel - -Need to specify node IPs [here](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/winkernel/proxier.go#L1053) - current behaviour is leave the VIP to be empty to automatically select the node IP. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/pod-resolv-conf.md b/contributors/design-proposals/network/pod-resolv-conf.md index ed6e090f..f0fbec72 100644 --- a/contributors/design-proposals/network/pod-resolv-conf.md +++ b/contributors/design-proposals/network/pod-resolv-conf.md @@ -1,210 +1,6 @@ -# Custom /etc/resolv.conf +Design proposals have been archived. -* Status: pending -* Version: alpha -* Implementation owner: Bowei Du <[bowei@google.com](mailto:bowei@google.com)>, - Zihong Zheng <[zihongz@google.com](mailto:zihongz@google.com)> +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Overview -The `/etc/resolv.conf` in a pod is managed by Kubelet and its contents are -generated based on `pod.dnsPolicy`. For `dnsPolicy: Default`, the `search` and -`nameserver` fields are taken from the `resolve.conf` on the node where the pod -is running. If the `dnsPolicy` is `ClusterFirst`, the search contents of the -resolv.conf is the hosts `resolv.conf` augmented with the following options: - -* Search paths to add aliases for domain names in the same namespace and - cluster suffix. -* `options ndots` to 5 to ensure the search paths are searched for all - potential matches. - -The configuration of both search paths and `ndots` results in query -amplification of five to ten times for non-cluster internal names. This is due -to the fact that each of the search path expansions must be tried before the -actual result is found. This order of magnitude increase of query rate imposes a -large load on the kube-dns service. At the same time, there are user -applications do not need the convenience of the name aliases and do not wish to -pay this performance cost. - - -## Existing workarounds - -The current work around for this problem is to specify an FQDN for name -resolution. Any domain name that ends with a period (e.g. `foo.bar.com.`) will -not be search path expanded. However, use of FQDNs is not well-known practice -and imposes application-level changes. Cluster operators may not have the luxury -of enforcing such a change to applications that run on their infrastructure. - -It is also possible for the user to insert a short shell script snippet that -rewrites `resolv.conf` on container start-up. This has the same problems as the -previous approach and is also awkward for the user. This also forces the -container to have additional executable code such as a shell or scripting engine -which increases the applications security surface area. - - -# Proposal sketch - -This proposal gives users a way to overlay tweaks into the existing -`DnsPolicy`. A new PodSpec field `dnsParams` will contains fields that are -merged with the settings currently selected with `DnsPolicy`. - -The fields of `DnsParams` are: - -* `nameservers` is a list of additional nameservers to use for resolution. On - `resolv.conf` platforms, these are entries to `nameserver`. -* `search` is a list of additional search path subdomains. On `resolv.conf` - platforms, these are entries to the `search` setting. These domains will be - appended to the existing search path. -* `options` that are an OS-dependent list of (name, value) options. These values - are NOT expected to be generally portable across platforms. For containers that - use `/etc/resolv.conf` style configuration, these correspond to the parameters - passed to the `option` lines. Options will override if their names coincide, - i.e, if the `DnsPolicy` sets `ndots:5` and `ndots:1` appears in the `Spec`, - then the final value will be `ndots:1`. - -For users that want to completely customize their resolution configuration, we -add a new `DnsPolicy: Custom` that does not define any settings. This is -essentially an empty `resolv.conf` with no fields defined. - -## Pod API examples - -### Host `/etc/resolv.conf` - -Assume in the examples below that the host has the following `/etc/resolv.conf`: - -```bash -nameserver 10.1.1.10 -search foo.com -options ndots:1 -``` - -### Override DNS server and search paths - -In the example below, the user wishes to use their own DNS resolver and add the -pod namespace and a custom expansion to the search path, as they do not use the -other name aliases: - -```yaml -# Pod spec -apiVersion: v1 -kind: Pod -metadata: {"namespace": "ns1", "name": "example"} -spec: - ... - dnsPolicy: Custom - dnsParams: - nameservers: ["1.2.3.4"] - search: - - ns1.svc.cluster.local - - my.dns.search.suffix - options: - - name: ndots - value: 2 - - name: edns0 -``` - -The pod will get the following `/etc/resolv.conf`: - -```bash -nameserver 1.2.3.4 -search ns1.svc.cluster.local my.dns.search.suffix -options ndots:2 edns0 -``` - -## Overriding `ndots` - -Override `ndots:5` in `ClusterFirst` with `ndots:1`. This keeps all of the -settings intact: - -```yaml -dnsPolicy: ClusterFirst -dnsParams: -- options: - - name: ndots - - value: 1 -``` - -Resulting `resolv.conf`: - -```bash -nameserver 10.0.0.10 -search default.svc.cluster.local svc.cluster.local cluster.local foo.com -options ndots:1 -``` - -# API changes - -```go -type PodSpec struct { - ... - DNSPolicy string `json:"dnsPolicy,omitempty"` - DNSParams *PodDNSParams `json:"dnsParams,omitempty"` - ... -} - -type PodDNSParams struct { - Nameservers []string `json:"nameservers,omitempty"` - Search []string `json:"search,omitempty"` - Options []PodDNSParamsOption `json:"options,omitempty" patchStrategy:"merge" patchMergeKey:"name"` -} - -type PodDNSParamsOption struct { - Name string `json:"name"` - Value *string `json:"value,omitempty"` -} -``` - -## Semantics - -Let the following be the Go representation of the `resolv.conf`: - -```go -type ResolvConf struct { - Nameserver []string // "nameserver" entries - Search []string // "search" entries - Options []PodDNSParamsOption // "options" entries -} -``` - -Let `var HostResolvConf ResolvConf` be the host `resolv.conf`. - -Then the final Pod `resolv.conf` will be: - -```go -func podResolvConf() ResolvConf { - var podResolv ResolvConf - - switch (pod.DNSPolicy) { - case "Default": - podResolv = HostResolvConf - case "ClusterFirst: - podResolv.Nameservers = []string{ KubeDNSClusterIP } - podResolv.Search = ... // populate with ns.svc.suffix, svc.suffix, suffix, host entries... - podResolv.Options = []PodDNSParamsOption{{"ndots","5" }} - case "Custom": // start with empty `resolv.conf` - break - } - - // Append the additional nameservers. - podResolv.Nameservers = append(Nameservers, pod.DNSParams.Nameservers...) - // Append the additional search paths. - podResolv.Search = append(Search, pod.DNSParams.Search...) - // Merge the DnsParams.Options with the options derived from the given DNSPolicy. - podResolv.Options = mergeOptions(pod.Options, pod.DNSParams.Options) - - return podResolv -} -``` - -### Invalid configurations - -The follow configurations will result in an invalid Pod spec: - -* Nameservers or search paths exceed system limits. (Three nameservers, six - search paths, 256 characters for `glibc`). -* Invalid option appears for the given platform. - -# References - -* [Kubernetes DNS name specification](https://git.k8s.io/dns/docs/specification.md) -* [`/etc/resolv.conf manpage`](http://manpages.ubuntu.com/manpages/zesty/man5/resolv.conf.5.html) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/service-discovery.md b/contributors/design-proposals/network/service-discovery.md index c3da6b2b..f0fbec72 100644 --- a/contributors/design-proposals/network/service-discovery.md +++ b/contributors/design-proposals/network/service-discovery.md @@ -1,62 +1,6 @@ -# Service Discovery Proposal +Design proposals have been archived. -## Goal of this document +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -To consume a service, a developer needs to know the full URL and a description of the API. Kubernetes contains the host and port information of a service, but it lacks the scheme and the path information needed if the service is not bound at the root. In this document we propose some standard kubernetes service annotations to fix these gaps. It is important that these annotations are a standard to allow for standard service discovery across Kubernetes implementations. Note that the example largely speaks to consuming WebServices but that the same concepts apply to other types of services. - -## Endpoint URL, Service Type - -A URL can accurately describe the location of a Service. A generic URL is of the following form - - scheme:[//[user:password@]host[:port]][/]path[?query][#fragment] - -however for the purpose of service discovery we can simplify this to the following form - - scheme:[//host[:port]][/]path - -If a user and/or password is required then this information can be passed using Kubernetes Secrets. Kubernetes contains the host and port of each service but it lacks the scheme and path. - -`Service Path` - Every Service has one (or more) endpoint. As a rule the endpoint should be located at the root "/" of the location URL, i.e. `http://172.100.1.52/`. There are cases where this is not possible and the actual service endpoint could be located at `http://172.100.1.52/cxfcdi`. The Kubernetes metadata for a service does not capture the path part, making it hard to consume this service. - -`Service Scheme` - Services can be deployed using different schemes. Some popular schemes include `http`,`https`,`file`,`ftp` and `jdbc`. - -`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isn't a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected. - -## Service Description - -The API of a service is the point of interaction with a service consumer. The description of the API is an essential piece of information at creation time of the service consumer. It has become common to publish a service definition document on a know location on the service itself. This 'well known' place it not very standard, so it is proposed the service developer provides the service description path and the type of Definition Language (DL) used. - -`Service Description Path` - To facilitate the consumption of the service by client, the location this document would be greatly helpful to the service consumer. In some cases the client side code can be generated from such a document. It is assumed that the service description document is published somewhere on the service endpoint itself. - -`Service Description Language` - A number of Definition Languages (DL) have been developed to describe the service. Some of examples are `WSDL`, `WADL` and `Swagger`. In order to consume a description document it is good to know the type of DL used. - -## Standard Service Annotations - -Kubernetes allows the creation of Service Annotations. Here we propose the use of the following standard annotations - -* `api.service.kubernetes.io/path` - the path part of the service endpoint url. An example value could be `cxfcdi`, -* `api.service.kubernetes.io/scheme` - the scheme part of the service endpoint url. Some values could be `http` or `https`. -* `api.service.kubernetes.io/protocol` - the protocol of the service. Known values are `SOAP`, `XML-RPC` and `REST`, -* `api.service.kubernetes.io/description-path` - the path part of the service description document's endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`, -* `api.kubernetes.io/description-language` - the type of Description Language used. Known values are `WSDL`, `WADL`, `SwaggerJSON`, `SwaggerYAML`. - -The fragment below is taken from the service section of the kubernetes.json were these annotations are used - - ... - "objects" : [ { - "apiVersion" : "v1", - "kind" : "Service", - "metadata" : { - "annotations" : { - "api.service.kubernetes.io/protocol" : "REST", - "api.service.kubernetes.io/scheme" "http", - "api.service.kubernetes.io/path" : "cxfcdi", - "api.service.kubernetes.io/description-path" : "cxfcdi/swagger.json", - "api.service.kubernetes.io/description-language" : "SwaggerJSON" - }, - ... - -## Conclusion - -Five service annotations are proposed as a standard way to describe a service endpoint. These five annotation are promoted as a Kubernetes standard, so that services can be discovered and a service catalog can be build to facilitate service consumers. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/service-external-name.md b/contributors/design-proposals/network/service-external-name.md index 69073f8b..f0fbec72 100644 --- a/contributors/design-proposals/network/service-external-name.md +++ b/contributors/design-proposals/network/service-external-name.md @@ -1,156 +1,6 @@ -# Service externalName +Design proposals have been archived. -Author: Tim Hockin (@thockin), Rodrigo Campos (@rata), Rudi C (@therc) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Date: August 2016 -Status: Implementation in progress - -# Goal - -Allow a service to have a CNAME record in the cluster internal DNS service. For -example, the lookup for a `db` service could return a CNAME that points to the -RDS resource `something.rds.aws.amazon.com`. No proxying is involved. - -# Motivation - -There were many related issues, but we'll try to summarize them here. More info -is on GitHub issues/PRs: [#13748](https://issues.k8s.io/13748), [#11838](https://issues.k8s.io/11838), [#13358](https://issues.k8s.io/13358), [#23921](https://issues.k8s.io/23921) - -One motivation is to present as native cluster services, services that are -hosted externally. Some cloud providers, like AWS, hand out hostnames (IPs are -not static) and the user wants to refer to these services using regular -Kubernetes tools. This was requested in bugs, at least for AWS, for RedShift, -RDS, Elasticsearch Service, ELB, etc. - -Other users just want to use an external service, for example `oracle`, with dns -name `oracle-1.testdev.mycompany.com`, without having to keep DNS in sync, and -are fine with a CNAME. - -Another use case is to "integrate" some services for local development. For -example, consider a search service running in Kubernetes in staging, let's say -`search-1.stating.mycompany.com`. It's running on AWS, so it resides behind an -ELB (which has no static IP, just a hostname). A developer is building an app -that consumes `search-1`, but doesn't want to run it on their machine (before -Kubernetes, they didn't, either). They can just create a service that has a -CNAME to the `search-1` endpoint in staging and be happy as before. - -Also, Openshift needs this for "service refs". Service ref is really just the -three use cases mentioned above, but in the future a way to automatically inject -"service ref"s into namespaces via "service catalog"[1] might be considered. And -service ref is the natural way to integrate an external service, since it takes -advantage of native DNS capabilities already in wide use. - -[1]: https://github.com/kubernetes/kubernetes/pull/17543 - -# Alternatives considered - -In the issues linked above, some alternatives were also considered. A partial -summary of them follows. - -One option is to add the hostname to endpoints, as proposed in -https://github.com/kubernetes/kubernetes/pull/11838. This is problematic, as -endpoints are used in many places and users assume the required fields (such as -IP address) are always present and valid (and check that, too). If the field is -not required anymore or if there is just a hostname instead of the IP, -applications could break. Even assuming those cases could be solved, the -hostname will have to be resolved, which presents further questions and issues: -the timeout to use, whether the lookup is synchronous or asynchronous, dealing -with DNS TTL and more. One imperfect approach was to only resolve the hostname -upon creation, but this was considered not a great idea. A better approach -would be at a higher level, maybe a service type. - -There are more ideas described in [#13748](https://issues.k8s.io/13748), but all raised further issues, -ranging from using another upstream DNS server to creating a Name object -associated with DNSs. - -# Proposed solution - -The proposed solution works at the service layer, by adding a new `externalName` -type for services. This will create a CNAME record in the internal cluster DNS -service. No virtual IP or proxying is involved. - -Using a CNAME gets rid of unnecessary DNS lookups. There's no need for the -Kubernetes control plane to issue them, to pick a timeout for them and having to -refresh them when the TTL for a record expires. It's way simpler to implement, -while solving the right problem. And addressing it at the service layer avoids -all the complications mentioned above about doing it at the endpoints layer. - -The solution was outlined by Tim Hockin in -https://github.com/kubernetes/kubernetes/issues/13748#issuecomment-230397975 - -Currently a ServiceSpec looks like this, with comments edited for clarity: - -```go -type ServiceSpec struct { - Ports []ServicePort - - // If not specified, the associated Endpoints object is not automatically managed - Selector map[string]string - - // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced - ClusterIP string - - // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" - Type ServiceType - - // Only applies if clusterIP != "None" - ExternalIPs []string - SessionAffinity ServiceAffinity - - // Only applies to type=LoadBalancer - LoadBalancerIP string - LoadBalancerSourceRanges []string -``` - -The proposal is to change it to: - -```go -type ServiceSpec struct { - Ports []ServicePort - - // If not specified, the associated Endpoints object is not automatically managed -+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored. - Selector map[string]string - - // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced. -+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored. - ClusterIP string - -- // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" -+ // ExternalName, ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" - Type ServiceType - -+ // Only applies if type is ExternalName -+ ExternalName string - - // Only applies if clusterIP != "None" - ExternalIPs []string - SessionAffinity ServiceAffinity - - // Only applies to type=LoadBalancer - LoadBalancerIP string - LoadBalancerSourceRanges []string -``` - -For example, it can be used like this: - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: my-rds -spec: - ports: - - port: 12345 - type: ExternalName - externalName: myapp.rds.whatever.aws.says -``` - -There is one issue to take into account, that no other alternative considered -fixes, either: TLS. If the service is a CNAME for an endpoint that uses TLS, -connecting with the Kubernetes name `my-service.my-ns.svc.cluster.local` may -result in a failure during server certificate validation. This is acknowledged -and left for future consideration. For the time being, users and administrators -might need to ensure that the server certificates also mentions the Kubernetes -name as an alternate host name. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md b/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md index 659bbf53..f0fbec72 100644 --- a/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md +++ b/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md @@ -1,89 +1,6 @@ -# Support traffic shaping for CNI network plugin +Design proposals have been archived. -Version: Alpha +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Authors: @m1093782566 -## Motivation and background - -Currently the kubenet code supports applying basic traffic shaping during pod setup. This will happen if bandwidth-related annotations have been added to the pod's metadata, for example: - -```json -{ - "kind": "Pod", - "metadata": { - "name": "iperf-slow", - "annotations": { - "kubernetes.io/ingress-bandwidth": "10M", - "kubernetes.io/egress-bandwidth": "10M" - } - } -} -``` - -Our current implementation uses the `linux tc` to add an download(ingress) and upload(egress) rate limiter using 1 root `qdisc`, 2 `class `(one for ingress and one for egress) and 2 `filter`(one for ingress and one for egress attached to the ingress and egress classes respectively). - -Kubelet CNI code doesn't support it yet, though CNI has already added a [traffic sharping plugin](https://github.com/containernetworking/plugins/tree/master/plugins/meta/bandwidth). We can replicate the behavior we have today in kubenet for kubelet CNI network plugin if we feel this is an important feature. - -## Goal - -Support traffic shaping for CNI network plugin in Kubernetes. - -## Non-goal - -CNI plugins to implement this sort of traffic shaping guarantee. - -## Proposal - -If kubelet starts up with `network-plugin = cni` and user enabled traffic shaping via the network plugin configuration, it would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin. - -Traffic shaping in Kubelet CNI network plugin can work with ptp and bridge network plugins. - -### Pod Setup - -When we create a pod with bandwidth configuration in its metadata, for example, - -```json -{ - "kind": "Pod", - "metadata": { - "name": "iperf-slow", - "annotations": { - "kubernetes.io/ingress-bandwidth": "10M", - "kubernetes.io/egress-bandwidth": "10M" - } - } -} -``` - -Kubelet would firstly parse the ingress and egress bandwidth values and transform them to Kbps because both `ingressRate` and `egressRate` in cni bandwidth plugin are in Kbps. A user would add something like this to their CNI config list if they want to enable traffic shaping via the plugin: - -```json -{ - "type": "bandwidth", - "capabilities": {"trafficShaping": true} -} -``` - -Kubelet would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin: - -```json -{ - "type": "bandwidth", - "runtimeConfig": { - "trafficShaping": { - "ingressRate": "X", - "egressRate": "Y" - } - } -} -``` - -### Pod Teardown - -When we delete a pod, kubelet will build the runtime config for calling cni plugin `DelNetwork/DelNetworkList` API, which will remove this pod's bandwidth configuration. - -## Next step - -* Support ingress and egress burst bandwidth in Pod. -* Graduate annotations to Pod Spec. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/OWNERS b/contributors/design-proposals/node/OWNERS deleted file mode 100644 index 810bc689..00000000 --- a/contributors/design-proposals/node/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-node-leads -approvers: - - sig-node-leads -labels: - - sig/node diff --git a/contributors/design-proposals/node/accelerator-monitoring.md b/contributors/design-proposals/node/accelerator-monitoring.md index 5c247c19..f0fbec72 100644 --- a/contributors/design-proposals/node/accelerator-monitoring.md +++ b/contributors/design-proposals/node/accelerator-monitoring.md @@ -1,101 +1,6 @@ -# Monitoring support for hardware accelerators +Design proposals have been archived. -Version: Alpha +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Owner: @mindprince (agarwalrohit@google.com) -## Motivation - -We have had alpha support for running containers with GPUs attached in Kubernetes for a while. To take this to beta and GA, we need to provide GPU monitoring, so that users can get insights into how their GPU jobs are performing. - -## Detailed Design - -The current metrics pipeline for Kubernetes is: -- Container level metrics are collected by [cAdvisor](https://github.com/google/cadvisor). -- Kubelet embeds cAdvisor as a library. It uses its knowledge of pod-to-container mappings and the metrics from cAdvisor to expose pod level metrics as the summary API. -- [Heapster](https://github.com/kubernetes/heapster) uses kubelet’s summary API and pushes metrics to the some sink. -There are plans to change this pipeline but the details for that are still not finalized. - -To expose GPU metrics to Kubernetes users, we would need to make changes to all these components. - -First up is cAdvisor: we need to make cAdvisor collect metrics for GPUs that are attached to a container. - -The source for getting metrics for NVIDIA GPUs is [NVIDIA Management Library (NVML)](https://developer.nvidia.com/nvidia-management-library-nvml). NVML is a closed source C library [with a documented API](http://docs.nvidia.com/deploy/nvml-api/index.html). Because we want to use NVML from cAdvisor (which is written in Go), we need to [write a cgo wrapper for NVML](https://github.com/mindprince/gonvml). - -The cAdvisor binary is statically linked currently. Because we can’t statically link the closed source NVML code, we would need to make cAdvisor a dynamically linked binary. We would use `dlopen` in the cgo wrapper to dynamically load NVML. Because kubelet embeds cAdvisor, kubelet will also need to be a dynamically linked binary. In my testing, kubelet running on GCE 1.7.x clusters was found to be a dynamically linked binary already but now being dynamically linked will become a requirement. - -When cAdvisor starts up, it would read the vendor files in `/sys/bus/pci/devices/*` to see if any NVIDIA devices (vendor ID: `0x10de`) are attached to the node. -- If no NVIDIA devices are found, this code path would become dormant for the rest of cAdvisor/kubelet lifetime. -- If NVIDIA devices are found, we would start a goroutine that would check for the presence of NVML by trying to dynamically load it at regular intervals (say every minute or every 5 minutes). We need to do this regular checking instead of doing it just once because it may happen that cAdvisor is started before the nvidia drivers and nvml are installed. Once the NVML dynamic loading succeeds, we would use NVML’s query methods to find out how many devices exist on the node and create a map from their minor numbers to their handles and cache that map. The goroutine would exit at this point. - -If we detected the presence of NVML in the previous step, whenever a new container is detected by cAdvisor, cAdvisor would read the `devices.list` file from the container [devices cgroup](https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt). The `devices.list` file lists the major:minor number of all the devices that the container is allowed to access. If we find any device with major number `195` ([which is the major number assigned to NVIDIA devices](https://github.com/torvalds/linux/blob/v4.13/Documentation/admin-guide/devices.txt#L2583)), we would cache the list of corresponding minor numbers for that container. - -During every housekeeping operation, in addition to collecting all the existing metrics, we will use the cached nvidia device minor numbers and the map from minor numbers to device handles to get metrics for GPU devices attached to the container. - -The following new metrics would be exposed per container from cAdvisor: - -``` -type ContainerStats struct { -... - // Metrics for Accelerators. - // Each Accelerator corresponds to one element in the array. - Accelerators []AcceleratorStats `json:"accelerators,omitempty"` -... -} - -type AcceleratorStats struct { - // Make of the accelerator (nvidia, amd, google etc.) - Make string `json:"make"` - - // Model of the accelerator (tesla-p100, tesla-k80) - Model string `json:"model"` - - // ID of the accelerator. device minor number? Or UUID? - ID string `json:"id"` - - // Total accelerator memory. - // unit: bytes - MemoryTotal uint64 `json:"memory_total"` - - // Total accelerator memory allocated. - // unit: bytes - MemoryUsed uint64 `json:"memory_used"` - - // Percent of time over the past sample period during which - // the accelerator was actively processing. - DutyCycle uint64 `json:"duty_cycle"` -} -``` - -The API is generic to add support for different types of accelerators in the future even though we will only add support for NVIDIA GPUs initially. The API is inspired by what Google has in borg. - -We will update kubelet’s summary API to also add these metrics. - -From the summary API, they will flow to heapster and stackdriver. - -## Caveats -- As mentioned before, this would add a requirement that cAdvisor and kubelet are dynamically linked. -- We would need to make sure that kubelet is able to access the nvml libraries. Some existing container based nvidia driver installers install drivers in a special directory. We would need to make sure that directory is in kubelet’s `LD_LIBRARY_PATH`. - -## Testing Plan -- Adding unit tests and e2e tests to cAdvisor for this code. -- Manually testing various scenarios with nvml installed and not installed; containers running with nvidia devices attached and not attached. -- Performance/Utilization testing: impact on cAdvisor/kubelet resource usage. Impact on GPU performance when we collect metrics. - -## Alternatives Rejected -Why collect GPU metrics in cAdvisor? Why not collect them in [device plugins](/contributors/design-proposals/resource-management/device-plugin.md)? The path forward if we collected GPU metrics in device plugin is not clear and may take a lot of time to get finalized. - -Here’s a rough sketch of how things could work: - -(1) device plugin -> kubelet summary API -> heapster -> ... -- device plugin collects GPU metrics using the cgo wrapper. This is straightforward, in fact, this may even be easier because we don’t have to worry about making kubelet dynamically linked. -- device plugin exposes a new container-level metrics API. This is complicated. There's no good way to have a device plugin metrics API. All we can have is a device plugin metrics endpoint. We can't really define how the metrics inside that will look like because different device types can have wildly different metrics. We can't have a metrics structure that will work well both for GPUs and NICs for example. -- We would have to make the kubelet understand whatever metrics are exposed in the device plugin metrics endpoint and expose it though the summary API. This is not ideal because device plugins are out-of-tree and controlled by vendors, so there can’t a mapping between the metrics exposed by the device plugins and what’s exposed in the kubelet’s summary API. If we try to define such a mapping, it becomes an implicit API that new device plugins have to follow to get their metrics exposed by the kubelet or they would have to update the mapping. - -(2) device plugin -> heapster -> ... -- If we don’t go through the kubelet, we can make heapster directly talk to the metrics endpoint exposed by the device plugin. This has the same problem as the last bullet point: how would heapster understand the metrics exposed by the device plugins so that it [can expose them to its backends](https://github.com/kubernetes/heapster/blob/v1.4.3/docs/storage-schema.md). In addition, we would have to solve the issue of how to map containers to their pods. - -(3) device plugin -> … -- If we don’t go through kubelet or heapster. We can have the device plugins directly expose metrics to the monitoring agent. For example, device plugins can expose a /metrics endpoint in prometheus format and prometheus can scrape it directly or a prom-to-sd container can send metrics from that endpoint directly to stackdriver. This becomes a more DIY solution, where there’s no real monitoring support provided by kubernetes and device plugin vendors are expected to add metrics to their plugins and users/operators are expected to plumb those metrics to their metrics storage backends. This approach also requires a way to map containers to their pods. - -Once the new monitoring architecture plans are implemented, we can revisit this and maybe collect GPU metrics in device plugins instead of cAdvisor. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/all-in-one-volume.md b/contributors/design-proposals/node/all-in-one-volume.md index e1796817..f0fbec72 100644 --- a/contributors/design-proposals/node/all-in-one-volume.md +++ b/contributors/design-proposals/node/all-in-one-volume.md @@ -1,301 +1,6 @@ -## Abstract +Design proposals have been archived. -Describes a proposal for a new volume type that can project secrets, -configmaps, and downward API items. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -Users often need to build directories that contain multiple types of -configuration and secret data. For example, a configuration directory for some -software package may contain both config files and credentials. Currently, there -is no way to achieve this in Kubernetes without scripting inside of a container. - -## Constraints and Assumptions - -1. The volume types must remain unchanged for backward compatibility -2. There will be a new volume type for this proposed functionality, but no - other API changes -3. The new volume type should support atomic updates in the event of an input - change - -## Use Cases - -1. As a user, I want to automatically populate a single volume with the keys - from multiple secrets, configmaps, and with downward API information, so - that I can synthesize a single directory with various sources of - information -2. As a user, I want to populate a single volume with the keys from multiple - secrets, configmaps, and with downward API information, explicitly - specifying paths for each item, so that I can have full control over the - contents of that volume - -### Populating a single volume without pathing - -A user should be able to map any combination of resources mentioned above into a -single directory. There are plenty of examples of software that needs to be -configured both with config files and secret data. The combination of having -that data not only accessible, but in the same location provides for an easier -user experience. - -### Populating a single volume with pathing - -Currently it is possible to define the path within a volume for specific -resources. Therefore the same is true for each resource contained within the -new single volume. - -## Current State Overview - -The only way of utilizing secrets, configmaps, and downward API (while -maintaining atomic updates) currently is to access the data using separate mount -paths as shown in the volumeMounts section below: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: volume-test -spec: - containers: - - name: container-test - image: busybox - volumeMounts: - - name: mysecret - mountPath: "/secrets" - readOnly: true - - name: podInfo - mountPath: "/podinfo" - readOnly: true - - name: config-volume - mountPath: "/config" - readOnly: true - volumes: - - name: mysecret - secret: - secretName: jpeeler-db-secret - items: - - key: username - path: my-group/my-username - - name: podInfo - downwardAPI: - items: - - path: "labels" - fieldRef: - fieldPath: metadata.labels - - path: "annotations" - fieldRef: - fieldPath: metadata.annotations - - name: config-volume - configMap: - name: special-config - items: - - key: special.how - path: path/to/special-key -``` - -## Analysis - -There are several combinations of resources that can be used at once, which -all warrant consideration. The combinations are listed with one instance of -each resource, but real world usage will support multiple instances of a -specific resource too. Each example was written with the expectation that all -of the resources are to be projected to the same directory (or with the same -non-root parent), though it is not strictly required. - -### ConfigMap + Secrets + Downward API - -The user wishes to deploy containers with configuration data that includes -passwords. An application using these resources could be deploying OpenStack -on Kubernetes. The configuration data may need to be assembled differently -depending on if the services are going to be used for production or for -testing. If a pod is labeled with production or testing, the downward API -selector metadata.labels can be used to produce the correct OpenStack configs. - -### ConfigMap + Secrets - -Again, the user wishes to deploy containers involving configuration data and -passwords. This time the user is executing an Ansible playbook stored as a -configmap, with some sensitive encrypted tasks that are decrypted using a vault -password file. - -### ConfigMap + Downward API - -In this case, the user wishes to generate a config including the pod’s name -(available via the metadata.name selector). This application may then pass the -pod name along with requests in order to easily determine the source without -using IP tracking. - -### Secrets + Downward API - -A user may wish to use a secret as a public key to encrypt the namespace of -the pod (available via the metadata.namespace selector). This example may be -the most contrived, but perhaps the operator wishes to use the application to -deliver the namespace information securely without using an encrypted -transport. - -### Collisions between keys when configured paths are identical - -In the event the user specifies any keys with the same path, the pod spec will -not be accepted as valid. Note the specified path for mysecret and myconfigmap -are the same: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: volume-test -spec: - containers: - - name: container-test - image: busybox - volumeMounts: - - name: all-in-one - mountPath: "/projected-volume" - readOnly: true - volumes: - - name: all-in-one - projected: - sources: - - secret: - name: mysecret - items: - - key: username - path: my-group/data - - configMap: - name: myconfigmap - items: - - key: config - path: my-group/data -``` - - -### Collisions between keys without configured paths - -The only run time validation can occur is when all the paths are known at pod -creation, similar to the above scenario. Otherwise, when a conflict occurs the -most recent specified resource will overwrite anything preceding it (this is -true for resources that are updated after pod creation as well). - -### Collisions when one path is explicit and the other is automatically projected - -In the event that there is a collision due to a user specified path matching -data that is automatically projected, the latter resource will overwrite -anything preceding it as before. - -## Code changes - -### Proposed API objects - -```go -type ProjectedVolumeSource struct { - Sources []VolumeProjection `json:"sources"` - DefaultMode *int32 `json:"defaultMode,omitempty"` -} - -type VolumeProjection struct { - Secret *SecretProjection `json:"secret,omitempty"` - ConfigMap *ConfigMapProjection `json:"configMap,omitempty"` - DownwardAPI *DownwardAPIProjection `json:"downwardAPI,omitempty"` -} - -type SecretProjection struct { - LocalObjectReference - Items []KeyToPath - Optional *bool -} - -type ConfigMapProjection struct { - LocalObjectReference - Items []KeyToPath - Optional *bool -} - -type DownwardAPIProjection struct { - Items []DownwardAPIVolumeFile -} -``` - -### Additional required modifications - -Add to the VolumeSource struct: - -```go -Projected *ProjectedVolumeSource `json:"projected,omitempty"` -// (other existing fields omitted for brevity) -``` - -The appropriate conversion code would need to be generated for v1, validations -written, and the new volume plugin code produced as well. - -## Examples - -### Sample pod spec with a secret, a downward API, and a configmap - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: volume-test -spec: - containers: - - name: container-test - image: busybox - volumeMounts: - - name: all-in-one - mountPath: "/projected-volume" - readOnly: true - volumes: - - name: all-in-one - projected: - sources: - - secret: - name: mysecret - items: - - key: username - path: my-group/my-username - - downwardAPI: - items: - - path: "labels" - fieldRef: - fieldPath: metadata.labels - - path: "cpu_limit" - resourceFieldRef: - containerName: container-test - resource: limits.cpu - - configMap: - name: myconfigmap - items: - - key: config - path: my-group/my-config -``` - -### Sample pod spec with multiple secrets with a non-default permission mode set - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: volume-test -spec: - containers: - - name: container-test - image: busybox - volumeMounts: - - name: all-in-one - mountPath: "/projected-volume" - readOnly: true - volumes: - - name: all-in-one - projected: - sources: - - secret: - name: mysecret - items: - - key: username - path: my-group/my-username - - secret: - name: mysecret2 - items: - - key: password - path: my-group/my-password - mode: 511 -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/annotations-downward-api.md b/contributors/design-proposals/node/annotations-downward-api.md index dcad5ab1..f0fbec72 100644 --- a/contributors/design-proposals/node/annotations-downward-api.md +++ b/contributors/design-proposals/node/annotations-downward-api.md @@ -1,64 +1,6 @@ -# Exposing annotations via environment downward API +Design proposals have been archived. -Author: Michal Rostecki \<michal@kinvolk.io\> +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Introduction - -Annotations of the pod can be taken through the Kubernetes API, but currently -there is no way to pass them to the application inside the container. This means -that annotations can be used by the core Kubernetes services and the user outside -of the Kubernetes cluster. - -Of course using Kubernetes API from the application running inside the container -managed by Kubernetes is technically possible, but that's an idea which denies -the principles of microservices architecture. - -The purpose of the proposal is to allow to pass the annotation as the environment -variable to the container. - -### Use-case - -The primary usecase for this proposal are StatefulSets. There is an idea to expose -StatefulSet index to the applications running inside the pods managed by StatefulSet. -Since StatefulSet creates pods as the API objects, passing this index as an -annotation seems to be a valid way to do this. However, to finally pass this -information to the containerized application, we need to pass this annotation. -That's why the downward API for annotations is needed here. - -## API - -The exact `fieldPath` to the annotation will have the following syntax: - -``` -metadata.annotations['annotationKey'] -``` - -Which means that: -- the *annotationKey* will be specified inside brackets (`[`, `]`) and single quotation - marks (`'`) -- if the *annotationKey* contains `[`, `]` or `'` characters inside, they will need to - be escaped (like `\[`, `\]`, `\'`) and having these characters unescaped should result - in validation error - -Examples: -- `metadata.annotations['spec.pod.beta.kubernetes.io/statefulset-index']` -- `metadata.annotations['foo.bar/example-annotation']` -- `metadata.annotations['foo.bar/more\'complicated\]example\[with\'characters"to-escape']` - -So, assuming that we would want to pass the `pod.beta.kubernetes.io/statefulset-index` -annotation as a `STATEFULSET_INDEX` variable, the environment variable definition -will look like: - -``` -env: - - name: STATEFULSET_INDEX - valueFrom: - fieldRef: - fieldPath: metadata.annotations['spec.pod.beta.kubernetes.io/statefulset-index'] -``` - -## Implementation - -In general, this environment downward API part will be implemented in the same -place as the other metadata - as a label conversion function. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/container-init.md b/contributors/design-proposals/node/container-init.md index e26f92b4..f0fbec72 100644 --- a/contributors/design-proposals/node/container-init.md +++ b/contributors/design-proposals/node/container-init.md @@ -1,440 +1,6 @@ -# Pod initialization +Design proposals have been archived. -@smarterclayton +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -March 2016 - -## Proposal and Motivation - -Within a pod there is a need to initialize local data or adapt to the current -cluster environment that is not easily achieved in the current container model. -Containers start in parallel after volumes are mounted, leaving no opportunity -for coordination between containers without specialization of the image. If -two containers need to share common initialization data, both images must -be altered to cooperate using filesystem or network semantics, which introduces -coupling between images. Likewise, if an image requires configuration in order -to start and that configuration is environment dependent, the image must be -altered to add the necessary templating or retrieval. - -This proposal introduces the concept of an **init container**, one or more -containers started in sequence before the pod's normal containers are started. -These init containers may share volumes, perform network operations, and perform -computation prior to the start of the remaining containers. They may also, by -virtue of their sequencing, block or delay the startup of application containers -until some precondition is met. In this document we refer to the existing pod -containers as **app containers**. - -This proposal also provides a high level design of **volume containers**, which -initialize a particular volume, as a feature that specializes some of the tasks -defined for init containers. The init container design anticipates the existence -of volume containers and highlights where they will take future work - -## Design Points - -* Init containers should be able to: - * Perform initialization of shared volumes - * Download binaries that will be used in app containers as execution targets - * Inject configuration or extension capability to generic images at startup - * Perform complex templating of information available in the local environment - * Initialize a database by starting a temporary execution process and applying - schema info. - * Delay the startup of application containers until preconditions are met - * Register the pod with other components of the system -* Reduce coupling: - * Between application images, eliminating the need to customize those images for - Kubernetes generally or specific roles - * Inside of images, by specializing which containers perform which tasks - (install git into init container, use filesystem contents - in web container) - * Between initialization steps, by supporting multiple sequential init containers -* Init containers allow simple start preconditions to be implemented that are - decoupled from application code - * The order init containers start should be predictable and allow users to easily - reason about the startup of a container - * Complex ordering and failure will not be supported - all complex workflows can - if necessary be implemented inside of a single init container, and this proposal - aims to enable that ordering without adding undue complexity to the system. - Pods in general are not intended to support DAG workflows. -* Both run-once and run-forever pods should be able to use init containers -* As much as possible, an init container should behave like an app container - to reduce complexity for end users, for clients, and for divergent use cases. - An init container is a container with the minimum alterations to accomplish - its goal. -* Volume containers should be able to: - * Perform initialization of a single volume - * Start in parallel - * Perform computation to initialize a volume, and delay start until that - volume is initialized successfully. - * Using a volume container that does not populate a volume to delay pod start - (in the absence of init containers) would be an abuse of the goal of volume - containers. -* Container pre-start hooks are not sufficient for all initialization cases: - * They cannot easily coordinate complex conditions across containers - * They can only function with code in the image or code in a shared volume, - which would have to be statically linked (not a common pattern in wide use) - * They cannot be implemented with the current Docker implementation - see - [#140](https://github.com/kubernetes/kubernetes/issues/140) - - - -## Alternatives - -* Any mechanism that runs user code on a node before regular pod containers - should itself be a container and modeled as such - we explicitly reject - creating new mechanisms for running user processes. -* The container pre-start hook (not yet implemented) requires execution within - the container's image and so cannot adapt existing images. It also cannot - block startup of containers -* Running a "pre-pod" would defeat the purpose of the pod being an atomic - unit of scheduling. - - -## Design - -Each pod may have 0..N init containers defined along with the existing -1..M app containers. - -On startup of the pod, after the network and volumes are initialized, the -init containers are started in order. Each container must exit successfully -before the next is invoked. If a container fails to start (due to the runtime) -or exits with failure, it is retried according to the pod RestartPolicy. -RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways -pods will retry the failing init container with increasing backoff until it -succeeds. To align with the design of application containers, init containers -will only support "infinite retries" (RestartPolicyAlways) or "no retries" -(RestartPolicyNever). - -A pod cannot be ready until all init containers have succeeded. The ports -on an init container are not aggregated under a service. A pod that is -being initialized is in the `Pending` phase but should have a distinct -condition. Each app container and all future init containers should have -the reason `PodInitializing`. The pod should have a condition `Initializing` -set to `false` until all init containers have succeeded, and `true` thereafter. -If the pod is restarted, the `Initializing` condition should be set to `false`. - -If the pod is "restarted" all containers stopped and started due to -a node restart, change to the pod definition, or admin interaction, all -init containers must execute again. Restartable conditions are defined as: - -* An init container image is changed -* The pod infrastructure container is restarted (shared namespaces are lost) -* The Kubelet detects that all containers in a pod are terminated AND - no record of init container completion is available on disk (due to GC) - -Changes to the init container spec are limited to the container image field. -Altering the container image field is equivalent to restarting the pod. - -Because init containers can be restarted, retried, or reexecuted, container -authors should make their init behavior idempotent by handling volumes that -are already populated or the possibility that this instance of the pod has -already contacted a remote system. - -Each init container has all of the fields of an app container. The following -fields are prohibited from being used on init containers by validation: - -* `readinessProbe` - init containers must exit for pod startup to continue, - are not included in rotation, and so cannot define readiness distinct from - completion. - -Init container authors may use `activeDeadlineSeconds` on the pod and -`livenessProbe` on the container to prevent init containers from failing -forever. The active deadline includes init containers. - -Because init containers are semantically different in lifecycle from app -containers (they are run serially, rather than in parallel), for backwards -compatibility and design clarity they will be identified as distinct fields -in the API: - - pod: - spec: - containers: ... - initContainers: - - name: init-container1 - image: ... - ... - - name: init-container2 - ... - status: - containerStatuses: ... - initContainerStatuses: - - name: init-container1 - ... - - name: init-container2 - ... - -This separation also serves to make the order of container initialization -clear - init containers are executed in the order that they appear, then all -app containers are started at once. - -The name of each app and init container in a pod must be unique - it is a -validation error for any container to share a name. - -While init containers are in alpha state, they will be serialized as an annotation -on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status -of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`. -Mutation of these annotations is prohibited on existing pods. - - -### Resources - -Given the ordering and execution for init containers, the following rules -for resource usage apply: - -* The highest of any particular resource request or limit defined on all init - containers is the **effective init request/limit** -* The pod's **effective request/limit** for a resource is the higher of: - * sum of all app containers request/limit for a resource - * effective init request/limit for a resource -* Scheduling is done based on effective requests/limits, which means - init containers can reserve resources for initialization that are not used - during the life of the pod. -* The lowest QoS tier of init containers per resource is the **effective init QoS tier**, - and the highest QoS tier of both init containers and regular containers is the - **effective pod QoS tier**. - -So the following pod: - - pod: - spec: - initContainers: - - limits: - cpu: 100m - memory: 1GiB - - limits: - cpu: 50m - memory: 2GiB - containers: - - limits: - cpu: 10m - memory: 1100MiB - - limits: - cpu: 10m - memory: 1100MiB - -has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init -container cpu is larger than sum of all app containers, sum of container -memory is larger than the max of all init containers). The scheduler, node, -and quota must respect the effective pod request/limit. - -In the absence of a defined request or limit on a container, the effective -request/limit will be applied. For example, the following pod: - - pod: - spec: - initContainers: - - limits: - cpu: 100m - memory: 1GiB - containers: - - request: - cpu: 10m - memory: 1100MiB - -will have an effective request of `10m / 1100MiB`, and an effective limit -of `100m / 1GiB`, i.e.: - - pod: - spec: - initContainers: - - request: - cpu: 10m - memory: 1GiB - - limits: - cpu: 100m - memory: 1100MiB - containers: - - request: - cpu: 10m - memory: 1GiB - - limits: - cpu: 100m - memory: 1100MiB - -and thus have the QoS tier **Burstable** (because request is not equal to -limit). - -Quota and limits will be applied based on the effective pod request and -limit. - -Pod level cGroups will be based on the effective pod request and limit, the -same as the scheduler. - - -### Kubelet and container runtime details - -Container runtimes should treat the set of init and app containers as one -large pool. An individual init container execution should be identical to -an app container, including all standard container environment setup -(network, namespaces, hostnames, DNS, etc). - -All app container operations are permitted on init containers. The -logs for an init container should be available for the duration of the pod -lifetime or until the pod is restarted. - -During initialization, app container status should be shown with the reason -PodInitializing if any init containers are present. Each init container -should show appropriate container status, and all init containers that are -waiting for earlier init containers to finish should have the `reason` -PendingInitialization. - -The container runtime should aggressively prune failed init containers. -The container runtime should record whether all init containers have -succeeded internally, and only invoke new init containers if a pod -restart is needed (for Docker, if all containers terminate or if the pod -infra container terminates). Init containers should follow backoff rules -as necessary. The Kubelet *must* preserve at least the most recent instance -of an init container to serve logs and data for end users and to track -failure states. The Kubelet *should* prefer to garbage collect completed -init containers over app containers, as long as the Kubelet is able to -track that initialization has been completed. In the future, container -state checkpointing in the Kubelet may remove or reduce the need to -preserve old init containers. - -For the initial implementation, the Kubelet will use the last termination -container state of the highest indexed init container to determine whether -the pod has completed initialization. During a pod restart, initialization -will be restarted from the beginning (all initializers will be rerun). - - -### API Behavior - -All APIs that access containers by name should operate on both init and -app containers. Because names are unique the addition of the init container -should be transparent to use cases. - -A client with no knowledge of init containers should see appropriate -container status `reason` and `message` fields while the pod is in the -`Pending` phase, and so be able to communicate that to end users. - - -### Example init containers - -* Wait for a service to be created - - pod: - spec: - initContainers: - - name: wait - image: centos:centos7 - command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"] - containers: - - name: run - image: application-image - command: ["/my_application_that_depends_on_myservice"] - -* Register this pod with a remote server - - pod: - spec: - initContainers: - - name: register - image: centos:centos7 - command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"] - env: - - name: POD_NAME - valueFrom: - field: metadata.name - - name: POD_IP - valueFrom: - field: status.podIP - containers: - - name: run - image: application-image - command: ["/my_application_that_depends_on_myservice"] - -* Wait for an arbitrary period of time - - pod: - spec: - initContainers: - - name: wait - image: centos:centos7 - command: ["/bin/sh", "-c", "sleep 60"] - containers: - - name: run - image: application-image - command: ["/static_binary_without_sleep"] - -* Clone a git repository into a volume (can be implemented by volume containers in the future): - - pod: - spec: - initContainers: - - name: download - image: image-with-git - command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: git - containers: - - name: run - image: centos:centos7 - command: ["/var/lib/data/binary"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: git - volumes: - - emptyDir: {} - name: git - -* Execute a template transformation based on environment (can be implemented by volume containers in the future): - - pod: - spec: - initContainers: - - name: copy - image: application-image - command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - - name: transform - image: image-with-jinja - command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - containers: - - name: run - image: application-image - command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - volumes: - - emptyDir: {} - name: data - -* Perform a container build - - pod: - spec: - initContainers: - - name: copy - image: base-image - workingDir: /home/user/source-tree - command: ["make"] - containers: - - name: commit - image: image-with-docker - command: - - /bin/sh - - -c - - docker commit $(complex_bash_to_get_container_id_of_copy) \ - docker push $(commit_id) myrepo:latest - volumesMounts: - - mountPath: /var/run/docker.sock - volumeName: dockersocket - -## Backwards compatibility implications - -Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not -be able to rely on Kubelets implementing init containers. The management of feature skew between -master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855). - - -## Future work - -* Unify pod QoS class with init containers -* Implement container / image volumes to make composition of runtime from images efficient +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/container-runtime-interface-v1.md b/contributors/design-proposals/node/container-runtime-interface-v1.md index f2de2640..f0fbec72 100644 --- a/contributors/design-proposals/node/container-runtime-interface-v1.md +++ b/contributors/design-proposals/node/container-runtime-interface-v1.md @@ -1,264 +1,6 @@ -# Redefine Container Runtime Interface +Design proposals have been archived. -The umbrella issue: [#28789](https://issues.k8s.io/28789) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -Kubelet employs a declarative pod-level interface, which acts as the sole -integration point for container runtimes (e.g., `docker` and `rkt`). The -high-level, declarative interface has caused higher integration and maintenance -cost, and also slowed down feature velocity for the following reasons. - 1. **Not every container runtime supports the concept of pods natively**. - When integrating with Kubernetes, a significant amount of work needs to - go into implementing a shim of significant size to support all pod - features. This also adds maintenance overhead (e.g., `docker`). - 2. **High-level interface discourages code sharing and reuse among runtimes**. - E.g, each runtime today implements an all-encompassing `SyncPod()` - function, with the Pod Spec as the input argument. The runtime implements - logic to determine how to achieve the desired state based on the current - status, (re-)starts pods/containers and manages lifecycle hooks - accordingly. - 3. **Pod Spec is evolving rapidly**. New features are being added constantly. - Any pod-level change or addition requires changing of all container - runtime shims. E.g., init containers and volume containers. - -## Goals and Non-Goals - -The goals of defining the interface are to - - **improve extensibility**: Easier container runtime integration. - - **improve feature velocity** - - **improve code maintainability** - -The non-goals include - - proposing *how* to integrate with new runtimes, i.e., where the shim - resides. The discussion of adopting a client-server architecture is tracked - by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of - such an architecture is discussed. - - versioning the new interface/API. We intend to provide API versioning to - offer stability for runtime integrations, but the details are beyond the - scope of this proposal. - - adding support to Windows containers. Windows container support is a - parallel effort and is tracked by [#22623](https://issues.k8s.io/22623). - The new interface will not be augmented to support Windows containers, but - it will be made extensible such that the support can be added in the future. - - re-defining Kubelet's internal interfaces. These interfaces, though, may - affect Kubelet's maintainability, is not relevant to runtime integration. - - improving Kubelet's efficiency or performance, e.g., adopting event stream - from the container runtime [#8756](https://issues.k8s.io/8756), - [#16831](https://issues.k8s.io/16831). - -## Requirements - - * Support the already integrated container runtime: `docker` and `rkt` - * Support hypervisor-based container runtimes: `hyper`. - -The existing pod-level interface will remain as it is in the near future to -ensure supports of all existing runtimes are continued. Meanwhile, we will -work with all parties involved to switching to the proposed interface. - - -## Container Runtime Interface - -The main idea of this proposal is to adopt an imperative container-level -interface, which allows Kubelet to directly control the lifecycles of the -containers. - -Pod is composed of a group of containers in an isolated environment with -resource constraints. In Kubernetes, pod is also the smallest schedulable unit. -After a pod has been scheduled to the node, Kubelet will create the environment -for the pod, and add/update/remove containers in that environment to meet the -Pod Spec. To distinguish between the environment and the pod as a whole, we -will call the pod environment **PodSandbox.** - -The container runtimes may interpret the PodSandBox concept differently based -on how it operates internally. For runtimes relying on hypervisor, sandbox -represents a virtual machine naturally. For others, it can be Linux namespaces. - -In short, a PodSandbox should have the following features. - - * **Isolation**: E.g., Linux namespaces or a full virtual machine, or even - support additional security features. - * **Compute resource specifications**: A PodSandbox should implement pod-level - resource demands and restrictions. - -*NOTE: The resource specification does not include externalized costs to -container setup that are not currently trackable as Pod constraints, e.g., -filesystem setup, container image pulling, etc.* - -A container in a PodSandbox maps to an application in the Pod Spec. For Linux -containers, they are expected to share at least network, IPC and sometimes PID -namespaces. PID sharing is defined in [Shared PID -Namespace](pod-pid-namespace.md). Other namespaces are discussed in -[#1615](https://issues.k8s.io/1615). - - -Below is an example of the proposed interfaces. - -```go -// PodSandboxManager contains basic operations for sandbox. -type PodSandboxManager interface { - Create(config *PodSandboxConfig) (string, error) - Delete(id string) (string, error) - List(filter PodSandboxFilter) []PodSandboxListItem - Status(id string) PodSandboxStatus -} - -// ContainerRuntime contains basic operations for containers. -type ContainerRuntime interface { - Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error) - Start(id string) error - Stop(id string, timeout int) error - Remove(id string) error - List(filter ContainerFilter) ([]ContainerListItem, error) - Status(id string) (ContainerStatus, error) - Exec(id string, cmd []string, streamOpts StreamOptions) error -} - -// ImageService contains image-related operations. -type ImageService interface { - List() ([]Image, error) - Pull(image ImageSpec, auth AuthConfig) error - Remove(image ImageSpec) error - Status(image ImageSpec) (Image, error) - Metrics(image ImageSpec) (ImageMetrics, error) -} - -type ContainerMetricsGetter interface { - ContainerMetrics(id string) (ContainerMetrics, error) -} - -All functions listed above are expected to be thread-safe. -``` - -### Pod/Container Lifecycle - -The PodSandbox's lifecycle is decoupled from the containers, i.e., a sandbox -is created before any containers, and can exist after all containers in it have -terminated. - -Assume there is a pod with a single container C. To start a pod: - -``` - create sandbox Foo --> create container C --> start container C -``` - -To delete a pod: - -``` - stop container C --> remove container C --> delete sandbox Foo -``` - -The container runtime must not apply any transition (such as starting a new -container) unless explicitly instructed by Kubelet. It is Kubelet's -responsibility to enforce garbage collection, restart policy, and otherwise -react to changes in lifecycle. - -The only transitions that are possible for a container are described below: - -``` -() -> Created // A container can only transition to created from the - // empty, nonexistent state. The ContainerRuntime.Create - // method causes this transition. -Created -> Running // The ContainerRuntime.Start method may be applied to a - // Created container to move it to Running -Running -> Exited // The ContainerRuntime.Stop method may be applied to a running - // container to move it to Exited. - // A container may also make this transition under its own volition -Exited -> () // An exited container can be moved to the terminal empty - // state via a ContainerRuntime.Remove call. -``` - - -Kubelet is also responsible for gracefully terminating all the containers -in the sandbox before deleting the sandbox. If Kubelet chooses to delete -the sandbox with running containers in it, those containers should be forcibly -deleted. - -Note that every PodSandbox/container lifecycle operation (create, start, -stop, delete) should either return an error or block until the operation -succeeds. A successful operation should include a state transition of the -PodSandbox/container. E.g., if a `Create` call for a container does not -return an error, the container state should be "created" when the runtime is -queried. - -### Updates to PodSandbox or Containers - -Kubernetes support updates only to a very limited set of fields in the Pod -Spec. These updates may require containers to be re-created by Kubelet. This -can be achieved through the proposed, imperative container-level interface. -On the other hand, PodSandbox update currently is not required. - - -### Container Lifecycle Hooks - -Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing -discussion for supporting pre-start and post-stop hooks in -[#140](https://issues.k8s.io/140). - -These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the -container runtime. This frees the runtimes from having to support hooks -natively. - -Illustration of the container lifecycle and hooks: - -``` - pre-start post-start pre-stop post-stop - | | | | - exec exec exec exec - | | | | - create --------> start ----------------> stop --------> remove -``` - -In order for the lifecycle hooks to function as expected, the `Exec` call -will need access to the container's filesystem (e.g., mount namespaces). - -### Extensibility - -There are several dimensions for container runtime extensibility. - - Host OS (e.g., Linux) - - PodSandbox isolation mechanism (e.g., namespaces or VM) - - PodSandbox OS (e.g., Linux) - -As mentioned previously, this proposal will only address the Linux based -PodSandbox and containers. All Linux-specific configuration will be grouped -into one field. A container runtime is required to enforce all configuration -applicable to its platform, and should return an error otherwise. - -### Keep it minimal - -The proposed interface is experimental, i.e., it will go through (many) changes -until it stabilizes. The principle is to keep the interface minimal and -extend it later if needed. This includes a several features that are still in -discussion and may be achieved alternatively: - - * `AttachContainer`: [#23335](https://issues.k8s.io/23335) - * `PortForward`: [#25113](https://issues.k8s.io/25113) - -## Alternatives - -**[Status quo] Declarative pod-level interface** - - Pros: No changes needed. - - Cons: All the issues stated in #motivation - -**Allow integration at both pod- and container-level interfaces** - - Pros: Flexibility. - - Cons: All the issues stated in #motivation - -**Imperative pod-level interface** -The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod(). -This implies that the runtime needs to take over container lifecycle -management (i.e., enforce restart policy), lifecycle hooks, liveness checks, -etc. Kubelet will mainly be responsible for interfacing with the apiserver, and -can potentially become a very thin daemon. - - Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker` - shim maintenance cost is discounted. - - Cons: This will incur higher integration cost because every new container - runtime needs to implement all the features and need to understand the - concept of pods. This would also lead to lower feature velocity because the - interface will need to be changed, and the new pod-level feature will need - to be supported in each runtime. - -## Related Issues - - * Metrics: [#27097](https://issues.k8s.io/27097) - * Log management: [#24677](https://issues.k8s.io/24677) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/cpu-manager.md b/contributors/design-proposals/node/cpu-manager.md index 2dde3b6f..f0fbec72 100644 --- a/contributors/design-proposals/node/cpu-manager.md +++ b/contributors/design-proposals/node/cpu-manager.md @@ -1,424 +1,6 @@ -# CPU Manager +Design proposals have been archived. -_Authors:_ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com> -* @flyingcougar - Szymon Scharmach <szymon.scharmach@intel.com> -* @sjenning - Seth Jennings <sjenning@redhat.com> -**Contents:** - -* [Overview](#overview) -* [Proposed changes](#proposed-changes) -* [Operations and observability](#operations-and-observability) -* [Practical challenges](#practical-challenges) -* [Implementation roadmap](#implementation-roadmap) -* [Appendix A: cpuset pitfalls](#appendix-a-cpuset-pitfalls) - -## Overview - -_Problems to solve:_ - -1. Poor or unpredictable performance observed compared to virtual machine - based orchestration systems. Application latency and lower CPU - throughput compared to VMs due to cpu quota being fulfilled across all - cores, rather than exclusive cores, which results in fewer context - switches and higher cache affinity. -1. Unacceptable latency attributed to the OS process scheduler, especially - for “fast” virtual network functions (want to approach line rate on - modern server NICs.) - -_Solution requirements:_ - -1. Provide an API-driven contract from the system to a user: "if you are a - Guaranteed pod with 1 or more cores of cpu, the system will try to make - sure that the pod gets its cpu quota primarily from reserved core(s), - resulting in fewer context switches and higher cache affinity". -1. Support the case where in a given pod, one container is latency-critical - and another is not (e.g. auxiliary side-car containers responsible for - log forwarding, metrics collection and the like.) -1. Do not cap CPU quota for guaranteed containers that are granted - exclusive cores, since that would be antithetical to (1) above. -1. Take physical processor topology into account in the CPU affinity policy. - -### Related issues - -* Feature: [Further differentiate performance characteristics associated - with pod level QoS](https://github.com/kubernetes/features/issues/276) -* Feature: [Add CPU Manager for pod cpuset - assignment](https://github.com/kubernetes/features/issues/375) - -## Proposed changes - -### CPU Manager component - -The *CPU Manager* is a new software component in Kubelet responsible for -assigning pod containers to sets of CPUs on the local node. In later -phases, the scope will expand to include caches, a critical shared -processor resource. - -The kuberuntime notifies the CPU manager when containers come and -go. The first such notification occurs in between the container runtime -interface calls to create and start the container. The second notification -occurs after the container is stopped by the container runtime. The CPU -Manager writes CPU settings for containers using a new CRI method named -[`UpdateContainerResources`](https://github.com/kubernetes/kubernetes/pull/46105). -This new method is invoked from two places in the CPU manager: during each -call to `AddContainer` and also periodically from a separate -reconciliation loop. - - - -_CPU Manager block diagram. `Policy`, `State`, and `Topology` types are -factored out of the CPU Manager to promote reuse and to make it easier -to build and test new policies. The shared state abstraction allows -other Kubelet components to be agnostic of the CPU manager policy for -observability and checkpointing extensions._ - -#### Discovering CPU topology - -The CPU Manager must understand basic topology. First of all, it must -determine the number of logical CPUs (hardware threads) available for -allocation. On architectures that support [hyper-threading][ht], sibling -threads share a number of hardware resources including the cache -hierarchy. On multi-socket systems, logical CPUs co-resident on a socket -share L3 cache. Although there may be some programs that benefit from -disjoint caches, the policies described in this proposal assume cache -affinity will yield better application and overall system performance for -most cases. In all scenarios described below, we prefer to acquire logical -CPUs topologically. For example, allocating two CPUs on a system that has -hyper-threading turned on yields both sibling threads on the same -physical core. Likewise, allocating two CPUs on a non-hyper-threaded -system yields two cores on the same socket. - -**Decision:** Initially the CPU Manager will re-use the existing discovery -mechanism in cAdvisor. - -Alternate options considered for discovering topology: - -1. Read and parse the virtual file [`/proc/cpuinfo`][procfs] and construct a - convenient data structure. -1. Execute a simple program like `lscpu -p` in a subprocess and construct a - convenient data structure based on the output. Here is an example of - [data structure to represent CPU topology][topo] in go. The linked package - contains code to build a ThreadSet from the output of `lscpu -p`. -1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] -- - potentially adding support for the hwloc file format to the Kubelet. - -#### CPU Manager interfaces (sketch) - -```go -type State interface { - GetCPUSet(containerID string) (cpuset.CPUSet, bool) - GetDefaultCPUSet() cpuset.CPUSet - GetCPUSetOrDefault(containerID string) cpuset.CPUSet - SetCPUSet(containerID string, cpuset CPUSet) - SetDefaultCPUSet(cpuset CPUSet) - Delete(containerID string) -} - -type Manager interface { - Start(ActivePodsFunc, status.PodStatusProvider, runtimeService) - AddContainer(p *Pod, c *Container, containerID string) error - RemoveContainer(containerID string) error - State() state.Reader -} - -type Policy interface { - Name() string - Start(s state.State) - AddContainer(s State, pod *Pod, container *Container, containerID string) error - RemoveContainer(s State, containerID string) error -} - -type CPUSet map[int]struct{} // set operations and parsing/formatting helpers - -type CPUTopology // convenient type for querying and filtering CPUs -``` - -#### Configuring the CPU Manager - -Kubernetes will ship with three CPU manager policies. Only one policy is -active at a time on a given node, chosen by the operator via Kubelet -configuration. The three policies are **none**, **static** and **dynamic**. - -The active CPU manager policy is set through a new Kubelet -configuration value `--cpu-manager-policy`. The default value is `none`. - -The CPU manager periodically writes resource updates through the CRI in -order to reconcile in-memory cpuset assignments with cgroupfs. The -reconcile frequency is set through a new Kubelet configuration value -`--cpu-manager-reconcile-period`. If not specified, it defaults to the -same duration as `--node-status-update-frequency` (which itself defaults -to 10 seconds at time of writing.) - -Each policy is described below. - -#### Policy 1: "none" cpuset control [default] - -This policy preserves the existing Kubelet behavior of doing nothing -with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This "none" -policy would become the default CPU Manager policy until the effects of -the other policies are better understood. - -#### Policy 2: "static" cpuset control - -The "static" policy allocates exclusive CPUs for containers if they are -included in a pod of "Guaranteed" [QoS class][qos] and the container's -resource limit for the CPU resource is an integer greater than or -equal to one. All other containers share a set of CPUs. - -When exclusive CPUs are allocated for a container, those CPUs are -removed from the allowed CPUs of every other container running on the -node. Once allocated at pod admission time, an exclusive CPU remains -assigned to a single container for the lifetime of the pod (until it -becomes terminal.) - -The Kubelet requires the total CPU reservation from `--kube-reserved` -and `--system-reserved` to be greater than zero when the static policy is -enabled. This is because zero CPU reservation would allow the shared pool to -become empty. The set of reserved CPUs is taken in order of ascending -physical core ID. Operator documentation will be updated to explain how to -configure the system to use the low-numbered physical cores for kube-reserved -and system-reserved cgroups. - -Workloads that need to know their own CPU mask, e.g. for managing -thread-level affinity, can read it from the virtual file `/proc/self/status`: - -``` -$ grep -i cpus /proc/self/status -Cpus_allowed: 77 -Cpus_allowed_list: 0-2,4-6 -``` - -Note that containers running in the shared cpuset should not attempt any -application-level CPU affinity of their own, as those settings may be -overwritten without notice (whenever exclusive cores are -allocated or deallocated.) - -##### Implementation sketch - -The static policy maintains the following sets of logical CPUs: - -- **SHARED:** Burstable, BestEffort, and non-integral Guaranteed containers - run here. Initially this contains all CPU IDs on the system. As - exclusive allocations are created and destroyed, this CPU set shrinks - and grows, accordingly. This is stored in the state as the default - CPU set. - -- **RESERVED:** A subset of the shared pool which is not exclusively - allocatable. The membership of this pool is static for the lifetime of - the Kubelet. The size of the reserved pool is the ceiling of the total - CPU reservation from `--kube-reserved` and `--system-reserved`. - Reserved CPUs are taken topologically starting with lowest-indexed - physical core, as reported by cAdvisor. - -- **ASSIGNABLE:** Equal to `SHARED - RESERVED`. Exclusive CPUs are allocated - from this pool. - -- **EXCLUSIVE ALLOCATIONS:** CPU sets assigned exclusively to one container. - These are stored as explicit assignments in the state. - -When an exclusive allocation is made, the static policy also updates the -default cpuset in the state abstraction. The CPU manager's periodic -reconcile loop takes care of updating the cpuset in cgroupfs for any -containers that may be running in the shared pool. For this reason, -applications running within exclusively-allocated containers must tolerate -potentially sharing their allocated CPUs for up to the CPU manager -reconcile period. - -```go -func (p *staticPolicy) Start(s State) { - fullCpuset := cpuset.NewCPUSet() - for cpuid := 0; cpuid < p.topology.NumCPUs; cpuid++ { - fullCpuset.Add(cpuid) - } - // Figure out which cores shall not be used in shared pool - reserved, _ := takeByTopology(p.topology, fullCpuset, p.topology.NumReservedCores) - s.SetDefaultCPUSet(fullCpuset.Difference(reserved)) -} - -func (p *staticPolicy) AddContainer(s State, pod *Pod, container *Container, containerID string) error { - if numCPUs := numGuaranteedCPUs(pod, container); numCPUs != 0 { - // container should get some exclusively allocated CPUs - cpuset, err := p.allocateCPUs(s, numCPUs) - if err != nil { - return err - } - s.SetCPUSet(containerID, cpuset) - } - // container belongs in the shared pool (nothing to do; use default cpuset) - return nil -} - -func (p *staticPolicy) RemoveContainer(s State, containerID string) error { - if toRelease, ok := s.GetCPUSet(containerID); ok { - s.Delete(containerID) - s.SetDefaultCPUSet(s.GetDefaultCPUSet().Union(toRelease)) - } - return nil -} -``` - -##### Example pod specs and interpretation - -| Pod | Interpretation | -| ------------------------------------------ | ------------------------------ | -| Pod [Guaranteed]:<br /> A:<br />  cpu: 0.5 | Container **A** is assigned to the shared cpuset. | -| Pod [Guaranteed]:<br /> A:<br />  cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)<br /><br /> The shared cpuset is shrunk to make room for the exclusively allocated CPUs. | -| Pod [Guaranteed]:<br /> A:<br />  cpu: 1.0<br /> B:<br />  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | -| Pod [Guaranteed]:<br /> A:<br />  cpu: 1.5<br /> B:<br />  cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. | -| Pod [Burstable] | All containers are assigned to the shared cpuset. | -| Pod [BestEffort] | All containers are assigned to the shared cpuset. | - -##### Example scenarios and interactions - -1. _A container arrives that requires exclusive cores._ - 1. Kuberuntime calls the CRI delegate to create the container. - 1. Kuberuntime adds the container with the CPU manager. - 1. CPU manager adds the container to the static policy. - 1. Static policy acquires CPUs from the default pool, by - topological-best-fit. - 1. Static policy updates the state, adding an assignment for the new - container and removing those CPUs from the default pool. - 1. CPU manager reads container assignment from the state. - 1. CPU manager updates the container resources via the CRI. - 1. Kuberuntime calls the CRI delegate to start the container. - -1. _A container that was assigned exclusive cores terminates._ - 1. Kuberuntime removes the container with the CPU manager. - 1. CPU manager removes the container with the static policy. - 1. Static policy adds the container's assigned CPUs back to the default - pool. - 1. Kuberuntime calls the CRI delegate to remove the container. - 1. Asynchronously, the CPU manager's reconcile loop updates the - cpuset for all containers running in the shared pool. - -1. _The shared pool becomes empty._ - 1. This cannot happen. The size of the shared pool is greater than - the number of exclusively allocatable CPUs. The Kubelet requires the - total CPU reservation from `--kube-reserved` and `--system-reserved` - to be greater than zero when the static policy is enabled. The number - of exclusively allocatable CPUs is - `floor(capacity.cpu - allocatable.cpu)` and the shared pool initially - contains all CPUs in the system. - -#### Policy 3: "dynamic" cpuset control - -_TODO: Describe the policy._ - -Capturing discussions from resource management meetings and proposal comments: - -Unlike the static policy, when the dynamic policy allocates exclusive CPUs to -a container, the cpuset may change during the container's lifetime. If deemed -necessary, we discussed providing a signal in the following way. We could -project (a subset of) the CPU manager state into a volume visible to selected -containers. User workloads could subscribe to update events in a normal Linux -manner (e.g. inotify.) - -##### Implementation sketch - -```go -func (p *dynamicPolicy) Start(s State) { - // TODO -} - -func (p *dynamicPolicy) AddContainer(s State, pod *Pod, container *Container, containerID string) error { - // TODO -} - -func (p *dynamicPolicy) RemoveContainer(s State, containerID string) error { - // TODO -} -``` - -##### Example pod specs and interpretation - -| Pod | Interpretation | -| ------------------------------------------ | ------------------------------ | -| | | -| | | - -## Operations and observability - -* Checkpointing assignments - * The CPU Manager must be able to pick up where it left off in case the - Kubelet restarts for any reason. -* Read effective CPU assignments at runtime for alerting. This could be - satisfied by the checkpointing requirement. - -## Practical challenges - -1. Synchronizing CPU Manager state with the container runtime via the - CRI. Runc/libcontainer allows container cgroup settings to be updated - after creation, but neither the Kubelet docker shim nor the CRI - implement a similar interface. - 1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) -1. Compatibility with the `isolcpus` Linux kernel boot parameter. The operator - may want to correlate exclusive cores with the isolated CPUs, in which - case the static policy outlined above, where allocations are taken - directly from the shared pool, is too simplistic. - 1. Mitigation: defer supporting this until a new policy tailored for - use with `isolcpus` can be added. - -## Implementation roadmap - -### Phase 1: None policy [TARGET: Kubernetes v1.8] - -* Internal API exists to allocate CPUs to containers - ([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)) -* Kubelet configuration includes a CPU manager policy (initially only none) -* None policy is implemented. -* All existing unit and e2e tests pass. -* Initial unit tests pass. - -### Phase 2: Static policy [TARGET: Kubernetes v1.8] - -* Kubelet can discover "basic" CPU topology (HT-to-physical-core map) -* Static policy is implemented. -* Unit tests for static policy pass. -* e2e tests for static policy pass. -* Performance metrics for one or more plausible synthetic workloads show - benefit over none policy. - -### Phase 3: Beta support [TARGET: Kubernetes v1.9] - -* Container CPU assignments are durable across Kubelet restarts. -* Expanded user and operator docs and tutorials. - -### Later phases [TARGET: After Kubernetes v1.9] - -* Static policy also manages [cache allocation][cat] on supported platforms. -* Dynamic policy is implemented. -* Unit tests for dynamic policy pass. -* e2e tests for dynamic policy pass. -* Performance metrics for one or more plausible synthetic workloads show - benefit over none policy. -* Kubelet can discover "advanced" topology (NUMA). -* Node-level coordination for NUMA-dependent resource allocations, for example - devices, CPUs, memory-backed volumes including hugepages. - -## Appendix A: cpuset pitfalls - -1. [`cpuset.sched_relax_domain_level`][cpuset-files]. "controls the width of - the range of CPUs over which the kernel scheduler performs immediate - rebalancing of runnable tasks across CPUs." -1. Child cpusets must be subsets of their parents. If B is a child of A, - then B must be a subset of A. Attempting to shrink A such that B - would contain allowed CPUs not in A is not allowed (the write will - fail.) Nested cpusets must be shrunk bottom-up. By the same rationale, - nested cpusets must be expanded top-down. -1. Dynamically changing cpusets by directly writing to the sysfs would - create inconsistencies with container runtimes. -1. The `exclusive` flag. This will not be used. We will achieve - exclusivity for a CPU by removing it from all other assigned cpusets. -1. Tricky semantics when cpusets are combined with CFS shares and quota. - -[cat]: http://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html -[cpuset-files]: http://man7.org/linux/man-pages/man7/cpuset.7.html#FILES -[ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html -[hwloc]: https://www.open-mpi.org/projects/hwloc -[node-allocatable]: /contributors/design-proposals/node/node-allocatable.md#phase-2---enforce-allocatable-on-pods -[procfs]: http://man7.org/linux/man-pages/man5/proc.5.html -[qos]: /contributors/design-proposals/node/resource-qos.md -[topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/cri-dockershim-checkpoint.md b/contributors/design-proposals/node/cri-dockershim-checkpoint.md index 9f3a10b5..f0fbec72 100644 --- a/contributors/design-proposals/node/cri-dockershim-checkpoint.md +++ b/contributors/design-proposals/node/cri-dockershim-checkpoint.md @@ -1,127 +1,6 @@ -# CRI: Dockershim PodSandbox Checkpoint - -## Umbrella Issue -[#34672](https://github.com/kubernetes/kubernetes/issues/34672) - -## Background -[Container Runtime Interface (CRI)](/contributors/devel/sig-node/container-runtime-interface.md) -is an ongoing project to allow container runtimes to integrate with -kubernetes via a newly-defined API. -[Dockershim](https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/dockershim) -is the Docker CRI implementation. This proposal aims to introduce -checkpoint mechanism in dockershim. - -## Motivation -### Why do we need checkpoint? - - -With CRI, Kubelet only passes configurations (SandboxConfig, -ContainerConfig and ImageSpec) when creating sandbox, container and -image, and only use the reference id to manage them after creation. -However, information in configuration is not only needed during creation. - -In the case of dockershim with CNI network plugin, CNI plugins needs -the same information from PodSandboxConfig at creation and deletion. - -``` -Kubelet --------------------------------- - | RunPodSandbox(PodSandboxConfig) - | StopPodSandbox(PodSandboxID) - V -Dockershim------------------------------- - | SetUpPod - | TearDownPod - V -Network Plugin--------------------------- - | ADD - | DEL - V -CNI plugin------------------------------- -``` - - -In addition, checkpoint helps to improve the reliability of dockershim. -With checkpoints, critical information for disaster recovery could be -preserved. Kubelet makes decisions based on the reported pod states -from runtime shims. Dockershim currently gathers states from docker -engine. However, in case of disaster, docker engine may lose all -container information, including the reference ids. Without necessary -information, kubelet and dockershim could not conduct proper clean up. -For example, if docker containers are removed underneath kubelet, reference -to the allocated IPs and iptables setup for the pods are also lost. -This leads to resource leak and potential iptables rule conflict. - -### Why checkpoint in dockershim? -- CNI specification does not require CNI plugins to be stateful. And CNI -specification does not provide interface to retrieve states from CNI plugins. -- Currently there is no uniform checkpoint requirements across existing runtime shims. -- Need to preserve backward compatibility for kubelet. -- Easier to maintain backward compatibility by checkpointing at a lower level. - -## PodSandbox Checkpoint -Checkpoint file will be created for each PodSandbox. Files will be -placed under `/var/lib/dockershim/sandbox/`. File name will be the -corresponding `PodSandboxID`. File content will be json encoded. -Data structure is as follows: - -```go -const schemaVersion = "v1" - -type Protocol string - -// PortMapping is the port mapping configurations of a sandbox. -type PortMapping struct { - // Protocol of the port mapping. - Protocol *Protocol `json:"protocol,omitempty"` - // Port number within the container. - ContainerPort *int32 `json:"container_port,omitempty"` - // Port number on the host. - HostPort *int32 `json:"host_port,omitempty"` -} - -// CheckpointData contains all types of data that can be stored in the checkpoint. -type CheckpointData struct { - PortMappings []*PortMapping `json:"port_mappings,omitempty"` -} - -// PodSandboxCheckpoint is the checkpoint structure for a sandbox -type PodSandboxCheckpoint struct { - // Version of the pod sandbox checkpoint schema. - Version string `json:"version"` - // Pod name of the sandbox. Same as the pod name in the PodSpec. - Name string `json:"name"` - // Pod namespace of the sandbox. Same as the pod namespace in the PodSpec. - Namespace string `json:"namespace"` - // Data to checkpoint for pod sandbox. - Data *CheckpointData `json:"data,omitempty"` -} -``` - - -## Workflow Changes - - -`RunPodSandbox` creates checkpoint: -``` -() --> Pull Image --> Create Sandbox Container --> (Create Sandbox Checkpoint) --> Start Sandbox Container --> Set Up Network --> () -``` - -`RemovePodSandbox` removes checkpoint: -``` -() --> Remove Sandbox --> (Remove Sandbox Checkpoint) --> () -``` - -`ListPodSandbox` need to include all PodSandboxes as long as their -checkpoint files exist. If sandbox checkpoint exists but sandbox -container could not be found, the PodSandbox object will include -PodSandboxID, namespace and name. PodSandbox state will be `PodSandboxState_SANDBOX_NOTREADY`. - -`StopPodSandbox` and `RemovePodSandbox` need to conduct proper error handling to ensure idempotency. - - - -## Future extensions -This proposal is mainly driven by networking use cases. More could be added into checkpoint. +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/cri-windows.md b/contributors/design-proposals/node/cri-windows.md index 0192f6c4..f0fbec72 100644 --- a/contributors/design-proposals/node/cri-windows.md +++ b/contributors/design-proposals/node/cri-windows.md @@ -1,94 +1,6 @@ -# CRI: Windows Container Configuration +Design proposals have been archived. -**Authors**: Jiangtian Li (@JiangtianLi), Pengfei Ni (@feiskyer), Patrick Lang(@PatrickLang) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Proposed -## Background -Container Runtime Interface (CRI) defines [APIs and configuration types](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto) for kubelet to integrate various container runtimes. The Open Container Initiative (OCI) Runtime Specification defines [platform specific configuration](https://github.com/opencontainers/runtime-spec/blob/master/config.md#platform-specific-configuration), including Linux, Windows, and Solaris. Currently CRI only supports Linux container configuration. This proposal is to bring the Memory & CPU resource restrictions already specified in OCI for Windows to CRI. - -The Linux & Windows schedulers differ in design and the units used, but can accomplish the same goal of limiting resource consumption of individual containers. - -For example, on Linux platform, cpu quota and cpu period represent CPU resource allocation to tasks in a cgroup and cgroup by [Linux kernel CFS scheduler](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt). Container created in the cgroup are subject to those limitations, and additional processes forked or created will inherit the same cgroup. - -On the Windows platform, processes may be assigned to a job object, which can have [CPU rate control information](https://msdn.microsoft.com/en-us/library/windows/desktop/hh448384(v=vs.85).aspx), memory, and storage resource constraints enforced by the Windows kernel scheduler. A job object is created by Windows to at container creation time so all processes in the container will be aggregated and bound to the resource constraint. - -## Umbrella Issue -[#56734](https://github.com/kubernetes/kubernetes/issues/56734) - -## Feature Request -[#547](https://github.com/kubernetes/features/issues/547) - -## Motivation -The goal is to start filling the gap of platform support in CRI, specifically for Windows platform. For example, currently in dockershim Windows containers are scheduled using the default resource constraints and does not respect the resource requests and limits specified in POD. With this proposal, Windows containers will be able to leverage POD spec and CRI to allocate compute resource and respect restriction. - -## Proposed design - -The design is faily straightforward and to align CRI container configuration for Windows with [OCI runtime specification](https://github.com/opencontainers/runtime-spec/blob/master/specs-go/config.go): -``` -// WindowsResources has container runtime resource constraints for containers running on Windows. -type WindowsResources struct { - // Memory restriction configuration. - Memory *WindowsMemoryResources `json:"memory,omitempty"` - // CPU resource restriction configuration. - CPU *WindowsCPUResources `json:"cpu,omitempty"` -} -``` - -Since Storage and Iops for Windows containers is optional, it can be postponed to align with Linux container configuration in CRI. Therefore we propose to add the following to CRI for Windows container (PR [here](https://github.com/kubernetes/kubernetes/pull/57076)). - -### API definition -``` -// WindowsContainerConfig contains platform-specific configuration for -// Windows-based containers. -message WindowsContainerConfig { - // Resources specification for the container. - WindowsContainerResources resources = 1; -} - -// WindowsContainerResources specifies Windows specific configuration for -// resources. -message WindowsContainerResources { - // CPU shares (relative weight vs. other containers). Default: 0 (not specified). - int64 cpu_shares = 1; - // Number of CPUs available to the container. Default: 0 (not specified). - int64 cpu_count = 2; - // Specifies the portion of processor cycles that this container can use as a percentage times 100. - int64 cpu_maximum = 3; - // Memory limit in bytes. Default: 0 (not specified). - int64 memory_limit_in_bytes = 4; -} -``` - -### Mapping from Kubernetes API ResourceRequirements to Windows Container Resources -[Kubernetes API ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.9/#resourcerequirements-v1-core) contains two fields: limits and requests. Limits describes the maximum amount of compute resources allowed. Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. - -Windows Container Resources defines [resource control for Windows containers](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls). Note resource control is different between Hyper-V container (Hyper-V isolation) and Windows Server container (process isolation). Windows containers utilize job objects to group and track processes associated with each container. Resource controls are implemented on the parent job object associated with the container. In the case of Hyper-V isolation resource controls are applied both to the virtual machine as well as to the job object of the container running inside the virtual machine automatically, this ensures that even if a process running in the container bypassed or escaped the job objects controls the virtual machine would ensure it was not able to exceed the defined resource controls. - -[CPUCount](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L76) specifies number of processors to assign to the container. [CPUShares](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L77) specifies relative weight to other containers with cpu shares. Range is from 1 to 10000. [CPUMaximum or CPUPercent](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L78) specifies the portion of processor cycles that this container can use as a percentage times 100. Range is from 1 to 10000. On Windows Server containers, the processor resource controls are mutually exclusive, the order of precedence is CPUCount first, then CPUShares, and CPUPercent last (refer to [Docker User Manuals](https://github.com/docker/docker-ce/blob/master/components/cli/man/docker-run.1.md)). On Hyper-V containers, CPUMaximum applies to each processor independently, for example, CPUCount=2, CPUMaximum=5000 (50%) would limit each CPU to 50%. - -The mapping of resource limits/requests to Windows Container Resources is in the following table (refer to [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go#L265-#L289)): - -| | Windows Server Container | Hyper-V Container | -| ------------- |:-------------------------|:-----------------:| -| cpu_count | `cpu_count = int((container.Resources.Limits.Cpu().MilliValue() + 1000)/1000)` <br> `// 0 if not set` | Same | -| cpu_shares | `// milliCPUToShares converts milliCPU to 0-10000` <br> `cpu_shares=milliCPUToShares(container.Resources.Limits.Cpu().MilliValue())` <br> `if cpu_shares == 0 {` <br> `cpu_shares=milliCPUToShares(container.Resources.Request.Cpu().MilliValue())` <br> `}` | Same | -| cpu_maximum | `container.Resources.Limits.Cpu().MilliValue()/sysinfo.NumCPU()/1000*10000` | `container.Resources.Limits.Cpu().MilliValue()/cpu_count/1000*10000` | -| memory_limit_in_bytes | `container.Resources.Limits.Memory().Value()` | Same | -||| - - -## Implementation -The implementation will mainly be in two parts: -* In kuberuntime, where configuration is generated from POD spec. -* In container runtime, where configuration is passed to container configuration. For example, in dockershim, passed to [HostConfig](https://github.com/moby/moby/blob/master/api/types/container/host_config.go). - -In both parts, we need to implement: -* Fork code for Windows from Linux. -* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configuration in CRI to container configuration. - -To implement resource controls for Windows containers, refer to [this MSDN documentation](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls) and [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go). - -## Future work - -Windows [storage resource controls](https://github.com/opencontainers/runtime-spec/blob/master/config-windows.md#storage), security context (analog to SELinux, Apparmor, readOnlyRootFilesystem, etc.) and pod resource controls (analog to LinuxPodSandboxConfig.cgroup_parent already in CRI) are under investigation and would be handled in separate propsals. They will supplement and not replace the fields in `WindowsContainerResources` from this proposal. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/disk-accounting.md b/contributors/design-proposals/node/disk-accounting.md index d7afb38b..f0fbec72 100644 --- a/contributors/design-proposals/node/disk-accounting.md +++ b/contributors/design-proposals/node/disk-accounting.md @@ -1,581 +1,6 @@ -**Author**: Vishnu Kannan +Design proposals have been archived. -**Last** **Updated**: 11/16/2015 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Pending Review -This proposal is an attempt to come up with a means for accounting disk usage in Kubernetes clusters that are running docker as the container runtime. Some of the principles here might apply for other runtimes too. - -### Why is disk accounting necessary? - -As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesn't prevent running pods from using up all the available disk space. - -Kubernetes users have no insight into how the disk is being consumed. - -Large images and rapid logging can lead to temporary downtime on the nodes. The node has to free up disk space by deleting images and containers. During this cleanup, existing pods can fail and new pods cannot be started. The node will also transition into an `OutOfDisk` condition, preventing more pods from being scheduled to the node. - -Automated eviction of pods that are hogging the local disk is not possible since proper accounting isn’t available. - -Since local disk is a non-compressible resource, users need means to restrict usage of local disk by pods and containers. Proper disk accounting is a prerequisite. As of today, a misconfigured low QoS class pod can end up bringing down the entire cluster by taking up all the available disk space (misconfigured logging for example) - -### Goals - -1. Account for disk usage on the nodes. - -2. Compatibility with the most common docker storage backends - devicemapper, aufs and overlayfs - -3. Provide a roadmap for enabling disk as a schedulable resource in the future. - -4. Provide a plugin interface for extending support to non-default filesystems and storage drivers. - -### Non Goals - -1. Compatibility with all storage backends. The matrix is pretty large already and the priority is to get disk accounting to on most widely deployed platforms. - -2. Support for filesystems other than ext4 and xfs. - -### Introduction - -Disk accounting in Kubernetes cluster running with docker is complex because of the plethora of ways in which disk gets utilized by a container. - -Disk can be consumed for: - -1. Container images - -2. Container's writable layer - -3. Container's logs - when written to stdout/stderr and default logging backend in docker is used. - -4. Local volumes - hostPath, emptyDir, gitRepo, etc. - -As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the container's writable layer for aufs docker storage driver. -This information is made available to end users via the heapster monitoring pipeline. - -#### Image layers - -Image layers are shared between containers (COW) and so accounting for images is complicated. - -Image layers will have to be accounted as system overhead. - -As of today, it is not possible to check if there is enough disk space available on the node before an image is pulled. - -#### Writable Layer - -Docker creates a writable layer for every container on the host. Depending on the storage driver, the location and the underlying filesystem of this layer will change. - -Any files that the container creates or updates (assuming there are no volumes) will be considered as writable layer usage. - -The underlying filesystem is whatever the docker storage directory resides on. It is ext4 by default on most distributions, and xfs on RHEL. - -#### Container logs - -Docker engine provides a pluggable logging interface. Kubernetes is currently using the default logging mode which is `local file`. In this mode, the docker daemon stores bytes written by containers to their stdout or stderr, to local disk. These log files are contained in a special directory that is managed by the docker daemon. These logs are exposed via `docker logs` interface which is then exposed via kubelet and apiserver APIs. Currently, there is a hard-requirement for persisting these log files on the disk. - -#### Local Volumes - -Volumes are slightly different from other local disk use cases. They are pod scoped. Their lifetime is tied to that of a pod. Due to this property accounting of volumes will also be at the pod level. - -As of now, the volume types that can use local disk directly are ‘HostPath’, ‘EmptyDir’, and ‘GitRepo’. Secretes and Downwards API volumes wrap these primitive volumes. -Everything else is a network based volume. - -‘HostPath’ volumes map in existing directories in the host filesystem into a pod. Kubernetes manages only the mapping. It does not manage the source on the host filesystem. - -In addition to this, the changes introduced by a pod on the source of a hostPath volume is not cleaned by kubernetes once the pod exits. Due to these limitations, we will have to account hostPath volumes to system overhead. We should explicitly discourage use of HostPath in read-write mode. - -`EmptyDir`, `GitRepo` and other local storage volumes map to a directory on the host root filesystem, that is managed by Kubernetes (kubelet). Their contents are erased as soon as the pod exits. Tracking and potentially restricting usage for volumes is possible. - -### Docker storage model - -Before we start exploring solutions, let's get familiar with how docker handles storage for images, writable layer and logs. - -On all storage drivers, logs are stored under `<docker root dir>/containers/<container-id>/` - -The default location of the docker root directory is `/var/lib/docker`. - -Volumes are handled by kubernetes. -*Caveat: Volumes specified as part of Docker images are not handled by Kubernetes currently.* - -Container images and writable layers are managed by docker and their location will change depending on the storage driver. Each image layer and writable layer is referred to by an ID. The image layers are read-only. Once saved, existing writable layers can be frozen. Saving feature is not of importance to kubernetes since it works only on immutable images. - -*Note: Image layer IDs can be obtained by running `docker history -q --no-trunc <imagename>`* - -##### Aufs - -Image layers and writable layers are stored under `/var/lib/docker/aufs/diff/<id>`. - -The writable layers ID is equivalent to that of the container ID. - -##### Devicemapper - -Each container and each image gets own block device. Since this driver works at the block level, it is not possible to access the layers directly without mounting them. Each container gets its own block device while running. - -##### Overlayfs - -Image layers and writable layers are stored under `/var/lib/docker/overlay/<id>`. - -Identical files are hard-linked between images. - -The image layers contain all their data under a `root` subdirectory. - -Everything under `/var/lib/docker/overlay/<id>` are files required for running the container, including its writable layer. - -### Improve disk accounting - -Disk accounting is dependent on the storage driver in docker. A common solution that works across all storage drivers isn't available. - -I'm listing a few possible solutions for disk accounting below along with their limitations. - -We need a plugin model for disk accounting. Some storage drivers in docker will require special plugins. - -#### Container Images - -As of today, the partition that is holding docker images is flagged by cadvisor, and it uses filesystem stats to identify the overall disk usage of that partition. - -Isolated usage of just image layers is available today using `docker history <image name>`. -But isolated usage isn't of much use because image layers are shared between containers and so it is not possible to charge a single pod for image disk usage. - -Continuing to use the entire partition availability for garbage collection purposes in kubelet, should not affect reliability. -We might garbage collect more often. -As long as we do not expose features that require persisting old containers, computing image layer usage wouldn't be necessary. - -Main goals for images are -1. Capturing total image disk usage -2. Check if a new image will fit on disk. - -In case we choose to compute the size of image layers alone, the following are some of the ways to achieve that. - -*Note that some of the strategies mentioned below are applicable in general to other kinds of storage like volumes, etc.* - -##### Docker History - -It is possible to run `docker history` and then create a graph of all images and corresponding image layers. -This graph will let us figure out the disk usage of all the images. - -**Pros** -* Compatible across storage drivers. - -**Cons** -* Requires maintaining an internal representation of images. - -##### Enhance docker - -Docker handles the upload and download of image layers. It can embed enough information about each layer. If docker is enhanced to expose this information, we can statically identify space about to be occupied by read-only image layers, even before the image layers are downloaded. - -A new [docker feature](https://github.com/docker/docker/pull/16450) (docker pull --dry-run) is pending review, which outputs the disk space that will be consumed by new images. Once this feature lands, we can perform feasibility checks and reject pods that will consume more disk space that what is current availability on the node. - -Another option is to expose disk usage of all images together as a first-class feature. - -**Pros** - -* Works across all storage drivers since docker abstracts the storage drivers. - -* Less code to maintain in kubelet. - -**Cons** - -* Not available today. - -* Requires serialized image pulls. - -* Metadata files are not tracked. - -##### Overlayfs and Aufs - -###### `du` - -We can list all the image layer specific directories, excluding container directories, and run `du` on each of those directories. - -**Pros**: - -* This is the least-intrusive approach. - -* It will work off the box without requiring any additional configuration. - -**Cons**: - -* `du` can consume a lot of cpu and memory. There have been several issues reported against the kubelet in the past that were related to `du`. - -* It is time consuming. Cannot be run frequently. Requires special handling to constrain resource usage - setting lower nice value or running in a sub-container. - -* Can block container deletion by keeping file descriptors open. - - -###### Linux gid based Disk Quota - -[Disk quota](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html) feature provided by the linux kernel can be used to track the usage of image layers. Ideally, we need `project` support for disk quota, which lets us track usage of directory hierarchies using `project ids`. Unfortunately, that feature is only available for zfs filesystems. Since most of our distributions use `ext4` by default, we will have to use either `uid` or `gid` based quota tracking. - -Both `uids` and `gids` are meant for security. Overloading that concept for disk tracking is painful and ugly. But, that is what we have today. - -Kubelet needs to define a gid for tracking image layers and make that gid or group the owner of `/var/lib/docker/[aufs | overlayfs]` recursively. Once this is done, the quota sub-system in the kernel will report the blocks being consumed by the storage driver on the underlying partition. - -Since this number also includes the container's writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking container’s writable layer. Once we apply a different `gid` to the container's writable layer, which is located under `/var/lib/docker/<storage_driver>/diff/<container_id>`, the quota subsystem will not include the container's writable layer usage. - -Xfs on the other hand support project quota which lets us track disk usage of arbitrary directories using a project. Support for this feature in ext4 is being reviewed. So on xfs, we can use quota without having to clobber the writable layer's uid and gid. - -**Pros**: - -* Low overhead tracking provided by the kernel. - - -**Cons** - -* Requires updates to default ownership on docker's internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native. - -* Requires additional node configuration - quota subsystem needs to be setup on the node. This can either be automated or made a requirement for the node. - -* Kubelet needs to perform gid management. A range of gids have to allocated to the kubelet for the purposes of quota management. This range must not be used for any other purposes out of band. Not required if project quota is available. - -* Breaks `docker save` semantics. Since kubernetes assumes immutable images, this is not a blocker. To support quota in docker, we will need user-namespaces along with custom gid mapping for each container. This feature does not exist today. This is not an issue with project quota. - -*Note: Refer to the [Appendix](#appendix) section more real examples on using quota with docker.* - -**Project Quota** - -Project Quota support for ext4 is currently being reviewed upstream. If that feature lands in upstream sometime soon, project IDs will be used to disk tracking instead of uids and gids. - - -##### Devicemapper - -Devicemapper storage driver will setup two volumes, metadata and data, that will be used to store image layers and container writable layer. The volumes can be real devices or loopback. A Pool device is created which uses the underlying volume for real storage. - -A new thinly-provisioned volume, based on the pool, will be created for running container's. - -The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and container's writable layers. - -Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layer's disk usage. - -Linux quota and `du` will not work with device mapper. - -A docker dry run option (mentioned above) is another possibility. - - -#### Container Writable Layer - -###### Overlayfs / Aufs - -Docker creates a separate directory for the container's writable layer which is then overlayed on top of read-only image layers. - -Both the previously mentioned options of `du` and `Linux Quota` will work for this case as well. - -Kubelet can use `du` to track usage and enforce `limits` once disk becomes a schedulable resource. As mentioned earlier `du` is resource intensive. - -To use Disk quota, kubelet will have to allocate a separate gid per container. Kubelet can reuse the same gid for multiple instances of the same container (restart scenario). As and when kubelet garbage collects dead containers, the usage of the container will drop. - -If local disk becomes a schedulable resource, `linux quota` can be used to impose `request` and `limits` on the container writable layer. -`limits` can be enforced using hard limits. Enforcing `request` will be tricky. One option is to enforce `requests` only when the disk availability drops below a threshold (10%). Kubelet can at this point evict pods that are exceeding their requested space. Other options include using `soft limits` with grace periods, but this option is complex. - -###### Devicemapper - -FIXME: How to calculate writable layer usage with devicemapper? - -To enforce `limits` the volume created for the container's writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet. - - -#### Container logs - -Container logs are not storage driver specific. We can use either `du` or `quota` to track log usage per container. Log files are stored under `/var/lib/docker/containers/<container-id>`. - -In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layer's usage individually. - -For the purposes of enforcing limits though, kubelet will use the sum of logs and writable layer. - -In the future, we can consider adding log rotation support for these log files either in kubelet or via docker. - - -#### Volumes - -The local disk based volumes map to a directory on the disk. We can use `du` or `quota` to track the usage of volumes. - -There exists a concept called `FsGroup` today in kubernetes, which lets users specify a gid for all volumes in a pod. If that is set, we can use the `FsGroup` gid for quota purposes. This requires `limits` for volumes to be a pod level resource though. - - -### Yet to be explored - -* Support for filesystems other than ext4 and xfs like `zfs` - -* Support for Btrfs - -It should be clear at this point that we need a plugin based model for disk accounting. Support for other filesystems both CoW and regular can be added as and when required. As we progress towards making accounting work on the above mentioned storage drivers, we can come up with an abstraction for storage plugins in general. - - -### Implementation Plan and Milestones - -#### Milestone 1 - Get accounting to just work! - -This milestone targets exposing the following categories of disk usage from the kubelet - infrastructure (images, sys daemons, etc), containers (log + writable layer) and volumes. - -* `du` works today. Use `du` for all the categories and ensure that it works on both on aufs and overlayfs. - -* Add device mapper support. - -* Define a storage driver based pluggable disk accounting interface in cadvisor. - -* Reuse that interface for accounting volumes in kubelet. - -* Define a disk manager module in kubelet that will serve as a source of disk usage information for the rest of the kubelet. - -* Ensure that the kubelet metrics APIs (/apis/metrics/v1beta1) exposes the disk usage information. Add an integration test. - - -#### Milestone 2 - node reliability - -Improve user experience by doing whatever is necessary to keep the node running. - -NOTE: [`Out of Resource Killing`](https://github.com/kubernetes/kubernetes/issues/17186) design is a prerequisite. - -* Disk manager will evict pods and containers based on QoS class whenever the disk availability is below a critical level. - -* Explore combining existing container and image garbage collection logic into disk manager. - -Ideally, this phase should be completed before v1.2. - - -#### Milestone 3 - Performance improvements - -In this milestone, we will add support for quota and make it opt-in. There should be no user visible changes in this phase. - -* Add gid allocation manager to kubelet - -* Reconcile gids allocated after restart. - -* Configure linux quota automatically on startup. Do not set any limits in this phase. - -* Allocate gids for pod volumes, container's writable layer and logs, and also for image layers. - -* Update the docker runtime plugin in kubelet to perform the necessary `chown's` and `chmod's` between container creation and startup. - -* Pass the allocated gids as supplementary gids to containers. - -* Update disk manager in kubelet to use quota when configured. - - -#### Milestone 4 - Users manage local disks - -In this milestone, we will make local disk a schedulable resource. - -* Finalize volume accounting - is it at the pod level or per-volume. - -* Finalize multi-disk management policy. Will additional disks be handled as whole units? - -* Set aside some space for image layers and rest of the infra overhead - node allocable resources includes local disk. - -* `du` plugin triggers container or pod eviction whenever usage exceeds limit. - -* Quota plugin sets hard limits equal to user specified `limits`. - -* Devicemapper plugin resizes writable layer to not exceed the container's disk `limit`. - -* Disk manager evicts pods based on `usage` - `request` delta instead of just QoS class. - -* Sufficient integration testing to this feature. - - -### Appendix - - -#### Implementation Notes - -The following is a rough outline of the testing I performed to corroborate by prior design ideas. - -Test setup information - -* Testing was performed on GCE virtual machines - -* All the test VMs were using ext4. - -* Distribution tested against is mentioned as part of each graph driver. - -##### AUFS testing notes: - -Tested on Debian jessie - -1. Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ). - -2. Create a new group ‘x’ on the host and enable quota for that group - - 1. `groupadd -g 9000 x` - - 2. `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*) - - 3. `quota -g 9000 -v` // Check that quota is enabled - -3. Create a docker container - - 4. `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"` - - 8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d - -4. Change group on the writable layer directory for this container - - 5. `chmod a+s /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d` - - 6. `chown :x /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d` - -5. Start the docker container - - 7. `docker start 8d` - - 8. Check usage using quota and group ‘x’ - - ```shell - $ quota -g x -v - Disk quotas for group x (gid 9000): - Filesystem blocks quota limit grace files quota limit grace - /dev/sda1 10248 0 0 3 0 0 - ``` - - Using the same workflow, we can add new sticky group IDs to emptyDir volumes and account for their usage against pods. - - Since each container requires a gid for the purposes of quota, we will have to reserve ranges of gids for use by the kubelet. Since kubelet does not checkpoint its state, recovery of group id allocations will be an interesting problem. More on this later. - -Track the space occupied by images after it has been pulled locally as follows. - -*Note: This approach requires serialized image pulls to be of any use to the kubelet.* - -1. Create a group specifically for the graph driver - - 1. `groupadd -g 9001 docker-images` - -2. Update group ownership on the ‘graph’ (tracks image metadata) and ‘storage driver’ directories. - - 2. `chown -R :9001 /var/lib/docker/[overlay | aufs]` - - 3. `chmod a+s /var/lib/docker/[overlay | aufs]` - - 4. `chown -R :9001 /var/lib/docker/graph` - - 5. `chmod a+s /var/lib/docker/graph` - -3. Any new images pulled or containers created will be accounted to the `docker-images` group by default. - -4. Once we update the group ownership on newly created containers to a different gid, the container writable layer's specific disk usage gets dropped from this group. - -#### Overlayfs - -Tested on Ubuntu 15.10. - -Overlayfs works similar to Aufs. The path to the writable directory for container writable layer changes. - -* Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ). - -* Create a new group ‘x’ on the host and enable quota for that group - - * `groupadd -g 9000 x` - - * `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*) - - * `quota -g 9000 -v` // Check that quota is enabled - -* Create a docker container - - * `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"` - - * `b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61` - -* Change group on the writable layer’s directory for this container - - * `chmod -R a+s /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - - * `chown -R :9000 /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - -* Check quota before and after running the container. - - ```shell - $ quota -g x -v - Disk quotas for group x (gid 9000): - Filesystem blocks quota limit grace files quota limit grace - /dev/sda1 48 0 0 19 0 0 - ``` - - * Start the docker container - - * `docker start b8` - - Notice the **blocks** has changed - - ```sh - $ quota -g x -v - Disk quotas for group x (gid 9000): - Filesystem blocks quota limit grace files quota limit grace - /dev/sda1 10288 0 0 20 0 0 - ``` - -##### Device mapper - -Usage of Linux Quota should be possible for the purposes of volumes and log files. - -Devicemapper storage driver in docker uses ["thin targets"](https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt). Underneath there are two block devices - “data” and “metadata”, using which more block devices are created for containers. More information [here](http://www.projectatomic.io/docs/filesystems/). - -These devices can be loopback or real storage devices. - -The base device has a maximum storage capacity. This means that the sum total of storage space occupied by images and containers cannot exceed this capacity. - -By default, all images and containers are created from an initial filesystem with a 10GB limit. - -A separate filesystem is created for each container as part of start (not create). - -It is possible to [resize](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/) the container filesystem. - -For the purposes of image space tracking, we can - -#### Testing notes: -Notice the **Pool Name** -```shell -$ docker info -... -Storage Driver: devicemapper - Pool Name: docker-8:1-268480-pool - Pool Blocksize: 65.54 kB - Backing Filesystem: extfs - Data file: /dev/loop0 - Metadata file: /dev/loop1 - Data Space Used: 2.059 GB - Data Space Total: 107.4 GB - Data Space Available: 48.45 GB - Metadata Space Used: 1.806 MB - Metadata Space Total: 2.147 GB - Metadata Space Available: 2.146 GB - Udev Sync Supported: true - Deferred Removal Enabled: false - Data loop file: /var/lib/docker/devicemapper/devicemapper/data - Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata - Library Version: 1.02.99 (2015-06-20) -``` - -```shell -$ dmsetup table docker-8\:1-268480-pool -0 209715200 thin-pool 7:1 7:0 128 32768 1 skip_block_zeroing -``` - -128 is the data block size - -Usage from kernel for the primary block device - -```shell -$ dmsetup status docker-8\:1-268480-pool -0 209715200 thin-pool 37 441/524288 31424/1638400 - rw discard_passdown queue_if_no_space - -``` - -Usage/Available - 31424/1638400 - -Usage in MB = 31424 * 512 * 128 (block size from above) bytes = 1964 MB - -Capacity in MB = 1638400 * 512 * 128 bytes = 100 GB - -#### Log file accounting - -* Setup Linux quota for a container as mentioned above. - -* Update group ownership on the following directories to that of the container group ID created for graphing. Adapting the examples above: - - * `chmod -R a+s /var/lib/docker/**containers**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - - * `chown -R :9000 /var/lib/docker/**container**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - -##### Testing titbits - -* Ubuntu 15.10 doesn't ship with the quota module on virtual machines. [Install ‘linux-image-extra-virtual’](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work. - -* Overlay storage driver needs kernels >= 3.18. I used Ubuntu 15.10 to test Overlayfs. - -* If you use a non-default location for docker storage, change `/var/lib/docker` in the examples to your storage location. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/downward_api_resources_limits_requests.md b/contributors/design-proposals/node/downward_api_resources_limits_requests.md index e8cf4438..f0fbec72 100644 --- a/contributors/design-proposals/node/downward_api_resources_limits_requests.md +++ b/contributors/design-proposals/node/downward_api_resources_limits_requests.md @@ -1,618 +1,6 @@ -# Downward API for resource limits and requests +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Currently the downward API (via environment variables and volume plugin) only -supports exposing a Pod's name, namespace, annotations, labels and its IP -([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This -document explains the need and design to extend them to expose resources -(e.g. cpu, memory) limits and requests. -## Motivation - -Software applications require configuration to work optimally with the resources they're allowed to use. -Exposing the requested and limited amounts of available resources inside containers will allow -these applications to be configured more easily. Although docker already -exposes some of this information inside containers, the downward API helps -exposing this information in a runtime-agnostic manner in Kubernetes. - -## Use cases - -As an application author, I want to be able to use cpu or memory requests and -limits to configure the operational requirements of my applications inside containers. -For example, Java applications expect to be made aware of the available heap size via -a command line argument to the JVM, for example: java -Xmx:`<heap-size>`. Similarly, an -application may want to configure its thread pool based on available cpu resources and -the exported value of GOMAXPROCS. - -## Design - -This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473). -There are three approaches discussed in this document to obtain resources limits -and requests to be exposed as environment variables and volumes inside -containers: - -1. The first approach requires users to specify full json path selectors -in which selectors are relative to the pod spec. The benefit of this -approach is to specify pod-level resources, and since containers are -also part of a pod spec, it can be used to specify container-level -resources too. - -2. The second approach requires specifying partial json path selectors -which are relative to the container spec. This approach helps -in retrieving a container specific resource limits and requests, and at -the same time, it is simpler to specify than full json path selectors. - -3. In the third approach, users specify fixed strings (magic keys) to retrieve -resources limits and requests and do not specify any json path -selectors. This approach is similar to the existing downward API -implementation approach. The advantages of this approach are that it is -simpler to specify that the first two, and does not require any type of -conversion between internal and versioned objects or json selectors as -discussed below. - -Before discussing a bit more about merits of each approach, here is a -brief discussion about json path selectors and some implications related -to their use. - -#### JSONpath selectors - -Versioned objects in kubernetes have json tags as part of their golang fields. -Currently, objects in the internal API have json tags, but it is planned that -these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933) -for discussion). So for discussion in this proposal, we assume that -internal objects do not have json tags. In the first two approaches -(full and partial json selectors), when a user creates a pod and its -containers, the user specifies a json path selector in the pod's -spec to retrieve values of its limits and requests. The selector -is composed of json tags similar to json paths used with kubectl -([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal -uses kubernetes' json path library to process the selectors to retrieve -the values. As kubelet operates on internal objects (without json tags), -and the selectors are part of versioned objects, retrieving values of -the limits and requests can be handled using these two solutions: - -1. By converting an internal object to versioned object, and then using -the json path library to retrieve the values from the versioned object -by processing the selector. - -2. By converting a json selector of the versioned objects to internal -object's golang expression and then using the json path library to -retrieve the values from the internal object by processing the golang -expression. However, converting a json selector of the versioned objects -to internal object's golang expression will still require an instance -of the versioned object, so it seems more work from the first solution -unless there is another way without requiring the versioned object. - -So there is a one time conversion cost associated with the first (full -path) and second (partial path) approaches, whereas the third approach -(magic keys) does not require any such conversion and can directly -work on internal objects. If we want to avoid conversion cost and to -have implementation simplicity, my opinion is that magic keys approach -is relatively easiest to implement to expose limits and requests with -least impact on existing functionality. - -To summarize merits/demerits of each approach: - -|Approach | Scope | Conversion cost | JSON selectors | Future extension| -| ---------- | ------------------- | -------------------| ------------------- | ------------------- | -|Full selectors | Pod/Container | Yes | Yes | Possible | -|Partial selectors | Container | Yes | Yes | Possible | -|Magic keys | Container | No | No | Possible| - -Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object -in conjunction with partial selectors and magic keys approaches. - -### API with full JSONpath selectors - -Full json path selectors specify the complete path to the resources -limits and requests relative to pod spec. - -#### Environment variables - -This table shows how selectors can be used for various requests and -limits to be exposed as environment variables. Environment variable names -are examples only and not necessarily as specified, and the selectors do not -have to start with dot. - -| Env Var Name | Selector | -| ---- | ------------------- | -| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu| -| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory| -| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu| -| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory | - -#### Volume plugin - -This table shows how selectors can be used for various requests and -limits to be exposed as volumes. The path names are examples only and -not necessarily as specified, and the selectors do not have to start with dot. - - -| Path | Selector | -| ---- | ------------------- | -| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu| -| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory| -| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu| -| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory| - -Volumes are pod scoped, so a selector must be specified with a container name. - -Full json path selectors will use existing `type ObjectFieldSelector` -to extend the current implementation for resources requests and limits. - -```go -// ObjectFieldSelector selects an APIVersioned field of an object. -type ObjectFieldSelector struct { - APIVersion string `json:"apiVersion"` - // Required: Path of the field to select in the specified API version - FieldPath string `json:"fieldPath"` -} -``` - -#### Examples - -These examples show how to use full selectors with environment variables and volume plugin. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - fieldRef: - fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu -``` - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: k8s.gcr.io/busybox - command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - fieldRef: - fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu -``` - -#### Validations - -For APIs with full json path selectors, verify that selectors are -valid relative to pod spec. - - -### API with partial JSONpath selectors - -Partial json path selectors specify paths to resources limits and requests -relative to the container spec. These will be implemented by introducing a -`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current -implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. - -```go -// ContainerSpecFieldSelector selects an APIVersioned field of an object. -type ContainerSpecFieldSelector struct { - APIVersion string `json:"apiVersion"` - // Container name - ContainerName string `json:"containerName,omitempty"` - // Required: Path of the field to select in the specified API version - FieldPath string `json:"fieldPath"` -} - -// Represents a single file containing information from the downward API -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. - Path string `json:"path"` - // Selects a field of the pod: only annotations, labels, name and - // namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` - // Selects a field of the container: only resources limits and requests - // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, - // resources.requests.memory) are currently supported. - ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` -} - -// EnvVarSource represents a source for the value of an EnvVar. -// Only one of its fields may be set. -type EnvVarSource struct { - // Selects a field of the container: only resources limits and requests - // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, - // resources.requests.memory) are currently supported. - ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` - // Selects a field of the pod; only name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` - // Selects a key of a secret in the pod's namespace. - SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` -} -``` - -#### Environment variables - -This table shows how partial selectors can be used for various requests and -limits to be exposed as environment variables. Environment variable names -are examples only and not necessarily as specified, and the selectors do not -have to start with dot. - -| Env Var Name | Selector | -| -------------------- | -------------------| -| CPU_LIMIT | resources.limits.cpu | -| MEMORY_LIMIT | resources.limits.memory | -| CPU_REQUEST | resources.requests.cpu | -| MEMORY_REQUEST | resources.requests.memory | - -Since environment variables are container scoped, it is optional -to specify container name as part of the partial selectors as they are -relative to container spec. If container name is not specified, then -it defaults to current container. However, container name could be specified -to expose variables from other containers. - -#### Volume plugin - -This table shows volume paths and partial selectors used for resources cpu and memory. -Volume path names are examples only and not necessarily as specified, and the -selectors do not have to start with dot. - -| Path | Selector | -| -------------------- | -------------------| -| cpu_limit | resources.limits.cpu | -| memory_limit | resources.limits.memory | -| cpu_request | resources.requests.cpu | -| memory_request | resources.requests.memory | - -Volumes are pod scoped, the container name must be specified as part of -`containerSpecFieldRef` with them. - -#### Examples - -These examples show how to use partial selectors with environment variables and volume plugin. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - containerSpecFieldRef: - fieldPath: resources.limits.cpu -``` - -``` -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: k8s.gcr.io/busybox - command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - containerSpecFieldRef: - containerName: "client-container" - fieldPath: resources.limits.cpu -``` - -#### Validations - -For APIs with partial json path selectors, verify -that selectors are valid relative to container spec. -Also verify that container name is provided with volumes. - - -### API with magic keys - -In this approach, users specify fixed strings (or magic keys) to retrieve resources -limits and requests. This approach is similar to the existing downward -API implementation approach. The fixed string used for resources limits and requests -for cpu and memory are `limits.cpu`, `limits.memory`, -`requests.cpu` and `requests.memory`. Though these strings are same -as json path selectors but are processed as fixed strings. These will be implemented by -introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current -implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. - -The fields in ResourceFieldSelector are `containerName` to specify the name of a -container, `resource` to specify the type of a resource (cpu or memory), and `divisor` -to specify the output format of values of exposed resources. The default value of divisor -is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid -values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer -(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes), -`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`, -`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes). -For more information about these resource formats, [see details](resources.md). - -Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor. -For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then -exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and -the ceiling of 0.25 is 1. - -```go -type ResourceFieldSelector struct { - // Container name - ContainerName string `json:"containerName,omitempty"` - // Required: Resource to select - Resource string `json:"resource"` - // Specifies the output format of the exposed resources - Divisor resource.Quantity `json:"divisor,omitempty"` -} - -// Represents a single file containing information from the downward API -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. - Path string `json:"path"` - // Selects a field of the pod: only annotations, labels, name and - // namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` -} - -// EnvVarSource represents a source for the value of an EnvVar. -// Only one of its fields may be set. -type EnvVarSource struct { - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` - // Selects a field of the pod; only name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` - // Selects a key of a secret in the pod's namespace. - SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` -} -``` - -#### Environment variables - -This table shows environment variable names and strings used for resources cpu and memory. -The variable names are examples only and not necessarily as specified. - -| Env Var Name | Resource | -| -------------------- | -------------------| -| CPU_LIMIT | limits.cpu | -| MEMORY_LIMIT | limits.memory | -| CPU_REQUEST | requests.cpu | -| MEMORY_REQUEST | requests.memory | - -Since environment variables are container scoped, it is optional -to specify container name as part of the partial selectors as they are -relative to container spec. If container name is not specified, then -it defaults to current container. However, container name could be specified -to expose variables from other containers. - -#### Volume plugin - -This table shows volume paths and strings used for resources cpu and memory. -Volume path names are examples only and not necessarily as specified. - -| Path | Resource | -| -------------------- | -------------------| -| cpu_limit | limits.cpu | -| memory_limit | limits.memory| -| cpu_request | requests.cpu | -| memory_request | requests.memory | - -Volumes are pod scoped, the container name must be specified as part of -`resourceFieldRef` with them. - -#### Examples - -These examples show how to use magic keys approach with environment variables and volume plugin. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - resourceFieldRef: - resource: limits.cpu - - name: MEMORY_LIMIT - valueFrom: - resourceFieldRef: - resource: limits.memory - divisor: "1Mi" -``` - -In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: k8s.gcr.io/busybox - command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - resourceFieldRef: - containerName: client-container - resource: limits.cpu - divisor: "1m" - - path: "memory_limit" - resourceFieldRef: - containerName: client-container - resource: limits.memory -``` - -In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively. - - -#### Validations - -For APIs with magic keys, verify that the resource strings are valid and is one -of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`. -Also verify that container name is provided with volumes. - -## Pod-level and container-level resource access - -Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in -all approaches. Container-level resources will be accessed by `type ObjectFieldSelector` -with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef` -with partial and magic keys approaches, respectively. The following table -summarizes resource access with these approaches. - -| Approach | Pod resources| Container resources | -| -------------------- | -------------------|-------------------| -| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`| -| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` | -| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` | - -## Output format - -The output format for resources limits and requests will be same as -cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024 -and rounded to integer) and memory in bytes. For example, memory request -or limit of `64Mi` in the container spec will be output as `67108864` -bytes, and cpu request or limit of `250m` (millicores) will be output as -`256` of cpu shares. - -## Implementation approach - -The current implementation of this proposal will focus on the API with magic keys -approach. The main reason for selecting this approach is that it might be -easier to incorporate and extend resource specific functionality. - -## Applied example - -Here we discuss how to use exposed resource values to set, for example, Java -memory size or GOMAXPROCS for your applications. Lets say, you expose a container's -(running an application like tomcat for example) requested memory as `HEAP_SIZE` -and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable. -One way to set the heap size or cpu for this application would be to wrap the binary -in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it) -and GOMAXPROCS environment variables inside the container image. The spec file for the -application pod could look like: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64M" - cpu: "250m" - limits: - memory: "128M" - cpu: "500m" - env: - - name: HEAP_SIZE - valueFrom: - resourceFieldRef: - resource: requests.memory - - name: CPU_LIMIT - valueFrom: - resourceFieldRef: - resource: requests.cpu -``` - -Note that the value of divisor by default is `1`. Now inside the container, -the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as: - -```sh -export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)" - -and - -export GOMAXPROCS=$(CPU_LIMIT)" -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/dynamic-kubelet-configuration.md b/contributors/design-proposals/node/dynamic-kubelet-configuration.md index fdbce1b2..f0fbec72 100644 --- a/contributors/design-proposals/node/dynamic-kubelet-configuration.md +++ b/contributors/design-proposals/node/dynamic-kubelet-configuration.md @@ -1,310 +1,6 @@ -# Dynamic Kubelet Configuration +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal for making it possible to (re)configure Kubelets in a live cluster by providing config via the API server. Some subordinate items include local checkpointing of Kubelet configuration and the ability for the Kubelet to read config from a file on disk, rather than from command line flags. - -## Motivation - -The Kubelet is currently configured via command-line flags. This is painful for a number of reasons: -- It makes it difficult to change the way Kubelets are configured in a running cluster, because it is often tedious to change the Kubelet startup configuration (without adding your own configuration management system e.g. Ansible, Salt, Puppet). -- It makes it difficult to manage different Kubelet configurations for different nodes, e.g. if you want to canary a new config or slowly flip the switch on a new feature. -- The current lack of versioned Kubelet configuration means that any time we change Kubelet flags, we risk breaking someone's setup. - -## Example Use Cases - -- Staged rollout of configuration changes, including tuning adjustments and enabling new Kubelet features. -- Streamline cluster bootstrap. The Kubeadm folks want to plug in to dynamic config, for example: [kubernetes/kubeadm#28](https://github.com/kubernetes/kubeadm/issues/28). -- Making it easier to run tests with different Kubelet configurations, because tests can modify Kubelet configuration on the fly. - -## Primary Goals of the Design - -Kubernetes should: - -- Provide a versioned object to represent the Kubelet configuration. -- Provide the ability to specify a dynamic configuration source for each node. -- Provide a way to share the same configuration source between nodes. -- Protect against bad configuration pushes. -- Recommend, but not mandate, the basics of a workflow for updating configuration. - -Additionally, we should: - -- Add Kubelet support for consuming configuration via a file on disk. This aids work towards deprecating flags in favor of on-disk configuration. -- Make it possible to opt-out of remote configuration as an extra layer of protection. This should probably be a flag, rather than a dynamic field, as it would otherwise be too easy to accidentally turn off dynamic config with a config push. - -## Design - -Two really important questions: -1. How should one organize and represent configuration in a cluster? -2. How should one orchestrate changes to that configuration? - -This doc primarily focuses on (1) and the downstream (API server -> Kubelet) aspects of (2). - -### Organization of the Kubelet's Configuration Type - -In general, components should expose their configuration types from their own source trees. The types are currently in the alpha `componentconfig` API group, and should be broken out into the trees of their individual components. PR [#42759](https://github.com/kubernetes/kubernetes/pull/42759) reorganized the Kubelet's tree to facilitate this. PR [#44252](https://github.com/kubernetes/kubernetes/pull/44252) initiates the decomposition of the type. - -Components with several same-configured instances, like the Kubelet, should be able to share configuration sources. A 1:N mapping of config-object:instances is much more efficient than requiring a config object per-instance. As one example, we removed the `HostNameOverride` and `NodeIP` fields from the configuration type because these cannot be shared between Nodes - [#40117](https://github.com/kubernetes/kubernetes/pull/40117). - -Components that currently take command line flags should not just map these flags directly into their configuration types. We should, in general, think about which parameters make sense to configure dynamically, which can be shared between instances, and which are so low-level that they shouldn't really be exposed on the component's interface in the first place. Thus, the Kubelet's flags should be kept separate from configuration - [#40117](https://github.com/kubernetes/kubernetes/pull/40117). - -The Kubelet's current configuration type is an unreadable monolith. We should decompose it into sub-objects for convenience of composition and management. An example grouping is in PR [#44252](https://github.com/kubernetes/kubernetes/pull/44252). - -### Representing and Referencing Configuration - -#### Cluster-level object -- The Kubelet's configuration should be: - + *Structured* into sub-categories, so that it is readable. - + *A bundled payload*, so that all Kubelet parameters roll out to a given `Node` in unison. -- The Kubelet's configuration should be stored in the `Data` of a `ConfigMap` object. Each value should be a `YAML` or `JSON` blob, and should be associated with the correct key. - + Note that today, there is only a single `KubeletConfiguration` object (required under the `kubelet` key). -- The `ConfigMap` containing the desired configuration should be specified via the `Node` object corresponding to the Kubelet. The `Node` will have a new `spec` subfield, `configSource`, which is a new type, `NodeConfigSource` (described below). - -`ConfigMap` object containing the Kubelet's configuration: -``` -kind: ConfigMap -metadata: - name: my-kubelet-config -data: - kubelet: "{JSON blob}" -``` - -#### On-disk - -There are two types of configuration that are stored on disk: -- cached configurations from a remote source, e.g. `ConfigMaps` from etcd. -- the local "init" configuration, e.g. the set of config files the node is provisioned with. - -The Kubelet should accept a `--dynamic-config-dir` flag, which specifies a directory for storing all of the information necessary for dynamic configuration from remote sources; e.g. the cached configurations, which configuration is currently in use, which configurations are known to be bad, etc. -- When the Kubelet downloads a `ConfigMap`, it will checkpoint a serialization of the `ConfigMap` object to a file at `{dynamic-config-dir}/checkpoints/{UID}`. -- We checkpoint the entire object, rather than unpacking the contents to disk, because the former is less complex and reduces chance for errors during the checkpoint process. - -The Kubelet should also accept a `--init-config-dir` flag, which specifies a directory containing a local set of "init" configuration files the node was provisioned with. The Kubelet will substitute these files for the built-in `KubeletConfiguration` defaults (and also for the existing command-line flags that map to `KubeletConfiguration` fields). - -### Orchestration of configuration - -There are a lot of opinions around how to orchestrate configuration in a cluster. The following items form the basis of a robust solution: - -#### Robust Kubelet behavior - -To make config updates robust, the Kubelet should be able to locally and automatically recover from bad config pushes. We should strive to avoid requiring operator intervention, though this may not be possible in all scenarios. - -Recovery involves: -- Checkpointing configuration on-disk, so prior versions are locally available for rollback. -- Tracking a last-known-good (LKG) configuration, which will be the rollback target if the current configuration turns out to be bad. -- Tracking bad configurations, so the Kubelet can avoid using known-bad configurations across restarts. -- Detecting whether a crash-loop correlates with a new configuration, marking these configurations bad, and rolling back to the last-known-good when this happens. - -##### Finding and checkpointing intended configuration - -The Kubelet finds its intended configuration by looking for the `ConfigMap` referenced via it's `Node`'s optional `spec.configSource` field. This field will be a new type: -``` -type NodeConfigSource struct { - ConfigMapRef *ObjectReference -} -``` - -For now, this type just contains an `ObjectReference` that points to a `ConfigMap`. The `spec.configSource` field will be of type `*NodeConfigSource`, because it is optional. - -The `spec.configSource` field can be considered "correct," "empty," or "invalid." The field is "empty" if and only if it is `nil`. The field is "correct" if and only if it is neither "empty" nor "invalid." The field is "invalid" if it fails to meet any of the following criteria: -- Exactly one subfield of `NodeConfigSource` must be non-`nil`. -- All information contained in the non-`nil` subfield must meet the requirements of that subfield. - -The requirements of the `ConfigMapRef` subfield are as follows: -- All of `ConfigMapRef.UID`, `ConfigMapRef.Namespace`, and `ConfigMapRef.Name`must be non-empty. -- Both `ConfigMapRef.UID` and the `ConfigMapRef.Namespace`-`ConfigMapRef.Name` pair must unambiguously refer to the same object. -- The referenced object must exist. -- The referenced object must be a `ConfigMap`. - -The Kubelet must have permission to read `ConfigMap`s in the namespace that contains the referenced `ConfigMap`. - -If the `spec.configSource` is empty, the Kubelet will use its "init" configuration or built-in defaults (including values from flags that currently map to configuration) if the "init" config is also absent. - -If the `spec.configSource` is invalid (or if some other issue prevents syncing configuration with what is specified on the `Node`): -- If the Kubelet is in its startup sequence, it will defer to its LKG configuration and report the failure to determine the desired configuration via a `NodeCondition` (discussed later) when the status sync loop starts. -- If the failure to determine desired configuration occurs as part of the configuration sync-loop operation of a live Kubelet, the failure will be reported in a `NodeCondition` (discussed later), but the Kubelet will not change its configuration. This is to prevent disrupting live Kubelets in the event of user error. - -If the `spec.configSource` is correct and using `ConfigMapRef`, the Kubelet checkpoints this `ConfigMap` to the `dynamic-config-dir`, as specified above in the *Representing and Referencing Configuration* section. - -The Kubelet serializes a "current" and "last-known-good" `NodeConfigSource` to disk to track which configuration it is currently using, and which to roll back to in the event of a failure. If either of these files is empty, the Kubelet interprets this as "use the local config," whether that means the combination of built-in defaults and flags, or the "init" config. - -The Kubelet detects new configuration by watching the `Node` object for changes to the `spec.configSource` field. When the Kubelet detects new configuration, it checkpoints it as necessary, saves the new `NodeConfigSource` to the "current" file, and calls `os.Exit(0)`. This relies on the process manager (e.g. `systemd`) to restart the Kubelet. When the Kubelet restarts, it will attempt to load the new configuration. - -##### Metrics for Bad Configuration - -These are metrics we can use to detect bad configuration. Some are perfect indicators (validation) and some are imperfect (`P(ThinkBad == ActuallyBad) < 1`). - -Perfect Metrics: -- `KubeletConfiguration` cannot be deserialized -- `KubeletConfiguration` fails a validation step - -Imperfect Metrics: -- Kubelet restarts occur above a frequency threshold when using a given configuration, before that configuration is out of a "trial period." - -We should absolutely use the perfect metrics we have available. These immediately tell us when we have bad configuration. Adding a user-defined trial period, within which restarts above a user-defined frequency can be treated as crash-loops caused by the current configuration, adds some protection against more complex configuration mishaps. - -More advanced error detection is probably best left to an out-of-band component, like the Node Problem Detector. We shouldn't go overboard attempting to make the Kubelet too smart. - -##### Tracking LKG (last-known-good) configuration - -The Kubelet tracks its "last-known-good" configuration by saving the relevant `NodeConfigSource` to a file, just as it does for its "current" configuration. If this file is empty, the Kubelet should treat the "init" config or combination of built-in defaults and flags as it's last-known-good config. - -Any configuration retrieved from the API server must persist beyond a trial period before it can be considered LKG. This trial period will be called `ConfigTrialDuration`, will be a `Duration` as defined by `k8s.io/apimachinery/pkg/apis/meta/v1/duration.go`, and will be a parameter of the `KubeletConfiguration`. The trial period on a given configuration is the trial period used for that configuration (as opposed to, say, using the trial period set on the previous configuration). This is useful if you have, for example, a configuration you want to roll out with a longer trial period for additional caution. - -Similarly, the number of restarts tolerated during the trial period is exposed to the user via the `CrashLoopThreshold` field of the `KubeletConfiguration`. This field has a minimum of `0` and a maximum of `10`. The maximum of `10` is an arbitrary threshold to prevent unbounded growth of the startups-tracking file (discussed later). We implicitly allow one more restart than the user-provided threshold, because one restart is necessary to begin using a new configuration. - -The "init" configuration will be automatically considered good. If a node is provisioned with an init configuration, it MUST be a valid configuration. The Kubelet will always attempt to deserialize the init configuration and validate it on startup, regardless of whether a remote configuration exists. If this fails, the Kubelet will refuse to start. Similarly, the Kubelet will refuse to start if the sum total of built-in defaults and flag values that still map to configuration is invalid. This is to make invalid node provisioning extremely obvious. - -It is very important to be strict about the validity of the init and default configurations, because these are the baseline last-known-good configurations. If either configuration turns out to be bad, there is nothing to fall back to. We presume a user provisions nodes with an init configuration when the Kubelet defaults are inappropriate for their use case. It would thus be inappropriate to fall back to the Kubelet defaults if the init configuration exists. - -As the init configuration and the built-in defaults are automatically considered good, intentionally setting `spec.configSource` on the `Node` to its empty default will reset the last-known-good back to whichever local config is in use. - -##### Rolling back to the LKG config - -When a configuration correlates too strongly with a crash loop, the Kubelet will "roll-back" to its last-known-good configuration. This process involves three components: -1. The Kubelet must choose to use its LKG configuration instead of its intended current configuration. -2. The Kubelet must remember which configuration was bad, so it doesn't roll forward to that configuration again. -3. The Kubelet must report that it rolled back to LKG due to the *belief* that it had a bad configuration. - -Regarding (2), when the Kubelet detects a bad configuration, it will add an entry to a "bad configs" file in the `dynamic-config-dir`, mapping the namespace and name of the `ConfigMap` to the time at which it was determined to be a bad config and the reason it was marked bad. The Kubelet will not roll forward to any of these configurations again unless their entries are removed from this file. For example, the contents of this file might look like (shown here with a `reason` matching what would be reported in a `NodeCondition`: -``` -{ - "{uid}": { - "time": "RFC 3339 formatted timestamp", - "reason": "failed to validate current (UID: {UID})" - } -} -``` - -Regarding (1), the Kubelet will check the "bad configs" file on startup. It will use the "last-known-good" config if the "current" config referenced is listed in the "bad configs" file. - -Regarding (3), the Kubelet should report via the `Node`'s status: -- That it is using LKG. -- The configuration LKG refers to. -- The supposedly bad configuration that the Kubelet decided to avoid. -- The reason it thinks the configuration is bad. - -##### Tracking restart frequency against the current configuration - -Every time the Kubelet starts up, it will append the startup time to a "startups" file in the `dynamic-config-dir`. This file is a JSON list of string RFC3339-formatted timestamps. On Kubelet startup, if the time elapsed since the last modification to the "current" file does not exceed `ConfigTrialDuration`, the Kubelet will count the number of timestamps in the "startups" file that occur after the last modification to "current." If this number exceeds the `CrashLoopThreshold`, the configuration will be marked bad and considered the cause of the crash-loop. The Kubelet will then roll back to its LKG configuration. We use "exceeds" as the trigger, because the Kubelet must be able to restart once to adopt a new configuration. - -##### Dead-end states - -We may use imperfect indicators to detect bad configuration. It is possible for a crash-loop unrelated to the current configuration to cause that configuration to be marked bad. This becomes evident when the Kubelet rolls back to the LKG configuration and continues to crash. In this scenario, an out-of-band node repair is required to revive the Kubelet. Since the current configuration was not, in fact, the cause of the issue, the component in charge of node repair should also reset that belief by removing the entry for the current configuration from the "bad configs" file. - -##### How to version "bad configs" and "startups" tracking files - -Having unversioned types for persisted data presents an upgrade risk; if a new Kubelet expects a different data format, the new Kubelet may fail. We need a versioning mechanism for these files to protect against this. - -We have at least these two options: -- Define a versioned API group, similar to how we have a versioned KubeletConfiguration object, for these data-tracking types. -- Alternatively, store these files under a directory that contains the Kubelet's version string in its name. - -There are tradeoffs between these approaches: - -**Theoretical kubelet version pros:** -1. Config marked bad due to a bug in an old Kubelet won't be considered bad by the new Kubelet that contains the bug fix, e.g. incorrect bad-config information won't leak across Kubelet upgrades. -2. If you have to roll back the Kubelet version while the current config is still in its trial period, config marked bad by a new Kubelet due to a bug or crash-loop won't all-of-a-sudden look bad to an old Kubelet that used to accept it. -3. We never worry about whether a new Kubelet is compatible with a given schema for this tracking state, because the state is always local to a given Kubelet version. - -**Pros in practice:** -1. In practice, you probably moved away from the config your old Kubelet marked bad. It is not unreasonable, however, that the ConfigMap behind the old config still exists in your cluster, and you would like to roll this config out now that you have a new Kubelet. In this case, it's nice not to have to delete and recreate the same ConfigMap (config source objects are checkpointed by UID, re-creation gets you a new UID) just to roll it out. In practice, this is probably a rare scenario because users shouldn't initiate a Kubelet upgrade until they are sure the current config is stable. -2. This is a worse information leak than the previous scenario. A new Kubelet that crash-loops might mark the current config bad (or simply record enough startups that the old Kubelet will mark it bad), and then you won't be able to roll back to the same Kubelet + config combination as before because the old Kubelet will now reject the current config. In practice, this is probably a rare scenario because users shouldn't initiate a Kubelet upgrade until they are sure the current config is stable. -3. This is nice in practice: it also allows us to make what would be breaking changes (if we used apimachinery) across Kubelet versions, which means we can more easily iterate on the Kubelet's behavior in the face of bad config. - -**Theoretical kubelet version cons:** -1. Different Kubelet versions can't make use of each-other's bad-config information. Thus if you initiate a Kubelet upgrade on a cluster where the Kubelets are currently using the last-known-good config because the current config is bad, the new Kubelets will have to rediscover that the current config is bad. -2. Different Kubelet versions can't make use of each-other's startup count. Thus if you initiate a Kubelet upgrade on a cluster where the Kubelets are currently using the last-known-good config because the current config caused a crash-loop, the new Kubelets will have to rediscover that the current config is bad by crash-looping. -3. The Kubelet has to make sure it properly namespaces this state under a versioned directory name. - -**Cons in practice:** -1. This scenario is certainly possible, but except for the crash-loop case, it is unlikely to cause serious issues, as all other "bad config" cases are discovered almost immediately during loading/parsing/validating the config. In general, we should recommend that Kubelet version upgrades only be performed when the `ConfigOK` condition is `True` anyway, so this should be a rare scenario. -2. This is slightly worse than Con 1, because a Kubelet crash-loop will be more likely to disrupt Kubelet version upgrades, e.g. causing an automatic rollback. As above, however, we should recommend that Kubelet version upgrades only be performed when the `ConfigOK` condition is `True`, in which case this should be a rare scenario. -3. This isn't particularly difficult to do, but it is additional code. - -Most of the pros and cons aren't particularly strong either way; the scenarios should be rare in practice given good operational procedures. The remaining pros/cons come down to which versioning scheme is nicer from a development perspective. Namespacing under a versioned directory is fairly easy to do and allows us to iterate quickly and safely across Kubelet versions. - -This doc proposes namespacing under a versioned directory. - - -##### Reporting Configuration Status - -Succinct messages related to the state of `Node` configuration should be reported in a `NodeCondition`, in `status.conditions`. These should inform users which `ConfigMap` the Kubelet is using, and if the Kubelet has detected any issues with the configuration. The Kubelet should report this condition during startup, after attempting to validate configuration but before actually using it, so that the chance of a bad configuration inhibiting status reporting is minimized. - -All `NodeCondition`s contain the fields: `lastHeartbeatTime:Time`, `lastTransitionTime:Time`, `message:string`, `reason:string`, `status:string(True|False|Unknown)`, and `type:string`. - -These are some brief descriptions of how these fields should be interpreted for node-configuration related conditions: -- `lastHeartbeatTime`: The last time the Kubelet updated the condition. The Kubelet will typically do this whenever it is restarted, because that is when configuration changes occur. The Kubelet will update this on restart regardless of whether the configuration, or the condition, has changed. -- `lastTransitionTime`: The last time this condition changed. The Kubelet will not update this unless it intends to set a different condition than is currently set. -- `message`: Think of this as the "effect" of the `reason`. -- `reason`: Think of this as the "cause" of the `message`. -- `status`: `True` if the currently set configuration is considered OK, `False` if it is known not to be. `Unknown` is used when the Kubelet cannot determine the user's desired configuration. -- `type`: `ConfigOK` will always be used for these conditions. - -The following list of example conditions, sans `type`, `lastHeartbeatTime`, and `lastTransitionTime`, can be used to get a feel for the relationship between `message`, `reason`, and `status`: - -Config is OK: -``` -message: "using current (UID: {cur-UID})" -reason: "all checks passed" -status: "True" -``` - -No remote config specified: -``` -message: "using current (init)" -reason: "current is set to the local default, and an init config was provided" -status: "True" -``` - -No remote config specified, no local `init` config provided: -``` -message: "using current (default)" -reason: "current is set to the local default, and no init config was provided" -status: "True" -``` - -If `Node.spec.configSource` is invalid during Kubelet startup: -``` -message: "using last-known-good (init)" -reason: "failed to sync, desired config unclear, cause: invalid NodeConfigSource, exactly one subfield must be non-nil, but all were nil" -status: "Unknown" -``` - -Validation of a configuration fails: -``` -message: "using last known good: (UID: {lkg-UID})" -reason: "failed to validate current (UID: {cur-UID})" -status: "False" -``` - -The same text as the `reason`, along with more details on the precise nature of the error, will be printed in the Kubelet log. - -### Operational Considerations - -#### Rollout workflow - -Kubernetes does not have the concepts of immutable, or even undeleteable API objects. This makes it easy to "shoot yourself in the foot" by modifying or deleting a `ConfigMap`. This results in undefined behavior, because the assumption is that these `ConfigMaps` are not mutated once deployed. For example, this design includes no method for invalidating the Kubelet's local cache of configurations, so there is no concept of eventually consistent results from edits or deletes of a `ConfigMap`. You may, in such a scenario, end up with partially consistent results or no results at all. - -Thus, we recommend that rollout workflow consist only of creating new `ConfigMap` objects and updating the `spec.configSource` field on each `Node` to point to that new object. This results in a controlled rollout with well-defined behavior. - -There is discussion in [#10179](https://github.com/kubernetes/kubernetes/issues/10179) regarding ways to prevent unintentional mutation and deletion of objects. - -## Additional concerns not-yet-addressed - -### Monitoring configuration status - -- A way to query/monitor the config in-use on a given node. Today this is possible via the configz endpoint, but this is just for debugging, not production use. There are a number of other potential solutions, e.g. exposing live config via Prometheus. - -### Orchestration -- A specific orchestration solution for rolling out kubelet configuration. There are several factors to think about, including these general rollout considerations: - + Selecting the objects that the new config should be rolled out to (Today it is probably okay to roll out `Node` config with the intention that the nodes are eventually homogeneously configured across the cluster. But what if a user intentionally wants different sets of nodes to be configured differently? A cluster running multiple separate applications, for example.). - + Specifying the desired end-state of the rollout. - + Specifying the rate at which to roll out. - + Detecting problems with the rollout and automatically stopping bad rollouts. - + Specifying how risk-averse to be when deciding to stop a rollout. - + Recording a history of rollouts so that it is possible to roll back to previous versions. - + Properly handling objects that are added or removed while a configuration rollout is in process. For example, `Nodes` might be added or removed due to an autoscaling feature, etc. - + Reconciling configuration with objects added after the completion of a rollout, e.g. new `Nodes`. - + Pausing/resuming a rollout. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/envvar-configmap.md b/contributors/design-proposals/node/envvar-configmap.md index 9464a1af..f0fbec72 100644 --- a/contributors/design-proposals/node/envvar-configmap.md +++ b/contributors/design-proposals/node/envvar-configmap.md @@ -1,184 +1,6 @@ -# ConfigMaps as environment variables +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Populating environment variables of a container from an entire ConfigMap. - -## Design Points - -A container can specify a set of existing ConfigMaps to populate environment variables. - -There needs to be an easy way to isolate the variables introduced by a given -ConfigMap. The contents of a ConfigMap may not be known in advance and it may -be generated by someone or something else. Services may provide binding -information via a ConfigMap. If you have a common service with multiple -instances like a Message Queue or Database, there needs to be a way to -uniquely identify and prevent collision when consuming multiple ConfigMaps in -a single container using this feature. - -## Proposed Design - -Containers can specify a set of sources that are consumed as environment -variables. One such source is a ConfigMap. -Each key defined in the ConfigMap's `Data` object must be a "C" identifier. If -an invalid key is present, the container will fail to start. - -Environment variables defined by a `Container` are processed in a specific -order. The processing order is as follows: - -1. All automatic service environment variables -1. All `EnvFrom` blocks in order -1. All `Env` blocks in order. - -The last value processed for any given environment variable will be the -decided winner. Variable references defined by an `EnvVar` struct will be -resolved by the current values defined even if the value is replaced later. - -To prevent collisions amongst multiple ConfigMaps, each defined ConfigMap can -have an optional associated prefix that is prepended to each key in the -ConfigMap. Prefixes must be a "C" identifier. - -### Kubectl updates - -The `describe` command will display the configmap name that have been defined as -part of the environment variable section including the optional prefix when -defined. - -### API Resource - -A new `EnvFromSource` type containing a `ConfigMapRef` will be added to the -`Container` struct. - -```go -// EnvFromSource represents the source of a set of ConfigMaps -type EnvFromSource struct { - // A string to place in front of every key. Must be a C_IDENTIFIER. - // +optional - Prefix string `json:"prefix,omitempty"` - // The ConfigMap to select from - ConfigMapRef *LocalObjectReference `json:"configMapRef,omitempty"` -} - -type Container struct { - // List of sources to populate environment variables in the container. - // The keys defined within a source must be a C_IDENTIFIER. An invalid key - // will prevent the container from starting. When a key exists in multiple - // sources, the value associated with the last source will take precedence. - // Values defined by an Env with a duplicate key will take precedence over - // any listed source. - // Cannot be updated. - // +optional - EnvFrom []EnvFromSource `json:"envFrom,omitempty"` -} -``` - -### Examples - -### Consuming `ConfigMap` as Environment Variables - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: etcd-env-config -data: - number_of_members: "1" - initial_cluster_state: new - initial_cluster_token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN - discovery_token: DUMMY_ETCD_DISCOVERY_TOKEN - discovery_url: http://etcd_discovery:2379 - etcdctl_peers: http://etcd:2379 - duplicate_key: FROM_CONFIG_MAP - REPLACE_ME: "a value" -``` - -This pod consumes the entire `ConfigMap` as environment variables: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - env: - - name: duplicate_key - value: FROM_ENV - - name: expansion - value: $(REPLACE_ME) - envFrom: - - configMapRef: - name: etcd-env-config -``` - -The resulting environment variables will be: - -``` -number_of_members="1" -initial_cluster_state="new" -initial_cluster_token="DUMMY_ETCD_INITIAL_CLUSTER_TOKEN" -discovery_token="DUMMY_ETCD_DISCOVERY_TOKEN" -discovery_url="http://etcd_discovery:2379" -etcdctl_peers="http://etcd:2379" -duplicate_key="FROM_ENV" -expansion="a value" -REPLACE_ME="a value" -``` - -### Consuming multiple `ConfigMap` as Environment Variables - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: env-config -data: - key1: a - key2: b -``` - -This pod consumes the entire `ConfigMap` as environment variables: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - envFrom: - - prefix: cm1_ - configMapRef: - name: env-config - - prefix: cm2_ - configMapRef: - name: env-config -``` - -The resulting environment variables will be: - -``` -cm1_key1="a" -cm1_key2="b" -cm2_key1="a" -cm2_key2="b" -``` - -### Future - -Add similar support for Secrets. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/expansion.md b/contributors/design-proposals/node/expansion.md index 2647e85c..f0fbec72 100644 --- a/contributors/design-proposals/node/expansion.md +++ b/contributors/design-proposals/node/expansion.md @@ -1,412 +1,6 @@ -# Variable expansion in pod command, args, and env +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal for the expansion of environment variables using a simple `$(var)` -syntax. -## Motivation - -It is extremely common for users to need to compose environment variables or -pass arguments to their commands using the values of environment variables. -Kubernetes should provide a facility for the 80% cases in order to decrease -coupling and the use of workarounds. - -## Goals - -1. Define the syntax format -2. Define the scoping and ordering of substitutions -3. Define the behavior for unmatched variables -4. Define the behavior for unexpected/malformed input - -## Constraints and Assumptions - -* This design should describe the simplest possible syntax to accomplish the -use-cases. -* Expansion syntax will not support more complicated shell-like behaviors such -as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. - -## Use Cases - -1. As a user, I want to compose new environment variables for a container using -a substitution syntax to reference other variables in the container's -environment and service environment variables. -1. As a user, I want to substitute environment variables into a container's -command. -1. As a user, I want to do the above without requiring the container's image to -have a shell. -1. As a user, I want to be able to specify a default value for a service -variable which may not exist. -1. As a user, I want to see an event associated with the pod if an expansion -fails (ie, references variable names that cannot be expanded). - -### Use Case: Composition of environment variables - -Currently, containers are injected with docker-style environment variables for -the services in their pod's namespace. There are several variables for each -service, but users routinely need to compose URLs based on these variables -because there is not a variable for the exact format they need. Users should be -able to build new environment variables with the exact format they need. -Eventually, it should also be possible to turn off the automatic injection of -the docker-style variables into pods and let the users consume the exact -information they need via the downward API and composition. - -#### Expanding expanded variables - -It should be possible to reference an variable which is itself the result of an -expansion, if the referenced variable is declared in the container's environment -prior to the one referencing it. Put another way -- a container's environment is -expanded in order, and expanded variables are available to subsequent -expansions. - -### Use Case: Variable expansion in command - -Users frequently need to pass the values of environment variables to a -container's command. Currently, Kubernetes does not perform any expansion of -variables. The workaround is to invoke a shell in the container's command and -have the shell perform the substitution, or to write a wrapper script that sets -up the environment and runs the command. This has a number of drawbacks: - -1. Solutions that require a shell are unfriendly to images that do not contain -a shell. -2. Wrapper scripts make it harder to use images as base images. -3. Wrapper scripts increase coupling to Kubernetes. - -Users should be able to do the 80% case of variable expansion in command without -writing a wrapper script or adding a shell invocation to their containers' -commands. - -### Use Case: Images without shells - -The current workaround for variable expansion in a container's command requires -the container's image to have a shell. This is unfriendly to images that do not -contain a shell (`scratch` images, for example). Users should be able to perform -the other use-cases in this design without regard to the content of their -images. - -### Use Case: See an event for incomplete expansions - -It is possible that a container with incorrect variable values or command line -may continue to run for a long period of time, and that the end-user would have -no visual or obvious warning of the incorrect configuration. If the kubelet -creates an event when an expansion references a variable that cannot be -expanded, it will help users quickly detect problems with expansions. - -## Design Considerations - -### What features should be supported? - -In order to limit complexity, we want to provide the right amount of -functionality so that the 80% cases can be realized and nothing more. We felt -that the essentials boiled down to: - -1. Ability to perform direct expansion of variables in a string. -2. Ability to specify default values via a prioritized mapping function but -without support for defaults as a syntax-level feature. - -### What should the syntax be? - -The exact syntax for variable expansion has a large impact on how users perceive -and relate to the feature. We considered implementing a very restrictive subset -of the shell `${var}` syntax. This syntax is an attractive option on some level, -because many people are familiar with it. However, this syntax also has a large -number of lesser known features such as the ability to provide default values -for unset variables, perform inline substitution, etc. - -In the interest of preventing conflation of the expansion feature in Kubernetes -with the shell feature, we chose a different syntax similar to the one in -Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since -it is not required to implement the required use-cases. - -Nested references, ie, variable expansion within variable names, are not -supported. - -#### How should unmatched references be treated? - -Ideally, it should be extremely clear when a variable reference couldn't be -expanded. We decided the best experience for unmatched variable references would -be to have the entire reference, syntax included, show up in the output. As an -example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then -`$(VARIABLE_NAME)` should be present in the output. - -#### Escaping the operator - -Although the `$(var)` syntax does overlap with the `$(command)` form of command -substitution supported by many shells, because unexpanded variables are present -verbatim in the output, we expect this will not present a problem to many users. -If there is a collision between a variable name and command substitution syntax, -the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate -to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. - -## Design - -This design encompasses the variable expansion syntax and specification and the -changes needed to incorporate the expansion feature into the container's -environment and command. - -### Syntax and expansion mechanics - -This section describes the expansion syntax, evaluation of variable values, and -how unexpected or malformed inputs are handled. - -#### Syntax - -The inputs to the expansion feature are: - -1. A utf-8 string (the input string) which may contain variable references. -2. A function (the mapping function) that maps the name of a variable to the -variable's value, of type `func(string) string`. - -Variable references in the input string are indicated exclusively with the syntax -`$(<variable-name>)`. The syntax tokens are: - -- `$`: the operator, -- `(`: the reference opener, and -- `)`: the reference closer. - -The operator has no meaning unless accompanied by the reference opener and -closer tokens. The operator can be escaped using `$$`. One literal `$` will be -emitted for each `$$` in the input. - -The reference opener and closer characters have no meaning when not part of a -variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME` -without a closing expression, the operator and expression opening characters are -treated as ordinary characters without special meanings. - -#### Scope and ordering of substitutions - -The scope in which variable references are expanded is defined by the mapping -function. Within the mapping function, any arbitrary strategy may be used to -determine the value of a variable name. The most basic implementation of a -mapping function is to use a `map[string]string` to lookup the value of a -variable. - -In order to support default values for variables like service variables -presented by the kubelet, which may not be bound because the service that -provides them does not yet exist, there should be a mapping function that uses a -list of `map[string]string` like: - -```go -func MakeMappingFunc(maps ...map[string]string) func(string) string { - return func(input string) string { - for _, context := range maps { - val, ok := context[input] - if ok { - return val - } - } - - return "" - } -} - -// elsewhere -containerEnv := map[string]string{ - "FOO": "BAR", - "ZOO": "ZAB", - "SERVICE2_HOST": "some-host", -} - -serviceEnv := map[string]string{ - "SERVICE_HOST": "another-host", - "SERVICE_PORT": "8083", -} - -// single-map variation -mapping := MakeMappingFunc(containerEnv) - -// default variables not found in serviceEnv -mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv) -``` - -### Implementation changes - -The necessary changes to implement this functionality are: - -1. Add a new interface, `ObjectEventRecorder`, which is like the -`EventRecorder` interface, but scoped to a single object, and a function that -returns an `ObjectEventRecorder` given an `ObjectReference` and an -`EventRecorder`. -2. Introduce `third_party/golang/expansion` package that provides: - 1. An `Expand(string, func(string) string) string` function. - 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` -function. -3. Make the kubelet expand environment correctly. -4. Make the kubelet expand command correctly. - -#### Event Recording - -In order to provide an event when an expansion references undefined variables, -the mapping function must be able to create an event. In order to facilitate -this, we should create a new interface in the `api/client/record` package which -is similar to `EventRecorder`, but scoped to a single object: - -```go -// ObjectEventRecorder knows how to record events about a single object. -type ObjectEventRecorder interface { - // Event constructs an event from the given information and puts it in the queue for sending. - // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will - // be used to automate handling of events, so imagine people writing switch statements to - // handle them. You want to make that easy. - // 'message' is intended to be human readable. - // - // The resulting event will be created in the same namespace as the reference object. - Event(reason, message string) - - // Eventf is just like Event, but with Sprintf for the message field. - Eventf(reason, messageFmt string, args ...interface{}) - - // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. - PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) -} -``` - -There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object` -and an `EventRecorder`: - -```go -type objectRecorderImpl struct { - object runtime.Object - recorder EventRecorder -} - -func (r *objectRecorderImpl) Event(reason, message string) { - r.recorder.Event(r.object, reason, message) -} - -func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder { - return &objectRecorderImpl{object, recorder} -} -``` - -#### Expansion package - -The expansion package should provide two methods: - -```go -// MappingFuncFor returns a mapping function for use with Expand that -// implements the expansion semantics defined in the expansion spec; it -// returns the input string wrapped in the expansion syntax if no mapping -// for the input is found. If no expansion is found for a key, an event -// is raised on the given recorder. -func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string { - // ... -} - -// Expand replaces variable references in the input string according to -// the expansion spec using the given mapping function to resolve the -// values of variables. -func Expand(input string, mapping func(string) string) string { - // ... -} -``` - -#### Kubelet changes - -The Kubelet should be made to correctly expand variables references in a -container's environment, command, and args. Changes will need to be made to: - -1. The `makeEnvironmentVariables` function in the kubelet; this is used by -`GenerateRunContainerOptions`, which is used by both the docker and rkt -container runtimes. -2. The docker manager `setEntrypointAndCommand` func has to be changed to -perform variable expansion. -3. The rkt runtime should be made to support expansion in command and args -when support for it is implemented. - -### Examples - -#### Inputs and outputs - -These examples are in the context of the mapping: - -| Name | Value | -|-------------|------------| -| `VAR_A` | `"A"` | -| `VAR_B` | `"B"` | -| `VAR_C` | `"C"` | -| `VAR_REF` | `$(VAR_A)` | -| `VAR_EMPTY` | `""` | - -No other variables are defined. - -| Input | Result | -|--------------------------------|----------------------------| -| `"$(VAR_A)"` | `"A"` | -| `"___$(VAR_B)___"` | `"___B___"` | -| `"___$(VAR_C)"` | `"___C"` | -| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` | -| `"$(VAR_A)-1"` | `"A-1"` | -| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` | -| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` | -| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` | -| `"f000-$$VAR_A"` | `"f000-$VAR_A"` | -| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` | -| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` | -| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` | -| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` | -| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` | -| `"$(VAR_REF)"` | `"$(VAR_A)"` | -| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` | -| `"foo$(VAR_EMPTY)bar"` | `"foobar"` | -| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` | -| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` | -| `"$?_boo_$!"` | `"$?_boo_$!"` | -| `"$VAR_A"` | `"$VAR_A"` | -| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` | -| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` | -| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` | -| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` | -| `"$$$$$$$(VAR_A)"` | `"$$$A"` | -| `"$VAR_A)"` | `"$VAR_A)"` | -| `"${VAR_A}"` | `"${VAR_A}"` | -| `"$(VAR_B)_______$(A"` | `"B_______$(A"` | -| `"$(VAR_C)_______$("` | `"C_______$("` | -| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` | -| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` | -| `"--$($($($($--"` | `"--$($($($($--"` | -| `"$($($($($--foo$("` | `"$($($($($--foo$("` | -| `"foo0--$($($($("` | `"foo0--$($($($("` | -| `"$(foo$$var)"` | `"$(foo$$var)"` | - -#### In a pod: building a URL - -Notice the `$(var)` syntax. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: expansion-pod -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh", "-c", "env" ] - env: - - name: PUBLIC_URL - value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" - restartPolicy: Never -``` - -#### In a pod: building a URL using downward API - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: expansion-pod -spec: - containers: - - name: test-container - image: k8s.gcr.io/busybox - command: [ "/bin/sh", "-c", "env" ] - env: - - name: POD_NAMESPACE - valueFrom: - fieldRef: - fieldPath: "metadata.namespace" - - name: PUBLIC_URL - value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" - restartPolicy: Never -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-auth.md b/contributors/design-proposals/node/kubelet-auth.md index cb34f65d..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-auth.md +++ b/contributors/design-proposals/node/kubelet-auth.md @@ -1,103 +1,6 @@ -# Kubelet Authentication / Authorization +Design proposals have been archived. -Author: Jordan Liggitt (jliggitt@redhat.com) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview - -The kubelet exposes endpoints which give access to data of varying sensitivity, -and allow performing operations of varying power on the node and within containers. -There is no built-in way to limit or subdivide access to those endpoints, -so deployers must secure the kubelet API using external, ad-hoc methods. - -This document proposes a method for authenticating and authorizing access -to the kubelet API, using interfaces and methods that complement the existing -authentication and authorization used by the API server. - -## Preliminaries - -This proposal assumes the existence of: - -* a functioning API server -* the SubjectAccessReview and TokenReview APIs - -It also assumes each node is additionally provisioned with the following information: - -1. Location of the API server -2. Any CA certificates necessary to trust the API server's TLS certificate -3. Client credentials authorized to make SubjectAccessReview and TokenReview API calls - -## API Changes - -None - -## Kubelet Authentication - -Enable starting the kubelet with one or more of the following authentication methods: - -* x509 client certificate -* bearer token -* anonymous (current default) - -For backwards compatibility, the default is to enable anonymous authentication. - -### x509 client certificate - -Add a new `--client-ca-file=[file]` option to the kubelet. -When started with this option, the kubelet authenticates incoming requests using x509 -client certificates, validated against the root certificates in the provided bundle. -The kubelet will reuse the x509 authenticator already used by the API server. - -The master API server can already be started with `--kubelet-client-certificate` and -`--kubelet-client-key` options in order to make authenticated requests to the kubelet. - -### Bearer token - -Add a new `--authentication-token-webhook=[true|false]` option to the kubelet. -When true, the kubelet authenticates incoming requests with bearer tokens by making -`TokenReview` API calls to the API server. - -The kubelet will reuse the webhook authenticator already used by the API server, configured -to call the API server using the connection information already provided to the kubelet. - -To improve performance of repeated requests with the same bearer token, the -`--authentication-token-webhook-cache-ttl` option supported by the API server -would be supported. - -### Anonymous - -Add a new `--anonymous-auth=[true|false]` option to the kubelet. -When true, requests to the secure port that are not rejected by other configured -authentication methods are treated as anonymous requests, and given a username -of `system:anonymous` and a group of `system:unauthenticated`. - -## Kubelet Authorization - -Add a new `--authorization-mode` option to the kubelet, specifying one of the following modes: -* `Webhook` -* `AlwaysAllow` (current default) - -For backwards compatibility, the authorization mode defaults to `AlwaysAllow`. - -### Webhook - -Webhook mode converts the request to authorization attributes, and makes a `SubjectAccessReview` -API call to check if the authenticated subject is allowed to make a request with those attributes. -This enables authorization policy to be centrally managed by the authorizer configured for the API server. - -The kubelet will reuse the webhook authorizer already used by the API server, configured -to call the API server using the connection information already provided to the kubelet. - -To improve performance of repeated requests with the same authenticated subject and request attributes, -the same webhook authorizer caching options supported by the API server would be supported: - -* `--authorization-webhook-cache-authorized-ttl` -* `--authorization-webhook-cache-unauthorized-ttl` - -### AlwaysAllow - -This mode allows any authenticated request. - -## Future Work - -* Add support for CRL revocation for x509 client certificate authentication (http://issue.k8s.io/18982) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-authorizer.md b/contributors/design-proposals/node/kubelet-authorizer.md index 0352ea94..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-authorizer.md +++ b/contributors/design-proposals/node/kubelet-authorizer.md @@ -1,184 +1,6 @@ -# Scoped Kubelet API Access +Design proposals have been archived. -Author: Jordan Liggitt (jliggitt@redhat.com) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview -Kubelets are primarily responsible for: -* creating and updating status of their Node API object -* running and updating status of Pod API objects bound to their node -* creating/deleting "mirror pod" API objects for statically-defined pods running on their node - -To run a pod, a kubelet must have read access to the following objects referenced by the pod spec: -* Secrets -* ConfigMaps -* PersistentVolumeClaims (and any bound PersistentVolume or referenced StorageClass object) - -As of 1.6, kubelets have read/write access to all Node and Pod objects, and -read access to all Secret, ConfigMap, PersistentVolumeClaim, and PersistentVolume objects. -This means that compromising a node gives access to credentials that allow modifying other nodes, -pods belonging to other nodes, and accessing confidential data unrelated to the node's pods. - -This document proposes limiting a kubelet's API access using a new node authorizer, admission plugin, and additional API validation: -* Node authorizer - * Authorizes requests from identifiable nodes using a fixed policy identical to the default RBAC `system:node` cluster role - * Further restricts secret, configmap, persistentvolumeclaim and persistentvolume access to only allow reading objects referenced by pods bound to the node making the request -* Node admission - * Limit identifiable nodes to only be able to mutate their own Node API object - * Limit identifiable nodes to only be able to create mirror pods bound to themselves - * Limit identifiable nodes to only be able to mutate mirror pods bound to themselves - * Limit identifiable nodes to not be able to create mirror pods that reference API objects (secrets, configmaps, service accounts, persistent volume claims) -* Additional API validation - * Reject mirror pods that are not bound to a node - * Reject pod updates that remove mirror pod annotations - -## Alternatives considered - -**Can this just be enforced by authorization?** - -Authorization does not have access to request bodies (or the existing object, for update requests), -so it could not restrict access based on fields in the incoming or existing object. - -**Can this just be enforced by admission?** - -Admission is only called for mutating requests, so it could not restrict read access. - -**Can an existing authorizer be used?** - -Only one authorizer (RBAC) has in-tree support for dynamically programmable policy. - -Manifesting RBAC policy rules to give each node access to individual objects within namespaces -would require large numbers of frequently-modified roles and rolebindings, resulting in -significant write-multiplication. - -Additionally, not all clusters will use RBAC, but all useful clusters will have nodes. -A node-specific authorizer allows cluster admins to continue to use their authorization mode of choice. - -## Node identification - -The first step is to identify whether a particular API request is being made by -a node, and if so, from which node. - -The proposed node authorizer and admission plugin will take a `NodeIdentifier` interface: - -```go -type NodeIdentifier interface { - // IdentifyNode determines node information from the given user.Info. - // nodeName is the name of the Node API object associated with the user.Info, - // and may be empty if a specific node cannot be determined. - // isNode is true if the user.Info represents an identity issued to a node. - IdentifyNode(user.Info) (nodeName string, isNode bool) -} -``` - -The default `NodeIdentifier` implementation: -* `isNode` - true if the user groups contain the `system:nodes` group and the user name is in the format `system:node:<nodeName>` -* `nodeName` - set if `isNode` is true, by extracting the `<nodeName>` portion of the `system:node:<nodeName>` username - -This group and user name format match the identity created for each kubelet as part of [kubelet TLS bootstrapping](https://kubernetes.io/docs/admin/kubelet-tls-bootstrapping/). - -## Node authorizer - -A new node authorization mode (`Node`) will be made available for use in combination -with other authorization modes (for example `--authorization-mode=Node,RBAC`). - -The node authorizer does the following: -1. If a request is not from a node (`IdentifyNode()` returns isNode=false), reject -2. If a specific node cannot be identified (`IdentifyNode()` returns nodeName=""), reject -3. If a request is for a secret, configmap, persistent volume or persistent volume claim, reject unless the verb is `get`, and the requested object is related to the requesting node: - - * node <-pod - * node <-pod-> secret - * node <-pod-> configmap - * node <-pod-> pvc - * node <-pod-> pvc <-pv - * node <-pod-> pvc <-pv-> secret -4. For other resources, allow if allowed by the rules in the default `system:node` cluster role - -Subsequent authorizers in the chain can run and choose to allow requests rejected by the node authorizer. - -## Node admission - -A new node admission plugin (`--admission-control=...,NodeRestriction,...`) is made available that does the following: - -1. If a request is not from a node (`IdentifyNode()` returns isNode=false), allow the request -2. If a specific node cannot be identified (`IdentifyNode()` returns nodeName=""), reject the request -3. For requests made by identifiable nodes: - * Limits `create` of node resources: - * only allow the node object corresponding to the node making the API request - * Limits `create` of pod resources: - * only allow pods with mirror pod annotations - * only allow pods with nodeName set to the node making the API request - * do not allow pods that reference any API objects (secrets, serviceaccounts, configmaps, or persistentvolumeclaims) - * Limits `update` of node and nodes/status resources: - * only allow updating the node object corresponding to the node making the API request - * Limits `update` of pods/status resources: - * only allow reporting status for pods with nodeName set to the node making the API request - * Limits `delete` of node resources: - * only allow deleting the node object corresponding to the node making the API request - * Limits `delete` of pod resources: - * only allow deleting pods with nodeName set to the node making the API request - -## API Changes - -Change Pod validation for mirror pods: - * Reject `create` of pod resources with mirror pod annotations that do not specify a nodeName - * Reject `update` of pod resources with mirror pod annotations that modify or remove the mirror pod annotation - -## RBAC Changes - -In 1.6, the `system:node` cluster role is automatically bound to the `system:nodes` group when using RBAC. -Because the node authorizer accomplishes the same purpose, with the benefit of additional restrictions -on secret and configmap access, the automatic binding of the `system:nodes` group to the `system:node` role will be deprecated in 1.7. - -In 1.7, the binding will not be created if the `Node` authorization mode is used. - -In 1.8, the binding will not be created at all. - -The `system:node` cluster role will continue to be created when using RBAC, -for compatibility with deployment methods that bind other users or groups to that role. - -## Migration considerations - -### Kubelets outside the `system:nodes` group - -Kubelets outside the `system:nodes` group would not be authorized by the `Node` authorization mode, -and would need to continue to be authorized via whatever mechanism currently authorizes them. -The node admission plugin would not restrict requests from these kubelets. - -### Kubelets with undifferentiated usernames - -In some deployments, kubelets have credentials that place them in the `system:nodes` group, -but do not identify the particular node they are associated with. - -These kubelets would not be authorized by the `Node` authorization mode, -and would need to continue to be authorized via whatever mechanism currently authorizes them. - -The `NodeRestriction` admission plugin would ignore requests from these kubelets, -since the default node identifier implementation would not consider that a node identity. - -### Upgrades from previous versions - -Upgraded 1.6 clusters using RBAC will continue functioning as-is because the `system:nodes` group binding will already exist. - -If a cluster admin wishes to start using the `Node` authorizer and `NodeRestriction` admission plugin -to limit node access to the API, they can do that non-disruptively: -1. Enable the `Node` authorization mode (`--authorization-mode=Node,RBAC`) and the `NodeRestriction` admission plugin -2. Ensure all their kubelets' credentials conform to the group/username requirements -3. Audit their apiserver logs to ensure the `Node` authorizer is not rejecting requests from kubelets (no `NODE DENY` messages logged) -4. Delete the `system:node` cluster role binding - -## Future work - -Node and pod mutation, and secret and configmap read access are the most critical permissions to restrict. -Future work could further limit a kubelet's API access: -* only write events with the kubelet set as the event source -* only get endpoints objects referenced by pods bound to the kubelet's node (currently only needed for glusterfs volumes) -* only get/list/watch pods bound to the kubelet's node (requires additional list/watch authorization capabilities) -* only get/list/watch it's own node object (requires additional list/watch authorization capabilities) - -Features that expand or modify the APIs or objects accessed by the kubelet will need to involve the node authorizer. -Known features in the design or development stages that might modify kubelet API access are: -* [Dynamic kubelet configuration](https://github.com/kubernetes/features/issues/281) -* [Local storage management](/contributors/design-proposals/storage/local-storage-overview.md) -* [Bulk watch of secrets/configmaps](https://github.com/kubernetes/community/pull/443) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-cri-logging.md b/contributors/design-proposals/node/kubelet-cri-logging.md index 3ece02a3..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-cri-logging.md +++ b/contributors/design-proposals/node/kubelet-cri-logging.md @@ -1,246 +1,6 @@ -# CRI: Log management for container stdout/stderr streams +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goals and non-goals - -Container Runtime Interface (CRI) is an ongoing project to allow container -runtimes to integrate with kubernetes via a newly-defined API. The goal of this -proposal is to define how container's *stdout/stderr* log streams should be -handled in CRI. - -The explicit non-goal is to define how (non-stdout/stderr) application logs -should be handled. Collecting and managing arbitrary application logs is a -long-standing issue [1] in kubernetes and is worth a proposal of its own. Even -though this proposal does not touch upon these logs, the direction of -this proposal is aligned with one of the most-discussed solutions, logging -volumes [1], for general logging management. - -*In this proposal, “logs” refer to the stdout/stderr streams of the -containers, unless specified otherwise.* - -Previous CRI logging issues: - - Tracking issue: https://github.com/kubernetes/kubernetes/issues/30709 - - Proposal (by @tmrtfs): https://github.com/kubernetes/kubernetes/pull/33111 - -The scope of this proposal is narrower than the #33111 proposal, and hopefully -this will encourage a more focused discussion. - - -## Background - -Below is a brief overview of logging in kubernetes with docker, which is the -only container runtime with fully functional integration today. - -**Log lifecycle and management** - -Docker supports various logging drivers (e.g., syslog, journal, and json-file), -and allows users to configure the driver by passing flags to the docker daemon -at startup. Kubernetes defaults to the "json-file" logging driver, in which -docker writes the stdout/stderr streams to a file in the json format as shown -below. - -``` -{“log”: “The actual log line”, “stream”: “stderr”, “time”: “2016-10-05T00:00:30.082640485Z”} -``` - -Docker deletes the log files when the container is removed, and a cron-job (or -systemd timer-based job) on the node is responsible to rotate the logs (using -`logrotate`). To preserve the logs for introspection and debuggability, kubelet -keeps the terminated container until the pod object has been deleted from the -apiserver. - -**Container log retrieval** - -The kubernetes CLI tool, kubectl, allows users to access the container logs -using [`kubectl logs`] -(http://kubernetes.io/docs/user-guide/kubectl/kubectl_logs/) command. -`kubectl logs` supports flags such as `--since` that requires understanding of -the format and the metadata (i.e., timestamps) of the logs. In the current -implementation, kubelet calls `docker logs` with parameters to return the log -content. As of now, docker only supports `log` operations for the “journal” and -“json-file” drivers [2]. In other words, *the support of `kubectl logs` is not -universal in all kubernetes deployments*. - -**Cluster logging support** - -In a production cluster, logs are usually collected, aggregated, and shipped to -a remote store where advanced analysis/search/archiving functions are -supported. In kubernetes, the default cluster-addons includes a per-node log -collection daemon, `fluentd`. To facilitate the log collection, kubelet creates -symbolic links to all the docker containers logs under `/var/log/containers` -with pod and container metadata embedded in the filename. - -``` -/var/log/containers/<pod_name>_<pod_namespace>_<container_name>-<container_id>.log` -``` - -The fluentd daemon watches the `/var/log/containers/` directory and extract the -metadata associated with the log from the path. Note that this integration -requires kubelet to know where the container runtime stores the logs, and will -not be directly applicable to CRI. - - -## Requirements - - 1. **Provide ways for CRI-compliant runtimes to support all existing logging - features, i.e., `kubectl logs`.** - - 2. **Allow kubelet to manage the lifecycle of the logs to pave the way for - better disk management in the future.** This implies that the lifecycle - of containers and their logs need to be decoupled. - - 3. **Allow log collectors to easily integrate with Kubernetes across - different container runtimes while preserving efficient storage and - retrieval.** - -Requirement (1) provides opportunities for runtimes to continue support -`kubectl logs --since` and related features. Note that even though such -features are only supported today for a limited set of log drivers, this is an -important usability tool for a fresh, basic kubernetes cluster, and should not -be overlooked. Requirement (2) stems from the fact that disk is managed by -kubelet as a node-level resource (not per-pod) today, hence it is difficult to -delegate to the runtime by enforcing per-pod disk quota policy. In addition, -container disk quota is not well supported yet, and such limitation may not -even be well-perceived by users. Requirement (1) is crucial to the kubernetes' -extensibility and usability across all deployments. - -## Proposed solution - -This proposal intends to satisfy the requirements by - - 1. Enforce where the container logs should be stored on the host - filesystem. Both kubelet and the log collector can interact with - the log files directly. - - 2. Ask the runtime to decorate the logs in a format that kubelet understands. - -**Log directories and structures** - -Kubelet will be configured with a root directory (e.g., `/var/log/pods` or -`/var/lib/kubelet/logs/) to store all container logs. Below is an example of a -path to the log of a container in a pod. - -``` -/var/log/pods/<podUID>/<containerName>_<instance#>.log -``` - -In CRI, this is implemented by setting the pod-level log directory when -creating the pod sandbox, and passing the relative container log path -when creating a container. - -``` -PodSandboxConfig.LogDirectory: /var/log/pods/<podUID>/ -ContainerConfig.LogPath: <containerName>_<instance#>.log -``` - -Because kubelet determines where the logs are stored and can access them -directly, this meets requirement (1). As for requirement (2), the log collector -can easily extract basic pod metadata (e.g., pod UID, container name) from -the paths, and watch the directly for any changes. In the future, we can -extend this by maintaining a metadata file in the pod directory. - -**Log format** - -The runtime should decorate each log entry with a RFC 3339Nano timestamp -prefix, the stream type (i.e., "stdout" or "stderr"), the tags of the log -entry, the log content that ends with a newline. - -The `tags` fields can support multiple tags, delimited by `:`. Currently, only -one tag is defined in CRI to support multi-line log entries: partial or full. -Partial (`P`) is used when a log entry is split into multiple lines by the -runtime, and the entry has not ended yet. Full (`F`) indicates that the log -entry is completed -- it is either a single-line entry, or this is the last -line of the multiple-line entry. - -For example, -``` -2016-10-06T00:17:09.669794202Z stdout F The content of the log entry 1 -2016-10-06T00:17:09.669794202Z stdout P First line of log entry 2 -2016-10-06T00:17:09.669794202Z stdout P Second line of the log entry 2 -2016-10-06T00:17:10.113242941Z stderr F Last line of the log entry 2 -``` - -With the knowledge, kubelet can parse the logs and serve them for `kubectl -logs` requests. This meets requirement (3). Note that the format is defined -deliberately simple to provide only information necessary to serve the requests. -We do not intend for kubelet to host various logging plugins. It is also worth -mentioning again that the scope of this proposal is restricted to stdout/stderr -streams of the container, and we impose no restriction to the logging format of -arbitrary container logs. - -**Who should rotate the logs?** - -We assume that a separate task (e.g., cron job) will be configured on the node -to rotate the logs periodically, similar to today's implementation. - -We do not rule out the possibility of letting kubelet or a per-node daemon -(`DaemonSet`) to take up the responsibility, or even declare rotation policy -in the kubernetes API as part of the `PodSpec`, but it is beyond the scope of -this proposal. - -**What about non-supported log formats?** - -If a runtime chooses to store logs in non-supported formats, it essentially -opts out of `kubectl logs` features, which is backed by kubelet today. It is -assumed that the user can rely on the advanced, cluster logging infrastructure -to examine the logs. - -It is also possible that in the future, `kubectl logs` can contact the cluster -logging infrastructure directly to serve logs [1a]. Note that this does not -eliminate the need to store the logs on the node locally for reliability. - - -**How can existing runtimes (docker/rkt) comply to the logging requirements?** - -In the short term, the ongoing docker-CRI integration [3] will support the -proposed solution only partially by (1) creating symbolic links for kubelet -to access, but not manage the logs, and (2) add support for json format in -kubelet. A more sophisticated solution that either involves using a custom -plugin or launching a separate process to copy and decorate the log will be -considered as a mid-term solution. - -For rkt, implementation will rely on providing external file-descriptors for -stdout/stderr to applications via systemd [4]. Those streams are currently -managed by a journald sidecar, which collects stream outputs and store them -in the journal file of the pod. This will replaced by a custom sidecar which -can produce logs in the format expected by this specification and can handle -clients attaching as well. - -## Alternatives - -There are ad-hoc solutions/discussions that addresses one or two of the -requirements, but no comprehensive solution for CRI specifically has been -proposed so far (with the exception of @tmrtfs's proposal -[#33111](https://github.com/kubernetes/kubernetes/pull/33111), which has a much -wider scope). It has come up in discussions that kubelet can delegate all the -logging management to the runtime to allow maximum flexibility. However, it is -difficult for this approach to meet either requirement (1) or (2), without -defining complex logging API. - -There are also possibilities to implement the current proposal by imposing the -log file paths, while leveraging the runtime to access and/or manage logs. This -requires the runtime to expose knobs in CRI to retrieve, remove, and examine -the disk usage of logs. The upside of this approach is that kubelet needs not -mandate the logging format, assuming runtime already includes plugins for -various logging formats. Unfortunately, this is not true for existing runtimes -such as docker, which supports log retrieval only for a very limited number of -log drivers [2]. On the other hand, the downside is that we would be enforcing -more requirements on the runtime through log storage location on the host, and -a potentially premature logging API that may change as the disk management -evolves. - -## References - -[1] Log management issues: - - a. https://github.com/kubernetes/kubernetes/issues/17183 - - b. https://github.com/kubernetes/kubernetes/issues/24677 - - c. https://github.com/kubernetes/kubernetes/pull/13010 - -[2] Docker logging drivers: - - https://docs.docker.com/engine/admin/logging/overview/ - -[3] Docker CRI integration: - - https://github.com/kubernetes/kubernetes/issues/31459 - -[4] rkt support: https://github.com/systemd/systemd/pull/4179 +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 51fa9203..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -1,462 +1,6 @@ -# Kubelet - Eviction Policy +Design proposals have been archived. -**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Proposed (memory evictions WIP) -This document presents a specification for how the `kubelet` evicts pods when compute resources are too low. - -## Goals - -The node needs a mechanism to preserve stability when available compute resources are low. - -This is especially important when dealing with incompressible compute resources such -as memory or disk. If either resource is exhausted, the node would become unstable. - -The `kubelet` has some support for influencing system behavior in response to a system OOM by -having the system OOM killer see higher OOM score adjust scores for containers that have consumed -the largest amount of memory relative to their request. System OOM events are very compute -intensive, and can stall the node until the OOM killing process has completed. In addition, -the system is prone to return to an unstable state since the containers that are killed due to OOM -are either restarted or a new pod is scheduled on to the node. - -Instead, we would prefer a system where the `kubelet` can pro-actively monitor for -and prevent against total starvation of a compute resource, and in cases of where it -could appear to occur, pro-actively fail one or more pods, so the workload can get -moved and scheduled elsewhere when/if its backing controller creates a new pod. - -## Scope of proposal - -This proposal defines a pod eviction policy for reclaiming compute resources. - -As of now, memory and disk based evictions are supported. -The proposal focuses on a simple default eviction strategy -intended to cover the broadest class of user workloads. - -## Eviction Signals - -The `kubelet` will support the ability to trigger eviction decisions on the following signals. - -| Eviction Signal | Description | -|------------------|---------------------------------------------------------------------------------| -| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | -| nodefs.available | nodefs.available := node.stats.fs.available | -| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree | -| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available | -| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree | - -Each of the above signals support either a literal or percentage based value. The percentage based value -is calculated relative to the total capacity associated with each signal. - -`kubelet` supports only two filesystem partitions. - -1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc. -1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers. - -`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. -`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`. - -## Eviction Thresholds - -The `kubelet` will support the ability to specify eviction thresholds. - -An eviction threshold is of the following form: - -`<eviction-signal><operator><quantity | int%>` - -* valid `eviction-signal` tokens as defined above. -* valid `operator` tokens are `<` -* valid `quantity` tokens must match the quantity representation used by Kubernetes -* an eviction threshold can be expressed as a percentage if ends with `%` token. - -If threshold criteria are met, the `kubelet` will take pro-active action to attempt -to reclaim the starved compute resource associated with the eviction signal. - -The `kubelet` will support soft and hard eviction thresholds. - -For example, if a node has `10Gi` of memory, and the desire is to induce eviction -if available memory falls below `1Gi`, an eviction signal can be specified as either -of the following (but not both). - -* `memory.available<10%` -* `memory.available<1Gi` - -### Soft Eviction Thresholds - -A soft eviction threshold pairs an eviction threshold with a required -administrator specified grace period. No action is taken by the `kubelet` -to reclaim resources associated with the eviction signal until that grace -period has been exceeded. If no grace period is provided, the `kubelet` will -error on startup. - -In addition, if a soft eviction threshold has been met, an operator can -specify a maximum allowed pod termination grace period to use when evicting -pods from the node. If specified, the `kubelet` will use the lesser value among -the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period. -If not specified, the `kubelet` will kill pods immediately with no graceful -termination. - -To configure soft eviction thresholds, the following flags will be supported: - -``` ---eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction. ---eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction. ---eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. -``` - -### Hard Eviction Thresholds - -A hard eviction threshold has no grace period, and if observed, the `kubelet` -will take immediate action to reclaim the associated starved resource. If a -hard eviction threshold is met, the `kubelet` will kill the pod immediately -with no graceful termination. - -To configure hard eviction thresholds, the following flag will be supported: - -``` ---eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction. -``` - -## Eviction Monitoring Interval - -The `kubelet` will initially evaluate eviction thresholds at the same -housekeeping interval as `cAdvisor` housekeeping. - -In Kubernetes 1.2, this was defaulted to `10s`. - -It is a goal to shrink the monitoring interval to a much shorter window. -This may require changes to `cAdvisor` to let alternate housekeeping intervals -be specified for selected data (https://github.com/google/cadvisor/issues/1247) - -For the purposes of this proposal, we expect the monitoring interval to be no -more than `10s` to know when a threshold has been triggered, but we will strive -to reduce that latency time permitting. - -## Node Conditions - -The `kubelet` will support a node condition that corresponds to each eviction signal. - -If a hard eviction threshold has been met, or a soft eviction threshold has been met -independent of its associated grace period, the `kubelet` will report a condition that -reflects the node is under pressure. - -The following node conditions are defined that correspond to the specified eviction signal. - -| Node Condition | Eviction Signal | Description | -|----------------|------------------|------------------------------------------------------------------| -| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold | -| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold | - -The `kubelet` will continue to report node status updates at the frequency specified by -`--node-status-update-frequency` which defaults to `10s`. - -### Oscillation of node conditions - -If a node is oscillating above and below a soft eviction threshold, but not exceeding -its associated grace period, it would cause the corresponding node condition to -constantly oscillate between true and false, and could cause poor scheduling decisions -as a consequence. - -To protect against this oscillation, the following flag is defined to control how -long the `kubelet` must wait before transitioning out of a pressure condition. - -``` ---eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait -before transitioning out of an eviction pressure condition. -``` - -The `kubelet` would ensure that it has not observed an eviction threshold being met -for the specified pressure condition for the period specified before toggling the -condition back to `false`. - -## Eviction scenarios - -### Memory - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-hard="memory.available<100Mi" ---eviction-soft="memory.available<300Mi" ---eviction-soft-grace-period="memory.available=30s" -``` - -The `kubelet` will run a sync loop that looks at the available memory -on the node as reported from `cAdvisor` by calculating (capacity - workingSet). -If available memory is observed to drop below 100Mi, the `kubelet` will immediately -initiate eviction. If available memory is observed as falling below `300Mi`, -it will record when that signal was observed internally in a cache. If at the next -sync, that criteria was no longer satisfied, the cache is cleared for that -signal. If that signal is observed as being satisfied for longer than the -specified period, the `kubelet` will initiate eviction to attempt to -reclaim the resource that has met its eviction threshold. - -### Memory CGroup Notifications - -When the `kubelet` is started with `--experimental-kernel-memcg-notification=true`, -it will use cgroup events on the memory.usage_in_bytes file in order to trigger the eviction manager. -With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager, -collect metrics, and respond with evictions much quicker than using the sync loop alone. - -To do this, we periodically adjust the memory cgroup threshold based on total_inactive_file. The eviction manager -periodically measures total_inactive_file, and sets the threshold for usage_in_bytes to mem_capacity - eviction_hard + -total_inactive_file. This means that the threshold is crossed when usage_in_bytes - total_inactive_file -= mem_capacity - eviction_hard. - -### Disk - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-hard="nodefs.available<1Gi,nodefs.inodesFree<1,imagefs.available<10Gi,imagefs.inodesFree<10" ---eviction-soft="nodefs.available<1.5Gi,nodefs.inodesFree<10,imagefs.available<20Gi,imagefs.inodesFree<100" ---eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m" -``` - -The `kubelet` will run a sync loop that looks at the available disk -on the node's supported partitions as reported from `cAdvisor`. -If available disk space on the node's primary filesystem is observed to drop below 1Gi -or the free inodes on the node's primary filesystem is less than 1, -the `kubelet` will immediately initiate eviction. -If available disk space on the node's image filesystem is observed to drop below 10Gi -or the free inodes on the node's primary image filesystem is less than 10, -the `kubelet` will immediately initiate eviction. - -If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`, -or if the free inodes on the node's primary filesystem is less than 10, -or if available disk space on the node's image filesystem is observed as falling below `20Gi`, -or if the free inodes on the node's image filesystem is less than 100, -it will record when that signal was observed internally in a cache. If at the next -sync, that criterion was no longer satisfied, the cache is cleared for that -signal. If that signal is observed as being satisfied for longer than the -specified period, the `kubelet` will initiate eviction to attempt to -reclaim the resource that has met its eviction threshold. - -## Eviction of Pods - -If an eviction threshold has been met, the `kubelet` will initiate the -process of evicting pods until it has observed the signal has gone below -its defined threshold. - -The eviction sequence works as follows: - -* for each monitoring interval, if eviction thresholds have been met - * find candidate pod - * fail the pod - * block until pod is terminated on node - -If a pod is not terminated because a container does not happen to die -(i.e. processes stuck in disk IO for example), the `kubelet` may select -an additional pod to fail instead. The `kubelet` will invoke the `KillPod` -operation exposed on the runtime interface. If an error is returned, -the `kubelet` will select a subsequent pod. - -## Eviction Strategy - -The `kubelet` will implement an eviction strategy oriented around -[Priority](/contributors/design-proposals/scheduling/pod-priority-api.md) -and pod usage relative to requests. It will target pods that are the lowest -Priority, and are the largest consumers of the starved resource relative to -their scheduling request. - -It will target pods whose usage of the starved resource exceeds its requests. -Of those pods, it will rank by priority, then usage - requests. If system -daemons are exceeding their allocation (see [Strategy Caveat](strategy-caveat) below), -and all pods are using less than their requests, then it must evict a pod -whose usage is less than requests, based on priority, then usage - requests. - -Prior to v1.9: -The `kubelet` will implement a default eviction strategy oriented around -the pod quality of service class. - -It will target pods that are the largest consumers of the starved compute -resource relative to their scheduling request. It ranks pods within a -quality of service tier in the following order. - -* `BestEffort` pods that consume the most of the starved resource are failed -first. -* `Burstable` pods that consume the greatest amount of the starved resource -relative to their request for that resource are killed first. If no pod -has exceeded its request, the strategy targets the largest consumer of the -starved resource. -* `Guaranteed` pods that consume the greatest amount of the starved resource -relative to their request are killed first. If no pod has exceeded its request, -the strategy targets the largest consumer of the starved resource. - -### Strategy Caveat - -A pod consuming less resources than its requests is guaranteed to never be -evicted because of another pod's resource consumption. That said, guarantees -are only as good as the underlying foundation they are built upon. If a system daemon -(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than -were reserved via `system-reserved` or `kube-reserved` allocations, then the node -must choose to evict a pod, even if it is consuming less than its requests. -It must take action in order to preserve node stability, and to limit the impact -of the unexpected consumption to other well-behaved pod(s). - -## Disk based evictions - -### With Imagefs - -If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete logs -1. Evict Pods if required. - -If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete unused images -1. Evict Pods if required. - -### Without Imagefs - -If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete logs -1. Delete unused images -1. Evict Pods if required. - -Let's explore the different options for freeing up disk space. - -### Delete logs of dead pods/containers - -As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around, -to provide access to logs. -In the future, if we store logs of dead containers outside of the container itself, then -`kubelet` can delete these logs to free up disk space. -Once the lifetime of containers and logs are split, kubelet can support more user friendly policies -around log evictions. `kubelet` can delete logs of the oldest containers first. -Since logs from the first and the most recent incarnation of a container is the most important for most applications, -kubelet can try to preserve these logs and aggressively delete logs from other container incarnations. - -Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space. - -### Delete unused images - -`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark. -Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached. -`kubelet` employs a LRU policy when it comes to deleting images. - -The existing policy will be replaced with a much simpler policy. -Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability -above eviction thresholds, then kubelet will not delete any images. -If `kubelet` decides to delete unused images, it will delete *all* unused images. - -### Evict pods - -There is no ability to specify disk limits for pods/containers today. -Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time. -`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions. -`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds. -Within each QoS bucket, `kubelet` will sort pods according to their disk usage. -`kubelet` will sort pods in each bucket as follows: - -#### Without Imagefs - -If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage -- local volumes + logs & writable layer of all its containers. - -#### With Imagefs - -If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs` -- local volumes + logs of all its containers. - -If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers. - -## Minimum eviction reclaim - -In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in -`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, - is time consuming. - -To mitigate these issues, `kubelet` will have a per-resource `minimum-reclaim`. Whenever `kubelet` observes -resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource. - -Following are the flags through which `minimum-reclaim` can be configured for each evictable resource: - -`--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"` - -The default `eviction-minimum-reclaim` is `0` for all resources. - -## Deprecation of existing features - -`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal, -some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal. - -| Existing Flag | New Flag | Rationale | -| ------------- | -------- | --------- | -| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection | -| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | eviction reclaims achieve the same behavior | -| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context | -| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context | -| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context | -| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal | -| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources | - -## Kubelet Admission Control - -### Feasibility checks during kubelet admission - -#### Memory - -The `kubelet` will reject `BestEffort` pods if any of the memory -eviction thresholds have been exceeded independent of the configured -grace period. - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-soft="memory.available<256Mi" ---eviction-soft-grace-period="memory.available=30s" -``` - -If the `kubelet` sees that it has less than `256Mi` of memory available -on the node, but the `kubelet` has not yet initiated eviction since the -grace period criteria has not yet been met, the `kubelet` will still immediately -fail any incoming best effort pods. - -The reasoning for this decision is the expectation that the incoming pod is -likely to further starve the particular compute resource and the `kubelet` should -return to a steady state before accepting new workloads. - -#### Disk - -The `kubelet` will reject all pods if any of the disk eviction thresholds have been met. - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-soft="nodefs.available<1500Mi" ---eviction-soft-grace-period="nodefs.available=30s" -``` - -If the `kubelet` sees that it has less than `1500Mi` of disk available -on the node, but the `kubelet` has not yet initiated eviction since the -grace period criteria has not yet been met, the `kubelet` will still immediately -fail any incoming pods. - -The rationale for failing **all** pods instead of just best effort is because disk is currently -a best effort resource for all QoS classes. - -Kubelet will apply the same policy even if there is a dedicated `image` filesystem. - -## Scheduler - -The node will report a condition when a compute resource is under pressure. The -scheduler should view that condition as a signal to dissuade placing additional -best effort pods on the node. - -In this case, the `MemoryPressure` condition if true should dissuade the scheduler -from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission. - -On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from -placing **any** new pods on the node since they will be rejected by the `kubelet` in admission. - -## Best Practices - -### DaemonSet - -As `Priority` is a key factor in the eviction strategy, if you do not want -pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass -in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if -there are sufficient resources, specify a lower or default priorityClass.
\ No newline at end of file +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-hypercontainer-runtime.md b/contributors/design-proposals/node/kubelet-hypercontainer-runtime.md index 8aba0b1a..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-hypercontainer-runtime.md +++ b/contributors/design-proposals/node/kubelet-hypercontainer-runtime.md @@ -1,40 +1,6 @@ -Kubelet HyperContainer Container Runtime -======================================= +Design proposals have been archived. -Authors: Pengfei Ni (@feiskyer), Harry Zhang (@resouer) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Abstract - -This proposal aims to support [HyperContainer](http://hypercontainer.io) container -runtime in Kubelet. - -## Motivation - -HyperContainer is a Hypervisor-agnostic Container Engine that allows you to run Docker images using -hypervisors (KVM, Xen, etc.). By running containers within separate VM instances, it offers a -hardware-enforced isolation, which is required in multi-tenant environments. - -## Goals - -1. Complete pod/container/image lifecycle management with HyperContainer. -2. Setup network by network plugins. -3. 100% Pass node e2e tests. -4. Easy to deploy for both local dev/test and production clusters. - -## Design - -The HyperContainer runtime will make use of the kubelet Container Runtime Interface. [Fakti](https://github.com/kubernetes/frakti) implements the CRI interface and exposes -a local endpoint to Kubelet. Fakti communicates with [hyperd](https://github.com/hyperhq/hyperd) -with its gRPC API to manage the lifecycle of sandboxes, containers and images. - - - -## Limitations - -Since pods are running directly inside hypervisor, host network is not supported in HyperContainer -runtime. - -## Development - -The HyperContainer runtime is maintained by <https://github.com/kubernetes/frakti>. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-rkt-runtime.md b/contributors/design-proposals/node/kubelet-rkt-runtime.md index 1bc6435b..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-rkt-runtime.md +++ b/contributors/design-proposals/node/kubelet-rkt-runtime.md @@ -1,99 +1,6 @@ -Next generation rkt runtime integration -======================================= +Design proposals have been archived. -Authors: Euan Kemp (@euank), Yifan Gu (@yifan-gu) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Abstract - -This proposal describes the design and road path for integrating rkt with kubelet with the new container runtime interface. - -## Background - -Currently, the Kubernetes project supports rkt as a container runtime via an implementation under [pkg/kubelet/rkt package](https://github.com/kubernetes/kubernetes/tree/v1.5.0-alpha.0/pkg/kubelet/rkt). - -This implementation, for historical reasons, has required implementing a large amount of logic shared by the original Docker implementation. - -In order to make additional container runtime integrations easier, more clearly defined, and more consistent, a new [Container Runtime Interface](https://github.com/kubernetes/kubernetes/blob/v1.5.0-alpha.0/pkg/kubelet/api/v1alpha1/runtime/api.proto) (CRI) is being designed. -The existing runtimes, in order to both prove the correctness of the interface and reduce maintenance burden, are incentivized to move to this interface. - -This document proposes how the rkt runtime integration will transition to using the CRI. - -## Goals - -### Full-featured - -The CRI integration must work as well as the existing integration in terms of features. - -Until that's the case, the existing integration will continue to be maintained. - -### Easy to Deploy - -The new integration should not be any more difficult to deploy and configure than the existing integration. - -### Easy to Develop - -This iteration should be as easy to work and iterate on as the original one. - -It will be available in an initial usable form quickly in order to validate the CRI. - -## Design - -In order to fulfill the above goals, the rkt CRI integration will make the following choices: - -### Remain in-process with Kubelet - -The current rkt container runtime integration is able to be deployed simply by deploying the kubelet binary. - -This is, in no small part, to make it *Easy to Deploy*. - -Remaining in-process also helps this integration not regress on performance, one axis of being *Full-Featured*. - -### Communicate through gRPC - -Although the kubelet and rktlet will be compiled together, the runtime and kubelet will still communicate through gRPC interface for better API abstraction. - -For the near short term, they will still talk through a unix socket before we implement a custom gRPC connection that skips the network stack. - -### Developed as a Separate Repository - -Brian Grant's discussion on splitting the Kubernetes project into [separate repos](https://github.com/kubernetes/kubernetes/issues/24343) is a compelling argument for why it makes sense to split this work into a separate repo. - -In order to be *Easy to Develop*, this iteration will be maintained as a separate repository, and re-vendored back in. - -This choice will also allow better long-term growth in terms of better issue-management, testing pipelines, and so on. - -Unfortunately, in the short term, it's possible that some aspects of this will also cause pain and it's very difficult to weight each side correctly. - -### Exec the rkt binary (initially) - -While significant work on the rkt [api-service](https://coreos.com/rkt/docs/latest/subcommands/api-service.html) has been made, -it has also been a source of problems and additional complexity, -and was never transitioned to entirely. - -In addition, the rkt cli has historically been the primary interface to the rkt runtime. - -The initial integration will execute the rkt binary directly for app creation/start/stop/removal, as well as image pulling/removal. - -The creation of pod sandbox is also done via rkt command line, but it will run under `systemd-run` so it's monitored by the init process. - -In the future, some of these decisions are expected to be changed such that rkt is vendored as a library dependency for all operations, and other init systems will be supported as well. - - -## Roadmap and Milestones - -1. rktlet integrate with kubelet to support basic pod/container lifecycle (pod creation, container creation/start/stop, pod stop/removal) [[Done]](https://github.com/kubernetes-incubator/rktlet/issues/9) -2. rktlet integrate with kubelet to support more advanced features: - - Support kubelet networking, host network - - Support mount / volumes [[#33526]](https://github.com/kubernetes/kubernetes/issues/33526) - - Support exposing ports - - Support privileged containers - - Support selinux options [[#33139]](https://github.com/kubernetes/kubernetes/issues/33139) - - Support attach [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579) - - Support exec [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579) - - Support logging [[#33111]](https://github.com/kubernetes/kubernetes/pull/33111) - -3. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with nspawn stage1. -4. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with kvm stage1. -5. Revendor rktlet into `pkg/kubelet/rktshim`, and start deprecating the `pkg/kubelet/rkt` package. -6. Eventually replace the current `pkg/kubelet/rkt` package. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-rootfs-distribution.md b/contributors/design-proposals/node/kubelet-rootfs-distribution.md index 16c29404..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-rootfs-distribution.md +++ b/contributors/design-proposals/node/kubelet-rootfs-distribution.md @@ -1,165 +1,6 @@ -# Running Kubelet in a Chroot +Design proposals have been archived. -Authors: Vishnu Kannan \<vishh@google.com\>, Euan Kemp \<euan.kemp@coreos.com\>, Brandon Philips \<brandon.philips@coreos.com\> +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Introduction -The Kubelet is a critical component of Kubernetes that must be run on every node in a cluster. - -However, right now it's not always easy to run it *correctly*. The Kubelet has -a number of dependencies that must exist in its filesystem, including various -mount and network utilities. Missing any of these can lead to unexpected -differences between Kubernetes hosts. For example, the Google Container VM -image (GCI) is missing various mount commands even though the Kernel supports -those filesystem types. Similarly, CoreOS Container Linux intentionally doesn't ship with -many mount utilities or socat in the base image. Other distros have a related -problem of ensuring these dependencies are present and versioned appropriately -for the Kubelet. - -In order to solve this problem, it's proposed that running the Kubelet in a -prepackaged chroot should be a supported, recommended, way of running a fully -functioning Kubelet. - -## The Kubelet Chroot - -The easiest way to express all filesystem dependencies of the Kubelet comprehensively is to ship a filesystem image and run the Kubelet within it. The [hyperkube image](../../cluster/images/hyperkube/) already provides such a filesystem. - -Even though the hyperkube image is distributed as a container, this method of -running the Kubelet intentionally is using a chroot and is neither a container nor pod. - -The kubelet chroot will essentially operate as follows: - -``` -container-download-and-extract k8s.gcr.io/hyperkube:v1.4.0 /path/to/chroot -mount --make-shared /var/lib/kubelet -mount --rbind /var/lib/kubelet /path/to/chroot/var/lib/kubelet -# And many more mounts, omitted -... -chroot /path/to/kubelet /usr/bin/hyperkube kubelet -``` - -Note: Kubelet might need access to more directories on the host and we intend to identity mount all those directories into the chroot. A partial list can be found in the CoreOS Container Linux kubelet-wrapper script. -This logic will also naturally be abstracted so it's no more difficult for the user to run the Kubelet. - -Currently, the Kubelet does not need access to arbitrary paths on the host (as -hostPath volumes are managed entirely by the docker daemon process, including -SELinux context applying), so Kubelet makes no operations at those paths). - -This will likely change in the future, at which point a shared bindmount of `/` -will be made available at a known path in the Kubelet chroot. This change will -necessarily be more intrusive since it will require the kubelet to behave -differently (use the shared rootfs mount's path) when running within the -chroot. - -## Current Use - -This method of running the Kubelet is already in use by users of CoreOS Container Linux. The details of this implementation are found in the [kubelet wrapper documentation](https://coreos.com/kubernetes/docs/latest/kubelet-wrapper.html). - -## Implementation - -### Target Distros - -The two distros which benefit the most from this change are GCI and CoreOS Container Linux. Initially, these changes will only be implemented for those distros. - -This work will also only initially target the GCE provider and `kube-up` method of deployment. - -#### Hyperkube Image Packaging - -The Hyperkube image is distributed as part of an official release to the `k8s.gcr.io` registry, but is not included along with the `kube-up` artifacts used for deployment. - -This will need to be remediated in order to complete this proposal. - -### Testing & Rollout - -In order to ensure the paths remain complete, e2e tests *must* be run against a -Kubelet operating in this manner as part of the submit queue. - -To ensure that this feature does not unduly impact others, it will be added to -GCI, but gated behind a feature-flag for a sort confidence-building period -(e.g. `KUBE_RUN_HYPERKUBE_IMAGE=false`). A temporary non-blocking e2e job will -be added with that option. If the results look clean after a week, the -deployment option can be removed and the GCI image can completely switch over. - -Once that testing is in place, it can be rolled out across other distros as -desired. - - -#### Everything else - -In the initial implementation, rkt or docker can be used to extract the rootfs of the hyperkube image. rkt fly or a systemd unit (using [`RootDirectory`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#RootDirectory=)) can be used to perform the needed setup, chroot, and execution of the kubelet within that rootfs. - - - -## FAQ - -#### Will this replace or break other installation options? - -Other installation options include using RPMs, DEBs, and simply running the statically compiled Kubelet binary. - -All of these methods will continue working as they do now. In the future they may choose to also run the kubelet in this manner, but they don't necessarily have to. - - -#### Is this running the kubelet as a pod? - -This is different than running the Kubelet as a pod. Rather than using namespaces, it uses only a chroot and shared bind mounts. - -## Alternatives - -#### Container + Shared bindmounts - -Instead of using a chroot with shared bindmounts, a proper pod or container could be used if the container supported shared bindmounts. - -This introduces some additional complexity in requiring something more than just the bare minimum. It also relies on having a container runtime available and puts said runtime in the critical path for the Kubelet. - -#### "Dependency rootfs" aware kubelet - -The Kubelet could be made aware of the rootfs containing all its dependencies, but not chrooted into it (e.g. started with a `--dependency-root-dir=/path/to/extracted/container` flag). - -The Kubelet could then always search for the binary it wishes to run in that path first and prefer it, as well as preferring libraries in that path. It would effectively run all dependencies similar to the following: - -```bash -export PATH=${dep_root}/bin:${dep_root}/usr/bin:... -export LD_LIBRARY_PATH=${dep_root}/lib:${dep_root}/usr/lib:... -# Run 'mount': -$ ${dep_root}/lib/x86_64-linux-gnu/ld.so --inhibit-cache mount $args -``` - -**Downsides**: - -This adds significant complexity and, due to the dynamic library hackery, might require some container-specific knowledge of the Kubelet or a rootfs of a predetermined form. - -This solution would also have to still solve the packaging of that rootfs, though the solution would likely be identical to the solution for distributing the chroot-kubelet-rootfs. - -#### Waiting for Flexv2 + port-forwarding changes - -The CRI effort plans to change how [port-forward](https://github.com/kubernetes/kubernetes/issues/29579) works, towards a method which will not depend explicitly on socat or other networking utilities. - -Similarly, for the mount utilities, the [Flex Volume v2](https://github.com/kubernetes/features/issues/93) feature is aiming to solve this utility. - - -**Downsides**: - -This requires waiting on other features which might take a significant time to land. It also could end up not fully fixing the problem (e.g. pushing down port-forwarding to the runtime doesn't ensure the runtime doesn't rely on host utilities). - -The Flex Volume feature is several releases out from fully replacing the current volumes as well. - -Finally, there are dependencies that neither of these proposals cover. An -effort to identify these is underway [here](https://issues.k8s.io/26093). - -## Non-Alternatives - -#### Pod + containerized flag - -Currently, there's a `--containerized` flag. This flag doesn't actually remove the dependency on mount utilities on the node though, so does not solve the problem described here. It also is under consideration for [removal](https://issues.k8s.io/26093). - -## Open Questions - -#### Why not a mount namespace? - -#### Timeframe - -During the 1.6 timeframe, the changes mentioned in implementation will be undergone for the CoreOS Container Linux and GCI distros. - -Based on the test results and additional problems that may arise, rollout will -be determined from there. Hopefully the rollout can also occur in the 1.6 -timeframe. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/kubelet-systemd.md b/contributors/design-proposals/node/kubelet-systemd.md index cef68d2a..f0fbec72 100644 --- a/contributors/design-proposals/node/kubelet-systemd.md +++ b/contributors/design-proposals/node/kubelet-systemd.md @@ -1,403 +1,6 @@ -# Kubelet and systemd interaction +Design proposals have been archived. -**Author**: Derek Carr (@derekwaynecarr) - -**Status**: Proposed - -## Motivation - -Many Linux distributions have either adopted, or plan to adopt `systemd` as their init system. - -This document describes how the node should be configured, and a set of enhancements that should -be made to the `kubelet` to better integrate with these distributions independent of container -runtime. - -## Scope of proposal - -This proposal does not account for running the `kubelet` in a container. - -## Background on systemd - -To help understand this proposal, we first provide a brief summary of `systemd` behavior. - -### systemd units - -`systemd` manages a hierarchy of `slice`, `scope`, and `service` units. - -* `service` - application on the server that is launched by `systemd`; how it should start/stop; -when it should be started; under what circumstances it should be restarted; and any resource -controls that should be applied to it. -* `scope` - a process or group of processes which are not launched by `systemd` (i.e. fork), like -a service, resource controls may be applied -* `slice` - organizes a hierarchy in which `scope` and `service` units are placed. a `slice` may -contain `slice`, `scope`, or `service` units; processes are attached to `service` and `scope` -units only, not to `slices`. The hierarchy is intended to be unified, meaning a process may -only belong to a single leaf node. - -### cgroup hierarchy: split versus unified hierarchies - -Classical `cgroup` hierarchies were split per resource group controller, and a process could -exist in different parts of the hierarchy. - -For example, a process `p1` could exist in each of the following at the same time: - -* `/sys/fs/cgroup/cpu/important/` -* `/sys/fs/cgroup/memory/unimportant/` -* `/sys/fs/cgroup/cpuacct/unimportant/` - -In addition, controllers for one resource group could depend on another in ways that were not -always obvious. - -For example, the `cpu` controller depends on the `cpuacct` controller yet they were treated -separately. - -Many found it confusing for a single process to belong to different nodes in the `cgroup` hierarchy -across controllers. - -The Kernel direction for `cgroup` support is to move toward a unified `cgroup` hierarchy, where the -per-controller hierarchies are eliminated in favor of hierarchies like the following: - -* `/sys/fs/cgroup/important/` -* `/sys/fs/cgroup/unimportant/` - -In a unified hierarchy, a process may only belong to a single node in the `cgroup` tree. - -### cgroupfs single writer - -The Kernel direction for `cgroup` management is to promote a single-writer model rather than -allowing multiple processes to independently write to parts of the file-system. - -In distributions that run `systemd` as their init system, the cgroup tree is managed by `systemd` -by default since it implicitly interacts with the cgroup tree when starting units. Manual changes -made by other cgroup managers to the cgroup tree are not guaranteed to be preserved unless `systemd` -is made aware. `systemd` can be told to ignore sections of the cgroup tree by configuring the unit -to have the `Delegate=` option. - -See: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate= - -### cgroup management with systemd and container runtimes - -A `slice` corresponds to an inner-node in the `cgroup` file-system hierarchy. - -For example, the `system.slice` is represented as follows: - -`/sys/fs/cgroup/<controller>/system.slice` - -A `slice` is nested in the hierarchy by its naming convention. - -For example, the `system-foo.slice` is represented as follows: - -`/sys/fs/cgroup/<controller>/system.slice/system-foo.slice/` - -A `service` or `scope` corresponds to leaf nodes in the `cgroup` file-system hierarchy managed by -`systemd`. Services and scopes can have child nodes managed outside of `systemd` if they have been -delegated with the `Delegate=` option. - -For example, if the `docker.service` is associated with the `system.slice`, it is -represented as follows: - -`/sys/fs/cgroup/<controller>/system.slice/docker.service/` - -To demonstrate the use of `scope` units using the `docker` container runtime, if a -user launches a container via `docker run -m 100M busybox`, a `scope` will be created -because the process was not launched by `systemd` itself. The `scope` is parented by -the `slice` associated with the launching daemon. - -For example: - -`/sys/fs/cgroup/<controller>/system.slice/docker-<container-id>.scope` - -`systemd` defines a set of slices. By default, service and scope units are placed in -`system.slice`, virtual machines and containers registered with `systemd-machined` are -found in `machine.slice`, and user sessions handled by `systemd-logind` in `user.slice`. - -## Node Configuration on systemd - -### kubelet cgroup driver - -The `kubelet` reads and writes to the `cgroup` tree during bootstrapping -of the node. In the future, it will write to the `cgroup` tree to satisfy other -purposes around quality of service, etc. - -The `kubelet` must cooperate with `systemd` in order to ensure proper function of the -system. The bootstrapping requirements for a `systemd` system are different than one -without it. - -The `kubelet` will accept a new flag to control how it interacts with the `cgroup` tree. - -* `--cgroup-driver=` - cgroup driver used by the kubelet. `cgroupfs` or `systemd`. - -By default, the `kubelet` should default `--cgroup-driver` to `systemd` on `systemd` distributions. - -The `kubelet` should associate node bootstrapping semantics to the configured -`cgroup driver`. - -### Node allocatable - -The proposal makes no changes to the definition as presented here: -https://git.k8s.io/kubernetes/docs/proposals/node-allocatable.md - -The node will report a set of allocatable compute resources defined as follows: - -`[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]` - -### Node capacity - -The `kubelet` will continue to interface with `cAdvisor` to determine node capacity. - -### System reserved - -The node may set aside a set of designated resources for non-Kubernetes components. - -The `kubelet` accepts the followings flags that support this feature: - -* `--system-reserved=` - A set of `ResourceName`=`ResourceQuantity` pairs that -describe resources reserved for host daemons. -* `--system-container=` - Optional resource-only container in which to place all -non-kernel processes that are not already in a container. Empty for no container. -Rolling back the flag requires a reboot. (Default: ""). - -The current meaning of `system-container` is inadequate on `systemd` environments. -The `kubelet` should use the flag to know the location that has the processes that -are associated with `system-reserved`, but it should not modify the cgroups of -existing processes on the system during bootstrapping of the node. This is -because `systemd` is the `cgroup manager` on the host and it has not delegated -authority to the `kubelet` to change how it manages `units`. - -The following describes the type of things that can happen if this does not change: -https://bugzilla.redhat.com/show_bug.cgi?id=1202859 - -As a result, the `kubelet` needs to distinguish placement of non-kernel processes -based on the cgroup driver, and only do its current behavior when not on `systemd`. - -The flag should be modified as follows: - -* `--system-container=` - Name of resource-only container that holds all -non-kernel processes whose resource consumption is accounted under -system-reserved. The default value is cgroup driver specific. systemd -defaults to system, cgroupfs defines no default. Rolling back the flag -requires a reboot. - -The `kubelet` will error if the defined `--system-container` does not exist -on `systemd` environments. It will verify that the appropriate `cpu` and `memory` -controllers are enabled. - -### Kubernetes reserved - -The node may set aside a set of resources for Kubernetes components: - -* `--kube-reserved=:` - A set of `ResourceName`=`ResourceQuantity` pairs that -describe resources reserved for host daemons. - -The `kubelet` does not enforce `--kube-reserved` at this time, but the ability -to distinguish the static reservation from observed usage is important for node accounting. - -This proposal asserts that `kubernetes.slice` is the default slice associated with -the `kubelet` and `kube-proxy` service units defined in the project. Keeping it -separate from `system.slice` allows for accounting to be distinguished separately. - -The `kubelet` will detect its `cgroup` to track `kube-reserved` observed usage on `systemd`. -If the `kubelet` detects that its a child of the `system-container` based on the observed -`cgroup` hierarchy, it will warn. - -If the `kubelet` is launched directly from a terminal, it's most likely destination will -be in a `scope` that is a child of `user.slice` as follows: - -`/sys/fs/cgroup/<controller>/user.slice/user-1000.slice/session-1.scope` - -In this context, the parent `scope` is what will be used to facilitate local developer -debugging scenarios for tracking `kube-reserved` usage. - -The `kubelet` has the following flag: - -* `--resource-container="/kubelet":` Absolute name of the resource-only container to create -and run the Kubelet in (Default: /kubelet). - -This flag will not be supported on `systemd` environments since the init system has already -spawned the process and placed it in the corresponding container associated with its unit. - -### Kubernetes container runtime reserved - -This proposal asserts that the reservation of compute resources for any associated -container runtime daemons is tracked by the operator under the `system-reserved` or -`kubernetes-reserved` values and any enforced limits are set by the -operator specific to the container runtime. - -**Docker** - -If the `kubelet` is configured with the `container-runtime` set to `docker`, the -`kubelet` will detect the `cgroup` associated with the `docker` daemon and use that -to do local node accounting. If an operator wants to impose runtime limits on the -`docker` daemon to control resource usage, the operator should set those explicitly in -the `service` unit that launches `docker`. The `kubelet` will not set any limits itself -at this time and will assume whatever budget was set aside for `docker` was included in -either `--kube-reserved` or `--system-reserved` reservations. - -Many OS distributions package `docker` by default, and it will often belong to the -`system.slice` hierarchy, and therefore operators will need to budget it for there -by default unless they explicitly move it. - -**rkt** - -rkt has no client/server daemon, and therefore has no explicit requirements on container-runtime -reservation. - -### kubelet cgroup enforcement - -The `kubelet` does not enforce the `system-reserved` or `kube-reserved` values by default. - -The `kubelet` should support an additional flag to turn on enforcement: - -* `--system-reserved-enforce=false` - Optional flag that if true tells the `kubelet` -to enforce the `system-reserved` constraints defined (if any) -* `--kube-reserved-enforce=false` - Optional flag that if true tells the `kubelet` -to enforce the `kube-reserved` constraints defined (if any) - -Usage of this flag requires that end-user containers are launched in a separate part -of cgroup hierarchy via `cgroup-root`. - -If this flag is enabled, the `kubelet` will continually validate that the configured -resource constraints are applied on the associated `cgroup`. - -### kubelet cgroup-root behavior under systemd - -The `kubelet` supports a `cgroup-root` flag which is the optional root `cgroup` to use for pods. - -This flag should be treated as a pass-through to the underlying configured container runtime. - -If `--cgroup-enforce=true`, this flag warrants special consideration by the operator depending -on how the node was configured. For example, if the container runtime is `docker` and its using -the `systemd` cgroup driver, then `docker` will take the daemon wide default and launch containers -in the same slice associated with the `docker.service`. By default, this would mean `system.slice` -which could cause end-user pods to be launched in the same part of the cgroup hierarchy as system daemons. - -In those environments, it is recommended that `cgroup-root` is configured to be a subtree of `machine.slice`. - -### Proposed cgroup hierarchy - -``` -$ROOT - | - +- system.slice - | | - | +- sshd.service - | +- docker.service (optional) - | +- ... - | - +- kubernetes.slice - | | - | +- kubelet.service - | +- docker.service (optional) - | - +- machine.slice (container runtime specific) - | | - | +- docker-<container-id>.scope - | - +- user.slice - | +- ... -``` - -* `system.slice` corresponds to `--system-reserved`, and contains any services the -operator brought to the node as normal configuration. -* `kubernetes.slice` corresponds to the `--kube-reserved`, and contains kube specific -daemons. -* `machine.slice` should parent all end-user containers on the system and serve as the -root of the end-user cluster workloads run on the system. -* `user.slice` is not explicitly tracked by the `kubelet`, but it is possible that `ssh` -sessions to the node where the user launches actions directly. Any resource accounting -reserved for those actions should be part of `system-reserved`. - -The container runtime daemon, `docker` in this outline, must be accounted for in either -`system.slice` or `kubernetes.slice`. - -In the future, the depth of the container hierarchy is not recommended to be rooted -more than 2 layers below the root as it historically has caused issues with node performance -in other `cgroup` aware systems (https://bugzilla.redhat.com/show_bug.cgi?id=850718). It -is anticipated that the `kubelet` will parent containers based on quality of service -in the future. In that environment, those changes will be relative to the configured -`cgroup-root`. - -### Linux Kernel Parameters - -The `kubelet` will set the following: - -* `sysctl -w vm.overcommit_memory=1` -* `sysctl -w vm.panic_on_oom=0` -* `sysctl -w kernel/panic=10` -* `sysctl -w kernel/panic_on_oops=1` - -### OOM Score Adjustment - -The `kubelet` at bootstrapping will set the `oom_score_adj` value for Kubernetes -daemons, and any dependent container-runtime daemons. - -If `container-runtime` is set to `docker`, then set its `oom_score_adj=-999` - -## Implementation concerns - -### kubelet block-level architecture - -``` -+----------+ +----------+ +----------+ -| | | | | Pod | -| Node <-------+ Container<----+ Lifecycle| -| Manager | | Manager | | Manager | -| +-------> | | | -+---+------+ +-----+----+ +----------+ - | | - | | - | +-----------------+ - | | | - | | | -+---v--v--+ +-----v----+ -| cgroups | | container| -| library | | runtimes | -+---+-----+ +-----+----+ - | | - | | - +---------+----------+ - | - | - +-----------v-----------+ - | Linux Kernel | - +-----------------------+ -``` - -The `kubelet` should move to an architecture that resembles the above diagram: - -* The `kubelet` should not interface directly with the `cgroup` file-system, but instead -should use a common `cgroups library` that has the proper abstraction in place to -work with either `cgroupfs` or `systemd`. The `kubelet` should just use `libcontainer` -abstractions to facilitate this requirement. The `libcontainer` abstractions as -currently defined only support an `Apply(pid)` pattern, and we need to separate that -abstraction to allow cgroup to be created and then later joined. -* The existing `ContainerManager` should separate node bootstrapping into a separate -`NodeManager` that is dependent on the configured `cgroup-driver`. -* The `kubelet` flags for cgroup paths will convert internally as part of cgroup library, -i.e. `/foo/bar` will just convert to `foo-bar.slice` - -### kubelet accounting for end-user pods - -This proposal re-enforces that it is inappropriate at this time to depend on `--cgroup-root` as the -primary mechanism to distinguish and account for end-user pod compute resource usage. - -Instead, the `kubelet` can and should sum the usage of each running `pod` on the node to account for -end-user pod usage separate from system-reserved and kubernetes-reserved accounting via `cAdvisor`. - -## Known issues - -### Docker runtime support for --cgroup-parent - -Docker versions <= 1.0.9 did not have proper support for `-cgroup-parent` flag on `systemd`. This -was fixed in this PR (https://github.com/docker/docker/pull/18612). As result, it's expected -that containers launched by the `docker` daemon may continue to go in the default `system.slice` and -appear to be counted under system-reserved node usage accounting. - -If operators run with later versions of `docker`, they can avoid this issue via the use of `cgroup-root` -flag on the `kubelet`, but this proposal makes no requirement on operators to do that at this time, and -this can be revisited if/when the project adopts docker 1.10. - -Some OS distributions will fix this bug in versions of docker <= 1.0.9, so operators should -be aware of how their version of `docker` was packaged when using this feature. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/node-allocatable.md b/contributors/design-proposals/node/node-allocatable.md index 07f52b74..f0fbec72 100644 --- a/contributors/design-proposals/node/node-allocatable.md +++ b/contributors/design-proposals/node/node-allocatable.md @@ -1,336 +1,6 @@ -# Node Allocatable Resources +Design proposals have been archived. -### Authors: timstclair@, vishh@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview - -Kubernetes nodes typically run many OS system daemons in addition to kubernetes daemons like kubelet, runtime, etc. and user pods. -Kubernetes assumes that all the compute resources available, referred to as `Capacity`, in a node are available for user pods. -In reality, system daemons use non-trivial amount of resources and their availability is critical for the stability of the system. -To address this issue, this proposal introduces the concept of `Allocatable` which identifies the amount of compute resources available to user pods. -Specifically, the kubelet will provide a few knobs to reserve resources for OS system daemons and kubernetes daemons. - -By explicitly reserving compute resources, the intention is to avoid overcommiting the node and not have system daemons compete with user pods. -The resources available to system daemons and user pods will be capped based on user specified reservations. - -If `Allocatable` is available, the scheduler will use that instead of `Capacity`, thereby not overcommiting the node. - -## Design - -### Definitions - -1. **Node Capacity** - Already provided as - [`NodeStatus.Capacity`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus), - this is total capacity read from the node instance, and assumed to be constant. -2. **System-Reserved** (proposed) - Compute resources reserved for processes which are not managed by - Kubernetes. Currently this covers all the processes lumped together in the `/system` raw - container. -3. **Kubelet Allocatable** - Compute resources available for scheduling (including scheduled & - unscheduled resources). This value is the focus of this proposal. See [below](#api-changes) for - more details. -4. **Kube-Reserved** (proposed) - Compute resources reserved for Kubernetes components such as the - docker daemon, kubelet, kube proxy, etc. - -### API changes - -#### Allocatable - -Add `Allocatable` (4) to -[`NodeStatus`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus): - -``` -type NodeStatus struct { - ... - // Allocatable represents schedulable resources of a node. - Allocatable ResourceList `json:"allocatable,omitempty"` - ... -} -``` - -Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be: - -``` - [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] - [Hard-Eviction-Threshold] -``` - -The scheduler will use `Allocatable` in place of `Capacity` when scheduling pods, and the Kubelet -will use it when performing admission checks. - -*Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a - separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this - proposal.* - -#### Kube-Reserved - -`KubeReserved` is the parameter specifying resources reserved for kubernetes components (4). It is -provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during -normal Kubelet operation (this may change in the [future](#future-work)). - -The flag will be specified as a serialized `ResourceList`, with resources defined by the API -`ResourceName` and values specified in `resource.Quantity` format, e.g.: - -``` ---kube-reserved=cpu=500m,memory=5Mi -``` - -Initially we will only support CPU and memory, but will eventually support more resources like [local storage](#phase-3) and io proportional weights to improve node reliability. - -#### System-Reserved - -In the initial implementation, `SystemReserved` will be functionally equivalent to -[`KubeReserved`](#kube-reserved), but with a different semantic meaning. While KubeReserved -designates resources set aside for kubernetes components, SystemReserved designates resources set -aside for non-kubernetes components (currently this is reported as all the processes lumped -together in the `/system` raw container on non-systemd nodes). - -## Kubelet Evictions Thresholds - -To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage. -Together, evictions and node allocatable help improve node stability. - -As of v1.5, evictions are based on overall node usage relative to `Capacity`. -Kubelet evicts pods based on QoS and user configured eviction thresholds. -More details in [this doc](./kubelet-eviction.md#enforce-node-allocatable) - -From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot exceed `Allocatable`. -Memory and CPU limits are enforced using cgroups, but there exists no easy means to enforce storage limits though. -Enforcing storage limits using Linux Quota is not possible since it's not hierarchical. -Once storage is supported as a resource for `Allocatable`, Kubelet has to perform evictions based on `Allocatable` in addition to `Capacity`. - -Note that eviction limits are enforced on pods only and system daemons are free to use any amount of resources unless their reservations are enforced. - -Here is an example to illustrate Node Allocatable for memory: - -Node Capacity is `32Gi`, kube-reserved is `2Gi`, system-reserved is `1Gi`, eviction-hard is set to `<100Mi` - -For this node, the effective Node Allocatable is `28.9Gi` only; i.e. if kube and system components use up all their reservation, the memory available for pods is only `28.9Gi` and kubelet will evict pods once overall usage of pods crosses that threshold. - -If we enforce Node Allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi` in which case evictions will not be performed unless kernel memory consumption is above `100Mi`. - -In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Thresholds`. - -However, the scheduler is not expected to use more than `28.9Gi` and so `Node Allocatable` on Node Status will be `28.9Gi`. - -If kube and system components do not use up all their reservation, with the above example, pods will face memcg OOM kills from the node allocatable cgroup before kubelet evictions kick in. -To better enforce QoS under this situation, Kubelet will apply the hard eviction thresholds on the node allocatable cgroup as well, if node allocatable is enforced. -The resulting behavior will be the same for user pods. -With the above example, Kubelet will evict pods whenever pods consume more than `28.9Gi` which will be `<100Mi` from `29Gi` which will be the memory limits on the Node Allocatable cgroup. - -## General guidelines - -System daemons are expected to be treated similar to `Guaranteed` pods. -System daemons can burst within their bounding cgroups and this behavior needs to be managed as part of kubernetes deployment. -For example, Kubelet can have its own cgroup and share `KubeReserved` resources with the Container Runtime. -However, Kubelet cannot burst and use up all available Node resources if `KubeReserved` is enforced. - -Users are advised to be extra careful while enforcing `SystemReserved` reservation since it can lead to critical services being CPU starved or OOM killed on the nodes. -The recommendation is to enforce `SystemReserved` only if a user has profiled their nodes exhaustively to come up with precise estimates. - -To begin with enforce `Allocatable` on `pods` only. -Once adequate monitoring and alerting is in place to track kube daemons, attempt to enforce `KubeReserved` based on heuristics. -More on this in [Phase 2](#phase-2-enforce-allocatable-on-pods). - -The resource requirements of kube system daemons will grow over time as more and more features are added. -Over time, the project will attempt to bring down utilization, but that is not a priority as of now. -So expect a drop in `Allocatable` capacity over time. - -`Systemd-logind` places ssh sessions under `/user.slice`. -Its usage will not be accounted for in the nodes. -Take into account resource reservation for `/user.slice` while configuring `SystemReserved`. -Ideally `/user.slice` should reside under `SystemReserved` top level cgroup. - -## Recommended Cgroups Setup - -Following is the recommended cgroup configuration for Kubernetes nodes. -All OS system daemons are expected to be placed under a top level `SystemReserved` cgroup. -`Kubelet` and `Container Runtime` are expected to be placed under `KubeReserved` cgroup. -The reason for recommending placing the `Container Runtime` under `KubeReserved` is as follows: - -1. A container runtime on Kubernetes nodes is not expected to be used outside of the Kubelet. -1. It's resource consumption is tied to the number of pods running on a node. - -Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individually track their usage. -```text - -/ (Cgroup Root) -. -+..systemreserved or system.slice (Specified via `--system-reserved-cgroup`; `SystemReserved` enforced here *optionally* by kubelet) -. . .tasks(sshd,udev,etc) -. -. -+..podruntime or podruntime.slice (Specified via `--kube-reserved-cgroup`; `KubeReserved` enforced here *optionally* by kubelet) -. . -. +..kubelet -. . .tasks(kubelet) -. . -. +..runtime -. .tasks(docker-engine, containerd) -. -. -+..kubepods or kubepods.slice (Node Allocatable enforced here by Kubelet) -. . -. +..PodGuaranteed -. . . -. . +..Container1 -. . . .tasks(container processes) -. . . -. . +..PodOverhead -. . . .tasks(per-pod processes) -. . ... -. . -. +..Burstable -. . . -. . +..PodBurstable -. . . . -. . . +..Container1 -. . . . .tasks(container processes) -. . . +..Container2 -. . . . .tasks(container processes) -. . . . -. . . ... -. . . -. . ... -. . -. . -. +..Besteffort -. . . -. . +..PodBesteffort -. . . . -. . . +..Container1 -. . . . .tasks(container processes) -. . . +..Container2 -. . . . .tasks(container processes) -. . . . -. . . ... -. . . -. . ... - -``` - -`systemreserved` & `kubereserved` cgroups are expected to be created by users. -If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically. - -`kubepods` cgroups will be created by kubelet automatically if it is not already there. -Creation of `kubepods` cgroup is tied to QoS Cgroup support which is controlled by `--cgroups-per-qos` flag. -If the cgroup driver is set to `systemd` then Kubelet will create a `kubepods.slice` via systemd. -By default, Kubelet will `mkdir` `/kubepods` cgroup directly via cgroupfs. - -#### Containerizing Kubelet - -If Kubelet is managed using a container runtime, have the runtime create cgroups for kubelet under `kubereserved`. - -### Metrics - -Kubelet identifies it's own cgroup and exposes it's usage metrics via the Summary metrics API (/stats/summary) -With docker runtime, kubelet identifies docker runtime's cgroups too and exposes metrics for it via the Summary metrics API. -To provide a complete overview of a node, Kubelet will expose metrics from cgroups enforcing `SystemReserved`, `KubeReserved` & `Allocatable` too. - -## Implementation Phases - -### Phase 1 - Introduce Allocatable to the system without enforcement - -**Status**: Implemented v1.2 - -In this phase, Kubelet will support specifying `KubeReserved` & `SystemReserved` resource reservations via kubelet flags. -The defaults for these flags will be `""`, meaning zero cpu or memory reservations. -Kubelet will compute `Allocatable` and update `Node.Status` to include it. -The scheduler will use `Allocatable` instead of `Capacity` if it is available. - -### Phase 2 - Enforce Allocatable on Pods - -**Status**: Targeted for v1.6 - -In this phase, Kubelet will automatically create a top level cgroup to enforce Node Allocatable across all user pods. -The creation of this cgroup is controlled by `--cgroups-per-qos` flag. - -Kubelet will support specifying the top level cgroups for `KubeReserved` and `SystemReserved` and support *optionally* placing resource restrictions on these top level cgroups. - -Users are expected to specify `KubeReserved` and `SystemReserved` based on their deployment requirements. - -Resource requirements for Kubelet and the runtime is typically proportional to the number of pods running on a node. -Once a user identified the maximum pod density for each of their nodes, they will be able to compute `KubeReserved` using [this performance dashboard](http://node-perf-dash.k8s.io/#/builds). -[This blog post](https://kubernetes.io/blog/2016/11/visualize-kubelet-performance-with-node-dashboard/) explains how the dashboard has to be interpreted. -Note that this dashboard provides usage metrics for docker runtime only as of now. - -Support for evictions based on Allocatable will be introduced in this phase. - -New flags introduced in this phase are as follows: - -1. `--enforce-node-allocatable=[pods][,][kube-reserved][,][system-reserved]` - - * This flag will default to `pods` in v1.6. - * This flag will be a `no-op` unless `--kube-reserved` and/or `--system-reserved` has been specified. - * If `--cgroups-per-qos=false`, then this flag has to be set to `""`. Otherwise its an error and kubelet will fail. - * It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`. - * Users intending to turn off this feature can set this flag to `""`. - * Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified. - * Specifying `system-reserved` value in this flag is invalid if `--system-reserved-cgroup` flag is not specified. - * By including `kube-reserved` or `system-reserved` in this flag's value, and by specifying the following two flags, Kubelet will attempt to enforce the reservations specified via `--kube-reserved` & `system-reserved` respectively. - -2. `--kube-reserved-cgroup=<absolute path to a cgroup>` - * This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation. - * Example: `/kube.slice`. Note that absolute paths are required and systemd naming scheme isn't supported. - -3. `--system-reserved-cgroup=<absolute path to a cgroup>` - * This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation. - * Example: `/system.slice`. Note that absolute paths are required and systemd naming scheme isn't supported. - -4. `--experimental-node-allocatable-ignore-eviction-threshold` - * This flag is provided as an `opt-out` option to avoid including Hard eviction thresholds in Node Allocatable which can impact existing clusters. - * The default value is `false`. - -#### Rollout details - -This phase is expected to improve Kubernetes node stability. -However it requires users to specify non-default values for `--kube-reserved` & `--system-reserved` flags though. - -The rollout of this phase has been long due and hence we are attempting to include it in v1.6. - -Since `KubeReserved` and `SystemReserved` continue to have `""` as defaults, the node's `Allocatable` does not change automatically. -Since this phase requires node drains (or pod restarts/terminations), it is considered disruptive to users. - -To rollback this phase, set `--enforce-node-allocatable` flag to `""` and `--experimental-node-allocatable-ignore-eviction-threshold` to `true`. -The former disables Node Allocatable enforcement on all pods and the latter avoids including hard eviction thresholds in Node Allocatable. - -This rollout in v1.6 might cause the following symptoms: - -1. If `--kube-reserved` and/or `--system-reserved` flags are also specified, OOM kills of containers and/or evictions of pods. This can happen primarily to `Burstable` and `BestEffort` pods since they can no longer use up all the resource available on the node. -1. Total allocatable capacity in the cluster reduces resulting in pods staying `Pending` because Hard Eviction Thresholds are included in Node Allocatable. - -##### Proposed Timeline - -```text -02/14/2017 - Discuss the rollout plan in sig-node meeting -02/15/2017 - Flip the switch to enable pod level cgroups by default -02/21/2017 - Merge phase 2 implementation -02/27/2017 - Kubernetes Feature complete (i.e. code freeze) -03/01/2017 - Send an announcement to kubernetes-dev@ about this rollout along with rollback options and potential issues. Recommend users to set kube and system reserved. -03/22/2017 - Kubernetes 1.6 release -``` - -### Phase 3 - Metrics & support for Storage - -*Status*: Targeted for v1.7 - -In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API. -`Storage` will also be introduced as a reservable resource in this phase. - -## Known Issues - -### Kubernetes reservation is smaller than kubernetes component usage - -**Solution**: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved -resources and hope for the best. If the node usage is less than Allocatable, there will be some room -for overflow and the node should continue to function. If the node has been scheduled to `allocatable` -(worst-case scenario) it may enter an unstable state, which is the current behavior in this -situation. - -A recommended alternative is to enforce KubeReserved once Kubelet supports it (Phase 2). -In the future we may set a parent cgroup for kubernetes components, with limits set -according to `KubeReserved`. - -### 3rd party schedulers - -The community should be notified that an update to schedulers is recommended, but if a scheduler is -not updated it falls under the above case of "scheduler is not allocatable-resources aware". +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/node-usernamespace-remapping.md b/contributors/design-proposals/node/node-usernamespace-remapping.md index 37f22836..f0fbec72 100644 --- a/contributors/design-proposals/node/node-usernamespace-remapping.md +++ b/contributors/design-proposals/node/node-usernamespace-remapping.md @@ -1,209 +1,6 @@ -# Support Node-Level User Namespaces Remapping +Design proposals have been archived. -- [Summary](#summary) -- [Motivation](#motivation) -- [Goals](#goals) -- [Non-Goals](#non-goals) -- [Use Stories](#user-stories) -- [Proposal](#proposal) -- [Future Work](#future-work) -- [Risks and Mitigations](risks-and-mitigations) -- [Graduation Criteria](graduation-criteria) -- [Alternatives](alternatives) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -_Authors:_ - -* Mrunal Patel <mpatel@redhat.com> -* Jan Pazdziora <jpazdziora@redhat.com> -* Vikas Choudhary <vichoudh@redhat.com> - -## Summary -Container security consists of many different kernel features that work together to make containers secure. User namespaces is one such feature that enables interesting possibilities for containers by allowing them to be root inside the container while not being root on the host. This gives more capabilities to the containers while protecting the host from the container being root and adds one more layer to container security. -In this proposal we discuss: -- use-cases/user-stories that benefit from this enhancement -- implementation design and scope for alpha release -- long-term roadmap to fully support this feature beyond alpha - -## Motivation -From user_namespaces(7): -> User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace. - -In order to run Pods with software which expects to run as root or with elevated privileges while still containing the processes and protecting both the Nodes and other Pods, Linux kernel mechanism of user namespaces can be used make the processes in the Pods view their environment as having the privileges, while on the host (Node) level these processes appear as without privileges or with privileges only affecting processes in the same Pods - -The purpose of using user namespaces in Kubernetes is to let the processes in Pods think they run as one uid set when in fact they run as different “real” uids on the Nodes. - -In this text, most everything said about uids can also be applied to gids. - -## Goals -Enable user namespace support in a kubernetes cluster so that workloads that work today also work with user namespaces enabled at runtime. Furthermore, make workloads that require root/privileged user inside the container, safer for the node using the additional security of user namespaces. Containers will run in a user namespace different from user-namespace of the underlying host. - -## Non-Goals -- Non-goal is to support pod/container level user namespace isolation. There can be images using different users but on the node, pods/containers running with these images will share common user namespace remapping configuration. In other words, all containers on a node share a common user-namespace range. -- Remote volumes support eg. NFS - -## User Stories -- As a cluster admin, I want to protect the node from the rogue container process(es) running inside pod containers with root privileges. If such a process is able to break out into the node, it could be a security issue. -- As a cluster admin, I want to support all the images irrespective of what user/group that image is using. -- As a cluster admin, I want to allow some pods to disable user namespaces if they require elevated privileges. - -## Proposal -Proposal is to support user-namespaces for the pod containers. This can be done at two levels: -- Node-level : This proposal explains this part in detail. -- Namespace-Level/Pod-level: Plan is to target this in future due to missing support in the low level system components such as runtimes and kernel. More on this in the `Future Work` section. - -Node-level user-namespace support means that, if feature is enabled, all pods on a node will share a common user-namespace, common UID(and GID) range (which is a subset of node’s total UIDs(and GIDs)). This common user-namespace is runtime’s default user-namespace range which is remapped to containers’ UIDs(and GID), starting with the first UID as container’s ‘root’. -In general Linux convention, UID(or GID) mapping consists of three parts: -1. Host (U/G)ID: First (U/G)ID of the range on the host that is being remapped to the (U/G)IDs in the container user-namespace -2. Container (U/G)ID: First (U/G)ID of the range in the container namespace and this is mapped to the first (U/G)ID on the host(mentioned in previous point). -3. Count/Size: Total number of consecutive mapping between host and container user-namespaces, starting from the first one (including) mentioned above. - -As an example, `host_id 1000, container_id 0, size 10` -In this case, 1000 to 1009 on host will be mapped to 0 to 9 inside the container. - -User-namespace support should be enabled only when container runtime on the node supports user-namespace remapping and is enabled in its configuration. To enable user-namespaces, feature-gate flag will need to be passed to Kubelet like this `--feature-gates=”NodeUserNamespace=true”` - -A new CRI API, `GetRuntimeConfigInfo` will be added. Kubelet will use this API: -- To verify if user-namespace remapping is enabled at runtime. If found disabled, kubelet will fail to start -- To determine the default user-namespace range at the runtime, starting UID of which is mapped to the UID '0' of the container. - -### Volume Permissions -Kubelet will change the file permissions, i.e chown, at `/var/lib/kubelet/pods` prior to any container start to get file permissions updated according to remapped UID and GID. -This proposal will work only for local volumes and not with remote volumes such as NFS. - -### How to disable `NodeUserNamespace` for a specific pod -This can be done in two ways: -- **Alpha:** Implicitly using host namespace for the pod containers -This support is already present (currently it seems broken, will be fixed) in Kubernetes as an experimental functionality, which can be enabled using `feature-gates=”ExperimentalHostUserNamespaceDefaulting=true”`. -If Pod-Security-Policy is configured to allow the following to be requested by a pod, host user-namespace will be enabled for the container: - - host namespaces (pid, ipc, net) - - non-namespaced capabilities (mknod, sys_time, sys_module) - - the pod contains a privileged container or using host path volumes. - - https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c#diff-a53fa76e941e0bdaee26dcbc435ad2ffR437 introduced via https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c. - -- **Beta:** Explicit API to request host user-namespace in pod spec - This is being targeted under Beta graduation plans. - -### CRI API Changes -Proposed CRI API changes: - -```golang -// Runtime service defines the public APIs for remote container runtimes -service RuntimeService { - // Version returns the runtime name, runtime version, and runtime API version. - rpc Version(VersionRequest) returns (VersionResponse) {} - ……. - ……. - // GetRuntimeConfigInfo returns the configuration details of the runtime. - rpc GetRuntimeConfigInfo(GetRuntimeConfigInfoRequest) returns (GetRuntimeConfigInfoResponse) {} -} -// LinuxIDMapping represents a single user namespace mapping in Linux. -message LinuxIDMapping { - // container_id is the starting id for the mapping inside the container. - uint32 container_id = 1; - // host_id is the starting id for the mapping on the host. - uint32 host_id = 2; - // size is the length of the mapping. - uint32 size = 3; -} - -message LinuxUserNamespaceConfig { - // is_enabled, if true indicates that user-namespaces are supported and enabled in the container runtime - bool is_enabled = 1; - // uid_mappings is an array of user id mappings. - repeated LinuxIDMapping uid_mappings = 1; - // gid_mappings is an array of group id mappings. - repeated LinuxIDMapping gid_mappings = 2; -} -message GetRuntimeConfig { - LinuxUserNamespaceConfig user_namespace_config = 1; -} - -message GetRuntimeConfigInfoRequest {} - -message GetRuntimeConfigInfoResponse { - GetRuntimeConfig runtime_config = 1 -} - -... - -// NamespaceOption provides options for Linux namespaces. -message NamespaceOption { - // Network namespace for this container/sandbox. - // Note: There is currently no way to set CONTAINER scoped network in the Kubernetes API. - // Namespaces currently set by the kubelet: POD, NODE - NamespaceMode network = 1; - // PID namespace for this container/sandbox. - // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. - // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods. - // Namespaces currently set by the kubelet: POD, CONTAINER, NODE - NamespaceMode pid = 2; - // IPC namespace for this container/sandbox. - // Note: There is currently no way to set CONTAINER scoped IPC in the Kubernetes API. - // Namespaces currently set by the kubelet: POD, NODE - NamespaceMode ipc = 3; - // User namespace for this container/sandbox. - // Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API. - // The container runtime should ignore this if user namespace is NOT enabled. - // POD is the default value. Kubelet will set it to NODE when trying to use host user-namespace - // Namespaces currently set by the kubelet: POD, NODE - NamespaceMode user = 4; -} - -``` - -### Runtime Support -- Docker: Here is the [user-namespace documentation](https://docs.docker.com/engine/security/userns-remap/) and this is the [implementation PR](https://github.com/moby/moby/pull/12648) - - Concerns: -Docker API does not provide user-namespace mapping. Therefore to handle `GetRuntimeConfigInfo` API, changes will be done in `dockershim` to read system files, `/etc/subuid` and `/etc/subgid`, for figuring out default user-namespace mapping. `/info` api will be used to figure out if user-namespace is enabled and `Docker Root Dir` will be used to figure out host uid mapped to the uid `0` in container. eg. `Docker Root Dir: /var/lib/docker/2131616.2131616` this shows host uid `2131616` will be mapped to uid `0` -- CRI-O: https://github.com/kubernetes-incubator/cri-o/pull/1519 -- Containerd: https://github.com/containerd/containerd/blob/129167132c5e0dbd1b031badae201a432d1bd681/container_opts_unix.go#L149 - -### Implementation Roadmap -#### Phase 1: Support in Kubelet, Alpha, [Target: Kubernetes v1.11] -- Add feature gate `NodeUserNamespace`, disabled by default -- Add new CRI API, `GetRuntimeConfigInfo()` -- Add logic in Kubelet to handle pod creation which includes parsing GetRuntimeConfigInfo response and changing file-permissions in /var/lib/kubelet with learned userns mapping. -- Add changes in dockershim to implement GetRuntimeConfigInfo() for docker runtime -- Add changes in CRI-O to implement userns support and GetRuntimeConfigInfo() support -- Unit test cases -- e2e tests - -#### Phase 2: Beta Support [Target: Kubernetes v1.12] -- PSP integration -- To grow ExperimentalHostUserNamespaceDefaulting from experimental feature gate to a Kubelet flag -- API changes to allow pod able to request HostUserNamespace in pod spec -- e2e tests - -### References -- Default host user namespace via experimental flag - - https://github.com/kubernetes/kubernetes/pull/31169 -- Enable userns support for containers launched by kubelet - - https://github.com/kubernetes/features/issues/127 -- Track Linux User Namespaces in the Pod Security Policy - - https://github.com/kubernetes/kubernetes/issues/59152 -- Add support for experimental-userns-remap-root-uid and experimental-userns-remap-root-gid options to match the remapping used by the container runtime. - - https://github.com/kubernetes/kubernetes/pull/55707 -- rkt User Namespaces Background - - https://coreos.com/rkt/docs/latest/devel/user-namespaces.html - -## Future Work -### Namespace-Level/Pod-Level user-namespace support -There is no runtime today which supports creating containers with a specified user namespace configuration. For example here is the discussion related to this support in Docker https://github.com/moby/moby/issues/28593 -Once user-namespace feature in the runtimes has evolved to support container’s request for a specific user-namespace mapping(UID and GID range), we can extend current Node-Level user-namespace support in Kubernetes to support Namespace-level isolation(or if desired even pod-level isolation) by dividing and allocating learned mapping from runtime among Kubernetes namespaces (or pods, if desired). From end-user UI perspective, we don't expect any change in the UI related to user namespaces support. -### Remote Volumes -Remote Volumes support should be investigated and should be targeted in future once support is there at lower infra layers. - - -## Risks and Mitigations -The main risk with this change stems from the fact that processes in Pods will run with different “real” uids than they used to, while expecting the original uids to make operations on the Nodes or consistently access shared persistent storage. -- This can be mitigated by turning the feature on gradually, per-Pod or per Kubernetes namespace. -- For the Kubernetes' cluster Pods (that provide the Kubernetes functionality), testing of their behaviour and ability to run in user namespaced setups is crucial. - -## Graduation Criteria -- PSP integration -- API changes to allow pod able to request host user namespace using for example, `HostUserNamespace: True`, in pod spec -- e2e tests - -## Alternatives -User Namespace mappings can be passed explicitly through kubelet flags similar to https://github.com/kubernetes/kubernetes/pull/55707 but we do not prefer this option because this is very much prone to mis-configuration. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/optional-configmap.md b/contributors/design-proposals/node/optional-configmap.md index 1b11bb12..f0fbec72 100644 --- a/contributors/design-proposals/node/optional-configmap.md +++ b/contributors/design-proposals/node/optional-configmap.md @@ -1,174 +1,6 @@ -# Optional ConfigMaps and Secrets +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Allow the ConfigMaps or Secrets that are used to populate the environment variables of a -container and files within a Volume to be optional. -## Use Cases - -When deploying an application to multiple environments like development, test, -and production, there may be certain environment variables that must reflect -the values that are relevant to said environment. One way to do so would be to -have a well named ConfigMap which contains all the environment variables -needed. With the introduction of optional ConfigMaps, one could instead define a required -ConfigMap which contains all the environment variables for any environment -with a set of initialized or default values. An additional optional ConfigMap -can also be specified which allows the deployer to provide any overrides for -the current environment. - -An application developer can populate a volume with files defined from a -ConfigMap. The developer may have some required files to be created and have -optional additional files at a different target. The developer can specify on -the Pod that there is an optional ConfigMap that will provide these additional -files if the ConfigMap exists. - -## Design Points - -A container can specify an entire ConfigMap to be populated as environment -variables via `EnvFrom`. When required, the container fails to start if the -ConfigMap does not exist. If the ConfigMap is optional, the container will -skip the non-existent ConfigMap and proceed as normal. - -A container may also specify a single environment variable to retrieve its -value from a ConfigMap via `Env`. If the key does not exist in the ConfigMap -during container start, the container will fail to start. If however, the -ConfigMap is marked optional, during container start, a non-existent ConfigMap -or a missing key in the ConfigMap will not prevent the container from -starting. Any previous value for the given key will be used. - -Any changes to the ConfigMap will not affect environment variables of running -containers. If the Container is restarted, the set of environment variables -will be re-evaluated. - -The same processing rules applies to Secrets. - -A pod can specify a set of Volumes to mount. A ConfigMap can represent the -files to populate the volume. The ConfigMaps can be marked as optional. The -default is to require the ConfigMap existence. If the ConfigMap is required -and does not exist, the volume creation will fail. If the ConfigMap is marked -as optional, the volume will be created regardless, and the files will be -populated only if the ConfigMap exists and has content. If the ConfigMap is -changed, the volume will eventually reflect the new set of data available from -the ConfigMap. - -## Proposed Design - -To support an optional ConfigMap either as a ConfigMapKeySelector, ConfigMapEnvSource or a -ConfigMapVolumeSource, a boolean will be added to specify whether it is -optional. The default will be required. - -To support an optional Secret either as a SecretKeySelector, or a -SecretVolumeSource, a boolean will be added to specify whether it is optional. -The default will be required. - -### Kubectl updates - -The `describe` command will display the additional optional field of the -ConfigMap and Secret for both the environment variables and volume sources. - -### API Resource - -A new `Optional` field of type boolean will be added. - -```go -type ConfigMapKeySelector struct { - // Specify whether the ConfigMap must be defined - // +optional - Optional *bool `json:"optional,omitempty" protobuf:"varint,3,opt,name=optional"` -} - -type ConfigMapEnvSource struct { - // Specify whether the ConfigMap must be defined - // +optional - Optional *bool `json:"optional,omitempty" protobuf:"varint,2,opt,name=optional"` -} - -type ConfigMapVolumeSource struct { - // Specify whether the ConfigMap must be defined - // +optional - Optional *bool `json:"optional,omitempty" protobuf:"varint,4,opt,name=optional"` -} - -type SecretKeySelector struct { - // Specify whether the ConfigMap must be defined - // +optional - Optional *bool `json:"optional,omitempty" protobuf:"varint,3,opt,name=optional"` -} - -type SecretVolumeSource struct { - // Specify whether the Secret must be defined - // +optional - Optional *bool `json:"optional,omitempty" protobuf:"varint,4,opt,name=optional"` -} -``` - -### Examples - -Optional `ConfigMap` as Environment Variables - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - env: - - name: foo - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: port - optional: true -``` - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - envFrom: - - configMap: - name: etcd-env-config - optional: true -``` - -Optional `ConfigMap` as a VolumeSource - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - volumes: - - name: pod-configmap-volume - configMap: - name: configmap-test-volume - optional: true - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/pleg.png b/contributors/design-proposals/node/pleg.png Binary files differdeleted file mode 100644 index f15c5d83..00000000 --- a/contributors/design-proposals/node/pleg.png +++ /dev/null diff --git a/contributors/design-proposals/node/plugin-watcher.md b/contributors/design-proposals/node/plugin-watcher.md index eb8f4be1..f0fbec72 100644 --- a/contributors/design-proposals/node/plugin-watcher.md +++ b/contributors/design-proposals/node/plugin-watcher.md @@ -1,169 +1,6 @@ +Design proposals have been archived. -# Plugin Watcher Utility +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Background -Portability and extendability are the major goals of Kubernetes from its beginning and we have seen more plugin mechanisms developed on Kubernetes to further improve them. Moving in this direction, Kubelet is starting to support pluggable [device exporting](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md) and [CSI volume plugins](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md). We are seeing the need for a common Kubelet plugin discovery model that can be used by different types of node-level plugins, such as device plugins, CSI, and CNI, to establish communication channels with Kubelet. This document lists two possible approaches of implementing this common Kubelet plugin discovery model. We are hoping to discuss these proposals with the OSS community to gather consensus on which model we would like to take forward. - - -## General Requirements - -The primary goal of the Kubelet plugin discovery model is to provide a common mechanism for users to dynamically deploy vendor specific plugins that make different types of devices, or storage system, or network components available on a Kubernetes node. - -Here are the general requirements to consider when designing this system: - -* Security/authentication requirements -* Underlying communication channel to use: stay with gRPC v.s. flexibility to support multiple communication protocol -* How to detect API version mismatching -* Ping-pong plugin registration -* Failure recovery or upgrade story upon kubelet restart and/or plugin restart -* How to prevent single misbehaving plugin from flooding kubelet -* How to de-registration -* Need to support some existing protocol that is not bound to K8s, like CSI - -## Proposed Models - -#### Model 1: plugin registers with Kubelet through grpc (currently used in device plugin) - -* Currently requires plugin to run with privilege and communicate to kubelet through unix socket under a canonical directory, but has flexibility to support different communication channels or authentication methods. -* API version mismatch is detected during registration. -* Currently always take newest plugin upon re-registration. Can implement some policy to reject plugin re-registration if a plugin re-registers too frequently. Can terminate the communication channel if a plugin sends too many updates to Kubelet. -* In the current implementation, kubelet removes all of the device plugin unix sockets. Device plugins are expected to watch for such event and re-register with the new kubelet instance. The solution is a bit ad-hoc. There is also a temporary period that we can't schedule new pods requiring device plugin resource on the node after kubelet restart, till the corresponding device plugin re-registers. This temporary period can be avoided if we also checkpoints device plugin socket information on Kubelet side. Pods previously scheduled can continue with device plugin allocation information already recorded in a checkpoint file. Checkpointing plugin socket information is easier to be added in DevicePlugins that already maintains a checkpoint file for other purposes. This however could be a new requirement for other plugin systems like CSI. - - - -#### Model 2: Kubelet watches new plugins under a canonical path through inotify (Preferred one and current implementation) - -* Plugin can export a registration rpc for API version checking or further authentication. Kubelet doesn't need to export a rpc service. -* We will take gRPC as the single supported communication channel. -* Can take the newest plugin from the latest inotify creation. May require socket name to follow certain naming convention (e.g., resourceName.timestamp) to detect ping-pong plugin registration, and ignore socket creations from a plugin if it creates too many sockets during a short period of time. We can even require that the resource name embedded in the socket path to be part of the identification process, e.g., a plugin at `/var/lib/kubelet/plugins/resourceName.timestamp` must identify itself as resourceName or it will be rejected. -* Easy to avoid temporary plugin unavailability after kubelet restart. Kubelet just needs to scan through the special directory. It can remove plugin sockets that fail to respond, and always take the last live socket when multiple registrations happen with the same plugin name. This simplifies device plugin implementation because they don't need to detect Kubelet restarts and re-register. -* A plugin should remove its socket upon termination to avoid leaving dead sockets in the canonical path, although this is not strictly required. -* CSI needs flexibility to not only bound to Kubernetes. With probe model, may need to add an interface for K8s to get plugin information. -* We can introduce special plugin pod for which we automatically setup its environment to communicate with kubelet. Even if Kubelet runs in a container, it is easy to config the communication path between plugin and Kubelet. - - -**More Implementation Details on Model 2:** - -* Kubelet will have a new module, PluginWatcher, which will probe a canonical path recursively -* On detecting a socket creation, Watcher will try to get plugin identity details using a gRPC client on the discovered socket and the RPCs of a newly introduced `Identity` service. -* Plugins must implement `Identity` service RPCs for initial communication with Watcher. - -**Identity Service Primitives:** -```golang - -// PluginInfo is the message sent from a plugin to the Kubelet pluginwatcher for plugin registration -message PluginInfo { - // Type of the Plugin. CSIPlugin or DevicePlugin - string type = 1; - // Plugin name that uniquely identifies the plugin for the given plugin type. - // For DevicePlugin, this is the resource name that the plugin manages and - // should follow the extended resource name convention. - // For CSI, this is the CSI driver registrar name. - string name = 2; - // Optional endpoint location. If found set by Kubelet component, - // Kubelet component will use this endpoint for specific requests. - // This allows the plugin to register using one endpoint and possibly use - // a different socket for control operations. CSI uses this model to delegate - // its registration external from the plugin. - string endpoint = 3; - // Plugin service API versions the plugin supports. - // For DevicePlugin, this maps to the deviceplugin API versions the - // plugin supports at the given socket. - // The Kubelet component communicating with the plugin should be able - // to choose any preferred version from this list, or returns an error - // if none of the listed versions is supported. - repeated string supported_versions = 4; -} - -// RegistrationStatus is the message sent from Kubelet pluginwatcher to the plugin for notification on registration status -message RegistrationStatus { - // True if plugin gets registered successfully at Kubelet - bool plugin_registered = 1; - // Error message in case plugin fails to register, empty string otherwise - string error = 2; -} - -// RegistrationStatusResponse is sent by plugin to kubelet in response to RegistrationStatus RPC -message RegistrationStatusResponse { -} - -// InfoRequest is the empty request message from Kubelet -message InfoRequest { -} - -// Registration is the service advertised by the Plugins. -service Registration { - rpc GetInfo(InfoRequest) returns (PluginInfo) {} - rpc NotifyRegistrationStatus(RegistrationStatus) returns (RegistrationStatusResponse) {} -} -``` - -**PluginWatcher primitives:** -```golang -// Watcher is the plugin watcher -type Watcher struct { - path string - handlers map[string]RegisterCallbackFn - stopCh chan interface{} - fs utilfs.Filesystem - fsWatcher *fsnotify.Watcher - wg sync.WaitGroup - mutex sync.Mutex -} - -// RegisterCbkFn is the type of the callback function that handlers will provide -type RegisterCallbackFn func(pluginName string, endpoint string, versions []string, socketPath string) (chan bool, error) - -// AddHandler registers a callback to be invoked for a particular type of plugin -func (w *Watcher) AddHandler(pluginType string, handlerCbkFn RegisterCbkFn) { - w.handlers[handlerType] = handlerCbkFn -} - -// Start watches for the creation of plugin sockets at the path -func (w *Watcher) Start() error { - -// Probes on the canonical path for socket creations in a forever loop - -// For any new socket creation, invokes `Info()` at plugins Identity service -resp, err := client.Info(context.Background(), &watcherapi.Empty{}) - -// Keeps the connection open and passes plugin's identity details, along with socket path to the handler using callback function registered by handler. Handler callback is selected based on the Type of the plugin, for example device plugin or CSI plugin -// Handler Callback is supposed to authenticate the plugin details and if all correct, register the Plugin at the kubelet subsystem. - -if handlerCbkFn, ok := w.handlers[resp.Type]; ok { - err = handlerCbkFn(resp, event.Name) -... -} - -// After Callback returns, PluginWatcher notifies back status to the plugin - -client.NotifyRegistrationStatus(ctx, ®isterapi.RegistrationStatus{ -... -}) - -``` - - -**How any Kubelet sub-module can use PluginWatcher:** - - - -* There must be a callback function defined in the sub-module of the signature: - -```golang -type RegisterCallbackFn func(pluginName string, endpoint string, versions []string, socketPath string) (chan bool, error) -``` -* Just after sub-module start, this callback should be registered with the PluginWatcher, eg: -```golang -kl.pluginWatcher.AddHandler(pluginwatcherapi.DevicePlugin, kl.containerManager.GetPluginRegistrationHandlerCbkFunc()) -``` - -**Open issues (Points from the meeting notes for the record):** -* Discuss with security team if this is a viable approach (and if cert auth can be added on top for added security). -* Plugin author should be able to write yaml once, so the plugin dir should not be hard coded. 3 options: - * Downward API param for plugin directory that will be used as hostpath src - * A new volume plugin that can be used by plugin to drop a socket - * Have plugins call kubelet -- link local interface - * Bigger change -- kubelet doesn't do this - * Path of most resistance +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/pod-cache.png b/contributors/design-proposals/node/pod-cache.png Binary files differdeleted file mode 100644 index dee86c40..00000000 --- a/contributors/design-proposals/node/pod-cache.png +++ /dev/null diff --git a/contributors/design-proposals/node/pod-lifecycle-event-generator.md b/contributors/design-proposals/node/pod-lifecycle-event-generator.md index 42ca80c4..f0fbec72 100644 --- a/contributors/design-proposals/node/pod-lifecycle-event-generator.md +++ b/contributors/design-proposals/node/pod-lifecycle-event-generator.md @@ -1,196 +1,6 @@ -# Kubelet: Pod Lifecycle Event Generator (PLEG) +Design proposals have been archived. -In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node, -driving the pod states to match their pod specifications (specs). To achieve -this, Kubelet needs to react to changes in both (1) pod specs and (2) the -container states. For the former, Kubelet watches the pod specs changes from -multiple sources; for the latter, Kubelet polls the container runtime -periodically (e.g., 10s) for the latest states for all containers. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Polling incurs non-negligible overhead as the number of pods/containers increases, -and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which -queries the container runtime individually. Periodic, concurrent, large number -of requests causes high CPU usage spikes (even when there is no spec/state -change), poor performance, and reliability problems due to overwhelmed container -runtime. Ultimately, it limits Kubelet's scalability. - -(Related issues reported by users: [#10451](https://issues.k8s.io/10451), -[#12099](https://issues.k8s.io/12099), [#12082](https://issues.k8s.io/12082)) - -## Goals and Requirements - -The goal of this proposal is to improve Kubelet's scalability and performance -by lowering the pod management overhead. - - Reduce unnecessary work during inactivity (no spec/state changes) - - Lower the concurrent requests to the container runtime. - -The design should be generic so that it can support different container runtimes -(e.g., Docker and rkt). - -## Overview - -This proposal aims to replace the periodic polling with a pod lifecycle event -watcher. - - - -## Pod Lifecycle Event - -A pod lifecycle event interprets the underlying container state change at the -pod-level abstraction, making it container-runtime-agnostic. The abstraction -shields Kubelet from the runtime specifics. - -```go -type PodLifeCycleEventType string - -const ( - ContainerStarted PodLifeCycleEventType = "ContainerStarted" - ContainerStopped PodLifeCycleEventType = "ContainerStopped" - NetworkSetupCompleted PodLifeCycleEventType = "NetworkSetupCompleted" - NetworkFailed PodLifeCycleEventType = "NetworkFailed" -) - -// PodLifecycleEvent is an event reflects the change of the pod state. -type PodLifecycleEvent struct { - // The pod ID. - ID types.UID - // The type of the event. - Type PodLifeCycleEventType - // The accompanied data which varies based on the event type. - Data interface{} -} -``` - -Using Docker as an example, starting of a POD infra container would be -translated to a NetworkSetupCompleted`pod lifecycle event. - - -## Detect Changes in Container States Via Relisting - -In order to generate pod lifecycle events, PLEG needs to detect changes in -container states. We can achieve this by periodically relisting all containers -(e.g., docker ps). Although this is similar to Kubelet's polling today, it will -only be performed by a single thread (PLEG). This means that we still -benefit from not having all pod workers hitting the container runtime -concurrently. Moreover, only the relevant pod worker would be woken up -to perform a sync. - -The upside of relying on relisting is that it is container runtime-agnostic, -and requires no external dependency. - -### Relist period - -The shorter the relist period is, the sooner that Kubelet can detect the -change. Shorter relist period also implies higher cpu usage. Moreover, the -relist latency depends on the underlying container runtime, and usually -increases as the number of containers/pods grows. We should set a default -relist period based on measurements. Regardless of what period we set, it will -likely be significantly shorter than the current pod sync period (10s), i.e., -Kubelet will detect container changes sooner. - - -## Impact on the Pod Worker Control Flow - -Kubelet is responsible for dispatching an event to the appropriate pod -worker based on the pod ID. Only one pod worker would be woken up for -each event. - -Today, the pod syncing routine in Kubelet is idempotent as it always -examines the pod state and the spec, and tries to drive to state to -match the spec by performing a series of operations. It should be -noted that this proposal does not intend to change this property -- -the sync pod routine would still perform all necessary checks, -regardless of the event type. This trades some efficiency for -reliability and eliminate the need to build a state machine that is -compatible with different runtimes. - -## Leverage Upstream Container Events - -Instead of relying on relisting, PLEG can leverage other components which -provide container events, and translate these events into pod lifecycle -events. This will further improve Kubelet's responsiveness and reduce the -resource usage caused by frequent relisting. - -The upstream container events can come from: - -(1). *Event stream provided by each container runtime* - -Docker's API exposes an [event -stream](https://docs.docker.com/engine/api/v1.40/#operation/SystemEvents). -Nonetheless, rkt does not support this yet, but they will eventually support it -(see [coreos/rkt#1193](https://github.com/coreos/rkt/issues/1193)). - -(2). *cgroups event stream by cAdvisor* - -cAdvisor is integrated in Kubelet to provide container stats. It watches cgroups -containers using inotify and exposes an event stream. Even though it does not -support rkt yet, it should be straightforward to add such a support. - -Option (1) may provide richer sets of events, but option (2) has the advantage -to be more universal across runtimes, as long as the container runtime uses -cgroups. Regardless of what one chooses to implement now, the container event -stream should be easily swappable with a clearly defined interface. - -Note that we cannot solely rely on the upstream container events due to the -possibility of missing events. PLEG should relist infrequently to ensure no -events are missed. - -## Generate Expected Events - -*This is optional for PLEGs which performs only relisting, but required for -PLEGs that watch upstream events.* - -A pod worker's actions could lead to pod lifecycle events (e.g., -create/kill a container), which the worker would not observe until -later. The pod worker should ignore such events to avoid unnecessary -work. - -For example, assume a pod has two containers, A and B. The worker - - - Creates container A - - Receives an event `(ContainerStopped, B)` - - Receives an event `(ContainerStarted, A)` - - -The worker should ignore the `(ContainerStarted, A)` event since it is -expected. Arguably, the worker could process `(ContainerStopped, B)` -as soon as it receives the event, before observing the creation of -A. However, it is desirable to wait until the expected event -`(ContainerStarted, A)` is observed to keep a consistent per-pod view -at the worker. Therefore, the control flow of a single pod worker -should adhere to the following rules: - -1. Pod worker should process the events sequentially. -2. Pod worker should not start syncing until it observes the outcome of its own - actions in the last sync to maintain a consistent view. - -In other words, a pod worker should record the expected events, and -only wake up to perform the next sync until all expectations are met. - - - Creates container A, records an expected event `(ContainerStarted, A)` - - Receives `(ContainerStopped, B)`; stores the event and goes back to sleep. - - Receives `(ContainerStarted, A)`; clears the expectation. Proceeds to handle - `(ContainerStopped, B)`. - -We should set an expiration time for each expected events to prevent the worker -from being stalled indefinitely by missing events. - -## TODOs for v1.2 - -For v1.2, we will add a generic PLEG which relists periodically, and leave -adopting container events for future work. We will also *not* implement the -optimization that generate and filters out expected events to minimize -redundant syncs. - -- Add a generic PLEG using relisting. Modify the container runtime interface - to provide all necessary information to detect container state changes - in `GetPods()` (#13571). - -- Benchmark docker to adjust relisting frequency. - -- Fix/adapt features that rely on frequent, periodic pod syncing. - * Liveness/Readiness probing: Create a separate probing manager using - explicitly container probing period [#10878](https://issues.k8s.io/10878). - * Instruct pod workers to set up a wake-up call if syncing failed, so that - it can retry. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/pod-pid-namespace.md b/contributors/design-proposals/node/pod-pid-namespace.md index aeac92fe..f0fbec72 100644 --- a/contributors/design-proposals/node/pod-pid-namespace.md +++ b/contributors/design-proposals/node/pod-pid-namespace.md @@ -1,10 +1,6 @@ -# Shared PID Namespace +Design proposals have been archived. -* Status: Superseded -* Version: N/A -* Implementation Owner: [@verb](https://github.com/verb) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The Shared PID Namespace proposal has moved to the -[Shared PID Namespace KEP][shared-pid-kep]. -[shared-pid-kep]: https://git.k8s.io/enhancements/keps/sig-node/20190920-pod-pid-namespace.md +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/pod-resource-management.md b/contributors/design-proposals/node/pod-resource-management.md index 2c3348a6..f0fbec72 100644 --- a/contributors/design-proposals/node/pod-resource-management.md +++ b/contributors/design-proposals/node/pod-resource-management.md @@ -1,601 +1,6 @@ -# Kubelet pod level resource management +Design proposals have been archived. -**Authors**: +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -1. Buddha Prakash (@dubstack) -1. Vishnu Kannan (@vishh) -1. Derek Carr (@derekwaynecarr) - -**Last Updated**: 02/21/2017 - -**Status**: Implementation planned for Kubernetes 1.6 - -This document proposes a design for introducing pod level resource accounting -to Kubernetes. It outlines the implementation and associated rollout plan. - -## Introduction - -Kubernetes supports container level isolation by allowing users -to specify [compute resource requirements](/contributors/design-proposals/scheduling/resources.md) via requests and -limits on individual containers. The `kubelet` delegates creation of a -cgroup sandbox for each container to its associated container runtime. - -Each pod has an associated [Quality of Service (QoS)](resource-qos.md) -class based on the aggregate resource requirements made by individual -containers in the pod. The `kubelet` has the ability to -[evict pods](kubelet-eviction.md) when compute resources are scarce. It evicts -pods with the lowest QoS class in order to attempt to maintain stability of the -node. - -The `kubelet` has no associated cgroup sandbox for individual QoS classes or -individual pods. This inhibits the ability to perform proper resource -accounting on the node, and introduces a number of code complexities when -trying to build features around QoS. - -This design introduces a new cgroup hierarchy to enable the following: - -1. Enforce QoS classes on the node. -1. Simplify resource accounting at the pod level. -1. Allow containers in a pod to share slack resources within its pod cgroup. -For example, a Burstable pod has two containers, where one container makes a -CPU request and the other container does not. The latter container should -get CPU time not used by the former container. Today, it must compete for -scare resources at the node level across all BestEffort containers. -1. Ability to charge per container overhead to the pod instead of the node. -This overhead is container runtime specific. For example, `docker` has -an associated `containerd-shim` process that is created for each container -which should be charged to the pod. -1. Ability to charge any memory usage of memory-backed volumes to the pod when -an individual container exits instead of the node. - -## Enabling QoS and Pod level cgroups - -To enable the new cgroup hierarchy, the operator must enable the -`--cgroups-per-qos` flag. Once enabled, the `kubelet` will start managing -inner nodes of the described cgroup hierarchy. - -The `--cgroup-root` flag if not specified when the `--cgroups-per-qos` flag -is enabled will default to `/`. The `kubelet` will parent any cgroups -it creates below that specified value per the -[node allocatable](node-allocatable.md) design. - -## Configuring a cgroup driver - -The `kubelet` will support manipulation of the cgroup hierarchy on -the host using a cgroup driver. The driver is configured via the -`--cgroup-driver` flag. - -The supported values are the following: - -* `cgroupfs` is the default driver that performs direct manipulation of the -cgroup filesystem on the host in order to manage cgroup sandboxes. -* `systemd` is an alternative driver that manages cgroup sandboxes using -transient slices for resources that are supported by that init system. - -Depending on the configuration of the associated container runtime, -operators may have to choose a particular cgroup driver to ensure -proper system behavior. For example, if operators use the `systemd` -cgroup driver provided by the `docker` runtime, the `kubelet` must -be configured to use the `systemd` cgroup driver. - -Implementation of either driver will delegate to the libcontainer library -in opencontainers/runc. - -### Conversion of cgroupfs to systemd naming conventions - -Internally, the `kubelet` maintains both an abstract and a concrete name -for its associated cgroup sandboxes. The abstract name follows the traditional -`cgroupfs` style syntax. The concrete name is the name for how the cgroup -sandbox actually appears on the host filesystem after any conversions performed -based on the cgroup driver. - -If the `systemd` cgroup driver is used, the `kubelet` converts the `cgroupfs` -style syntax into transient slices, and as a result, it must follow `systemd` -conventions for path encoding. - -For example, the cgroup name `/burstable/pod123-456` is translated to a -transient slice with the name `burstable-pod123_456.slice`. Given how -systemd manages the cgroup filesystem, the concrete name for the cgroup -sandbox becomes `/burstable.slice/burstable-pod123_456.slice`. - -## Integration with container runtimes - -The `kubelet` when integrating with container runtimes always provides the -concrete cgroup filesystem name for the pod sandbox. - -## Conversion of CPU millicores to cgroup configuration - -Kubernetes measures CPU requests and limits in millicores. - -The following formula is used to convert CPU in millicores to cgroup values: - -* cpu.shares = (cpu in millicores * 1024) / 1000 -* cpu.cfs_period_us = 100000 (i.e. 100ms) -* cpu.cfs_quota_us = quota = (cpu in millicores * 100000) / 1000 - -## Pod level cgroups - -The `kubelet` will create a cgroup sandbox for each pod. - -The naming convention for the cgroup sandbox is `pod<pod.UID>`. It enables -the `kubelet` to associate a particular cgroup on the host filesystem -with a corresponding pod without managing any additional state. This is useful -when the `kubelet` restarts and needs to verify the cgroup filesystem. - -A pod can belong to one of the following 3 QoS classes in decreasing priority: - -1. Guaranteed -1. Burstable -1. BestEffort - -The resource configuration for the cgroup sandbox is dependent upon the -pod's associated QoS class. - -### Guaranteed QoS - -A pod in this QoS class has its cgroup sandbox configured as follows: - -``` -pod<UID>/cpu.shares = sum(pod.spec.containers.resources.requests[cpu]) -pod<UID>/cpu.cfs_quota_us = sum(pod.spec.containers.resources.limits[cpu]) -pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory]) -``` - -### Burstable QoS - -A pod in this QoS class has its cgroup sandbox configured as follows: - -``` -pod<UID>/cpu.shares = sum(pod.spec.containers.resources.requests[cpu]) -``` - -If all containers in the pod specify a cpu limit: - -``` -pod<UID>/cpu.cfs_quota_us = sum(pod.spec.containers.resources.limits[cpu]) -``` - -Finally, if all containers in the pod specify a memory limit: - -``` -pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory]) -``` - -### BestEffort QoS - -A pod in this QoS class has its cgroup sandbox configured as follows: - -``` -pod<UID>/cpu.shares = 2 -``` - -## QoS level cgroups - -The `kubelet` defines a `--cgroup-root` flag that is used to specify the `ROOT` -node in the cgroup hierarchy below which the `kubelet` should manage individual -cgroup sandboxes. It is strongly recommended that users keep the default -value for `--cgroup-root` as `/` in order to avoid deep cgroup hierarchies. The -`kubelet` creates a cgroup sandbox under the specified path `ROOT/kubepods` per -[node allocatable](node-allocatable.md) to parent pods. For simplicity, we will -refer to `ROOT/kubepods` as `ROOT` in this document. - -The `ROOT` cgroup sandbox is used to parent all pod sandboxes that are in -the Guaranteed QoS class. By definition, pods in this class have cpu and -memory limits specified that are equivalent to their requests so the pod -level cgroup sandbox confines resource consumption without the need of an -additional cgroup sandbox for the tier. - -When the `kubelet` launches, it will ensure a `Burstable` cgroup sandbox -and a `BestEffort` cgroup sandbox exist as children of `ROOT`. These cgroup -sandboxes will parent pod level cgroups in those associated QoS classes. - -The `kubelet` highly prioritizes resource utilization, and thus -allows BestEffort and Burstable pods to potentially consume as many -resources that are presently available on the node. - -For compressible resources like CPU, the `kubelet` attempts to mitigate -the issue via its use of CPU CFS shares. CPU time is proportioned -dynamically when there is contention using CFS shares that attempts to -ensure minimum requests are satisfied. - -For incompressible resources, this prioritization scheme can inhibit the -ability of a pod to have its requests satisfied. For example, a Guaranteed -pods memory request may not be satisfied if there are active BestEffort -pods consuming all available memory. - -As a node operator, I may want to satisfy the following use cases: - -1. I want to prioritize access to compressible resources for my system -and/or kubernetes daemons over end-user pods. -1. I want to prioritize access to compressible resources for my Guaranteed -workloads over my Burstable workloads. -1. I want to prioritize access to compressible resources for my Burstable -workloads over my BestEffort workloads. - -Almost all operators are encouraged to support the first use case by enforcing -[node allocatable](node-allocatable.md) via `--system-reserved` and `--kube-reserved` -flags. It is understood that not all operators may feel the need to extend -that level of reservation to Guaranteed and Burstable workloads if they choose -to prioritize utilization. That said, many users in the community deploy -cluster services as Guaranteed or Burstable workloads via a `DaemonSet` and would like a similar -resource reservation model as is provided via [node allocatable](node-allocatable) -for system and kubernetes daemons. - -For operators that have this concern, the `kubelet` with opt-in configuration -will attempt to limit the ability for a pod in a lower QoS tier to burst utilization -of a compressible resource that was requested by a pod in a higher QoS tier. - -The `kubelet` will support a flag `experimental-qos-reserved` that -takes a set of percentages per incompressible resource that controls how the -QoS cgroup sandbox attempts to reserve resources for its tier. It attempts -to reserve requested resources to exclude pods from lower QoS classes from -using resources requested by higher QoS classes. The flag will accept values -in a range from 0-100%, where a value of `0%` instructs the `kubelet` to attempt -no reservation, and a value of `100%` will instruct the `kubelet` to attempt to -reserve the sum of requested resource across all pods on the node. The `kubelet` -initially will only support `memory`. The default value per incompressible -resource if not specified is for no reservation to occur for the incompressible -resource. - -Prior to starting a pod, the `kubelet` will attempt to update the -QoS cgroup sandbox associated with the lower QoS tier(s) in order -to prevent consumption of the requested resource by the new pod. -For example, prior to starting a Guaranteed pod, the Burstable -and BestEffort QoS cgroup sandboxes are adjusted. For resource -specific details, and concerns, see the sections per resource that -follow. - -The `kubelet` will allocate resources to the QoS level cgroup -dynamically in response to the following events: - -1. kubelet startup/recovery -1. prior to creation of the pod level cgroup -1. after deletion of the pod level cgroup -1. at periodic intervals to reach `experimental-qos-reserved` -heurisitc that converge to a desired state. - -All writes to the QoS level cgroup sandboxes are protected via a -common lock in the kubelet to ensure we do not have multiple concurrent -writers to this tier in the hierarchy. - -### QoS level CPU allocation - -The `BestEffort` cgroup sandbox is statically configured as follows: - -``` -ROOT/besteffort/cpu.shares = 2 -``` - -This ensures that allocation of CPU time to pods in this QoS class -is given the lowest priority. - -The `Burstable` cgroup sandbox CPU share allocation is dynamic based -on the set of pods currently scheduled to the node. - -``` -ROOT/burstable/cpu.shares = max(sum(Burstable pods cpu requests), 2) -``` - -The Burstable cgroup sandbox is updated dynamically in the exit -points described in the previous section. Given the compressible -nature of CPU, and the fact that cpu.shares are evaluated via relative -priority, the risk of an update being incorrect is minimized as the `kubelet` -converges to a desired state. Failure to set `cpu.shares` at the QoS level -cgroup would result in `500m` of cpu for a Guaranteed pod to have different -meaning than `500m` of cpu for a Burstable pod in the current hierarchy. This -is because the default `cpu.shares` value if unspecified is `1024` and `cpu.shares` -are evaluated relative to sibling nodes in the cgroup hierarchy. As a consequence, -all of the Burstable pods under contention would have a relative priority of 1 cpu -unless updated dynamically to capture the sum of requests. For this reason, -we will always set `cpu.shares` for the QoS level sandboxes -by default as part of roll-out for this feature. - -### QoS level memory allocation - -By default, no memory limits are applied to the BestEffort -and Burstable QoS level cgroups unless a `--qos-reserve-requests` value -is specified for memory. - -The heuristic that is applied is as follows for each QoS level sandbox: - -``` -ROOT/burstable/memory.limit_in_bytes = - Node.Allocatable - {(summation of memory requests of `Guaranteed` pods)*(reservePercent / 100)} -ROOT/besteffort/memory.limit_in_bytes = - Node.Allocatable - {(summation of memory requests of all `Guaranteed` and `Burstable` pods)*(reservePercent / 100)} -``` - -A value of `--experimental-qos-reserved=memory=100%` will cause the -`kubelet` to adjust the Burstable and BestEffort cgroups from consuming memory -that was requested by a higher QoS class. This increases the risk -of inducing OOM on BestEffort and Burstable workloads in favor of increasing -memory resource guarantees for Guaranteed and Burstable workloads. A value of -`--experimental-qos-reserved=memory=0%` will allow a Burstable -and BestEffort QoS sandbox to consume up to the full node allocatable amount if -available, but increases the risk that a Guaranteed workload will not have -access to requested memory. - -Since memory is an incompressible resource, it is possible that a QoS -level cgroup sandbox may not be able to reduce memory usage below the -value specified in the heuristic described earlier during pod admission -and pod termination. - -As a result, the `kubelet` runs a periodic thread to attempt to converge -to this desired state from the above heuristic. If unreclaimable memory -usage has exceeded the desired limit for the sandbox, the `kubelet` will -attempt to set the effective limit near the current usage to put pressure -on the QoS cgroup sandbox and prevent further consumption. - -The `kubelet` will not wait for the QoS cgroup memory limit to converge -to the desired state prior to execution of the pod, but it will always -attempt to cap the existing usage of QoS cgroup sandboxes in lower tiers. -This does mean that the new pod could induce an OOM event at the `ROOT` -cgroup, but ideally per our QoS design, the oom_killer targets a pod -in a lower QoS class, or eviction evicts a lower QoS pod. The periodic -task is then able to converge to the steady desired state so any future -pods in a lower QoS class do not impact the pod at a higher QoS class. - -Adjusting the memory limits for the QoS level cgroup sandbox carries -greater risk given the incompressible nature of memory. As a result, -we are not enabling this function by default, but would like operators -that want to value resource priority over resource utilization to gather -real-world feedback on its utility. - -As a best practice, operators that want to provide a similar resource -reservation model for Guaranteed pods as we offer via enforcement of -node allocatable are encouraged to schedule their Guaranteed pods first -as it will ensure the Burstable and BestEffort tiers have had their QoS -memory limits appropriately adjusted before taking unbounded workload on -node. - -## Memory backed volumes - -The pod level cgroup ensures that any writes to a memory backed volume -are correctly charged to the pod sandbox even when a container process -in the pod restarts. - -All memory backed volumes are removed when a pod reaches a terminal state. - -The `kubelet` verifies that a pod's cgroup is deleted from the -host before deleting a pod from the API server as part of the graceful -deletion process. - -## Log basic cgroup management - -The `kubelet` will log and collect metrics associated with cgroup manipulation. - -It will log metrics for cgroup create, update, and delete actions. - -## Rollout Plan - -### Kubernetes 1.5 - -The support for the described cgroup hierarchy is experimental. - -### Kubernetes 1.6+ - -The feature will be enabled by default. - -As a result, we will recommend that users drain their nodes prior -to upgrade of the `kubelet`. If users do not drain their nodes, the -`kubelet` will act as follows: - -1. If a pod has a `RestartPolicy=Never`, then mark the pod -as `Failed` and terminate its workload. -1. All other pods that are not parented by a pod-level cgroup -will be restarted. - -The `cgroups-per-qos` flag will be enabled by default, but user's -may choose to opt-out. We may deprecate this opt-out mechanism -in Kubernetes 1.7, and remove the flag entirely in Kubernetes 1.8. - -#### Risk Assessment - -The impact of the unified cgroup hierarchy is restricted to the `kubelet`. - -Potential issues: - -1. Bugs -1. Performance and/or reliability issues for `BestEffort` pods. This is -most likely to appear on E2E test runs that mix/match pods across different -QoS tiers. -1. User misconfiguration; most notably the `--cgroup-driver` needs to match -the expected behavior of the container runtime. We provide clear errors -in `kubelet` logs for container runtimes that we include in tree. - -#### Proposed Timeline - -* 01/31/2017 - Discuss the rollout plan in sig-node meeting -* 02/14/2017 - Flip the switch to enable pod level cgroups by default - * enable existing experimental behavior by default -* 02/21/2017 - Assess impacts based on enablement -* 02/27/2017 - Kubernetes Feature complete (i.e. code freeze) - * opt-in behavior surrounding the feature (`experimental-qos-reserved` support) completed. -* 03/01/2017 - Send an announcement to kubernetes-dev@ about the rollout and potential impact -* 03/22/2017 - Kubernetes 1.6 release -* TBD (1.7?) - Eliminate the option to not use the new cgroup hierarchy. - -This is based on the tentative timeline of kubernetes 1.6 release. Need to work out the timeline with the 1.6 release czar. - -## Future enhancements - -### Add Pod level metrics to Kubelet's metrics provider - -Update the `kubelet` metrics provider to include pod level metrics. - -### Evaluate supporting evictions local to QoS cgroup sandboxes - -Rather than induce eviction at `/` or `/kubepods`, evaluate supporting -eviction decisions for the unbounded QoS tiers (Burstable, BestEffort). - -## Examples - -The following describes the cgroup representation of a node with pods -across multiple QoS classes. - -### Cgroup Hierarchy - -The following identifies a sample hierarchy based on the described design. - -It assumes the flag `--experimental-qos-reserved` is not enabled for clarity. - -``` -$ROOT - | - +- Pod1 - | | - | +- Container1 - | +- Container2 - | ... - +- Pod2 - | +- Container3 - | ... - +- ... - | - +- burstable - | | - | +- Pod3 - | | | - | | +- Container4 - | | ... - | +- Pod4 - | | +- Container5 - | | ... - | +- ... - | - +- besteffort - | | - | +- Pod5 - | | | - | | +- Container6 - | | +- Container7 - | | ... - | +- ... -``` - -### Guaranteed Pods - -We have two pods Pod1 and Pod2 having Pod Spec given below - -```yaml -kind: Pod -metadata: - name: Pod1 -spec: - containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 2Gi -``` - -```yaml -kind: Pod -metadata: - name: Pod2 -spec: - containers: - name: foo - resources: - limits: - cpu: 20m - memory: 2Gii -``` - -Pod1 and Pod2 are both classified as Guaranteed and are nested under the `ROOT` cgroup. - -``` -/ROOT/Pod1/cpu.quota = 110m -/ROOT/Pod1/cpu.shares = 110m -/ROOT/Pod1/memory.limit_in_bytes = 3Gi -/ROOT/Pod2/cpu.quota = 20m -/ROOT/Pod2/cpu.shares = 20m -/ROOT/Pod2/memory.limit_in_bytes = 2Gi -``` - -#### Burstable Pods - -We have two pods Pod3 and Pod4 having Pod Spec given below: - -```yaml -kind: Pod -metadata: - name: Pod3 -spec: - containers: - name: foo - resources: - limits: - cpu: 50m - memory: 2Gi - requests: - cpu: 20m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 1Gi -``` - -```yaml -kind: Pod -metadata: - name: Pod4 -spec: - containers: - name: foo - resources: - limits: - cpu: 20m - memory: 2Gi - requests: - cpu: 10m - memory: 1Gi -``` - -Pod3 and Pod4 are both classified as Burstable and are hence nested under -the Burstable cgroup. - -``` -/ROOT/burstable/cpu.shares = 130m -/ROOT/burstable/memory.limit_in_bytes = Allocatable - 5Gi -/ROOT/burstable/Pod3/cpu.quota = 150m -/ROOT/burstable/Pod3/cpu.shares = 120m -/ROOT/burstable/Pod3/memory.limit_in_bytes = 3Gi -/ROOT/burstable/Pod4/cpu.quota = 20m -/ROOT/burstable/Pod4/cpu.shares = 10m -/ROOT/burstable/Pod4/memory.limit_in_bytes = 2Gi -``` - -#### Best Effort pods - -We have a pod, Pod5, having Pod Spec given below: - -```yaml -kind: Pod -metadata: - name: Pod5 -spec: - containers: - name: foo - resources: - name: bar - resources: -``` - -Pod5 is classified as BestEffort and is hence nested under the BestEffort cgroup - -``` -/ROOT/besteffort/cpu.shares = 2 -/ROOT/besteffort/cpu.quota= not set -/ROOT/besteffort/memory.limit_in_bytes = Allocatable - 7Gi -/ROOT/besteffort/Pod5/memory.limit_in_bytes = no limit -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/propagation.md b/contributors/design-proposals/node/propagation.md index 20cf58d9..f0fbec72 100644 --- a/contributors/design-proposals/node/propagation.md +++ b/contributors/design-proposals/node/propagation.md @@ -1,311 +1,6 @@ -# HostPath Volume Propagation +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal to add support for propagation mode in HostPath volume, which allows -mounts within containers to visible outside the container and mounts after pods -creation visible to containers. Propagation [modes] (https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) contains "shared", "slave", "private", -"unbindable". Out of them, docker supports "shared" / "slave" / "private". -Several existing issues and PRs were already created regarding that particular -subject: -* Capability to specify mount propagation mode of per volume with docker [#20698] (https://github.com/kubernetes/kubernetes/pull/20698) -* Set propagation to "shared" for hostPath volume [#31504] (https://github.com/kubernetes/kubernetes/pull/31504) - -## Use Cases - -1. (From @Kaffa-MY) Our team attempts to containerize flocker with zfs as back-end -storage, and launch them in DaemonSet. Containers in the same flocker node need -to read/write and share the same mounted volume. Currently the volume mount -propagation mode cannot be specified between the host and the container, and then -the volume mount of each container would be isolated from each other. -This use case is also referenced by Containerized Volume Client Drivers - Design -Proposal [#22216] (https://github.com/kubernetes/kubernetes/pull/22216) - -1. (From @majewsky) I'm currently putting the [OpenStack Swift object storage] (https://github.com/openstack/swift) into -k8s on CoreOS. Swift's storage services expect storage drives to be mounted at -/srv/node/{drive-id} (where {drive-id} is defined by the cluster's ring, the topology -description data structure which is shared between all cluster members). Because -there are several such services on each node (about a dozen, actually), I assemble -/srv/node in the host mount namespace, and pass it into the containers as a hostPath -volume. -Swift is designed such that drives can be mounted and unmounted at any time (most -importantly to hot-swap failed drives) and the services can keep running, but if -the services run in a private mount namespace, they won't see the mounts/unmounts -performed on the host mount namespace until the containers are restarted. -The slave mount namespace is the correct solution for this AFAICS. Until this -becomes available in k8s, we will have to have operations restart containers manually -based on monitoring alerts. - -1. (From @victorgp) When using CoreOS Container Linux that does not provides external fuse systems -like, in our case, GlusterFS, and you need a container to do the mounts. The only -way to see those mounts in the host, hence also visible by other containers, is by -sharing the mount propagation. - -1. (From @YorikSar) For OpenStack project, Neutron, we need network namespaces -created by it to persist across reboot of pods with Neutron agents. Without it -we have unnecessary data plane downtime during rolling update of these agents. -Neutron L3 agent creates interfaces and iptables rules for each virtual router -in a separate network namespace. For managing them it uses ip netns command that -creates persistent network namespaces by calling unshare(CLONE_NEWNET) and then -bind-mounting new network namespace's inode from /proc/self/ns/net to file with -specified name in /run/netns dir. These bind mounts are the only references to -these namespaces that remain. -When we restart the pod, its mount namespace is destroyed with all these bind -mounts, so all network namespaces created by the agent are gone. For them to -survive we need to bind mount a dir from host mount namespace to container one -with shared flag, so that all bind mounts are propagated across mount namespaces -and references to network namespaces persist. - -1. (From https://github.com/kubernetes/kubernetes/issues/46643) I expect the - container to start and any fuse mounts it creates in a volume that exists on - other containers in the pod (that are using :slave) are available to those - other containers. - - In other words, two containers in the same pod share an EmptyDir. One - container mounts something in it and the other one can see it. The first - container must have (r)shared mount propagation to the EmptyDir, the second - one can have (r)slave. - - -## Implementation Alternatives - -### Add an option in VolumeMount API - -The new `VolumeMount` will look like: - -```go -type MountPropagationMode string - -const ( - // MountPropagationHostToContainer means that the volume in a container will - // receive new mounts from the host or other containers, but filesystems - // mounted inside the container won't be propagated to the host or other - // containers. - // Note that this mode is recursively applied to all mounts in the volume - // ("rslave" in Linux terminology). - MountPropagationHostToContainer MountPropagationMode = "HostToContainer" - // MountPropagationBidirectional means that the volume in a container will - // receive new mounts from the host or other containers, and its own mounts - // will be propagated from the container to the host or other containers. - // Note that this mode is recursively applied to all mounts in the volume - // ("rshared" in Linux terminology). - MountPropagationBidirectional MountPropagationMode = "Bidirectional" -) - -type VolumeMount struct { - // Required: This must match the Name of a Volume [above]. - Name string `json:"name"` - // Optional: Defaults to false (read-write). - ReadOnly bool `json:"readOnly,omitempty"` - // Required. - MountPath string `json:"mountPath"` - // mountPropagation is the mode how are mounts in the volume propagated from - // the host to the container and from the container to the host. - // When not set, MountPropagationHostToContainer is used. - // This field is alpha in 1.8 and can be reworked or removed in a future - // release. - // Optional. - MountPropagation *MountPropagationMode `json:"mountPropagation,omitempty"` -} -``` - -Default would be `HostToContainer`, i.e. `rslave`, which should not break -backward compatibility, `Bidirectional` must be explicitly requested. -Using enum instead of simple `PropagateMounts bool` allows us to extend the -modes to `private` or non-recursive `shared` and `slave` if we need so in -future. - -Only privileged containers are allowed to use `Bidirectional` for their volumes. -This will be enforced during validation. - -Opinion against this: - -1. This will affect all volumes, while only HostPath need this. It could be -checked during validation and any non-HostPath volumes with non-default -propagation could be rejected. - -1. This need API change, which is discouraged. - -### Add an option in HostPathVolumeSource - -The new `HostPathVolumeSource` will look like: - -```go -type MountPropagationMode string - -const ( - // MountPropagationHostToContainer means that the volume in a container will - // receive new mounts from the host or other containers, but filesystems - // mounted inside the container won't be propagated to the host or other - // containers. - // Note that this mode is recursively applied to all mounts in the volume - // ("rslave" in Linux terminology). - MountPropagationHostToContainer MountPropagationMode = "HostToContainer" - // MountPropagationBidirectional means that the volume in a container will - // receive new mounts from the host or other containers, and its own mounts - // will be propagated from the container to the host or other containers. - // Note that this mode is recursively applied to all mounts in the volume - // ("rshared" in Linux terminology). - MountPropagationBidirectional MountPropagationMode = "Bidirectional" -) - -type HostPathVolumeSource struct { - Path string `json:"path"` - // mountPropagation is the mode how are mounts in the volume propagated from - // the host to the container and from the container to the host. - // When not set, MountPropagationHostToContainer is used. - // This field is alpha in 1.8 and can be reworked or removed in a future - // release. - // Optional. - MountPropagation *MountPropagationMode `json:"mountPropagation,omitempty"` -} -``` - -Default would be `HostToContainer`, i.e. `rslave`, which should not break -backward compatibility, `Bidirectional` must be explicitly requested. -Using enum instead of simple `PropagateMounts bool` allows us to extend the -modes to `private` or non-recursive `shared` and `slave` if we need so in -future. - -Only privileged containers can use HostPath with `Bidirectional` mount -propagation - kubelet silently downgrades the propagation to `HostToContainer` -when running `Bidirectional` HostPath in a non-privileged container. This allows -us to use the same `HostPathVolumeSource` in a pod with two containers, one -non-privileged with `HostToContainer` propagation and second privileged with -`Bidirectional` that mounts stuff for the first one. - -Opinion against this: - -1. This need API change, which is discouraged. - -1. All containers use this volume will share the same propagation mode. - -1. Silent downgrade from `Bidirectional` to `HostToContainer` for non-privileged - containers. - -1. (From @jonboulle) May cause cross-runtime compatibility issue. - -1. It's not possible to validate a pod + mount propagation. Mount propagation - is stored in a HostPath PersistentVolume object, while privileged mode is - stored in Pod object. Validator sees only one object and we don't do - cross-object validation and can't reject non-privileged pod that uses a PV - with shared mount propagation. - -### Make HostPath shared for privileged containers, slave for non-privileged. - -Given only HostPath needs this feature, and CAP_SYS_ADMIN access is needed when -making mounts inside container, we can bind propagation mode with existing option -privileged, or we can introduce a new option in SecurityContext to control this. - -The propagation mode could be determined by the following logic: - -```go -// Environment check to ensure "shared" is supported. -if !dockerNewerThanV110 || !mountPathIsShared { - return "" -} -if container.SecurityContext.Privileged { - return "shared" -} else { - return "slave" -} -``` - -Opinion against this: - -1. This changes the behavior of existing config. - -1. (From @euank) "shared" is not correctly supported by some kernels, we need -runtime support matrix and when that will be addressed. - -1. This may cause silently fail and be a debuggability nightmare on many -distros. - -1. (From @euank) Changing those mountflags may make docker even less stable, -this may lock up kernel accidentally or potentially leak mounts. - -1. (From @jsafrane) Typical container that needs to mount something needs to -see host's `/dev` and `/sys` as HostPath volumes. This would make them shared -without any way to opt-out. Docker creates a new `/dev/shm` in the -container, which gets propagated to the host, shadowing host's `/dev/shm`. -Similarly, systemd running in a container is very picky about `/sys/fs/cgroup` -and something prevents it from starting if `/sys` is shared. - -## Decision - -* We will take 'Add an option in VolumeMount API' - * With an alpha feature gate in 1.8. - * Only privileged containers can use `rshared` (`Bidirectional`) mount - propagation (with a validator). - -* During alpha, all the behavior above must be explicitly enabled by - `kubelet --feature-gates=MountPropagation=true` - It will be used only for testing of volume plugins in e2e tests and - Mount propagation may be redesigned or even removed in any future release. - - When the feature is enabled: - - * The default mount propagation of **all** volumes (incl. GCE, AWS, Cinder, - Gluster, Flex, ...) will be `slave`, which is different to current - `private`. Extensive testing is needed! We may restrict it to HostPath + - EmptyDir in Beta. - - * **Any** volume in a privileged container can be `Bidirectional`. We may - restrict it to HostPath + EmptyDir in Beta. - - * Kubelet's Docker shim layer will check that it is able to run a container - with shared mount propagation on `/var/lib/kubelet` during startup and log - a warning otherwise. This ensures that both Docker and kubelet see the same - `/var/lib/kubelet` and it can be shared into containers. - E.g. Google COS-58 runs Docker in a separate mount namespace with slave - propagation and thus can't run a container with shared propagation on - anything. - - This will be done via simple docker version check (1.13 is required) when - the feature gate is enabled. - - * Node conformance suite will check that mount propagation in /var/lib/kubelet - works. - - * When running on a distro with `private` as default mount propagation - (probably anything that does not run systemd, such as Debian Wheezy), - Kubelet will make `/var/lib/kubelet` share-able into containers and it will - refuse to start if it's unsuccessful. - - It sounds complicated, but it's simple - `mount --bind --rshared /var/lib/kubelet /var/lib/kubelet`. See - kubernetes/kubernetes#45724 - - -## Extra Concerns - -@lucab and @euank has some extra concerns about pod isolation when propagation -modes are changed, listed below: - -1. how to clean such pod resources (as mounts are now crossing pod boundaries, -thus they can be kept busy indefinitely by processes outside of the pod) - -1. side-effects on restarts (possibly piling up layers of full-propagation mounts) - -1. how does this interacts with other mount features (nested volumeMounts may or -may not propagate back to the host, depending of ordering of mount operations) - -1. limitations this imposes on runtimes (RO-remounting may now affects the host, -is it on purpose or a dangerous side-effect?) - -1. A shared mount target imposes some constraints on its parent subtree (generally, -it has to be shared as well), which in turn prevents some mount operations when -preparing a pod (eg. MS_MOVE). - -1. The "on-by-default" nature means existing hostpath mounts, which used to be -harmless, could begin consuming kernel resources and cause a node to crash. Even -if a pod does not create any new mountpoints under its hostpath bindmount, it's -not hard to reach multiplicative explosions with shared bindmounts and so the -change in default + no cleanup could result in existing workloads knocking the -node over. - -These concerns are valid and we decide to limit the propagation mode to HostPath -volume only, in HostPath, we expect any runtime should NOT perform any additional -actions (such as clean up). This behavior is also consistent with current HostPath -logic: kube does not take care of the content in HostPath either. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/resource-qos.md b/contributors/design-proposals/node/resource-qos.md index 82ddef09..f0fbec72 100644 --- a/contributors/design-proposals/node/resource-qos.md +++ b/contributors/design-proposals/node/resource-qos.md @@ -1,215 +1,6 @@ -# Resource Quality of Service in Kubernetes +Design proposals have been archived. -**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) -**Last Updated**: 5/17/2016 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: Implemented - -*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.* - -## Introduction - -This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*. -Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee. - -Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. -The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. -When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers. -This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees. -Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes. - -## Requests and Limits - -For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](node-allocatable.md) & `request <= limit <= Infinity`. -If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. -Scheduling is based on `requests` and not `limits`. -The pods and its containers will not be allowed to exceed the specified limit. -How the request and limit are enforced depends on whether the resource is [compressible or incompressible](../scheduling/resources.md). - -### Compressible Resource Guarantees - -- For now, we are only supporting CPU. -- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal. -- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections). -- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available. - -### Incompressible Resource Guarantees - -- For now, we are only supporting memory. -- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory). -- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel. - -### Admission/Scheduling Policy - -- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](node-allocatable.md) capacity (for both memory and CPU). - -## QoS Classes - -In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority. - -The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section. - -Pods can be of one of 3 different classes: - -- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**. - -Examples: - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 100Mi -``` - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - requests: - cpu: 10m - memory: 1Gi - - name: bar - resources: - limits: - cpu: 100m - memory: 100Mi - requests: - cpu: 100m - memory: 100Mi -``` - -- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**. -When `limits` are not specified, they default to the node capacity. - -Examples: - -Container `bar` has no resources specified. - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - requests: - cpu: 10m - memory: 1Gi - - name: bar -``` - -Container `foo` and `bar` have limits set for different resources. - -```yaml -containers: - name: foo - resources: - limits: - memory: 1Gi - - name: bar - resources: - limits: - cpu: 100m -``` - -Container `foo` has no limits set, and `bar` has neither requests nor limits specified. - -```yaml -containers: - name: foo - resources: - requests: - cpu: 10m - memory: 1Gi - - name: bar -``` - -- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**. - -Examples: - -```yaml -containers: - name: foo - resources: - name: bar - resources: -``` - -Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled. - -Memory is an incompressible resource and so let's discuss the semantics of memory management a bit. - -- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. -These containers can use any amount of free memory in the node though. - -- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted. - -- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available. -Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist. - -### OOM Score configuration at the Nodes - -Pod OOM score configuration -- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed. -- The base OOM score is between 0 and 1000, so if process A's OOM_SCORE_ADJ - process B's OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B. -- The final OOM score of a process is also between 0 and 1000 - -*Best-effort* - - Set OOM_SCORE_ADJ: 1000 - - So processes in best-effort containers will have an OOM_SCORE of 1000 - -*Guaranteed* - - Set OOM_SCORE_ADJ: -998 - - So processes in guaranteed containers will have an OOM_SCORE of 0 or 1 - -*Burstable* - - If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2 - - Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested) - - This ensures that the OOM_SCORE of burstable pod is > 1 - - If memory request is `0`, OOM_SCORE_ADJ is set to `999`. - - So burstable pods will be killed if they conflict with guaranteed pods - - If a burstable pod uses less memory than requested, its OOM_SCORE < 1000 - - So best-effort pods will be killed if they conflict with burstable pods using less than requested memory - - If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000 - - Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed - - If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. - -*Pod infra containers* or *Special Pod init process* - - OOM_SCORE_ADJ: -998 - -*Kubelet, Docker* - - OOM_SCORE_ADJ: -999 (won't be OOM killed) - - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. - -## Known issues and possible improvements - -The above implementation provides for basic oversubscription with protection, but there are a few known limitations. - -#### Support for Swap - -- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn't enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior. - -## Alternative QoS Class Policy - -An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). -A strict hierarchy of user-specified numerical priorities is not desirable because: - -1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively -2. Changes to desired priority bands would require changes to all user pod configurations. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/runtime-client-server.md b/contributors/design-proposals/node/runtime-client-server.md index b50e003d..f0fbec72 100644 --- a/contributors/design-proposals/node/runtime-client-server.md +++ b/contributors/design-proposals/node/runtime-client-server.md @@ -1,202 +1,6 @@ -# Client/Server container runtime +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A proposal of client/server implementation of kubelet container runtime interface. - -## Motivation - -Currently, any container runtime has to be linked into the kubelet. This makes -experimentation difficult, and prevents users from landing an alternate -container runtime without landing code in core kubernetes. - -To facilitate experimentation and to enable user choice, this proposal adds a -client/server implementation of the [new container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899). The main goal -of this proposal is: - -- make it easy to integrate new container runtimes -- improve code maintainability - -## Proposed design - -**Design of client/server container runtime** - -The main idea of client/server container runtime is to keep main control logic in kubelet while letting remote runtime only do dedicated actions. An alpha [container runtime API](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) is introduced for integrating new container runtimes. The API is based on [protobuf](https://developers.google.com/protocol-buffers/) and [gRPC](http://www.grpc.io) for a number of benefits: - -- Perform faster than json -- Get client bindings for free: gRPC supports ten languages -- No encoding/decoding codes needed -- Manage api interfaces easily: server and client interfaces are generated automatically - -A new container runtime manager `KubeletGenericRuntimeManager` will be introduced to kubelet, which will - -- conforms to kubelet's [Runtime](../../pkg/kubelet/container/runtime.go#L58) interface -- manage Pods and Containers lifecycle according to kubelet policies -- call remote runtime's API to perform specific pod, container or image operations - -A simple workflow of invoking remote runtime API on starting a Pod with two containers can be shown: - -``` -Kubelet KubeletGenericRuntimeManager RemoteRuntime - + + + - | | | - +---------SyncPod------------->+ | - | | | - | +---- Create PodSandbox ------->+ - | +<------------------------------+ - | | | - | XXXXXXXXXXXX | - | | X | - | | NetworkPlugin. | - | | SetupPod | - | | X | - | XXXXXXXXXXXX | - | | | - | +<------------------------------+ - | +---- Pull image1 -------->+ - | +<------------------------------+ - | +---- Create container1 ------->+ - | +<------------------------------+ - | +---- Start container1 -------->+ - | +<------------------------------+ - | | | - | +<------------------------------+ - | +---- Pull image2 -------->+ - | +<------------------------------+ - | +---- Create container2 ------->+ - | +<------------------------------+ - | +---- Start container2 -------->+ - | +<------------------------------+ - | | | - | <-------Success--------------+ | - | | | - + + + -``` - -And deleting a pod can be shown: - -``` -Kubelet KubeletGenericRuntimeManager RemoteRuntime - + + + - | | | - +---------SyncPod------------->+ | - | | | - | +---- Stop container1 ----->+ - | +<------------------------------+ - | +---- Delete container1 ----->+ - | +<------------------------------+ - | | | - | +---- Stop container2 ------>+ - | +<------------------------------+ - | +---- Delete container2 ------>+ - | +<------------------------------+ - | | | - | XXXXXXXXXXXX | - | | X | - | | NetworkPlugin. | - | | TeardownPod | - | | X | - | XXXXXXXXXXXX | - | | | - | | | - | +---- Delete PodSandbox ------>+ - | +<------------------------------+ - | | | - | <-------Success--------------+ | - | | | - + + + -``` - -**API definition** - -Since we are going to introduce more image formats and want to separate image management from containers and pods, this proposal introduces two services `RuntimeService` and `ImageService`. Both services are defined at [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto): - -```proto -// Runtime service defines the public APIs for remote container runtimes -service RuntimeService { - // Version returns the runtime name, runtime version and runtime API version - rpc Version(VersionRequest) returns (VersionResponse) {} - - // CreatePodSandbox creates a pod-level sandbox. - // The definition of PodSandbox is at https://github.com/kubernetes/kubernetes/pull/25899 - rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse) {} - // StopPodSandbox stops the sandbox. If there are any running containers in the - // sandbox, they should be force terminated. - rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {} - // DeletePodSandbox deletes the sandbox. If there are any running containers in the - // sandbox, they should be force deleted. - rpc DeletePodSandbox(DeletePodSandboxRequest) returns (DeletePodSandboxResponse) {} - // PodSandboxStatus returns the Status of the PodSandbox. - rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {} - // ListPodSandbox returns a list of SandBox. - rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {} - - // CreateContainer creates a new container in specified PodSandbox - rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {} - // StartContainer starts the container. - rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {} - // StopContainer stops a running container with a grace period (i.e., timeout). - rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {} - // RemoveContainer removes the container. If the container is running, the container - // should be force removed. - rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {} - // ListContainers lists all containers by filters. - rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {} - // ContainerStatus returns status of the container. - rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {} - - // Exec executes the command in the container. - rpc Exec(stream ExecRequest) returns (stream ExecResponse) {} -} - -// Image service defines the public APIs for managing images -service ImageService { - // ListImages lists existing images. - rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {} - // ImageStatus returns the status of the image. - rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {} - // PullImage pulls a image with authentication config. - rpc PullImage(PullImageRequest) returns (PullImageResponse) {} - // RemoveImage removes the image. - rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {} -} -``` - -Note that some types in [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) are already defined at [Container runtime interface/integration](https://github.com/kubernetes/kubernetes/pull/25899). -We should decide how to integrate the types in [#25899](https://github.com/kubernetes/kubernetes/pull/25899) with gRPC services: - -* Auto-generate those types into protobuf by [go2idl](../../cmd/libs/go2idl/) - - Pros: - - trace type changes automatically, all type changes in Go will be automatically generated into proto files - - Cons: - - type change may break existing API implementations, e.g. new fields added automatically may not noticed by remote runtime - - needs to convert Go types to gRPC generated types, and vise versa - - needs processing attributes order carefully so as not to break generated protobufs (this could be done by using [protobuf tag](https://developers.google.com/protocol-buffers/docs/gotutorial)) - - go2idl doesn't support gRPC, [protoc-gen-gogo](https://github.com/gogo/protobuf) is still required for generating gRPC client -* Embed those types as raw protobuf definitions and generate Go files by [protoc-gen-gogo](https://github.com/gogo/protobuf) - - Pros: - - decouple type definitions, all type changes in Go will be added to proto manually, so it's easier to track gRPC API version changes - - Kubelet could reuse Go types generated by `protoc-gen-gogo` to avoid type conversions - - Cons: - - duplicate definition of same types - - hard to track type changes automatically - - need to manage proto files manually - -For better version controlling and fast iterations, this proposal embeds all those types in `api.proto` directly. - -## Implementation - -Each new runtime should implement the [gRPC](http://www.grpc.io) server based on [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto). For version controlling, `KubeletGenericRuntimeManager` will request `RemoteRuntime`'s `Version()` interface with the runtime api version. To keep backward compatibility, the API follows standard [protobuf guide](https://developers.google.com/protocol-buffers/docs/proto) to deprecate or add new interfaces. - -A new flag `--container-runtime-endpoint` (overrides `--container-runtime`) will be introduced to kubelet which identifies the unix socket file of the remote runtime service. And new flag `--image-service-endpoint` will be introduced to kubelet which identifies the unix socket file of the image service. - -To facilitate switching current container runtime (e.g. `docker` or `rkt`) to new runtime API, `KubeletGenericRuntimeManager` will provide a plugin mechanism allowing to specify local implementation or gRPC implementation. - -## Community Discussion - -This proposal is first filed by [@brendandburns](https://github.com/brendandburns) at [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768): - -* [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768) -* [kubernetes/13709](https://github.com/kubernetes/kubernetes/pull/13079) -* [New container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/runtime-pod-cache.md b/contributors/design-proposals/node/runtime-pod-cache.md index 752741f1..f0fbec72 100644 --- a/contributors/design-proposals/node/runtime-pod-cache.md +++ b/contributors/design-proposals/node/runtime-pod-cache.md @@ -1,169 +1,6 @@ -# Kubelet: Runtime Pod Cache +Design proposals have been archived. -This proposal builds on top of the Pod Lifecycle Event Generator (PLEG) proposed -in [#12802](https://issues.k8s.io/12802). It assumes that Kubelet subscribes to -the pod lifecycle event stream to eliminate periodic polling of pod -states. Please see [#12802](https://issues.k8s.io/12802). for the motivation and -design concept for PLEG. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Runtime pod cache is an in-memory cache which stores the *status* of -all pods, and is maintained by PLEG. It serves as a single source of -truth for internal pod status, freeing Kubelet from querying the -container runtime. - -## Motivation - -With PLEG, Kubelet no longer needs to perform comprehensive state -checking for all pods periodically. It only instructs a pod worker to -start syncing when there is a change of its pod status. Nevertheless, -during each sync, a pod worker still needs to construct the pod status -by examining all containers (whether dead or alive) in the pod, due to -the lack of the caching of previous states. With the integration of -pod cache, we can further improve Kubelet's CPU usage by - - 1. Lowering the number of concurrent requests to the container - runtime since pod workers no longer have to query the runtime - individually. - 2. Lowering the total number of inspect requests because there is no - need to inspect containers with no state changes. - -***Don't we already have a [container runtime cache] -(https://git.k8s.io/kubernetes/pkg/kubelet/container/runtime_cache.go)?*** - -The runtime cache is an optimization that reduces the number of `GetPods()` -calls from the workers. However, - - * The cache does not store all information necessary for a worker to - complete a sync (e.g., `docker inspect`); workers still need to inspect - containers individually to generate `api.PodStatus`. - * Workers sometimes need to bypass the cache in order to retrieve the - latest pod state. - -This proposal generalizes the cache and instructs PLEG to populate the cache, so -that the content is always up-to-date. - -**Why can't each worker cache its own pod status?** - -The short answer is yes, they can. The longer answer is that localized -caching limits the use of the cache content -- other components cannot -access it. This often leads to caching at multiple places and/or passing -objects around, complicating the control flow. - -## Runtime Pod Cache - - - -Pod cache stores the `PodStatus` for all pods on the node. `PodStatus` encompasses -all the information required from the container runtime to generate -`api.PodStatus` for a pod. - -```go -// PodStatus represents the status of the pod and its containers. -// api.PodStatus can be derived from examining PodStatus and api.Pod. -type PodStatus struct { - ID types.UID - Name string - Namespace string - IP string - ContainerStatuses []*ContainerStatus -} - -// ContainerStatus represents the status of a container. -type ContainerStatus struct { - ID ContainerID - Name string - State ContainerState - CreatedAt time.Time - StartedAt time.Time - FinishedAt time.Time - ExitCode int - Image string - ImageID string - Hash uint64 - RestartCount int - Reason string - Message string -} -``` - -`PodStatus` is defined in the container runtime interface, hence is -runtime-agnostic. - -PLEG is responsible for updating the entries pod cache, hence always keeping -the cache up-to-date. - -1. Detect change of container state -2. Inspect the pod for details -3. Update the pod cache with the new PodStatus - - If there is no real change of the pod entry, do nothing - - Otherwise, generate and send out the corresponding pod lifecycle event - -Note that in (3), PLEG can check if there is any disparity between the old -and the new pod entry to filter out duplicated events if needed. - -### Evict cache entries - -Note that the cache represents all the pods/containers known by the container -runtime. A cache entry should only be evicted if the pod is no longer visible -by the container runtime. PLEG is responsible for deleting entries in the -cache. - -### Generate `api.PodStatus` - -Because pod cache stores the up-to-date `PodStatus` of the pods, Kubelet can -generate the `api.PodStatus` by interpreting the cache entry at any -time. To avoid sending intermediate status (e.g., while a pod worker -is restarting a container), we will instruct the pod worker to generate a new -status at the beginning of each sync. - -### Cache contention - -Cache contention should not be a problem when the number of pods is -small. When Kubelet scales, we can always shard the pods by ID to -reduce contention. - -### Disk management - -The pod cache is not capable to fulfill the needs of container/image garbage -collectors as they may demand more than pod-level information. These components -will still need to query the container runtime directly at times. We may -consider extending the cache for these use cases, but they are beyond the scope -of this proposal. - - -## Impact on Pod Worker Control Flow - -A pod worker may perform various operations (e.g., start/kill a container) -during a sync. They will expect to see the results of such operations reflected -in the cache in the next sync. Alternately, they can bypass the cache and -query the container runtime directly to get the latest status. However, this -is not desirable since the cache is introduced exactly to eliminate unnecessary, -concurrent queries. Therefore, a pod worker should be blocked until all expected -results have been updated to the cache by PLEG. - -Depending on the type of PLEG (see [#12802](https://issues.k8s.io/12802)) in -use, the methods to check whether a requirement is met can differ. For a -PLEG that solely relies on relisting, a pod worker can simply wait until the -relist timestamp is newer than the end of the worker's last sync. On the other -hand, if pod worker knows what events to expect, they can also block until the -events are observed. - -It should be noted that `api.PodStatus` will only be generated by the pod -worker *after* the cache has been updated. This means that the perceived -responsiveness of Kubelet (from querying the API server) will be affected by -how soon the cache can be populated. For the pure-relisting PLEG, the relist -period can become the bottleneck. On the other hand, A PLEG which watches the -upstream event stream (and knows how what events to expect) is not restricted -by such periods and should improve Kubelet's perceived responsiveness. - -## TODOs for v1.2 - - - Redefine container runtime types ([#12619](https://issues.k8s.io/12619)): - and introduce `PodStatus`. Refactor dockertools and rkt to use the new type. - - - Add cache and instruct PLEG to populate it. - - - Refactor Kubelet to use the cache. - - - Deprecate the old runtime cache. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/seccomp.md b/contributors/design-proposals/node/seccomp.md index c2150b31..f0fbec72 100644 --- a/contributors/design-proposals/node/seccomp.md +++ b/contributors/design-proposals/node/seccomp.md @@ -1,270 +1,6 @@ -## Abstract +Design proposals have been archived. -A proposal for adding **alpha** support for -[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a -system call filtering facility in the Linux kernel which lets applications -define limits on system calls they may make, and what should happen when -system calls are made. Seccomp is used to reduce the attack surface available -to applications. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Applications use seccomp to restrict the set of system calls they can make. -Recently, container runtimes have begun adding features to allow the runtime -to interact with seccomp on behalf of the application, which eliminates the -need for applications to link against libseccomp directly. Adding support in -the Kubernetes API for describing seccomp profiles will allow administrators -greater control over the security of workloads running in Kubernetes. - -Goals of this design: - -1. Describe how to reference seccomp profiles in containers that use them - -## Constraints and Assumptions - -This design should: - -* build upon previous security context work -* be container-runtime agnostic -* allow use of custom profiles -* facilitate containerized applications that link directly to libseccomp -* enable a default seccomp profile for containers - -## Use Cases - -1. As an administrator, I want to be able to grant access to a seccomp profile - to a class of users -2. As a user, I want to run an application with a seccomp profile similar to - the default one provided by my container runtime -3. As a user, I want to run an application which is already libseccomp-aware - in a container, and for my application to manage interacting with seccomp - unmediated by Kubernetes -4. As a user, I want to be able to use a custom seccomp profile and use - it with my containers -5. As a user and administrator I want kubernetes to apply a sane default - seccomp profile to containers unless I otherwise specify. - -### Use Case: Administrator access control - -Controlling access to seccomp profiles is a cluster administrator -concern. It should be possible for an administrator to control which users -have access to which profiles. - -The [Pod Security Policy](https://github.com/kubernetes/kubernetes/pull/7893) -API extension governs the ability of users to make requests that affect pod -and container security contexts. The proposed design should deal with -required changes to control access to new functionality. - -### Use Case: Seccomp profiles similar to container runtime defaults - -Many users will want to use images that make assumptions about running in the -context of their chosen container runtime. Such images are likely to -frequently assume that they are running in the context of the container -runtime's default seccomp settings. Therefore, it should be possible to -express a seccomp profile similar to a container runtime's defaults. - -As an example, all dockerhub 'official' images are compatible with the Docker -default seccomp profile. So, any user who wanted to run one of these images -with seccomp would want the default profile to be accessible. - -### Use Case: Applications that link to libseccomp - -Some applications already link to libseccomp and control seccomp directly. It -should be possible to run these applications unmodified in Kubernetes; this -implies there should be a way to disable seccomp control in Kubernetes for -certain containers, or to run with a "no-op" or "unconfined" profile. - -Sometimes, applications that link to seccomp can use the default profile for a -container runtime, and restrict further on top of that. It is important to -note here that in this case, applications can only place _further_ -restrictions on themselves. It is not possible to re-grant the ability of a -process to make a system call once it has been removed with seccomp. - -As an example, elasticsearch manages its own seccomp filters in its code. -Currently, elasticsearch is capable of running in the context of the default -Docker profile, but if in the future, elasticsearch needed to be able to call -`ioperm` or `iopr` (both of which are disallowed in the default profile), it -should be possible to run elasticsearch by delegating the seccomp controls to -the pod. - -### Use Case: Custom profiles - -Different applications have different requirements for seccomp profiles; it -should be possible to specify an arbitrary seccomp profile and use it in a -container. This is more of a concern for applications which need a higher -level of privilege than what is granted by the default profile for a cluster, -since applications that want to restrict privileges further can always make -additional calls in their own code. - -An example of an application that requires the use of a syscall disallowed in -the Docker default profile is Chrome, which needs `clone` to create a new user -namespace. Another example would be a program which uses `ptrace` to -implement a sandbox for user-provided code, such as -[eval.in](https://eval.in/). - -## Community Work - -### Docker / OCI - -Docker supports the open container initiative's API for -seccomp, which is very close to the libseccomp API. It allows full -specification of seccomp filters, with arguments, operators, and actions. - -Docker allows the specification of a single seccomp filter. There are -community requests for: - -* [docker/22109](https://github.com/docker/docker/issues/22109): composable - seccomp filters -* [docker/21105](https://github.com/docker/docker/issues/22105): custom - seccomp filters for builds - -Implementation details: - -* [docker/17989](https://github.com/moby/moby/pull/17989): initial - implementation -* [docker/18780](https://github.com/moby/moby/pull/18780): default blacklist - profile -* [docker/18979](https://github.com/moby/moby/pull/18979): default whitelist - profile - -### rkt / appcontainers - -The `rkt` runtime delegates to systemd for seccomp support; there is an open -issue to add support once `appc` supports it. The `appc` project has an open -issue to be able to describe seccomp as an isolator in an appc pod. - -The systemd seccomp facility is based on a whitelist of system calls that can -be made, rather than a full filter specification. - -Issues: - -* [appc/529](https://github.com/appc/spec/issues/529) -* [rkt/1614](https://github.com/coreos/rkt/issues/1614) - -### HyperContainer - -[HyperContainer](https://hypercontainer.io) does not support seccomp. - -### lxd - -[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile. - -Issues: - -* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp - -### Other platforms and seccomp-like capabilities - -FreeBSD has a seccomp/capability-like facility called -[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). - -## Proposed Design - -### Seccomp API Resource? - -An earlier draft of this proposal described a new global API resource that -could be used to describe seccomp profiles. After some discussion, it was -determined that without a feedback signal from users indicating a need to -describe new profiles in the Kubernetes API, it is not possible to know -whether a new API resource is warranted. - -That being the case, we will not propose a new API resource at this time. If -there is strong community desire for such a resource, we may consider it in -the future. - -Instead of implementing a new API resource, we propose that pods be able to -reference seccomp profiles by name. Since this is an alpha feature, we will -use annotations instead of extending the API with new fields. - -In the alpha version of this feature we will use annotations to store the -names of seccomp profiles. The keys will be: - -`container.seccomp.security.alpha.kubernetes.io/<container name>` - -which will be used to set the seccomp profile of a container, and: - -`seccomp.security.alpha.kubernetes.io/pod` - -which will set the seccomp profile for the containers of an entire pod. If a -pod-level annotation is present, and a container-level annotation present for -a container, then the container-level profile takes precedence. - -The value of these keys should be container-runtime agnostic. We will -establish a format that expresses the conventions for distinguishing between -an unconfined profile, the container runtime's default, or a custom profile. -Since format of profile is likely to be runtime dependent, we will consider -profiles to be opaque to kubernetes for now. - -The following format is scoped as follows: - -1. `runtime/default` - the default profile for the container runtime, can be - overwritten by the following two. -2. `unconfined` - unconfined profile, ie, no seccomp sandboxing -3. `localhost/<profile-name>` - the profile installed to the node's local seccomp profile root - -Since seccomp profile schemes may vary between container runtimes, we will -treat the contents of profiles as opaque for now and avoid attempting to find -a common way to describe them. It is up to the container runtime to be -sensitive to the annotations proposed here and to interpret instructions about -local profiles. - -A new area on disk (which we will call the seccomp profile root) must be -established to hold seccomp profiles. A field will be added to the Kubelet -for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to -allow admins to set it. If unset, it should default to the `seccomp` -subdirectory of the kubelet root directory. - -### Pod Security Policy annotation - -The `PodSecurityPolicy` type should be annotated with the allowed seccomp -profiles using the key -`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this -key should be a comma delimited list. - -## Examples - -### Unconfined profile - -Here's an example of a pod that uses the unconfined profile: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: trustworthy-pod - annotations: - seccomp.security.alpha.kubernetes.io/pod: unconfined -spec: - containers: - - name: trustworthy-container - image: sotrustworthy:latest -``` - -### Custom profile - -Here's an example of a pod that uses a profile called `example-explorer- -profile` using the container-level annotation: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: explorer - annotations: - container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile -spec: - containers: - - name: explorer - image: k8s.gcr.io/explorer:1.0 - args: ["-port=8080"] - ports: - - containerPort: 8080 - protocol: TCP - volumeMounts: - - mountPath: "/mount/test-volume" - name: test-volume - volumes: - - name: test-volume - emptyDir: {} -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md b/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md index 1d5bd7b7..f0fbec72 100644 --- a/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md +++ b/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md @@ -1,182 +1,6 @@ -# Secrets, configmaps and downwardAPI file mode bits +Design proposals have been archived. -Author: Rodrigo Campos (@rata), Tim Hockin (@thockin) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Date: July 2016 - -Status: Design in progress - -# Goal - -Allow users to specify permission mode bits for a secret/configmap/downwardAPI -file mounted as a volume. For example, if a secret has several keys, a user -should be able to specify the permission mode bits for any file, and they may -all have different modes. - -Let me say that with "permission" I only refer to the file mode here and I may -use them interchangeably. This is not about the file owners, although let me -know if you prefer to discuss that here too. - - -# Motivation - -There is currently no way to set permissions on secret files mounted as volumes. -This can be a problem for applications that enforce files to have permissions -only for the owner (like fetchmail, ssh, pgpass file in postgres[1], etc.) and -it's just not possible to run them without changing the file mode. Also, -in-house applications may have this restriction too. - -It doesn't seem totally wrong if someone wants to make a secret, that is -sensitive information, not world-readable (or group, too) as it is by default. -Although it's already in a container that is (hopefully) running only one -process and it might not be so bad. But people running more than one process in -a container asked for this too[2]. - -For example, my use case is that we are migrating to kubernetes, the migration -is in progress (and will take a while) and we have migrated our deployment web -interface to kubernetes. But this interface connects to the servers via ssh, so -it needs the ssh keys, and ssh will only work if the ssh key file mode is the -one it expects. - -This was asked on the mailing list here[2] and here[3], too. - -[1]: https://www.postgresql.org/docs/9.1/static/libpq-pgpass.html -[2]: https://groups.google.com/forum/#!topic/kubernetes-dev/eTnfMJSqmaM -[3]: https://groups.google.com/forum/#!topic/google-containers/EcaOPq4M758 - -# Alternatives considered - -Several alternatives have been considered: - - * Add a mode to the API definition when using secrets: this is backward - compatible as described [here](/contributors/devel/sig-architecture/api_changes.md) IIUC and seems like the - way to go. Also @thockin said in the ML that he would consider such an - approach. But it might be worth to consider if we want to do the same for - configmaps or owners, but there is no need to do it now either. - - * Change the default file mode for secrets: I think this is unacceptable as it - is stated in the api_changes doc. And besides it doesn't feel correct IMHO, it - is technically one option. The argument for this might be that world and group - readable for a secret is not a nice default, we already take care of not - writing it to disk, etc. but the file is created world-readable anyways. Such a - default change has been done recently: the default was 0444 in kubernetes <= 1.2 - and is now 0644 in kubernetes >= 1.3 (and the file is not a regular file, - it's a symlink now). This change was done here to minimize differences between - configmaps and secrets: https://github.com/kubernetes/kubernetes/pull/25285. But - doing it again, and changing to something more restrictive (now is 0644 and it - should be 0400 to work with ssh and most apps) seems too risky, it's even more - restrictive than in k8s 1.2. Specially if there is no way to revert to the old - permissions and some use case is broken by this. And if we are adding a way to - change it, like in the option above, there is no need to rush changing the - default. So I would discard this. - - * We don't want to people be able to change this, at least for now, and the - ones who do, suggest that do it as a "postStart" command. This is acceptable - if we don't want to change kubernetes core for some reason, although there - seem to be valid use cases. But if the user want's to use the "postStart" for - something else, then it is more disturbing to do both things (have a script - in the docker image that deals with this, but is not probably concern of the - project so it's not nice, or specify several commands by using "sh"). - -# Proposed implementation - -The proposed implementation goes with the first alternative: adding a `mode` -to the API. - -There will be a `defaultMode`, type `int`, in: `type SecretVolumeSource`, `type -ConfigMapVolumeSource` and `type DownwardAPIVolumeSource`. And a `mode`, type -`int` too, in `type KeyToPath` and `DownwardAPIVolumeFile`. - -The mask provided in any of these fields will be ANDed with 0777 to disallow -setting sticky and setuid bits. It's not clear that use case is needed nor -really understood. And directories within the volume will be created as before -and are not affected by this setting. - -In other words, the fields will look like this: - -```go -type SecretVolumeSource struct { - // Name of the secret in the pod's namespace to use. - SecretName string `json:"secretName,omitempty"` - // If unspecified, each key-value pair in the Data field of the referenced - // Secret will be projected into the volume as a file whose name is the - // key and content is the value. If specified, the listed keys will be - // projected into the specified paths, and unlisted keys will not be - // present. If a key is specified which is not present in the Secret, - // the volume setup will error. Paths must be relative and may not contain - // the '..' path or start with '..'. - Items []KeyToPath `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type ConfigMapVolumeSource struct { - LocalObjectReference `json:",inline"` - // If unspecified, each key-value pair in the Data field of the referenced - // ConfigMap will be projected into the volume as a file whose name is the - // key and content is the value. If specified, the listed keys will be - // projected into the specified paths, and unlisted keys will not be - // present. If a key is specified which is not present in the ConfigMap, - // the volume setup will error. Paths must be relative and may not contain - // the '..' path or start with '..'. - Items []KeyToPath `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type KeyToPath struct { - // The key to project. - Key string `json:"key"` - - // The relative path of the file to map the key to. - // May not be an absolute path. - // May not contain the path element '..'. - // May not start with the string '..'. - Path string `json:"path"` - // Mode bits to use on this file. The used mode bits will be the - // provided AND 0777. - Mode int32 `json:"mode,omitempty"` -} - -type DownwardAPIVolumeSource struct { - // Items is a list of DownwardAPIVolume file - Items []DownwardAPIVolumeFile `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..' - Path string `json:"path"` - // Required: Selects a field of the pod: only annotations, labels, name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` - // Mode bits to use on this file. The used mode bits will be the - // provided AND 0777. - Mode int32 `json:"mode,omitempty"` -} -``` - -Adding it there allows the user to change the mode bits of every file in the -object, so it achieves the goal, while having the option to have a default and -not specify all files in the object. - -There are two downsides: - - * The files are symlinks pointint to the real file, and the realfile - permissions are only set. The symlink has the classic symlink permissions. - This is something already present in 1.3, and it seems applications like ssh - work just fine with that. Something worth mentioning, but doesn't seem to be - an issue. - * If the secret/configMap/downwardAPI is mounted in more than one container, - the file permissions will be the same on all. This is already the case for - Key mappings and doesn't seem like a big issue either. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/selinux-enhancements.md b/contributors/design-proposals/node/selinux-enhancements.md index aec5533b..f0fbec72 100644 --- a/contributors/design-proposals/node/selinux-enhancements.md +++ b/contributors/design-proposals/node/selinux-enhancements.md @@ -1,205 +1,6 @@ -## Abstract +Design proposals have been archived. -Presents a proposal for enhancing the security of Kubernetes clusters using -SELinux and simplifying the implementation of SELinux support within the -Kubelet by removing the need to label the Kubelet directory with an SELinux -context usable from a container. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -The current Kubernetes codebase relies upon the Kubelet directory being -labeled with an SELinux context usable from a container. This means that a -container escaping namespace isolation will be able to use any file within the -Kubelet directory without defeating kernel -[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control). -In order to limit the attack surface, we should enhance the Kubelet to relabel -any bind-mounts into containers into a usable SELinux context without depending -on the Kubelet directory's SELinux context. - -## Constraints and Assumptions - -1. No API changes allowed -2. Behavior must be fully backward compatible -3. No new admission controllers - make incremental improvements without huge - refactorings - -## Use Cases - -1. As a cluster operator, I want to avoid having to label the Kubelet - directory with a label usable from a container, so that I can limit the - attack surface available to a container escaping its namespace isolation -2. As a user, I want to run a pod without an SELinux context explicitly - specified and be isolated using MCS (multi-category security) on systems - where SELinux is enabled, so that the pods on each host are isolated from - one another -3. As a user, I want to run a pod that uses the host IPC or PID namespace and - want the system to do the right thing with regard to SELinux, so that no - unnecessary relabel actions are performed - -### Labeling the Kubelet directory - -As previously stated, the current codebase relies on the Kubelet directory -being labeled with an SELinux context usable from a container. The Kubelet -uses the SELinux context of this directory to determine what SELinux context -`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive. -The problem with this is that it opens an attack surface to a container that -escapes its namespace isolation; such a container would be able to use any -file in the Kubelet directory without defeating kernel MAC. - -### SELinux when no context is specified - -When no SELinux context is specified, Kubernetes should just do the right -thing, where doing the right thing is defined as isolating pods with a node- -unique set of categories. Node-uniqueness means unique among the pods -scheduled onto the node. Long-term, we want to have a cluster-wide allocator -for MCS labels. Node-unique MCS labels are a good middle ground that is -possible without a new, large, feature. - -### SELinux and host IPC and PID namespaces - -Containers in pods that use the host IPC or PID namespaces need access to -other processes and IPC mechanisms on the host. Therefore, these containers -should be run with the `spc_t` SELinux type by the container runtime. The -`spc_t` type is an unconfined type that other SELinux domains are allowed to -connect to. In the case where a pod uses one of these host namespaces, it -should be unnecessary to relabel the pod's volumes. - -## Analysis - -### Libcontainer SELinux library - -Docker and rkt both use the libcontainer SELinux library. This library -provides a method, `GetLxcContexts`, that returns the a unique SELinux -contexts for container processes and files used by them. `GetLxcContexts` -reads the base SELinux context information from a file at `/etc/selinux/<policy- -name>/contexts/lxc_contexts` and then adds a process-unique MCS label. - -Docker and rkt both leverage this call to determine the 'starting' SELinux -contexts for containers. - -### Docker - -Docker's behavior when no SELinux context is defined for a container is to -give the container a node-unique MCS label. - -#### Sharing IPC namespaces - -On the Docker runtime, the containers in a Kubernetes pod share the IPC and -PID namespaces of the pod's infra container. - -Docker's behavior for containers sharing these namespaces is as follows: if a -container B shares the IPC namespace of another container A, container B is -given the SELinux context of container A. Therefore, for Kubernetes pods -running on docker, in a vacuum the containers in a pod should have the same -SELinux context. - -[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When -the seccomp profile is set on a docker container that shares the IPC namespace -of another container, that container will not receive the other container's -SELinux context. - -#### Host IPC and PID namespaces - -In the case of a pod that shares the host IPC or PID namespace, this flag is -simply ignored and the container receives the `spc_t` SELinux type. The -`spc_t` type is unconfined, and so no relabeling needs to be done for volumes -for these pods. Currently, however, there is code which relabels volumes into -explicitly specified SELinux contexts for these pods. This code is unnecessary -and should be removed. - -#### Relabeling bind-mounts - -Docker is capable of relabeling bind-mounts into containers using the `:Z` -bind-mount flag. However, in the current implementation of the docker runtime -in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext -contains an SELinux context. We could easily implement the correct behaviors -by always setting `:Z` on systems where SELinux is enabled. - -### rkt - -rkt's behavior when no SELinux context is defined for a pod is similar to -Docker's -- an SELinux context with a node-unique MCS label is given to the -containers of a pod. - -#### Sharing IPC namespaces - -Containers (apps, in rkt terminology) in rkt pods share an IPC and PID -namespace by default. - -#### Relabeling bind-mounts - -Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux -context. - -#### Host IPC and PID namespaces - -Using the host IPC and PID namespaces is not currently supported by rkt. - -## Proposed Changes - -### Refactor `pkg/util/selinux` - -1. The `selinux` package should provide a method `SELinuxEnabled` that returns - whether SELinux is enabled, and is built for all platforms (the - libcontainer SELinux is only built on linux) -2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner` - and be changed to have the same method names and signatures as the - libcontainer methods its implementations wrap -3. The `SELinuxRunner` interface only needs `Getfilecon`, which is used by - the rkt code - -```go -package selinux - -// Note: the libcontainer SELinux package is only built for Linux, so it is -// necessary to have a NOP wrapper which is built for non-Linux platforms to -// allow code that links to this package not to differentiate its own methods -// for Linux and non-Linux platforms. -// -// SELinuxRunner wraps certain libcontainer SELinux calls. For more -// information, see: -// -// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go -type SELinuxRunner interface { - // Getfilecon returns the SELinux context for the given path or returns an - // error. - Getfilecon(path string) (string, error) -} -``` - -### Kubelet Changes - -1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can - be removed -2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no - longer call `relabelVolumes` -3. The `makeHostsMount` method in `kubelet_pods.go` should set the - `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true` - -### Changes to `pkg/kubelet/dockertools/` - -1. The `makeMountBindings` should be changed to: - 1. No longer accept the `podHasSELinuxLabel` parameter - 2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount - has the `SELinuxRelabel` attribute set to `true` -2. The `runContainer` method should be changed to always use the `:Z` - bind-mount flag on the termination message mount when SELinux is enabled - -### Changes to `pkg/kubelet/rkt` - -The should not be any required changes for the rkt runtime; we should test to -ensure things work as expected under rkt. - -### Changes to volume plugins and infrastructure - -1. The `VolumeHost` interface contains a method called `GetRootContext`; this - is an artifact of the old assumptions about the Kubelet directory's SELinux - context and can be removed -2. The `empty_dir.go` file should be changed to be completely agnostic of - SELinux; no behavior in this plugin needs to be differentiated when SELinux - is enabled - -### Changes to `pkg/controller/...` - -The `VolumeHost` abstraction is used in a couple of PV controllers as NOP -implementations. These should be altered to no longer include `GetRootContext`. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/selinux.md b/contributors/design-proposals/node/selinux.md index 6cde471d..f0fbec72 100644 --- a/contributors/design-proposals/node/selinux.md +++ b/contributors/design-proposals/node/selinux.md @@ -1,312 +1,6 @@ -## Abstract +Design proposals have been archived. -A proposal for enabling containers in a pod to share volumes using a pod level SELinux context. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation - -Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin -authors should not have to explicitly account for SELinux except for volume types that require -special handling of the SELinux context during setup. - -Currently, each container in a pod has an SELinux context. This is not an ideal factoring for -sharing resources using SELinux. - -We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a -generic way. - -Goals of this design: - -1. Describe the problems with a container SELinux context -2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context - which is backward compatible with the v1.0.0 API - -## Constraints and Assumptions - -1. We will not support securing containers within a pod from one another -2. Volume plugins should not have to handle setting SELinux context on volumes -3. We will not deal with shared storage - -## Current State Overview - -### Docker - -Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux -context of a container can be overridden with the `SecurityOpt` api that allows setting the different -parts of the SELinux context individually. - -Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different -use-cases: - -1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's - SELinux context -2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's - SElinux context, but remove the MCS labels, making the volume shareable between containers - -We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container -(from an SELinux standpoint) can use the volume. - -### rkt - -rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts` -and allocates a unique MCS label per pod. - -### Kubernetes - - -There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the -EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a -patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem -in general of handling SELinux in kubernetes to merging this PR. - -A new `PodSecurityContext` type has been added that carries information about security attributes -that apply to the entire pod and that apply to all containers in a pod. See: - -1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939) -1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823) - -## Use Cases - -1. As a cluster operator, I want to support securing pods from one another using SELinux when - SELinux integration is enabled in the cluster -2. As a user, I want volumes sharing to work correctly amongst containers in pods - -#### SELinux context: pod- or container- level? - -Currently, SELinux context is specifiable only at the container level. This is an inconvenient -factoring for sharing volumes and other SELinux-secured resources between containers because there -is no way in SELinux to share resources between processes with different MCS labels except to -remove MCS labels from the shared resource. This is a big security risk: _any container_ in the -system can work with a resource which has the same SELinux context as it and no MCS labels. Since -we are also not interested in isolating containers in a pod from one another, the SELinux context -should be shared by all containers in a pod to facilitate isolation from the containers in other -pods and sharing resources amongst all the containers of a pod. - -#### Volumes - -Kubernetes volumes can be divided into two broad categories: - -1. Unshared storage: - 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, - downward api. All volumes in this category delegate to `EmptyDir` for their underlying - storage. - 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively - by a single pod*. -2. Shared storage: - 1. `hostPath` is shared storage because it is necessarily used by a container and the host - 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. - 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because - they may be used simultaneously by multiple pods. - -For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon` -operation on the volume directory after running the volume plugin's `Setup` function. For these -volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume -plugin code. Some volume plugins may need to use the SELinux context during a mount operation in -certain cases. To account for this, our design must have a way for volume plugins to state that -a particular volume should or should not receive generic label management. - -For shared storage, the picture is murkier. Labels for existing shared storage will be managed -outside Kubernetes and administrators will have to set the SELinux context of pods correctly. -The problem of solving SELinux label management for new shared storage is outside the scope for -this proposal. - -## Analysis - -The system needs to be able to: - -1. Model correctly which volumes require SELinux label management -1. Relabel volumes with the correct SELinux context when required - -### Modeling whether a volume requires label management - -#### Unshared storage: volumes derived from `EmptyDir` - -Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure -that the ownership and SELinux context (when relevant) are set correctly for the volume to be -usable. - -#### Unshared storage: network block devices - -Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way -as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` -volumes, permissions and ownership can be managed on the client side by the Kubelet when used -exclusively by one pod. When the volumes are used outside of a persistent volume, or with the -`ReadWriteOnce` mode, they are effectively unshared storage. - -When used by multiple pods, there are many additional use-cases to analyze before we can be -confident that we can support SELinux label management robustly with these file systems. The right -design is one that makes it easy to experiment and develop support for ownership management with -volume plugins to enable developers and cluster operators to continue exploring these issues. - -#### Shared storage: hostPath - -The `hostPath` volume should only be used by effective-root users, and the permissions of paths -exposed into containers via hostPath volumes should always be managed by the cluster operator. If -the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath` -volume could affect changes in the state of arbitrary paths within the host's filesystem. This -would be a severe security risk, so we will consider hostPath a corner case that the kubelet should -never perform ownership management for. - -#### Shared storage: network - -Ownership management of shared storage is a complex topic. SELinux labels for existing shared -storage will be managed externally from Kubernetes. For this case, our API should make it simple to -express whether a particular volume should have these concerns managed by Kubernetes. - -We will not attempt to address the concerns of new shared storage in this proposal. - -When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` -modes, it is shared storage, and thus outside the scope of this proposal. - -#### API requirements - -From the above, we know that label management must be applied: - -1. To some volume types always -2. To some volume types never -3. To some volume types *sometimes* - -Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it -is desirable for other container runtime implementations to provide similar functionality. - -Relabeling should be an optional aspect of a volume plugin to accommodate: - -1. volume types for which generalized relabeling support is not sufficient -2. testing for each volume plugin individually - -## Proposed Design - -Our design should minimize code for handling SELinux labelling required in the Kubelet and volume -plugins. - -### Deferral: MCS label allocation - -Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the -primitives for higher level composition; making these automatic is a longer-term goal. Allocating -groups and MCS labels are fairly complex problems in their own right, and so our proposal will not -encompass either of these topics. There are several problems that the solution for allocation -depends on: - -1. Users and groups in Kubernetes -2. General auth policy in Kubernetes -3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893) - -### API changes - -The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823) -adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is -the addition of the semantics to this field: - -* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership -management in the Kubelet have their SELinuxContext set from this field. - -```go -package api - -type PodSecurityContext struct { - // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's - // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. - // - // This field will be used to set the SELinux of volumes that support SELinux label management - // by the kubelet. - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` -} -``` - -The V1 API is extended with the same semantics: - -```go -package v1 - -type PodSecurityContext struct { - // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's - // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. - // - // This field will be used to set the SELinux of volumes that support SELinux label management - // by the kubelet. - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` -} -``` - -#### API backward compatibility - -Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive -SELinux label management for their volumes. This is acceptable since old clients won't know about -this field and won't have any expectation of their volumes being managed this way. - -The existing backward compatibility semantics for SELinux do not change at all with this proposal. - -### Kubelet changes - -The Kubelet should be modified to perform SELinux label management when required for a volume. The -criteria to activate the kubelet SELinux label management for volumes are: - -1. SELinux integration is enabled in the cluster -2. SELinux is enabled on the node -3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set -4. The volume plugin supports SELinux label management - -The `volume.Mounter` interface should have a new method added that indicates whether the plugin -supports SELinux label management: - -```go -package volume - -type Builder interface { - // other methods omitted - SupportsSELinux() bool -} -``` - -Individual volume plugins are responsible for correctly reporting whether they support label -management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its -derivations will be tested with ownership management support: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | false | -| `awsElasticBlockStore` | false | -| `nfs` | false | -| `iscsi` | false | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | false | -| `cinder` | false | -| `cephfs` | false | - -Ultimately, the matrix will theoretically look like: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | true | -| `awsElasticBlockStore` | true | -| `nfs` | false | -| `iscsi` | true | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | true | -| `cinder` | false | -| `cephfs` | false | - -In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a -function of the container runtime implementations. Initially, we will modify the docker runtime -implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish -generic label management for docker containers. - -Volume types that require SELinux context information at mount must be injected with and respect the -enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism -will be used to carry information about label management enablement to the volume plugins that have -to manage labels individually. - -This allows the volume plugins to determine when they do and don't want this type of support from -the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/sysctl.md b/contributors/design-proposals/node/sysctl.md index 7e397960..f0fbec72 100644 --- a/contributors/design-proposals/node/sysctl.md +++ b/contributors/design-proposals/node/sysctl.md @@ -1,705 +1,6 @@ -# Setting Sysctls on the Pod Level +Design proposals have been archived. -This proposal aims at extending the current pod specification with support -for namespaced kernel parameters (sysctls) set for each pod. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Roadmap -### v1.4 - -- [x] initial implementation for v1.4 https://github.com/kubernetes/kubernetes/pull/27180 - + node-level whitelist for safe sysctls: `kernel.shm_rmid_forced`, `net.ipv4.ip_local_port_range`, `net.ipv4.tcp_max_syn_backlog`, `net.ipv4.tcp_syncookies` - + (disabled by-default) unsafe sysctls: `kernel.msg*`, `kernel.sem`, `kernel.shm*`, `fs.mqueue.*`, `net.*` - + new kubelet flag: `--experimental-allowed-unsafe-sysctls` - + PSP default: `*` -- [x] document node-level whitelist with kubectl flags and taints/tolerations -- [x] document host-level sysctls with daemon sets + taints/tolerations -- in parallel: kernel upstream patches to fix ipc accounting for 4.5+ - + [x] submitted to mainline - + [x] merged into mainline, compare https://github.com/torvalds/linux/commit/8c8d4d45204902e144abc0f15b7c658828028fa1 - -### v1.5+ - -- pre-requisites for `kernel.sem`, `kernel.msg*`, `fs.mqueue.*` on the node-level whitelist - + [x] pod cgroups active by default (compare [Pod Resource Management](pod-resource-management.md#implementation-status)) - + [ ] kmem accounting active by default - + [x] kernel patches for 4.5+ (merged since 4.9) -- reconsider what to do with `kernel.shm*` and other resource-limit sysctls with proper isolation: (a) keep them in the API (b) set node-level defaults - -## Table of Contents - - -- [Setting Sysctls on the Pod Level](#setting-sysctls-on-the-pod-level) - - [Roadmap](#roadmap) - - [v1.4](#v14) - - [v1.5](#v15) - - [Table of Contents](#table-of-contents) - - [Abstract](#abstract) - - [Motivation](#motivation) - - [Abstract Use Cases](#abstract-use-cases) - - [Constraints and Assumptions](#constraints-and-assumptions) - - [Further work (out of scope for this proposal)](#further-work-out-of-scope-for-this-proposal) - - [Community Work](#community-work) - - [Docker support for sysctl](#docker-support-for-sysctl) - - [Runc support for sysctl](#runc-support-for-sysctl) - - [Rkt support for sysctl](#rkt-support-for-sysctl) - - [Design Alternatives and Considerations](#design-alternatives-and-considerations) - - [Analysis of Sysctls of Interest](#analysis-of-sysctls-of-interest) - - [Summary of Namespacing and Isolation](#summary-of-namespacing-and-isolation) - - [Classification](#classification) - - [Proposed Design](#proposed-design) - - [Pod API Changes](#pod-api-changes) - - [Apiserver Validation and Kubelet Admission](#apiserver-validation-and-kubelet-admission) - - [In the Apiserver](#in-the-apiserver) - - [In the Kubelet](#in-the-kubelet) - - [Error behavior](#error-behavior) - - [Kubelet Flags to Extend the Whitelist](#kubelet-flags-to-extend-the-whitelist) - - [SecurityContext Enforcement](#securitycontext-enforcement) - - [Alternative 1: by name](#alternative-1-by-name) - - [Alternative 2: SysctlPolicy](#alternative-2-sysctlpolicy) - - [Application of the given Sysctls](#application-of-the-given-sysctls) - - [Examples](#examples) - - [Use in a pod](#use-in-a-pod) - - [Allowing only certain sysctls](#allowing-only-certain-sysctls) - - -## Abstract - -In Linux, the sysctl interface allows an administrator to modify kernel -parameters at runtime. Parameters are available via `/proc/sys/` virtual -process file system. The parameters cover various subsystems such as: - -* kernel (common prefix: `kernel.`) -* networking (common prefix: `net.`) -* virtual memory (common prefix: `vm.`) -* MDADM (common prefix: `dev.`) - -More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README). - -To get a list of basic prefixes on your system, you can run - -``` -$ sudo sysctl -a | cut -d' ' -f1 | cut -d'.' -f1 | sort -u -``` - -To get a list of all parameters, you can run - -``` -$ sudo sysctl -a -``` - -A number of them are namespaced and can therefore be set for a container -independently with today's Linux kernels. - -**Note**: This proposal - while sharing some use-cases - does not cover ulimits -(compare [Expose or utilize docker's rlimit support](https://github.com/kubernetes/kubernetes/issues/3595)). - -## Motivation - -A number of Linux applications need certain kernel parameter settings to - -- either run at all -- or perform well. - -In Kubernetes we want to allow to set these parameters within a pod specification -in order to enable the use of the platform for those applications. - -With Docker version 1.11.1 it is possible to change kernel parameters inside privileged containers. -However, the process is purely manual and the changes might be applied across all containers -affecting the entire host system. It is not possible to set the parameters within a non-privileged -container. - -With [docker#19265](https://github.com/docker/docker/pull/19265) docker-run as of 1.12.0 -supports setting a number of whitelisted sysctls during the container creation process. - -Some real-world examples for the use of sysctls: - -- PostgreSQL requires `kernel.shmmax` and `kernel.shmall` (among others) to be - set to reasonable high values (compare [PostgreSQL Manual 17.4.1. Shared Memory - and Semaphores](http://www.postgresql.org/docs/9.1/static/kernel-resources.html)). - The default of 32 MB for shared memory is not reasonable for a database. -- RabbitMQ proposes a number of sysctl settings to optimize networking: https://www.rabbitmq.com/networking.html. -- web applications with many concurrent connections require high values for - `net.core.somaxconn`. -- a containerized IPv6 routing daemon requires e.g. `/proc/sys/net/ipv6/conf/all/forwarding` and - `/proc/sys/net/ipv6/conf/all/accept_redirects` (compare - [docker#4717](https://github.com/docker/docker/issues/4717#issuecomment-98653017)) -- the [nginx ingress controller in kubernetes/contrib](https://git.k8s.io/contrib/ingress/controllers/nginx/examples/sysctl/change-proc-values-rc.yaml#L80) - uses a privileged sidekick container to set `net.core.somaxconn` and `net.ipv4.ip_local_port_range`. -- a huge software-as-a-service provider uses shared memory (`kernel.shm*`) and message queues (`kernel.msg*`) to - communicate between containers of their web-serving pods, configuring up to 20 GB of shared memory. - - For optimal network layer performance they set `net.core.rmem_max`, `net.core.wmem_max`, - `net.ipv4.tcp_rmem` and `net.ipv4.tcp_wmem` to much higher values than kernel defaults. - -- In [Linux Tuning guides for 10G ethernet](https://fasterdata.es.net/host-tuning/linux/) it is suggested to - set `net.core.rmem_max`/`net.core.wmem_max` to values as high as 64 MB and similar dimensions for - `net.ipv4.tcp_rmem`/`net.ipv4.tcp_wmem`. - - It is noted that - > tuning settings described here will actually decrease performance of hosts connected at rates of OC3 (155 Mbps) or less. - -- For integration of a web-backend with the load-balancer retry mechanics it is suggested in http://serverfault.com/questions/518862/will-increasing-net-core-somaxconn-make-a-difference: - - > Sometimes it's preferable to fail fast and let the load-balancer to do it's job(retry) than to make user wait - for that purpose we set net.core.somaxconn any value, and limit application backlog to e.g. 10 and set net.ipv4.tcp_abort_on_overflow to 1. - - In other words, sysctls change the observable application behavior from the view of the load-balancer radically. - -## Abstract Use Cases - -As an administrator I want to set customizable kernel parameters for a container - -1. To be able to limit consumed kernel resources - 1. so I can provide more resources to other containers - 1. to restrict system communication that slows down the host or other containers - 1. to protect against programming errors like resource leaks - 1. to protect against DDoS attacks. -1. To be able to increase limits for certain applications while not - changing the default for all containers on a host - 1. to enable resource hungry applications like databases to perform well - while the default limits for all other applications can be kept low - 1. to enable many network connections e.g. for web backends - 1. to allow special memory management like Java hugepages. -1. To be able to enable kernel features. - 1. to enable containerized execution of special purpose applications without - the need to enable those kernel features host wide, e.g. ip forwarding for - network router daemons - -## Constraints and Assumptions - -* Only namespaced kernel parameters can be modified -* Resource isolation is ensured for all safe sysctls. Sysctl with unclear, weak or not existing isolation are called unsafe sysctls. The later are disabled by default. -* Built on-top of the existing security context work -* Be container-runtime agnostic - - on the API level - - the implementation (and the set of supported sysctls) will depend on the runtime -* Kernel parameters can be set during a container creation process only. - -## Further work (out of scope for this proposal) - -* Update kernel parameters in running containers. -* Integration with new container runtime proposal: https://github.com/kubernetes/kubernetes/pull/25899. -* Hugepages support (compare [docker#4717](https://github.com/docker/docker/issues/4717#issuecomment-77426026)) - while also partly configured through sysctls (`vm.nr_hugepages`, compare http://andrigoss.blogspot.de/2008/02/jvm-performance-tuning.html) - is out-of-scope for this proposal as it is not namespaced and as a limited resource (similar to normal memory) needs deeper integration e.g. with the scheduler. - -## Community Work - -### Docker support for sysctl - -Supported sysctls (whitelist) as of Docker 1.12.0: - -- IPC namespace - - System V: `kernel.msgmax`, `kernel.msgmnb`, `kernel.msgmni`, `kernel.sem`, - `kernel.shmall`, `kernel.shmmax`, `kernel.shmmni`, `kernel.shm_rmid_forced` - - POSIX queues: `fs.mqueue.*` -- network namespace: `net.*` - -Error behavior: - -- not whitelisted sysctls are rejected: - -```shell -$ docker run --sysctl=foo=bla -it busybox /bin/sh -invalid value "foo=bla" for flag --sysctl: sysctl 'foo=bla' is not whitelisted -See 'docker run --help'. -``` - -Applied changes: - -* https://github.com/docker/docker/pull/19265 -* https://github.com/docker/engine-api/pull/38 - -Related issues: - -* https://github.com/docker/docker/issues/21126 -* https://github.com/ibm-messaging/mq-docker/issues/13 - -### Runc support for sysctl - -Supported sysctls (whitelist) as of RunC 0.1.1 (compare -[libcontainer config validator](https://github.com/opencontainers/runc/blob/master/libcontainer/configs/validate/validator.go#L107)): - -- IPC namespace - - System V: `kernel.msgmax`, `kernel.msgmnb`, `kernel.msgmni`, `kernel.sem`, - `kernel.shmall`, `kernel.shmmax`, `kernel.shmmni`, `kernel.shm_rmid_forced` - - POSIX queues: `fs.mqueue.*` -- network namespace: `net.*` - -Applied changes: - -* https://github.com/opencontainers/runc/pull/73 -* https://github.com/opencontainers/runc/pull/303 -* - -### Rkt support for sysctl - -The only sysctl support in rkt is through a [CNI plugin](https://github.com/containernetworking/plugins/blob/master/plugins/meta/tuning/README.md) plugin. The Kubernetes network plugin `kubenet` can easily be extended to call this with a given list of sysctls during pod launch. - -The default network plugin for rkt is `no-op` though. This mode leaves all network initialization to rkt itself. Rkt in turn uses the static CNI plugin configuration in `/etc/rkt/net.d`. This does not allow to customize the sysctls for a pod. Hence, in order to implement this proposal in `no-op` mode additional changes in rkt are necessary. - -Supported sysctls (whitelist): - -- network namespace: `net.*` - -Applied changes: - -* https://github.com/coreos/rkt/issues/2140 - -Issues: - -* https://github.com/coreos/rkt/issues/2075 - -## Design Alternatives and Considerations - -- Each pod has its own network stack that is shared among its containers. - A privileged side-kick or init container (compare https://git.k8s.io/contrib/ingress/controllers/nginx/examples/sysctl/change-proc-values-rc.yaml#L80) - is able to set `net.*` sysctls. - - Clearly, this is completely uncontrolled by the kubelet, but is a usable work-around if privileged - containers are permitted in the environment. As privileged container permissions (in the admission controller) are an all-or-nothing - decision and the actual code executed in them is not limited, allowing privileged container might be a security threat. - - The same work-around also works for shared memory and message queue sysctls as they are shared among the containers of a pod - in their ipc namespace. - -- Instead of giving the user a way to set sysctls for his pods, an alternative seems to be to set high values - for the limits of interest from the beginning inside the kubelet or the runtime. Then - so the theory - the - user's pods operate under quasi unlimited bounds. - - This might be true for some of the sysctls, which purely set limits for some host resources, but - - * some sysctls influence the behavior of the application, e.g.: - * `kernel.shm_rmid_forced` adds a garbage collection semantics to shared memory segments when possessing processes die. - This is against the System V standard though. - * `net.ipv4.tcp_abort_on_overflow` makes the kernel send RST packets when the application is overloaded, giving a load-balancer - the chance to reschedule a request to another backend. - * some sysctls lead to changed resource requirement characteristics, e.g.: - * `net.ipv4.tcp_rmem`/`net.ipv4.tcp_wmem` not only define min and max values, but also the default tcp window buffer size - for each socket. While large values are necessary for certain environments and applications, they lead to waste of resources - in the 90% case. - * some sysctls have a different error behavior, e.g.: - * creating a shared memory segment will fail immediately when `kernel.shmmax` is too small. - - With a large `kernel.shmmax` default, the creation of a segment always succeeds, but the OOM killer will - do its job when a shared memory segment exceeds the memory request of the container. - - The high values that could be set by the kubelet on launch might depend on the node's capacity and capabilities. But for - portability of workloads it is helpful to have a common baseline of sysctls settings one can expect on every node. The - kernel defaults (which are active if the kubelet does not change defaults) are such a (natural) baseline. - -- One could imagine to offer certain non-namespaced sysctls as well which - taint a host such that only containers with compatible sysctls settings are - scheduled there. This is considered *out of scope* to schedule pods with certain sysctls onto certain hosts according to some given rules. This must be done manually by the admin, e.g. by using taints and tolerations. - -- (Next to namespacing) *isolation* is the key requirement for a sysctl to be unconditionally allowed in a pod spec. There are the following alternatives: - - 1. allow only namespaced **and** isolated sysctls (= safe) in the API - 2. allow only namespaced **and** isolated sysctls **by-default** and make all other namespaced sysctls with unclear or weak isolation (= unsafe) opt-in by the cluster admin. - - For v1.4 only a handful of *safe* sysctls are defined. There are known, non-edge-case use-cases (see above) for a number of further sysctls. Some of them (especially the ipc sysctls) will probably be promoted onto the whitelist of safe sysctls in the near future when Kubernetes implements better resource isolation. - - On the other hand, especially in the `net.*` hierarchy there are a number of very low-level knobs to tune the network stack. They might be necessary for classes of applications requiring high-performance or realtime behavior. It is hard to forsee which knobs will be necessary in the future. At the same time the `net.*` hierarchy is huge making deep analysis on a 1-on-1 basis hard. If there is no way to use them at-your-own-risk, those users are forced into the use of privileged containers. This might be a security threat and a no-go for certain environments. Sysctls in the API (even if unsafe) in contrast allow finegrained control by the cluster admin without essentially opening up root access to the cluster nodes for some users. - - This requirement for a large number of accessible sysctls must be balanced though with the desire to have a minimal API surface: removing certain (unsafe) sysctls from an official API in a later version (e.g. because they turned out to be problematic for the node health) is problematic. - - To balance those two desires the API can be split in half: one official way to declare *safe* sysctls in a pod spec (this one will be promoted to beta and stable some day) and an alternative way to define *unsafe* sysctls. Possibly the second way will stay alpha forever to make it clear that unsafe sysctls are not a stable API of Kubernetes. Moreover, for all *unsafe* sysctls an opt-in policy is desirable, only controllable by the cluster admin, not by each cluster user. - -## Analysis of Sysctls of Interest - -**Note:** The kmem accounting has fundamentally changed in kernel 4.5 (compare https://github.com/torvalds/linux/commit/a9bb7e620efdfd29b6d1c238041173e411670996): older kernels (e.g. 4.4 from Ubuntu 16.04, 3.10 from CentOS 7.2) use a blacklist (`__GFP_NOACCOUNT`), newer kernels (e.g. 4.6.x from Fedora 24) use a whitelist (`__GFP_ACCOUNT`). **In the following the analysis is done for kernel >= 4.5:** - -- `kernel.shmall`, `kernel.shmmax`, `kernel.shmmni`: configure System V shared memory - * [x] **namespaced** in ipc ns - * [x] **accounted for** as user memory in memcg, using sparse allocation (like tmpfs) - uses [Resizable virtual memory filesystem](https://github.com/torvalds/linux/blob/master/mm/shmem.c) - * [x] hence **safe to customize** - * [x] **no application influence** with high values - * **defaults to** [unlimited pages, unlimited size, 4096 segments on today's kernels](https://github.com/torvalds/linux/blob/0e06f5c0deeef0332a5da2ecb8f1fcf3e024d958/include/uapi/linux/shm.h#L20). This makes **customization practically unnecessary**, at least for the segment sizes. IBM's DB2 suggests `256*GB of RAM` for `kernel.shmmni` (compare http://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.qb.server.doc/doc/c0057140.html), exceeding the kernel defaults for machines with >16GB of RAM. -- `kernel.shm_rmid_forced`: enforce removal of shared memory segments on process shutdown - * [x] **namespaced** in ipc ns -- `kernel.msgmax`, `kernel.msgmnb`, `kernel.msgmni`: configure System V messages - * [x] **namespaced** in ipc ns - * [ ] [temporarily **allocated in kmem** in a linked message list](http://lxr.linux.no/linux+v4.7/ipc/msgutil.c#L58), but **not accounted for** in memcg **with kernel >= 4.5** - * [ ] **defaults to** [8kb max packet size, 16384 kb total queue size, 32000 queues](http://lxr.linux.no/linux+v4.7/include/uapi/linux/msg.h#L75), **which might be too small** for certain applications - * [ ] arbitrary values [up to INT_MAX](http://lxr.linux.no/linux+v4.7/ipc/ipc_sysctl.c#L135). Hence, **potential DoS attack vector** against the host. - - Even without using a sysctl the kernel default allows any pod to allocate 512 MB of message memory (compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case). If kmem acconting is not active, this is outside of the pod resource limits. Then a node with 8 GB will not survive with >16 replicas of such a pod. - -- `fs.mqueue.*`: configure POSIX message queues. - * [x] **namespaced** in ipc ns - * [ ] uses the same [`load_msg`](http://lxr.linux.no/linux+v4.7/ipc/msgutil.c#L58) as System V messages, i.e. **no accounting for kernel >= 4.5** - * does [strict checking against rlimits](http://lxr.free-electrons.com/source/ipc/mqueue.c#L278) though - * [ ] **defaults to** [256 queues, max queue length 10, message size 8kb](http://lxr.free-electrons.com/source/include/linux/ipc_namespace.h#L102) - * [ ] can be customized via sysctls up to 64k max queue length, message size 16MB. Hence, **potential DoS attack vector** against the host -- `kernel.sem`: configure System V semaphores - * [x] **namespaced** in ipc ns - * [ ] uses [plain kmalloc and vmalloc](http://lxr.free-electrons.com/source/ipc/util.c#L404) **without accounting** - * [x] **defaults to** [32000 ids and 32000 semaphores per id](http://lxr.free-electrons.com/source/include/uapi/linux/sem.h#L78) (needing double digit number of bytes each), probably enough for all applications: - - > The values has been chosen to be larger than necessary for any known configuration. ([linux/sem.h](http://lxr.free-electrons.com/source/include/uapi/linux/sem.h#L69)) - -- `net.*`: configure the network stack - - `net.core.somaxconn`: maximum queue length specifiable by listen. - * [x] **namespaced** in net ns - * [ ] **might have application influence** for high values as it limits the socket queue length - * [?] **No real evidence found until now for accounting**. The limit is checked by `sk_acceptq_is_full` at http://lxr.free-electrons.com/source/net/ipv4/tcp_ipv4.c#L1276. After that a new socket is created. Probably, the tcp socket buffer sysctls apply then, with their accounting, see below. - * [ ] **very unreliable** tcp memory accounting. There have a been a number of attempts to drop that from the kernel completely, e.g. https://lkml.org/lkml/2014/9/12/401. On Fedora 24 (4.6.3) tcp accounting did not work at all, on Ubuntu 16.06 (4.4) it kind of worked in the root-cg, but in containers only values copied from the root-cg appeared. -e - `net.ipv4.tcp_wmem`/`net.ipv4.tcp_wmem`/`net.core.rmem_max`/`net.core.wmem_max`: socket buffer sizes - * [ ] **not namespaced in net ns**, and they are not even available under `/sys/net` - - `net.ipv4.ip_local_port_range`: local tcp/udp port range - * [x] **namespaced** in net ns - * [x] **no memory involved** - - `net.ipv4.tcp_max_syn_backlog`: number of half-open connections - * [ ] **not namespaced** - - `net.ipv4.tcp_syncookies`: enable syn cookies - * [x] **namespaced** in net ns - * [x] **no memory involved** - -### Summary of Namespacing and Isolation - -The individual analysis above leads to the following summary of: - -- namespacing (ns) - the sysctl is set in this namespace, independently from the parent/root namespace -- accounting (acc.) - the memory resources caused by the sysctl are accounted for by the given cgroup - - Kernel <= 4.4 and >= 4.5 fundamentally different kernel memory accounting (see note above). The two columns describe the two cases. - -| sysctl | ns | acc. for <= 4.4 | >= 4.5 | -| ---------------------------- | ---- | --------------- | ------------- | -| kernel.shm* | ipc | user memcg 1) | user memcg 1) | -| kernel.msg* | ipc | kmem memcg 3) | - 3) | -| fs.mqueue.* | ipc | kmem memcg | - | -| kernel.sem | ipc | kmem memcg | - | -| net.core.somaxconn | net | unreliable 4) | unreliable 4) | -| net.*.tcp_wmem/rmem | - 2) | unreliable 4) | unreliable 4) | -| net.core.wmem/rmem_max | - 2) | unreliable 4) | unreliable 4) | -| net.ipv4.ip_local_port_range | net | not needed 5) | not needed 5) | -| net.ipv4.tcp_syncookies | net | not needed 5) | not needed 5) | -| net.ipv4.tcp_max_syn_backlog | - 2) | ? | ? | - -Footnotes: - -1. a pod memory cgroup is necessary to catch segments from a dying process. -2. only available in root-ns, not even visible in a container -3. compare https://github.com/sttts/kmem-ipc-msg-queues as a test-case -4. in theory socket buffers should be accounted for by the kmem.tcp memcg counters. In practice this only worked very unreliably and not reproducibly, on some kernel not at all. kmem.tcp acconuting seems to be deprecated and on lkml patches has been posted to drop this broken feature. -5. b/c no memory is involved, i.e. purely functional difference - -**Note**: for all sysctls marked as "kmem memcg" kernel memory accounting must be enabled in the container for proper isolation. This will not be the case for 1.4, but is planned for 1.5. - -### Classification - -From the previous analysis the following classification is derived: - -| sysctl | ns | accounting | reclaim | pre-requisites | -| ---------------------------- | ----- | ---------- | --------- | -------------- | -| kernel.shm* | pod | container | pod | i 1) | -| kernel.msg* | pod | container | pod | i + ii + iii | -| fs.mqueue.* | pod | container | pod | i + ii + iii | -| kernel.sem | pod | container | pod | i + ii + iii | -| net.core.somaxconn | pod | container | container | i + ii + iv | -| net.*.tcp_wmem/rmem | host | container | container | i + ii + iv | -| net.core.wmem/rmem_max | host | container | container | i + ii + iv | -| net.ipv4.ip_local_port_range | pod | n/a | n/a | - | -| net.ipv4.tcp_syncookies | pod | n/a | n/a | - | -| net.ipv4.tcp_max_syn_backlog | pod | n/a | n/a | - | - -Explanation: - -- ns: value is namespaced on this level -- accounting: memory is accounted for against limits of this level -- reclaim: in the worst case, memory resources fall-through to this level and are accounted for there until they get destroyed -- pre-requisites: - 1. pod level cgroups - 2. kmem acconuting enabled in Kubernetes - 3. kmem accounting fixes for ipc namespace in Kernel >= 4.5 - 4. reliable kernel tcp net buffer accounting, which probably means to wait for cgroups v2. - -Footnote: - -1. Pod level cgroups don't exist today and pages are already re-parented on container deletion in v1.3. So supporting pod level sysctls in v1.4 that are tracked by user space memcg is not introducing any regression. - -**Note**: with the exception of `kernel.shm*` all of the listed pod-level sysctls depend on kernel memory accounting to be enabled for proper resource isolation. This will not be the case for 1.4 by default, but is planned in 1.5. - -**Note**: all the ipc objects persist when the originating containers dies. Their resources (if kmem accounting is enabled) fall back to the parent cgroup. As long as there is no pod level memory cgroup, the parent will be the container runtime, e.g. the docker daemon or the RunC process. It is [planned with v1.5 to introduce a pod level memory cgroup](pod-resource-management.md#implementation-status) which will fix this problem. - -**Note**: in general it is good practice to reserve special nodes for those pods which set sysctls which the kernel does not guarantee proper isolation for. - -## Proposed Design - -Sysctls in pods and `PodSecurityPolicy` are first introduced as an alpha feature for Kubernetes 1.4. This means that the API will model these as annotations, with the plan to turn those in first class citizens in a later release when the feature is promoted to beta. - -It is proposed to use a syntactical validation in the apiserver **and** a node-level whitelist of *safe sysctls* in the kubelet. The whitelist shall be fixed per version and might grow in the future when better resource isolation is in place in the kubelet. In addition a list of *allowed unsafe sysctls* will be configured per node by the cluster admin, with an empty list as the default. - -The following rules apply: - -- Only sysctls shall be whitelisted in the kubelet - + that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace) - + **and** that cannot lead to resource consumption outside of the limits of the container or the pod. - These are called *safe*. -- The cluster admin shall only be able to manually enable sysctls in the kubelet - + that are properly namespaced by the container or the pod (e.g. in the ipc or net namespace). - These are call *unsafe*. - -This means that sysctls that are not namespaced must be set by the admin on host level at his own risk, e.g. by running a *privileged daemonset*, possibly limited to a restricted, special-purpose set of nodes, if necessary with the host network namespace. This is considered out-of-scope of this proposal and out-of-scope of what the kubelet will do for the admin. A section is going to be added to the documentation describing this. - -The *allowed unsafe sysctls* will be configurable on the node via a flag of the kubelet. - -### Pod API Changes - -Pod specification must be changed to allow the specification of kernel parameters: - -```go -// Sysctl defines a kernel parameter to be set -type Sysctl struct { - // Name of a property to set - Name string `json:"name"` - // Value of a property to set - Value intstr.IntOrString `json:"value"` - // Must be true for unsafe sysctls. - Unsafe bool `json:"unsafe,omitempty"` -} - -// PodSecurityContext holds pod-level security attributes and common container settings. -// Some fields are also present in container.securityContext. Field values of -// container.securityContext take precedence over field values of PodSecurityContext. -type PodSecurityContext struct { - ... - // Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported - // sysctls (by the container runtime) might fail to launch. - Sysctls []Sysctl `json:"sysctls,omitempty"` -} -``` - -During alpha the extension of `PodSecurityContext` is modeled with annotations: - -``` -security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1` -security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3` -``` - -The value is a comma separated list of key-value pairs separated by `=`. - -*Safe* sysctls may be declared with `unsafe: true` (or in the respective annotation), while for *unsafe* sysctls `unsafe: true` is mandatory. This guarantees backwards-compatibility in future versions when sysctls have been promoted to the whitelist: old pod specs will still work. - -Possibly, the `security.alpha.kubernetes.io/unsafe-sysctls` annotation will stay as an alpha API (replacing the `Unsafe bool` field) even when `security.alpha.kubernetes.io/sysctls` has been promoted to beta or stable. This helps to make clear that unsafe sysctls are not a stable feature. - -**Note**: none of the whitelisted (and in general none with the exceptions of descriptive plain text ones) sysctls use anything else than numbers, possibly separated with spaces. - -**Note**: sysctls must be on the pod level because containers in a pod share IPC and network namespaces (if pod.spec.hostIPC and pod.spec.hostNetwork is false) and therefore cannot have conflicting sysctl values. Moreover, note that all namespaced sysctl supported by Docker/RunC are either in the IPC or network namespace. - -### Apiserver Validation and Kubelet Admission - -#### In the Apiserver - -The name of each sysctl in `PodSecurityContext.Sysctls[*].Name` (or the `annotation security.alpha.kubernetes.io/[unsafe-]sysctls` during alpha) is validated by the apiserver against: - -- 253 characters in length -- it matches `sysctlRegexp`: - -```go -const SysctlSegmentFmt string = "[a-z0-9]([-_a-z0-9]*[a-z0-9])?" -const SysctlFmt string = "(" + SysctlSegmentFmt + "\\.)*" + SysctlSegmentFmt -var sysctlRegexp = regexp.MustCompile("^" + SysctlFmt + "$") -``` - -#### In the Kubelet - -The name of each sysctl in `PodSecurityContext.Sysctls[*].Name` (or the `annotation security.alpha.kubernetes.io/[unsafe-]sysctls` during alpha) is checked by the kubelet against a static *whitelist*. - -The whitelist is defined under `pkg/kubelet` and to be maintained by the nodes team. - -The initial whitelist of safe sysctls will be: - -```go -var whitelist = []string{ - "kernel.shm_rmid_forced", - "net.ipv4.ip_local_port_range", - "net.ipv4.tcp_syncookies", - "net.ipv4.tcp_max_syn_backlog", -} -``` - -In parallel a namespace list is maintained with all sysctls and their respective, known kernel namespaces. This is initially derived from Docker's internal sysctl whitelist: - -```go -var namespaces = map[string]string{ - "kernel.sem": "ipc", -} - -var prefixNamespaces = map[string]string{ - "kernel.msg": "ipc", - "kenrel.shm": "ipc", - "fs.mqueue.": "ipc", - "net.": "net", -} -``` - -If a pod is created with host ipc or host network namespace, the respective sysctls are forbidden. - -### Error behavior - -Pods that do not comply with the syntactical sysctl format will be rejected by the apiserver. Pods that do not comply with the whitelist (or are not manually enabled as *allowed unsafe sysctls* for a node by the cluster admin) will fail to launch. An event will be created by the kubelet to notify the user. - -### Kubelet Flags to Extend the Whitelist - -The kubelet will get a new flag: - -``` ---experimental-allowed-unsafe-sysctls Comma-separated whitelist of unsafe - sysctls or unsafe sysctl patterns - (ending in *). Use these at your own - risk. -``` - -It defaults to the empty list. - -During kubelet launch the given value is checked against the list of known namespaces for sysctls or sysctl prefixes. If a namespace is not known, the kubelet will terminate with an error. - -### SecurityContext Enforcement - -#### Alternative 1: by name - -A list of permissible sysctls is to be added to `pkg/apis/extensions/types.go` (compare [pod-security-policy](pod-security-policy.md)): - -```go -// PodSecurityPolicySpec defines the policy enforced. -type PodSecurityPolicySpec struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. Each entry - // is either a plain sysctl name or ends in "*" in which case it is considered - // as a prefix of allowed sysctls. - Sysctls []string `json:"sysctls,omitempty"` -} -``` - -The `simpleProvider` in `pkg.security.podsecuritypolicy` will validate the value of `PodSecurityPolicySpec.Sysctls` with the sysctls of a given pod in `ValidatePodSecurityContext`. - -The default policy will be `*`, i.e. all syntactly correct sysctls are admitted by the `PodSecurityPolicySpec`. - -The `PodSecurityPolicySpec` applies to safe and unsafe sysctls in the same way. - -During alpha the following annotation will be used: - -``` -security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax,fs.mqueue.*` -``` - -on `PodSecurityPolicy` objects to customize the allowed sysctls. - -**Note**: This does not override the whitelist or the *allowed unsafe sysctls* on the nodes. They still apply. This only changes admission of pods in the apiserver. Pods can still fail to launch due to failed admission on the kubelet. - -#### Alternative 2: SysctlPolicy - -```go -// SysctlPolicy defines how a sysctl may be set. If neither Values, -// nor Min, Max are set, any value is allowed. -type SysctlPolicy struct { - // Name is the name of a sysctl or a pattern for a name. It consists of - // dot separated name segments. A name segment matches [a-z]+[a-z_-0-9]* or - // equals "*". The later is interpretated as a wildcard for that name - // segment. - Name string `json:"name"` - - // Values are allowed values to be set. Either Values is - // set or Min and Max. - Values []string `json:"values,omitempty"` - - // Min is the minimal value allowed to be set. - Min *int64 `json:"min,omitempty"` - - // Max is the maximum value allowed to be set. - Max *int64 `json:"max,omitempty"` -} - -// PodSecurityPolicySpec defines the policy enforced on sysctls. -type PodSecurityPolicySpec struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. - Sysctls []SysctlPolicy `json:"sysctls,omitempty"` -} -``` - -During alpha the following annotation will be used: - -``` -security.alpha.kubernetes.io/sysctls: kernel.shmmax,kernel.msgmax=max:10:min:1,kernel.msgmni=values:1000 2000 3000` -``` - -This extended syntax is a natural extension of that of alternative 1 and therefore can be implemented any time during alpha. - -Alternative 1 or 2 has to be chosen for the external API once the feature is promoted to beta. - -### Application of the given Sysctls - -Finally, the container runtime will interpret `pod.spec.securityPolicy.sysctls`, -e.g. in the case of Docker the `DockerManager` will apply the given sysctls to the infra container in `createPodInfraContainer`. - -In a later implementation of a container runtime interface (compare https://github.com/kubernetes/kubernetes/pull/25899), sysctls will be part of `LinuxPodSandboxConfig` (compare https://github.com/kubernetes/kubernetes/pull/25899#discussion_r64867763) and to be applied by the runtime implementation to the `PodSandbox` by the `PodSandboxManager` implementation. - -## Examples - -### Use in a pod - -Here is an example of a pod that has the safe sysctl `net.ipv4.ip_local_port_range` set to `1024 65535` and the unsafe sysctl `net.ipv4.route.min_pmtu` to `1000`. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: nginx - labels: - name: nginx -spec: - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - securityContext: - sysctls: - - name: net.ipv4.ip_local_port_range - value: "1024 65535" - - name: net.ipv4.route.min_pmtu - value: 1000 - unsafe: true -``` - -### Allowing only certain sysctls - -Here is an example of a `PodSecurityPolicy`, allowing `kernel.shmmax`, `kernel.shmall` and all `net.*` -sysctls to be set: - -```yaml -apiVersion: v1 -kind: PodSecurityPolicy -metadata: - name: database -spec: - sysctls: - - kernel.shmmax - - kernel.shmall - - net.* -``` - -and a restricted default `PodSecurityPolicy`: - -```yaml -apiVersion: v1 -kind: PodSecurityPolicy -metadata: - name: -spec: - sysctls: # none -``` - -in contrast to a permissive default `PodSecurityPolicy`: - -```yaml -apiVersion: v1 -kind: PodSecurityPolicy -metadata: - name: -spec: - sysctls: - - * -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/topology-manager.md b/contributors/design-proposals/node/topology-manager.md index eb7032ef..f0fbec72 100644 --- a/contributors/design-proposals/node/topology-manager.md +++ b/contributors/design-proposals/node/topology-manager.md @@ -1,322 +1,6 @@ -# Node Topology Manager +Design proposals have been archived. -_Authors:_ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com> -* @balajismaniam - Balaji Subramaniam <balaji.subramaniam@intel.com> -* @lmdaly - Louise M. Daly <louise.m.daly@intel.com> -**Contents:** - -* [Overview](#overview) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [User Stories](#user-stories) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Proposed Changes](#proposed-changes) - * [New Component: Topology Manager](#new-component-topology-manager) - * [Computing Preferred Affinity](#computing-preferred-affinity) - * [New Interfaces](#new-interfaces) - * [Changes to Existing Components](#changes-to-existing-components) -* [Graduation Criteria](#graduation-criteria) - * [alpha (target v1.11)](#alpha-target-v1.11) - * [beta](#beta) - * [GA (stable)](#ga-stable) -* [Challenges](#challenges) -* [Limitations](#limitations) -* [Alternatives](#alternatives) -* [Reference](#reference) - -# Overview - -An increasing number of systems leverage a combination of CPUs and -hardware accelerators to support latency-critical execution and -high-throughput parallel computation. These include workloads in fields -such as telecommunications, scientific computing, machine learning, -financial services and data analytics. Such hybrid systems comprise a -high performance environment. - -In order to extract the best performance, optimizations related to CPU -isolation and memory and device locality are required. However, in -Kubernetes, these optimizations are handled by a disjoint set of -components. - -This proposal provides a mechanism to coordinate fine-grained hardware -resource assignments for different components in Kubernetes. - -# Motivation - -Multiple components in the Kubelet make decisions about system -topology-related assignments: - -- CPU manager - - The CPU manager makes decisions about the set of CPUs a container is -allowed to run on. The only implemented policy as of v1.8 is the static -one, which does not change assignments for the lifetime of a container. -- Device manager - - The device manager makes concrete device assignments to satisfy -container resource requirements. Generally devices are attached to one -peripheral interconnect. If the device manager and the CPU manager are -misaligned, all communication between the CPU and the device can incur -an additional hop over the processor interconnect fabric. -- Container Network Interface (CNI) - - NICs including SR-IOV Virtual Functions have affinity to one socket, -with measurable performance ramifications. - -*Related Issues:* - -- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964] -- [Discover nodes with NUMA architecture][nfd-issue-84] -- [Support VF interrupt binding to specified CPU][sriov-issue-10] -- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity] - -Note that all of these concerns pertain only to multi-socket systems. Correct -behavior requires that the kernel receive accurate topology information from -the underlying hardware (typically via the SLIT table). See section 5.2.16 -and 5.2.17 of the -[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more -information. - -## Goals - -- Arbitrate preferred socket affinity for containers based on input from - CPU manager and Device Manager. -- Provide an internal interface and pattern to integrate additional - topology-aware Kubelet components. - -## Non-Goals - -- _Inter-device connectivity:_ Decide device assignments based on direct - device interconnects. This issue can be separated from socket - locality. Inter-device topology can be considered entirely within the - scope of the Device Manager, after which it can emit possible - socket affinities. The policy to reach that decision can start simple - and iterate to include support for arbitrary inter-device graphs. -- _HugePages:_ This proposal assumes that pre-allocated HugePages are - spread among the available memory nodes in the system. We further assume - the operating system provides best-effort local page allocation for - containers (as long as sufficient HugePages are free on the local memory - node. -- _CNI:_ Changing the Container Networking Interface is out of scope for - this proposal. However, this design should be extensible enough to - accommodate network interface locality if the CNI adds support in the - future. This limitation is potentially mitigated by the possibility to - use the device plugin API as a stopgap solution for specialized - networking requirements. - -## User Stories - -*Story 1: Fast virtualized network functions* - -A user asks for a "fast network" and automatically gets all the various -pieces coordinated (hugepages, cpusets, network device) co-located on a -socket. - -*Story 2: Accelerated neural network training* - -A user asks for an accelerator device and some number of exclusive CPUs -in order to get the best training performance, due to socket-alignment of -the assigned CPUs and devices. - -# Proposal - -*Main idea: Two phase topology coherence protocol* - -Topology affinity is tracked at the container level, similar to devices and -CPU affinity. At pod admission time, a new component called the Topology -Manager collects possible configurations from the Device Manager and the -CPU Manager. The Topology Manager acts as an oracle for local alignment by -those same components when they make concrete resource allocations. We -expect the consulted components to use the inferred QoS class of each -pod in order to prioritize the importance of fulfilling optimal locality. - -## Proposed Changes - -### New Component: Topology Manager - -This proposal is focused on a new component in the Kubelet called the -Topology Manager. The Topology Manager implements the pod admit handler -interface and participates in Kubelet pod admission. When the `Admit()` -function is called, the Topology Manager collects topology hints from other -Kubelet components. - -If the hints are not compatible, the Topology Manager may choose to -reject the pod. Behavior in this case depends on a new Kubelet configuration -value to choose the topology policy. The Topology Manager supports two -modes: `strict` and `preferred` (default). In `strict` mode, the pod is -rejected if alignment cannot be satisfied. The Topology Manager could -use `softAdmitHandler` to keep the pod in `Pending` state. - -The Topology Manager component will be disabled behind a feature gate until -graduation from alpha to beta. - -#### Computing Preferred Affinity - -A topology hint indicates a preference for some well-known local resources. -Initially, the only supported reference resource is a mask of CPU socket IDs. -After collecting hints from all providers, the Topology Manager chooses some -mask that is present in all lists. Here is a sketch: - -1. Apply a partial order on each list: number of bits set in the - mask, ascending. This biases the result to be more precise if - possible. -1. Iterate over the permutations of preference lists and compute - bitwise-and over the masks in each permutation. -1. Store the first non-empty result and break out early. -1. If no non-empty result exists, return an error. - -The behavior when a match does not exist is configurable, as described -above. - -#### New Interfaces - -```go -package topologymanager - -// TopologyManager helps to coordinate local resource alignment -// within the Kubelet. -type Manager interface { - lifecycle.PodAdmitHandler - Store - AddHintProvider(HintProvider) - RemovePod(podName string) -} - -// SocketMask is a bitmask-like type denoting a subset of available sockets. -type SocketMask struct{} // TBD - -// TopologyHints encodes locality to local resources. -type TopologyHints struct { - Sockets []SocketMask -} - -// HintStore manages state related to the Topology Manager. -type Store interface { - // GetAffinity returns the preferred affinity for the supplied - // pod and container. - GetAffinity(podName string, containerName string) TopologyHints -} - -// HintProvider is implemented by Kubelet components that make -// topology-related resource assignments. The Topology Manager consults each -// hint provider at pod admission time. -type HintProvider interface { - // Returns hints if this hint provider has a preference; otherwise - // returns `_, false` to indicate "don't care". - GetTopologyHints(pod v1.Pod, containerName string) (TopologyHints, bool) -} -``` - -_Listing: Topology Manager and related interfaces (sketch)._ - - - -_Figure: Topology Manager components._ - - - -_Figure: Topology Manager instantiation and inclusion in pod admit lifecycle._ - -### Changes to Existing Components - -1. Kubelet consults Topology Manager for pod admission (discussed above.) -1. Add two implementations of Topology Manager interface and a feature gate. - 1. As much Topology Manager functionality as possible is stubbed when the - feature gate is disabled. - 1. Add a functional Topology Manager that queries hint providers in order - to compute a preferred socket mask for each container. -1. Add `GetTopologyHints()` method to CPU Manager. - 1. CPU Manager static policy calls `GetAffinity()` method of - Topology Manager when deciding CPU affinity. -1. Add `GetTopologyHints()` method to Device Manager. - 1. Add Socket ID to Device structure in the device plugin - interface. Plugins should be able to determine the socket - when enumerating supported devices. See the protocol diff below. - 1. Device Manager calls `GetAffinity()` method of Topology Manager when - deciding device allocation. - -```diff -diff --git a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto -index efbd72c133..f86a1a5512 100644 ---- a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto -+++ b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto -@@ -73,6 +73,10 @@ message ListAndWatchResponse { - repeated Device devices = 1; - } - -+message TopologyInfo { -+ optional int32 socketID = 1 [default = -1]; -+} -+ - /* E.g: - * struct Device { - * ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e", -@@ -85,6 +89,8 @@ message Device { - string ID = 1; - // Health of the device, can be healthy or unhealthy, see constants.go - string health = 2; -+ // Topology details of the device (optional.) -+ optional TopologyInfo topology = 3; - } -``` - -_Listing: Amended device plugin gRPC protocol._ - - - -_Figure: Topology Manager hint provider registration._ - - - -_Figure: Topology Manager fetches affinity from hint providers._ - -# Graduation Criteria - -## Phase 1: Alpha (target v1.13) - -* Feature gate is disabled by default. -* Alpha-level documentation. -* Unit test coverage. -* CPU Manager allocation policy takes topology hints into account. -* Device plugin interface includes socket ID. -* Device Manager allocation policy takes topology hints into account. - -## Phase 2: Beta (later versions) - -* Feature gate is enabled by default. -* Alpha-level documentation. -* Node e2e tests. -* Support hugepages alignment. -* User feedback. - -## GA (stable) - -* *TBD* - -# Challenges - -* Testing the Topology Manager in a continuous integration environment - depends on cloud infrastructure to expose multi-node topologies - to guest virtual machines. -* Implementing the `GetHints()` interface may prove challenging. - -# Limitations - -* *TBD* - -# Alternatives - -* [AutoNUMA][numa-challenges]: This kernel feature affects memory - allocation and thread scheduling, but does not address device locality. - -# References - -* *TBD* - -[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964 -[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84 -[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10 -[proposal-affinity]: https://github.com/kubernetes/community/pull/171 -[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078 +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index d6895c28..f0fbec72 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -1,8 +1,6 @@ -# Troubleshoot Running Pods +Design proposals have been archived. -* Status: Superseded -* Version: N/A -* Implementation Owner: @verb +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The Troubleshooting Running Pods proposal has moved to the -[Ephemeral Containers KEP](https://git.k8s.io/enhancements/keps/sig-node/20190212-ephemeral-containers.md). + +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/release/OWNERS b/contributors/design-proposals/release/OWNERS deleted file mode 100644 index c414be94..00000000 --- a/contributors/design-proposals/release/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-release-leads -approvers: - - sig-release-leads -labels: - - sig/release diff --git a/contributors/design-proposals/release/release-notes.md b/contributors/design-proposals/release/release-notes.md index 42f5ff24..f0fbec72 100644 --- a/contributors/design-proposals/release/release-notes.md +++ b/contributors/design-proposals/release/release-notes.md @@ -1,188 +1,6 @@ +Design proposals have been archived. -# Kubernetes Release Notes +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -[djmm@google.com](mailto:djmm@google.com)<BR> -Last Updated: 2016-04-06 - - -- [Kubernetes Release Notes](#kubernetes-release-notes) - - [Objective](#objective) - - [Background](#background) - - [The Problem](#the-problem) - - [The (general) Solution](#the-general-solution) - - [Then why not just list *every* change that was submitted, CHANGELOG-style?](#then-why-not-just-list-every-change-that-was-submitted-changelog-style) - - [Options](#options) - - [Collection Design](#collection-design) - - [Publishing Design](#publishing-design) - - [Location](#location) - - [Layout](#layout) - - [Alpha/Beta/Patch Releases](#alphabetapatch-releases) - - [Major/Minor Releases](#majorminor-releases) - - [Work estimates](#work-estimates) - - [Caveats / Considerations](#caveats--considerations) - - -## Objective - -Define a process and design tooling for collecting, arranging and publishing -release notes for Kubernetes releases, automating as much of the process as -possible. - -The goal is to introduce minor changes to the development workflow -in a way that is mostly frictionless and allows for the capture of release notes -as PRs are submitted to the repository. - -This direct association of release notes to PRs captures the intention of -release visibility of the PR at the point an idea is submitted upstream. -The release notes can then be more easily collected and published when the -release is ready. - -## Background - -### The Problem - -Release notes are often an afterthought and clarifying and finalizing them -is often left until the very last minute at the time the release is made. -This is usually long after the feature or bug fix was added and is no longer on -the mind of the author. Worse, the collecting and summarizing of the -release is often left to those who may know little or nothing about these -individual changes! - -Writing and editing release notes at the end of the cycle can be a rushed, -interrupt-driven and often stressful process resulting in incomplete, -inconsistent release notes often with errors and omissions. - -### The (general) Solution - -Like most things in the development/release pipeline, the earlier you do it, -the easier it is for everyone and the better the outcome. Gather your release -notes earlier in the development cycle, at the time the features and fixes are -added. - -#### Then why not just list *every* change that was submitted, CHANGELOG-style? - -On larger projects like Kubernetes, showing every single change (PR) would mean -hundreds of entries. The goal is to highlight the major changes for a release. - -## Options - -1. Use of pre-commit and other local git hooks - * Experiments here using `prepare-commit-msg` and `commit-msg` git hook files - were promising but less than optimal due to the fact that they would - require input/confirmation with each commit and there may be multiple - commits in a push and eventual PR. -1. Use of [github templates](https://github.com/blog/2111-issue-and-pull-request-templates) - * Templates provide a great way to pre-fill PR comments, but there are no - server-side hooks available to parse and/or easily check the contents of - those templates to ensure that checkboxes were checked or forms were filled - in. -1. Use of labels enforced by mungers/bots - * We already make great use of mungers/bots to manage labels on PRs and it - fits very nicely in the existing workflow - -## Collection Design - -The munger/bot option fits most cleanly into the existing workflow. - -All `release-note-*` labeling is managed on the master branch PR only. -No `release-note-*` labels are needed on cherry-pick PRs and no information -will be collected from that cherry-pick PR. - -The only exception to this rule is when a PR is not a cherry-pick and is -targeted directly to the non-master branch. In this case, a `release-note-*` -label is required for that non-master PR. - -1. New labels added to github: `release-note-none`, maybe others for new release note categories - see Layout section below -1. A [new munger](https://github.com/kubernetes/kubernetes/issues/23409) that will: - * Add a `release-note-label-needed` label to all new master branch PRs - * Block merge by the submit queue on all PRs labeled as `release-note-label-needed` - * Auto-remove `release-note-label-needed` when one of the `release-note-*` labels is added - -## Publishing Design - -### Location - -With v1.2.0, the release notes were moved from their previous [github releases](https://github.com/kubernetes/kubernetes/releases) -location to [CHANGELOG.md](../../CHANGELOG.md). Going forward this seems like a good plan. -Other projects do similarly. - -The kubernetes.tar.gz download link is also displayed along with the release notes -in [CHANGELOG.md](../../CHANGELOG.md). - -Is there any reason to continue publishing anything to github releases if -the complete release story is published in [CHANGELOG.md](../../CHANGELOG.md)? - -### Layout - -Different types of releases will generally have different requirements in -terms of layout. As expected, major releases like v1.2.0 are going -to require much more detail than the automated release notes will provide. - -The idea is that these mechanisms will provide 100% of the release note -content for alpha, beta and most minor releases and bootstrap the content -with a release note 'template' for the authors of major releases like v1.2.0. - -The authors can then collaborate and edit the higher level sections of the -release notes in a PR, updating [CHANGELOG.md](../../CHANGELOG.md) as needed. - -v1.2.0 demonstrated the need, at least for major releases like v1.2.0, for -several sections in the published release notes. -In order to provide a basic layout for release notes in the future, -new releases can bootstrap [CHANGELOG.md](../../CHANGELOG.md) with the following template types: - -#### Alpha/Beta/Patch Releases - -These are automatically generated from `release-note*` labels, but can be modified as needed. - -``` -Action Required -* PR titles from the release-note-action-required label - -Other notable changes -* PR titles from the release-note label -``` - -#### Major/Minor Releases - -``` -Major Themes -* Add to or delete this section - -Other notable improvements -* Add to or delete this section - -Experimental Features -* Add to or delete this section - -Action Required -* PR titles from the release-note-action-required label - -Known Issues -* Add to or delete this section - -Provider-specific Notes -* Add to or delete this section - -Other notable changes -* PR titles from the release-note label -``` - -## Work estimates - -* The [new munger](https://github.com/kubernetes/kubernetes/issues/23409) - * Owner: @eparis - * Time estimate: Mostly done -* Updates to the tool that collects, organizes, publishes and sends release - notifications. - * Owner: @david-mcmahon - * Time estimate: A few days - - -## Caveats / Considerations - -* As part of the planning and development workflow how can we capture - release notes for bigger features? - [#23070](https://github.com/kubernetes/kubernetes/issues/23070) - * For now contributors should simply use the first PR that enables a new - feature by default. We'll revisit if this does not work well. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/release/release-test-signal.md b/contributors/design-proposals/release/release-test-signal.md index 6975629d..f0fbec72 100644 --- a/contributors/design-proposals/release/release-test-signal.md +++ b/contributors/design-proposals/release/release-test-signal.md @@ -1,130 +1,6 @@ -# Overview +Design proposals have been archived. -Describes the process and tooling (`find_green_build`) used to find a -binary signal from the Kubernetes testing framework for the purposes of -selecting a release candidate. Currently this process is used to gate -all Kubernetes releases. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -Previously, the guidance in the [(now deprecated) release document](https://github.com/kubernetes/kubernetes/blob/fc3ef9320eb9d8211d85fbc404e4bbdd751f90af/docs/devel/releasing.md) -was to "look for green tests". That is, of course, decidedly insufficient. - -Software releases should have the goal of being primarily automated and -having a gating binary test signal is a key component to that ultimate goal. - -## Design - -### General - -The idea is to capture and automate the existing manual methods of -finding a green signal for testing. - -* Identify a green run from the primary job `ci-kubernetes-e2e-gce` -* Identify matching green runs from the secondary jobs - -The tooling should also have a simple and common interface whether using it -for a dashboard, to gate a release within anago or for an individual to use it -to check the state of testing at any time. - -Output looks like this: - -``` -$ find_green_build -find_green_build: BEGIN main on djmm Mon Dec 19 16:28:15 PST 2016 - -Checking for a valid github API token: OK -Checking required system packages: OK -Checking/setting cloud tools: OK - -Getting ci-kubernetes-e2e-gce build results from Jenkins... -Getting ci-kubernetes-e2e-gce-serial build results from Jenkins... -Getting ci-kubernetes-e2e-gce-slow build results from Jenkins... -Getting ci-kubernetes-kubemark-5-gce build results from Jenkins... -Getting ci-kubernetes-e2e-gce-reboot build results from Jenkins... -Getting ci-kubernetes-e2e-gce-scalability build results from Jenkins... -Getting ci-kubernetes-test-go build results from Jenkins... -Getting ci-kubernetes-cross-build build results from Jenkins... -Getting ci-kubernetes-e2e-gke-serial build results from Jenkins... -Getting ci-kubernetes-e2e-gke build results from Jenkins... -Getting ci-kubernetes-e2e-gke-slow build results from Jenkins... - -(*) Primary job (-) Secondary jobs - -Jenkins Job Run # Build # Time/Status -= ================================= ====== ======= =========== -* ci-kubernetes-e2e-gce #1668 #2347 [14:46 12/19] -* (--buildversion=v1.6.0-alpha.0.2347+9925b68038eacc) -- ci-kubernetes-e2e-gce-serial -- -- GIVE UP - -* ci-kubernetes-e2e-gce #1666 #2345 [13:23 12/19] -* (--buildversion=v1.6.0-alpha.0.2345+523ff93471b052) -- ci-kubernetes-e2e-gce-serial -- -- GIVE UP - -* ci-kubernetes-e2e-gce #1664 #2341 [09:38 12/19] -* (--buildversion=v1.6.0-alpha.0.2341+def802272904c0) -- ci-kubernetes-e2e-gce-serial -- -- GIVE UP - -* ci-kubernetes-e2e-gce #1662 #2339 [08:45 12/19] -* (--buildversion=v1.6.0-alpha.0.2339+ce67a03b81dee5) -- ci-kubernetes-e2e-gce-serial -- -- GIVE UP - -* ci-kubernetes-e2e-gce #1653 #2335 [07:42 12/19] -* (--buildversion=v1.6.0-alpha.0.2335+d6046aab0e0678) -- ci-kubernetes-e2e-gce-serial #192 #2335 PASSED -- ci-kubernetes-e2e-gce-slow #989 #2335 PASSED -- ci-kubernetes-kubemark-5-gce #2602 #2335 PASSED -- ci-kubernetes-e2e-gce-reboot #1523 #2335 PASSED -- ci-kubernetes-e2e-gce-scalability #460 #2335 PASSED -- ci-kubernetes-test-go #1266 #2335 PASSED -- ci-kubernetes-cross-build -- -- GIVE UP - -* ci-kubernetes-e2e-gce #1651 #2330 [06:43 12/19] -* (--buildversion=v1.6.0-alpha.0.2330+75dfb21018a7c3) -- ci-kubernetes-e2e-gce-serial #191 #2319 PASSED -- ci-kubernetes-e2e-gce-slow #988 #2330 PASSED -- ci-kubernetes-kubemark-5-gce #2599 #2330 PASSED -- ci-kubernetes-e2e-gce-reboot #1521 #2330 PASSED -- ci-kubernetes-e2e-gce-scalability #459 #2321 PASSED -- ci-kubernetes-test-go #1264 #2330 PASSED -- ci-kubernetes-cross-build #320 #2330 PASSED -- ci-kubernetes-e2e-gke-serial #233 #2319 PASSED -- ci-kubernetes-e2e-gke #1834 #2330 PASSED -- ci-kubernetes-e2e-gke-slow #1041 #2330 PASSED - -JENKINS_BUILD_VERSION=v1.6.0-alpha.0.2330+75dfb21018a7c3 -RELEASE_VERSION[alpha]=v1.6.0-alpha.1 -RELEASE_VERSION_PRIME=v1.6.0-alpha.1 -``` - -### v1 - -The initial release of this analyzer did everything on the client side. -This was slow to grab 100s of individual test results from GCS. -This was mitigated somewhat by building a local cache, but for those that -weren't using it regularly, the cache building step was a significant -(~1 minute) hit when just trying to check the test status. - -### v2 - -Building and storing that local cache on the jenkins server at build time -was the way to speed things up. Getting the cache from GCS is now consistent -for all users at ~10 seconds. After that the analyzer is running. - - -## Uses - -`find_green_build` and its functions are used in 3 ways: - -1. During the release process itself via `anago`. -1. When creating a pending release notes report via `relnotes --preview`, - used in creating dashboards -1. By an individual to get a quick check on the binary signal status of jobs - -## Future work - -1. There may be other ways to improve the performance of this check by - doing more work server side. -1. Using the `relnotes --preview` output to generate an external dashboard - will give more real-time visibility to both candidate release notes and - testing state. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/release/versioning.md b/contributors/design-proposals/release/versioning.md index 9290a1c7..f0fbec72 100644 --- a/contributors/design-proposals/release/versioning.md +++ b/contributors/design-proposals/release/versioning.md @@ -1,129 +1,6 @@ -**This design proposal does NOT reflect the current state of Kubernetes versioning.** +Design proposals have been archived. -**Up-to-date versioning policies can be found [here](https://git.k8s.io/sig-release/release-engineering/versioning.md).** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Kubernetes Release Versioning - -Reference: [Semantic Versioning](http://semver.org) - -Legend: - -* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. -This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the -major version, **Y** is the minor version, and **Z** is the patch version.) - -## Release versioning - -### Minor version scheme and timeline - -* Kube X.Y.0-alpha.W, W > 0 (Branch: master) - * Alpha releases are released roughly every two weeks directly from the master -branch. - * No cherrypick releases. If there is a critical bugfix, a new release from -master can be created ahead of schedule. -* Kube X.Y.Z-beta.W (Branch: release-X.Y) - * When master is feature-complete for Kube X.Y, we will cut the release-X.Y -branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential -to X.Y. - * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. - * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, -(X.Y.0-beta.W | W > 0) as necessary. -* Kube X.Y.0 (Branch: release-X.Y) - * Final release, cut from the release-X.Y branch cut two weeks prior. - * X.Y.1-beta.0 will be tagged at the same commit on the same branch. - * X.Y.0 occur 3 to 4 months after X.(Y-1).0. -* Kube X.Y.Z, Z > 0 (Branch: release-X.Y) - * [Patch releases](#patch-releases) are released as we cherrypick commits into -the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. - * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is -tagged on the followup commit that updates pkg/version/base.go with the beta -version. -* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z) - * These are special and different in that the X.Y.Z tag is branched to isolate -the emergency/critical fix from all other changes that have landed on the -release branch since the previous tag - * Cut release-X.Y.Z branch to hold the isolated patch release - * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1) - * Branched [patch releases](#patch-releases) are rarely needed but used for -emergency/critical fixes to the latest release - * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed -for this kind of release to be possible. - -### Major version timeline - -There is no mandated timeline for major versions and there are currently no criteria -for shipping 2.0.0. We haven't so far applied a rigorous interpretation of semantic -versioning with respect to incompatible changes of any kind (e.g., component flag changes). -We previously discussed releasing 2.0.0 when removing the monolithic `v1` API -group/version, but there are no current plans to do so. - -### CI and dev version scheme - -* Continuous integration versions also exist, and are versioned off of alpha and -beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an -additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after -X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds -that are built off of a dirty build tree, (during development, with things in -the tree that are not checked it,) it will be appended with -dirty. - -### Supported releases and component skew - -We expect users to stay reasonably up-to-date with the versions of Kubernetes -they use in production, but understand that it may take time to upgrade, -especially for production-critical components. - -We expect users to be running approximately the latest patch release of a given -minor release; we often include critical bug fixes in -[patch releases](#patch-releases), and so encourage users to upgrade as soon as -possible. - -Different components are expected to be compatible across different amounts of -skew, all relative to the master version. Nodes may lag masters components by -up to two minor versions but should be at a version no newer than the master; a -client should be skewed no more than one minor version from the master, but may -lead the master by up to one minor version. For example, a v1.3 master should -work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4 -clients. - -Furthermore, we expect to "support" three minor releases at a time. "Support" -means we expect users to be running that version in production, though we may -not port fixes back before the latest minor version. For example, when v1.3 -comes out, v1.0 will no longer be supported: basically, that means that the -reasonable response to the question "my v1.0 cluster isn't working," is, "you -should probably upgrade it, (and probably should have some time ago)". With -minor releases happening approximately every three months, that means a minor -release is supported for approximately nine months. - -## Patch releases - -Patch releases are intended for critical bug fixes to the latest minor version, -such as addressing security vulnerabilities, fixes to problems affecting a large -number of users, severe problems with no workaround, and blockers for products -based on Kubernetes. - -They should not contain miscellaneous feature additions or improvements, and -especially no incompatibilities should be introduced between patch versions of -the same minor version (or even major version). - -Dependencies, such as Docker or Etcd, should also not be changed unless -absolutely necessary, and also just to fix critical bugs (so, at most patch -version changes, not new major nor minor versions). - -## Upgrades - -* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a -rolling upgrade across their cluster. (Rolling upgrade means being able to -upgrade the master first, then one node at a time. See [#4855](https://issues.k8s.io/4855) for details.) - * However, we do not recommend upgrading more than two minor releases at a -time (see [Supported releases and component skew](#Supported-releases-and-component-skew)), and do not recommend -running non-latest patch releases of a given minor release. -* No hard breaking changes over version boundaries. - * For example, if a user is at Kube 1.x, we may require them to upgrade to -Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across -major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as -graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone -to go from 1.x to 1.x+y before they go to 2.x. - -There is a separate question of how to track the capabilities of a kubelet to -facilitate rolling upgrades. That is not addressed here. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/OWNERS b/contributors/design-proposals/resource-management/OWNERS deleted file mode 100644 index d717eba7..00000000 --- a/contributors/design-proposals/resource-management/OWNERS +++ /dev/null @@ -1,6 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - wg-resource-management-leads -approvers: - - wg-resource-management-leads diff --git a/contributors/design-proposals/resource-management/admission_control_limit_range.md b/contributors/design-proposals/resource-management/admission_control_limit_range.md index 7dd454c7..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/admission_control_limit_range.md +++ b/contributors/design-proposals/resource-management/admission_control_limit_range.md @@ -1,230 +1,6 @@ -# Admission control plugin: LimitRanger +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document proposes a system for enforcing resource requirements constraints -as part of admission control. - -## Use cases - -1. Ability to enumerate resource requirement constraints per namespace -2. Ability to enumerate min/max resource constraints for a pod -3. Ability to enumerate min/max resource constraints for a container -4. Ability to specify default resource limits for a container -5. Ability to specify default resource requests for a container -6. Ability to enforce a ratio between request and limit for a resource. -7. Ability to enforce min/max storage requests for persistent volume claims - -## Data Model - -The **LimitRange** resource is scoped to a **Namespace**. - -### Type - -```go -// LimitType is a type of object that is limited -type LimitType string - -const ( - // Limit that applies to all pods in a namespace - LimitTypePod LimitType = "Pod" - // Limit that applies to all containers in a namespace - LimitTypeContainer LimitType = "Container" -) - -// LimitRangeItem defines a min/max usage limit for any resource that matches -// on kind. -type LimitRangeItem struct { - // Type of resource that this limit applies to. - Type LimitType `json:"type,omitempty"` - // Max usage constraints on this kind by resource name. - Max ResourceList `json:"max,omitempty"` - // Min usage constraints on this kind by resource name. - Min ResourceList `json:"min,omitempty"` - // Default resource requirement limit value by resource name if resource limit - // is omitted. - Default ResourceList `json:"default,omitempty"` - // DefaultRequest is the default resource requirement request value by - // resource name if resource request is omitted. - DefaultRequest ResourceList `json:"defaultRequest,omitempty"` - // MaxLimitRequestRatio if specified, the named resource must have a request - // and limit that are both non-zero where limit divided by request is less - // than or equal to the enumerated value; this represents the max burst for - // the named resource. - MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` -} - -// LimitRangeSpec defines a min/max usage limit for resources that match -// on kind. -type LimitRangeSpec struct { - // Limits is the list of LimitRangeItem objects that are enforced. - Limits []LimitRangeItem `json:"limits"` -} - -// LimitRange sets resource usage limits for each kind of resource in a -// Namespace. -type LimitRange struct { - TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata - ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the limits enforced. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status - Spec LimitRangeSpec `json:"spec,omitempty"` -} - -// LimitRangeList is a list of LimitRange items. -type LimitRangeList struct { - TypeMeta `json:",inline"` - // Standard list metadata. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds - ListMeta `json:"metadata,omitempty"` - - // Items is a list of LimitRange objects. - // More info: - // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md - Items []LimitRange `json:"items"` -} -``` - -### Validation - -Validation of a **LimitRange** enforces that for a given named resource the -following rules apply: - -Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) -<= Max (if specified) - -### Default Value Behavior - -The following default value behaviors are applied to a LimitRange for a given -named resource. - -``` -if LimitRangeItem.Default[resourceName] is undefined - if LimitRangeItem.Max[resourceName] is defined - LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName] -``` - -``` -if LimitRangeItem.DefaultRequest[resourceName] is undefined - if LimitRangeItem.Default[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] - else if LimitRangeItem.Min[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] -``` - -## AdmissionControl plugin: LimitRanger - -The **LimitRanger** plug-in introspects all incoming pod requests and evaluates -the constraints defined on a LimitRange. - -If a constraint is not specified for an enumerated resource, it is not enforced -or tracked. - -To enable the plug-in and support for LimitRange, the kube-apiserver must be -configured as follows: - -```console -$ kube-apiserver --admission-control=LimitRanger -``` - -### Enforcement of constraints - -**Type: Container** - -Supported Resources: - -1. memory -2. cpu - -Supported Constraints: - -Per container, the following must hold true: - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min <= Request (required) <= Limit (optional) | -| Max | Limit (required) <= Max | -| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) | - -Supported Defaults: - -1. Default - if the named resource has no enumerated values, the Limit is equal -to the Default -2. DefaultRequest - if the named resource has no enumerated values, the Request -is equal to the DefaultRequest - -**Type: Pod** - -Supported Resources: - -1. memory -2. cpu - -Supported Constraints: - -Across all containers in pod, the following must hold true - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min <= Request (required) <= Limit (optional) | -| Max | Limit (required) <= Max | -| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) | - -**Type: PersistentVolumeClaim** - -Supported Resources: - -1. storage - -Supported Constraints: - -Across all claims in a namespace, the following must hold true: - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min >= Request (required) | -| Max | Max <= Request (required) | - -Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time. - -## Run-time configuration - -The default ```LimitRange``` that is applied via Salt configuration will be -updated as follows: - -``` -apiVersion: "v1" -kind: "LimitRange" -metadata: - name: "limits" - namespace: default -spec: - limits: - - type: "Container" - defaultRequest: - cpu: "100m" -``` - -## Example - -An example LimitRange configuration: - -| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio | -| ---- | -------- | --- | --- | ------- | -------------- | ----------------- | -| Container | cpu | .1 | 1 | 500m | 250m | 4 | -| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | | - -Assuming an incoming container that specified no incoming resource requirements, -the following would happen. - -1. The incoming container cpu would request 250m with a limit of 500m. -2. The incoming container memory would request 250Mi with a limit of 500Mi -3. If the container is later resized, it's cpu would be constrained to between -.1 and 1 and the ratio of limit to request could not exceed 4. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/admission_control_resource_quota.md b/contributors/design-proposals/resource-management/admission_control_resource_quota.md index 02c3afb6..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/admission_control_resource_quota.md +++ b/contributors/design-proposals/resource-management/admission_control_resource_quota.md @@ -1,230 +1,6 @@ -# Admission control plugin: ResourceQuota +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document describes a system for enforcing hard resource usage limits per -namespace as part of admission control. - -## Use cases - -1. Ability to enumerate resource usage limits per namespace. -2. Ability to monitor resource usage for tracked resources. -3. Ability to reject resource usage exceeding hard quotas. - -## Data Model - -The **ResourceQuota** object is scoped to a **Namespace**. - -```go -// The following identify resource constants for Kubernetes object types -const ( - // Pods, number - ResourcePods ResourceName = "pods" - // Services, number - ResourceServices ResourceName = "services" - // ReplicationControllers, number - ResourceReplicationControllers ResourceName = "replicationcontrollers" - // ResourceQuotas, number - ResourceQuotas ResourceName = "resourcequotas" - // ResourceSecrets, number - ResourceSecrets ResourceName = "secrets" - // ResourcePersistentVolumeClaims, number - ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims" -) - -// ResourceQuotaSpec defines the desired hard limits to enforce for Quota -type ResourceQuotaSpec struct { - // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` -} - -// ResourceQuotaStatus defines the enforced hard limits and observed use -type ResourceQuotaStatus struct { - // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` - // Used is the current observed total usage of the resource in the namespace - Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` -} - -// ResourceQuota sets aggregate quota restrictions enforced per namespace -type ResourceQuota struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - - // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` - - // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` -} - -// ResourceQuotaList is a list of ResourceQuota items -type ResourceQuotaList struct { - TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - - // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` -} -``` - -## Quota Tracked Resources - -The following resources are supported by the quota system: - -| Resource | Description | -| ------------ | ----------- | -| cpu | Total requested cpu usage | -| memory | Total requested memory usage | -| pods | Total number of active pods where phase is pending or active. | -| services | Total number of services | -| replicationcontrollers | Total number of replication controllers | -| resourcequotas | Total number of resource quotas | -| secrets | Total number of secrets | -| persistentvolumeclaims | Total number of persistent volume claims | - -If a third-party wants to track additional resources, it must follow the -resource naming conventions prescribed by Kubernetes. This means the resource -must have a fully-qualified name (i.e. mycompany.org/shinynewresource) - -## Resource Requirements: Requests vs. Limits - -If a resource supports the ability to distinguish between a request and a limit -for a resource, the quota tracking system will only cost the request value -against the quota usage. If a resource is tracked by quota, and no request value -is provided, the associated entity is rejected as part of admission. - -For an example, consider the following scenarios relative to tracking quota on -CPU: - -| Pod | Container | Request CPU | Limit CPU | Result | -| --- | --------- | ----------- | --------- | ------ | -| X | C1 | 100m | 500m | The quota usage is incremented 100m | -| Y | C2 | 100m | none | The quota usage is incremented 100m | -| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit | -| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. | - -The rationale for accounting for the requested amount of a resource versus the -limit is the belief that a user should only be charged for what they are -scheduled against in the cluster. In addition, attempting to track usage against -actual usage, where request < actual < limit, is considered highly volatile. - -As a consequence of this decision, the user is able to spread its usage of a -resource across multiple tiers of service. Let's demonstrate this via an -example with a 4 cpu quota. - -The quota may be allocated as follows: - -| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage | -| --- | --------- | ----------- | --------- | ---- | ----------- | -| X | C1 | 1 | 4 | Burstable | 1 | -| Y | C2 | 2 | 2 | Guaranteed | 2 | -| Z | C3 | 1 | 3 | Burstable | 1 | - -It is possible that the pods may consume 9 cpu over a given time period -depending on the nodes available cpu that held pod X and Z, but since we -scheduled X and Z relative to the request, we only track the requesting value -against their allocated quota. If one wants to restrict the ratio between the -request and limit, it is encouraged that the user define a **LimitRange** with -**LimitRequestRatio** to control burst out behavior. This would in effect, let -an administrator keep the difference between request and limit more in line with -tracked usage if desired. - -## Status API - -A REST API endpoint to update the status section of the **ResourceQuota** is -exposed. It requires an atomic compare-and-swap in order to keep resource usage -tracking consistent. - -## Resource Quota Controller - -A resource quota controller monitors observed usage for tracked resources in the -**Namespace**. - -If there is observed difference between the current usage stats versus the -current **ResourceQuota.Status**, the controller posts an update of the -currently observed usage metrics to the **ResourceQuota** via the /status -endpoint. - -The resource quota controller is the only component capable of monitoring and -recording usage updates after a DELETE operation since admission control is -incapable of guaranteeing a DELETE request actually succeeded. - -## AdmissionControl plugin: ResourceQuota - -The **ResourceQuota** plug-in introspects all incoming admission requests. - -To enable the plug-in and support for ResourceQuota, the kube-apiserver must be -configured as follows: - -``` -$ kube-apiserver --admission-control=ResourceQuota -``` - -It makes decisions by evaluating the incoming object against all defined -**ResourceQuota.Status.Hard** resource limits in the request namespace. If -acceptance of the resource would cause the total usage of a named resource to -exceed its hard limit, the request is denied. - -If the incoming request does not cause the total usage to exceed any of the -enumerated hard resource limits, the plug-in will post a -**ResourceQuota.Status** document to the server to atomically update the -observed usage based on the previously read **ResourceQuota.ResourceVersion**. -This keeps incremental usage atomically consistent, but does introduce a -bottleneck (intentionally) into the system. - -To optimize system performance, it is encouraged that all resource quotas are -tracked on the same **ResourceQuota** document in a **Namespace**. As a result, -it is encouraged to impose a cap on the total number of individual quotas that -are tracked in the **Namespace** to 1 in the **ResourceQuota** document. - -## kubectl - -kubectl is modified to support the **ResourceQuota** resource. - -`kubectl describe` provides a human-readable output of quota. - -For example: - -```console -$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml -namespace "quota-example" created -$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example -resourcequota "quota" created -$ kubectl describe quota quota --namespace=quota-example -Name: quota -Namespace: quota-example -Resource Used Hard --------- ---- ---- -cpu 0 20 -memory 0 1Gi -persistentvolumeclaims 0 10 -pods 0 10 -replicationcontrollers 0 20 -resourcequotas 1 1 -secrets 1 10 -services 0 5 -``` - -Simple object count quotas supported on all standard resources using`count/<resource>.<group>` syntax. - -For example: -```console -$ kubectl create quota test --hard=count/deployments.extensions=2,count/replicasets.extensions=4,count/pods=3,count/secrets=4 -resourcequota "test" created -$ kubectl run nginx --image=nginx --replicas=2 -$ kubectl describe quota -Name: test -Namespace: default -Resource Used Hard --------- ---- ---- -count/deployments.extensions 1 2 -count/pods 2 3 -count/replicasets.extensions 1 4 -count/secrets 1 4 -``` - -## More information - -See [resource quota document](https://kubernetes.io/docs/concepts/policy/resource-quotas/) and the [example of Resource Quota](https://kubernetes.io/docs/tasks/administer-cluster/quota-api-object/) for more information. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/device-plugin-overview.png b/contributors/design-proposals/resource-management/device-plugin-overview.png Binary files differdeleted file mode 100644 index 4d3f90f8..00000000 --- a/contributors/design-proposals/resource-management/device-plugin-overview.png +++ /dev/null diff --git a/contributors/design-proposals/resource-management/device-plugin.md b/contributors/design-proposals/resource-management/device-plugin.md index 4cd2cc4e..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/device-plugin.md +++ b/contributors/design-proposals/resource-management/device-plugin.md @@ -1,504 +1,6 @@ -Device Manager Proposal -=============== +Design proposals have been archived. -* [Motivation](#motivation) -* [Use Cases](#use-cases) -* [Objectives](#objectives) -* [Non Objectives](#non-objectives) -* [Vendor story](#vendor-story) -* [End User story](#end-user-story) -* [Device Plugin](#device-plugin) - * [Introduction](#introduction) - * [Registration](#registration) - * [Unix Socket](#unix-socket) - * [Protocol Overview](#protocol-overview) - * [API specification](#api-specification) - * [HealthCheck and Failure Recovery](#healthcheck-and-failure-recovery) - * [API Changes](#api-changes) -* [Upgrading your cluster](#upgrading-your-cluster) -* [Installation](#installation) -* [Versioning](#versioning) -* [References](#references) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -_Authors:_ - -* @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com> -* @jiayingz - Jiaying Zhang <jiayingz@google.com> - -# Motivation - -Kubernetes currently supports discovery of CPU and Memory primarily to a -minimal extent. Very few devices are handled natively by Kubelet. - -It is not a sustainable solution to expect every hardware vendor to add their -vendor specific code inside Kubernetes to make their devices usable. - -Instead, we want a solution for vendors to be able to advertise their resources -to Kubelet and monitor them without writing custom Kubernetes code. -We also want to provide a consistent and portable solution for users to -consume hardware devices across k8s clusters. - -This document describes a vendor independent solution to: - * Discovering and representing external devices - * Making these devices available to the containers, using these devices, - scrubbing and securely sharing these devices. - * Health Check of these devices - -Because devices are vendor dependent and have their own sets of problems -and mechanisms, the solution we describe is a plugin mechanism that may run -in a container deployed through the DaemonSets mechanism or in bare metal mode. - -The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand, -Storage devices, and other similar computing resources that require vendor -specific initialization and setup. - -The goal is for a user to be able to enable vendor devices (e.g: GPUs) through -the following simple steps: - * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml` - * When launching `kubectl describe nodes`, the devices appear in the node - status as `vendor-domain/vendor-device`. Note: naming - convention is discussed in PR [#844](https://github.com/kubernetes/community/pull/844) - -# Use Cases - - * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.) - in my pod. - * I should be able to use that device without writing custom Kubernetes code. - * I want a consistent and portable solution to consume hardware devices - across k8s clusters. - -# Objectives - -1. Add support for vendor specific Devices in kubelet: - * Through an extension mechanism. - * Which allows discovery and health check of devices. - * Which allows hooking the runtime to make devices available in containers - and cleaning them up. -2. Define a deployment mechanism for this new API. -3. Define a versioning mechanism for this new API. - -# Non Objectives - -1. Handling heterogeneous nodes and topology related problems -2. Collecting metrics is not part of this proposal. We will only solve - Health Check. - -# TLDR - -At their core, device plugins are simple gRPC servers that may run in a -container deployed through the pod mechanism or in bare metal mode. - -These servers implement the gRPC interface defined later in this design -document and once the device plugin makes itself known to kubelet, kubelet -will interact with the device through two simple functions: - 1. A `ListAndWatch` function for the kubelet to Discover the devices and - their properties as well as notify of any status change (device - became unhealthy). - 2. An `Allocate` function which is called before creating a user container - consuming any exported devices - - - -# Vendor story - -Kubernetes provides to vendors a mechanism called device plugins to: - * advertise devices. - * monitor devices (currently perform health checks). - * hook into the runtime to execute device specific instructions - (e.g: Clean GPU memory) and - to take in order to make the device available in the container. - -```go -service DevicePlugin { - // returns a stream of []Device - rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} - rpc Allocate(AllocateRequest) returns (AllocateResponse) {} -} -``` - -The gRPC server that the device plugin must implement is expected to -be advertised on a unix socket in a mounted hostPath (e.g: -`/var/lib/kubelet/device-plugins/nvidiaGPU.sock`). - -Finally, to notify Kubelet of the existence of the device plugin, -the vendor's device plugin will have to make a request to Kubelet's -own gRPC server. -Only then will kubelet start interacting with the vendor's device plugin -through the gRPC apis. - -# End User story - -When setting up the cluster the admin knows what kind of devices are present -on the different machines and therefore can select what devices to enable. - -The cluster admin knows his cluster has NVIDIA GPUs therefore he deploys -the NVIDIA device plugin through: -`kubectl create -f nvidia.io/device-plugin.yml` - -The device plugin lands on all the nodes of the cluster and if it detects that -there are no GPUs it terminates (assuming `restart: OnFailure`). However, when -there are GPUs it reports them to Kubelet and starts its gRPC server to -monitor devices and hook into the container creation process. - -Devices reported by Device Plugins are advertised as Extended resources of -the shape `vendor-domain/vendor-device`. -E.g., Nvidia GPUs are advertised as `nvidia.com/gpu` - -Devices can be selected using the same process as for OIRs in the pod spec. -Devices have no impact on QOS. However, for the alpha, we expect the request -to have limits == requests. - -1. A user submits a pod spec requesting X GPUs (or devices) through - `vendor-domain/vendor-device` -2. The scheduler filters the nodes which do not match the resource requests -3. The pod lands on the node and Kubelet decides which device - should be assigned to the pod -4. Kubelet calls `Allocate` on the matching Device Plugins -5. The user deletes the pod or the pod terminates - -When receiving a pod which requests Devices kubelet is in charge of: - * deciding which device to assign to the pod's containers - * Calling the `Allocate` function with the list of devices - -The scheduler is still in charge of filtering the nodes which cannot -satisfy the resource requests. - -# Device Plugin - -## Introduction - -The device plugin is structured in 3 parts: -1. Registration: The device plugin advertises its presence to Kubelet -2. ListAndWatch: The device plugin advertises a list of Devices to Kubelet - and sends it again if the state of a Device changes -3. Allocate: When creating containers, Kubelet calls the device plugin's - `Allocate` function so that it can run device specific instructions (gpu - cleanup, QRNG initialization, ...) and instruct Kubelet how to make the - device available in the container. - -## Registration - -When starting the device plugin is expected to make a (client) gRPC call -to the `Register` function that Kubelet exposes. - -The communication between Kubelet is expected to happen only through Unix -sockets and follow this simple pattern: -1. The device plugins sends a `RegisterRequest` to Kubelet (through a - gRPC request) -2. Kubelet answers to the `RegisterRequest` with a `RegisterResponse` - containing any error Kubelet might have encountered -3. The device plugin start its gRPC server if it did not receive an - error - -## Unix Socket - -Device Plugins are expected to communicate with Kubelet through gRPC -on an Unix socket. -When starting the gRPC server, they are expected to create a unix socket -at the following host path: `/var/lib/kubelet/device-plugins/`. - -For non bare metal device plugin this means they will have to mount the folder -as a volume in their pod spec ([see Installation](#installation)). - -Device plugins can expect to find the socket to register themselves on -the host at the following path: -`/var/lib/kubelet/device-plugins/kubelet.sock`. - -## Protocol Overview - -When first registering themselves against Kubelet, the device plugin -will send: - * The name of their unix socket - * [The API version against which they were built](#versioning). - * Their `ResourceName` they want to advertise - -Kubelet answers with whether or not there was an error. -The errors may include (but not limited to): - * API version not supported - * A device plugin already registered this `ResourceName` - -After successful registration, Kubelet will interact with the plugin through -the following functions: - * ListAndWatch: The device plugin advertises a list of Devices to Kubelet - and sends it again if the state of a Device changes - * `Allocate`: Called when creating a container with a list of devices - - - - -## API Specification - -```go -// Registration is the service advertised by the Kubelet -// Only when Kubelet answers with a success code to a Register Request -// may Device Plugins start their service -// Registration may fail when device plugin version is not supported by -// Kubelet or the registered resourceName is already taken by another -// active device plugin. Device plugin is expected to terminate upon registration failure -service Registration { - rpc Register(RegisterRequest) returns (Empty) {} -} - -// DevicePlugin is the service advertised by Device Plugins -service DevicePlugin { - // ListAndWatch returns a stream of List of Devices - // Whenever a Device state change or a Device disappears, ListAndWatch - // returns the new list - rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} - - // Allocate is called during container creation so that the Device - // Plugin can run device specific operations and instruct Kubelet - // of the steps to make the Device available in the container - rpc Allocate(AllocateRequest) returns (AllocateResponse) {} -} - -message RegisterRequest { - // Version of the API the Device Plugin was built against - string version = 1; - // Name of the unix socket the device plugin is listening on - // PATH = path.Join(DevicePluginPath, endpoint) - string endpoint = 2; - // Schedulable resource name - string resource_name = 3; -} - -// - Allocate is expected to be called during pod creation since allocation -// failures for any container would result in pod startup failure. -// - Allocate allows kubelet to exposes additional artifacts in a pod's -// environment as directed by the plugin. -// - Allocate allows Device Plugin to run device specific operations on -// the Devices requested -message AllocateRequest { - repeated string devicesIDs = 1; -} - -// Failure Handling: -// if Kubelet sends an allocation request for dev1 and dev2. -// Allocation on dev1 succeeds but allocation on dev2 fails. -// The Device plugin should send a ListAndWatch update and fail the -// Allocation request -message AllocateResponse { - repeated DeviceRuntimeSpec spec = 1; -} - -// ListAndWatch returns a stream of List of Devices -// Whenever a Device state change or a Device disappears, ListAndWatch -// returns the new list -message ListAndWatchResponse { - repeated Device devices = 1; -} - -// The list to be added to the CRI spec -message DeviceRuntimeSpec { - string ID = 1; - - // List of environment variable to set in the container. - map<string, string> envs = 2; - // Mounts for the container. - repeated Mount mounts = 3; - // Devices for the container - repeated DeviceSpec devices = 4; -} - -// DeviceSpec specifies a host device to mount into a container. -message DeviceSpec { - // Path of the device within the container. - string container_path = 1; - // Path of the device on the host. - string host_path = 2; - // Cgroups permissions of the device, candidates are one or more of - // * r - allows container to read from the specified device. - // * w - allows container to write to the specified device. - // * m - allows container to create device files that do not yet exist. - string permissions = 3; -} - -// Mount specifies a host volume to mount into a container. -// where device library or tools are installed on host and container -message Mount { - // Path of the mount on the host. - string host_path = 1; - // Path of the mount within the container. - string mount_path = 2; - // If set, the mount is read-only. - bool read_only = 3; -} - -// E.g: -// struct Device { -// ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e", -// State: "Healthy", -//} -message Device { - string ID = 2; - string health = 3; -} -``` - -### HealthCheck and Failure Recovery - -We want Kubelet as well as the Device Plugins to recover from failures -that may happen on any side of this protocol. - -At the communication level, gRPC is a very strong piece of software and -is able to ensure that if failure happens it will try its best to recover -through exponential backoff reconnection and Keep Alive checks. - -The proposed mechanism intends to replace any device specific handling in -Kubelet. Therefore in general, device plugin failure or upgrade means that -Kubelet is not able to accept any pod requesting a Device until the upgrade -or failure finishes. - -If a device fails, the Device Plugin should signal that through the -`ListAndWatch` gRPC stream. We then expect Kubelet to fail the Pod. - -If any Device Plugin fails the behavior we expect depends on the task Kubelet -is performing: -* In general we expect Kubelet to remove any devices that are owned by the failed - device plugin from the node capacity. We also expect node allocatable to be - equal to node capacity. -* We however do not expect Kubelet to fail or restart any pods or containers - running that are using these devices. -* If Kubelet is in the process of allocating a device, then it should fail - the container process. - -If the Kubelet fails or restarts, we expect the Device Plugins to know about -it through gRPC's Keep alive feature and try to reconnect to Kubelet. - -When Kubelet fails or restarts it should know what are the devices that are -owned by the different containers and be able to rebuild a list of available -devices. -We are expecting to implement this through a checkpointing mechanism that Kubelet -would write and read from. - - -## API Changes - -When discovering the devices, Kubelet will be in charge of advertising those -resources to the API server as part of the kubelet node update current protocol. - -We will be using extended resources to schedule, trigger and advertise these -Devices. -When a Device plugin registers two `foo-device` the node status will be -updated to advertise 2 `vendor-domain/foo-device`. - -If a user wants to trigger the device plugin he only needs to request this -through the same mechanism as OIRs in his Pod Spec. - -# Upgrading your cluster - -*TLDR:* -Given that we cannot guarantee that the Device Plugins are not running -a daemon providing a critical service to Devices and when stopped will -crash the running containers, it is up to the vendor to specify the -upgrading scheme of their device plugin. - -However, If you are upgrading either Kubelet or any device plugin the safest way -is to drain the node of all pods and upgrade. - -Depending on what you are upgrading and what changes happened then it -is completely possible to only restart just Kubelet or just the device plugin. - -## Upgrading Kubelet - -This assumes that the Device Plugins running on the nodes fully implement the -protocol and are able to recover from a Kubelet crash. - -Then, as long as the Device Plugin API does not change upgrading Kubelet can be done -seamlessly through a Kubelet restart. - -*Currently:* -As mentioned in the Versioning section, we currently expect the Device Plugin's -API version to match exactly the Kubelet's Device Plugin API version. -Therefore if the Device Plugin API version change then you will have to change -the Device Plugin too. - - -*Future:* -When the Device Plugin API becomes a stable feature, versioning should be -backward compatible and even if Kubelet has a different Device Plugin API, - -it should not require a Device Plugin upgrade. - -Refer to the versioning section for versioning scheme compatibility. - -## Upgrading Device Plugins - -Because we cannot enforce what the different Device Plugins will do, we cannot -say for certain that upgrading a device plugin will not crash any containers -on the node. - -It is therefore up to the Device Plugin vendors to specify if the Device Plugins -can be upgraded without impacting any running containers. - -As mentioned earlier, the safest way is to drain the node before upgrading -the Device Plugins. - -# Installation - -The installation process should be straightforward to the user, transparent -and similar to other regular Kubernetes actions. -The device plugin should also run in containers so that Kubernetes can -deploy them and restart the plugins when they fail. -However, we should not prevent the user from deploying a bare metal device -plugin. - -Deploying the device plugins through DemonSets makes sense as the cluster -admin would be able to specify which machines it wants the device plugins to -run on, the process is similar to any Kubernetes action and does not require -to change any parts of Kubernetes. - -Additionally, for integrated solutions such as `kubeadm` we can add support -to auto-deploy community vetted Device Plugins. -Thus not fragmenting once more the Kubernetes ecosystem. - -For users installing Kubernetes without using an integrated solution such -as `kubeadm` they would use the examples that we would provide at: -`https://github.com/vendor/device-plugin/tree/master/device-plugin.yaml` - -YAML example: - -```yaml -apiVersion: extensions/v1beta1 -kind: DaemonSet -metadata: -spec: - template: - metadata: - labels: - - name: device-plugin - spec: - containers: - name: device-plugin-ctr - image: NVIDIA/device-plugin:1.0 - volumeMounts: - - mountPath: /device-plugin - - name: device-plugin - volumes: - - name: device-plugin - hostPath: - path: /var/lib/kubelet/device-plugins -``` - -# Versioning - -Currently we require exact version match between Kubelet and Device Plugin. -API version is expected to be increased only upon incompatible API changes. - -Follow protobuf guidelines on versioning: - * Do not change ordering - * Do not remove fields or change types - * Add optional fields - * Introducing new fields with proper default values - * Freeze the package name to `apis/device-plugin/v1alpha1` - * Have kubelet and the Device Plugin negotiate versions if we do break the API - -# References - - * [Adding a proposal for hardware accelerators](https://github.com/kubernetes/community/pull/844) - * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136) - * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/kubernetes/kubernetes/pull/42116) - * [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) - * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) - * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/device-plugin.png b/contributors/design-proposals/resource-management/device-plugin.png Binary files differdeleted file mode 100644 index 61ae6167..00000000 --- a/contributors/design-proposals/resource-management/device-plugin.png +++ /dev/null diff --git a/contributors/design-proposals/resource-management/gpu-support.md b/contributors/design-proposals/resource-management/gpu-support.md index 9c58cd8e..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/gpu-support.md +++ b/contributors/design-proposals/resource-management/gpu-support.md @@ -1,273 +1,6 @@ -- [GPU support](#gpu-support) - - [Objective](#objective) - - [Background](#background) - - [Detailed discussion](#detailed-discussion) - - [Inventory](#inventory) - - [Scheduling](#scheduling) - - [The runtime](#the-runtime) - - [NVIDIA support](#nvidia-support) - - [Event flow](#event-flow) - - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker) - - [Implementation plan](#implementation-plan) - - [V0](#v0) - - [Scheduling](#scheduling-1) - - [Runtime](#runtime) - - [Other](#other) - - [Future work](#future-work) - - [V1](#v1) - - [V2](#v2) - - [V3](#v3) - - [Undetermined](#undetermined) - - [Security considerations](#security-considerations) +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# GPU support - -Author: @therc - -Date: Apr 2016 - -Status: Design in progress, early implementation of requirements - -## Objective - -Users should be able to request GPU resources for their workloads, as easily as -for CPU or memory. Kubernetes should keep an inventory of machines with GPU -hardware, schedule containers on appropriate nodes and set up the container -environment with all that's necessary to access the GPU. All of this should -eventually be supported for clusters on either bare metal or cloud providers. - -## Background - -An increasing number of workloads, such as machine learning and seismic survey -processing, benefits from offloading computations to graphic hardware. While not -as tuned as traditional, dedicated high performance computing systems such as -MPI, a Kubernetes cluster can still be a great environment for organizations -that need a variety of additional, "classic" workloads, such as database, web -serving, etc. - -GPU support is hard to provide extensively and will thus take time to tame -completely, because - -- different vendors expose the hardware to users in different ways -- some vendors require fairly tight coupling between the kernel driver -controlling the GPU and the libraries/applications that access the hardware -- it adds more resource types (whole GPUs, GPU cores, GPU memory) -- it can introduce new security pitfalls -- for systems with multiple GPUs, affinity matters, similarly to NUMA -considerations for CPUs -- running GPU code in containers is still a relatively novel idea - -## Detailed discussion - -Currently, this document is mostly focused on the basic use case: run GPU code -on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow -enough scenario that it does not require large amounts of generic code yet. GCE -doesn't support GPUs at all; bare metal systems throw a lot of extra variables -into the mix. - -Later sections will outline future work to support a broader set of hardware, -environments and container runtimes. - -### Inventory - -Before any scheduling can occur, we need to know what's available out there. In -v0, we'll hardcode capacity detected by the kubelet based on a flag, -`--experimental-nvidia-gpu`. This will result in the user-defined resource -`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and -`NodeAllocatable`, as well as a node label. - -### Scheduling - -GPUs will be visible as first-class resources. In v0, we'll only assign whole -devices; sharing among multiple pods is left to future implementations. It's -probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md) -or pod priorities, especially if the nodes in a cluster are not homogeneous. -Consider these two cases: - -> Only half of the machines have a GPU and they're all busy with other -workloads. The other half of the cluster is doing very little work. A GPU -workload comes, but it can't schedule, because the devices are sitting idle on -nodes that are running something else and the nodes with little load lack the -hardware. - -> Some or all the machines have two graphic cards each. A number of jobs get -scheduled, requesting one device per pod. The scheduler puts them all on -different machines, spreading the load, perhaps by design. Then a new job comes -in, requiring two devices per pod, but it can't schedule anywhere, because all -we can find, at most, is one unused device per node. - -### The runtime - -Once we know where to run the container, it's time to set up its environment. At -a minimum, we'll need to map the host device(s) into the container. Because each -manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`, -but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic -needs to be hardware-specific, mapping from a logical device to a list of device -nodes necessary for software to talk to it. - -Support binaries and libraries are often versioned along with the kernel module, -so there should be further hooks to project those under `/bin` and some kind of -`/lib` before the application is started. This can be done for Docker with the -use of a versioned [Docker -volume](https://docs.docker.com/engine/tutorials/dockervolumes/) or -with upcoming Kubernetes-specific hooks such as init containers and volume -containers. In v0, images are expected to bundle everything they need. - -#### NVIDIA support - -The first implementation and testing ground will be for NVIDIA devices, by far -the most common setup. - -In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices -(limited to those required to drive the first card, `nvidia0`) to be mapped into -the container by the dockertools library. - -### Event flow - -This is what happens before and after a user schedules a GPU pod. - -1. Administrator installs a number of Kubernetes nodes with GPUs. The correct -kernel modules and device nodes under `/dev/` are present. - -1. Administrator makes sure the latest CUDA/driver versions are installed. - -1. Administrator enables `--experimental-nvidia-gpu` on kubelets - -1. Kubelets update node status with information about the GPU device, in addition -to cAdvisor's usual data about CPU/memory/disk - -1. User creates a Docker image compiling their application for CUDA, bundling -the necessary libraries. We ignore any versioning requirements in the image -using labels based on [NVIDIA's -conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13). - -1. User creates a pod using the image, requiring -`alpha.kubernetes.io/nvidia-gpu: 1` - -1. Scheduler picks a node for the pod - -1. The kubelet notices the GPU requirement and maps the three devices. In -Docker's engine-api, this means it'll add them to the Resources.Devices list. - -1. Docker runs the container to completion - -1. The scheduler notices that the device is available again - -### Too complex for now: nvidia-docker - -For v0, we discussed at length, but decided to leave aside initially the -[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is -an officially supported solution, thus avoiding a lot of new low level code, as -it takes care of functionality such as: - -- creating a Docker volume with binaries such as `nvidia-smi` and shared -libraries -- providing HTTP endpoints that monitoring tools can use to collect GPU metrics -- abstracting details such as `/dev` entry names for each device, as well as -control ones like `nvidiactl` - -The `nvidia-docker` wrapper also verifies that the CUDA version required by a -given image is supported by the host drivers, through inspection of well-known -image labels, if present. We should try to provide equivalent checks, either -for CUDA or OpenCL. - -This is current sample output from `nvidia-docker-plugin`, wrapped for -readability: - - $ curl -s localhost:3476/docker/cli - --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 - --volume-driver=nvidia-docker - --volume=nvidia_driver_352.68:/usr/local/nvidia:ro - -It runs as a daemon listening for HTTP requests on port 3476. The endpoint above -returns flags that need to be added to the Docker command line in order to -expose GPUs to the containers. There are optional URL arguments to request -specific devices if more than one are present on the system, as well as specific -versions of the support software. An obvious improvement is an additional -endpoint for JSON output. - -The unresolved question is whether `nvidia-docker-plugin` would run standalone -as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes -resource API) or whether the relevant code from its `nvidia` package should be -linked directly into kubelet. A partial list of tradeoffs: - -| | External binary | Linked in | -|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------| -| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding | -| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer | -| Bloat | None | Larger kubelet, even for systems without GPUs | -| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches | -| (Un)Marshalling | Need to talk over JSON | None | -| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags | -| Releases | Potentially on its own schedule | Tied to Kubernetes | - -## Implementation plan - -### V0 - -The first two tracks can progress in parallel. - -#### Scheduling - -1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go` -and co. -1. Plug resource into feasability checks used by kubelet, scheduler and -schedulercache. Maybe gated behind a flag? -1. Plug resource into resource_helpers.go -1. Plug resource into the limitranger - -#### Runtime - -1. Add kubelet config parameter to enable the resource -1. Make kubelet's `setNodeStatusMachineInfo` report the resource -1. Add a Devices list to container.RunContainerOptions -1. Use it from DockerManager's runContainer -1. Do the same for rkt (stretch goal) -1. When a pod requests a GPU, add the devices to the container options - -#### Other - -1. Add new resource to `kubectl describe` output. Optional for non-GPU users? -1. Administrator documentation, with sample scripts -1. User documentation - -## Future work - -Above all, we need to collect feedback from real users and use that to set -priorities for any of the items below. - -### V1 - -- Perform real detection of the installed hardware -- Figure a standard way to avoid bundling of shared libraries in images -- Support fractional resources so multiple pods can share the same GPU -- Support bare metal setups -- Report resource usage - -### V2 - -- Support multiple GPUs with resource hierarchies and affinities -- Support versioning of resources (e.g. "CUDA v7.5+") -- Build resource plugins into the kubelet? -- Support other device vendors -- Support Azure? -- Support rkt? - -### V3 - -- Support OpenCL (so images can be device-agnostic) - -### Undetermined - -It makes sense to turn the output of this project (external resource plugins, -etc.) into a more generic abstraction at some point. - - -## Security considerations - -There should be knobs for the cluster administrator to only allow certain users -or roles to schedule GPU workloads. Overcommitting or sharing the same device -across different pods is not considered safe. It should be possible to segregate -such GPU-sharing pods by user, namespace or a combination thereof. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/hugepages.md b/contributors/design-proposals/resource-management/hugepages.md index 9901a4bb..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/hugepages.md +++ b/contributors/design-proposals/resource-management/hugepages.md @@ -1,308 +1,6 @@ -# HugePages support in Kubernetes +Design proposals have been archived. -**Authors** -* Derek Carr (@derekwaynecarr) -* Seth Jennings (@sjenning) -* Piotr Prokop (@PiotrProkop) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -**Status**: In progress - -## Abstract - -A proposal to enable applications running in a Kubernetes cluster to use huge -pages. - -A pod may request a number of huge pages. The `scheduler` is able to place the -pod on a node that can satisfy that request. The `kubelet` advertises an -allocatable number of huge pages to support scheduling decisions. A pod may -consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not -overcommitted. - -## Motivation - -Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi -of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have -a built-in memory management unit that manages a list of these pages in -hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of -virtual-to-physical page mappings. If the virtual address passed in a hardware -instruction can be found in the TLB, the mapping can be determined quickly. If -not, a TLB miss occurs, and the system falls back to slower, software based -address translation. This results in performance issues. Since the size of the -TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the -page size. - -A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, -there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other -architectures, but the idea is the same. In order to use huge pages, -application must write code that is aware of them. Transparent Huge Pages (THP) -attempts to automate the management of huge pages without application knowledge, -but they have limitations. In particular, they are limited to 2Mi page sizes. -THP might lead to performance degradation on nodes with high memory utilization -or fragmentation due to defragmenting efforts of THP, which can lock memory -pages. For this reason, some applications may be designed to (or recommend) -usage of pre-allocated huge pages instead of THP. - -Managing memory is hard, and unfortunately, there is no one-size fits all -solution for all applications. - -## Scope - -This proposal only includes pre-allocated huge pages configured on the node by -the administrator at boot time or by manual dynamic allocation. It does not -discuss how the cluster could dynamically attempt to allocate huge pages in an -attempt to find a fit for a pod pending scheduling. It is anticipated that -operators may use a variety of strategies to allocate huge pages, but we do not -anticipate the kubelet itself doing the allocation. Allocation of huge pages -ideally happens soon after boot time. - -This proposal defers issues relating to NUMA. - -## Use Cases - -The class of applications that benefit from huge pages typically have -- A large memory working set -- A sensitivity to memory access latency - -Example applications include: -- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.) -- Java applications can back the heap with huge pages using the - `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. -- packet processing systems (DPDK) - -Applications can generally use huge pages by calling -- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory -- `mmap()` a file backed by `hugetlbfs` -- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known - Issues). - -1. A pod can use huge pages with any of the prior described methods. -1. A pod can request huge pages. -1. A scheduler can bind pods to nodes that have available huge pages. -1. A quota may limit usage of huge pages. -1. A limit range may constrain min and max huge page requests. - -## Feature Gate - -The proposal introduces huge pages as an Alpha feature. - -It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent -components pending graduation to Beta. - -## Node Specification - -Huge pages cannot be overcommitted on a node. - -A system may support multiple huge page sizes. It is assumed that most nodes -will be configured to primarily use the default huge page size as returned via -`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems -unless overridden by `default_hugepagesz=1g` in kernel boot parameters. - -For each supported huge page size, the node will advertise a resource of the -form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are -determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB` -directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>` -resource using binary notation form. It will convert `<hugepagesize>` into the -most compact binary notation using integer values. For example, if a node -supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node -capacity and allocatable values. Operators may set aside pre-allocated huge -pages that are not available for user pods similar to normal memory via the -`--system-reserved` flag. - -There are a variety of huge page sizes supported across different hardware -architectures. It is preferred to have a resource per size in order to better -support quota. For example, 1 huge page with size 2Mi is orders of magnitude -different than 1 huge page with size 1Gi. We assume gigantic pages are even -more precious resources than huge pages. - -Pre-allocated huge pages reduce the amount of allocatable memory on a node. The -node will treat pre-allocated huge pages similar to other system reservations -and reduce the amount of `memory` it reports using the following formula: - -``` -[Allocatable] = [Node Capacity] - - [Kube-Reserved] - - [System-Reserved] - - [Pre-Allocated-HugePages * HugePageSize] - - [Hard-Eviction-Threshold] -``` - -The following represents a machine with 10Gi of memory. 1Gi of memory has been -reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the -allocatable memory has been reduced to account for the amount of huge pages -reserved. - -``` -apiVersion: v1 -kind: Node -metadata: - name: node1 -... -status: - capacity: - memory: 10Gi - hugepages-2Mi: 1Gi - allocatable: - memory: 9Gi - hugepages-2Mi: 1Gi -... -``` - -## Pod Specification - -A pod must make a request to consume pre-allocated huge pages using the resource -`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in -bytes. The specified amount must align with the `<hugepagesize>`; otherwise, -the pod will fail validation. For example, it would be valid to request -`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`. - -The request and limit for `hugepages-<hugepagesize>` must match. Similar to -memory, an application that requests `hugepages-<hugepagesize>` resource is at -minimum in the `Burstable` QoS class. - -If a pod consumes huge pages via `shmget`, it must run with a supplemental group -that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of -this group is outside the scope of this specification. - -Initially, a pod may not consume multiple huge page sizes in a single pod spec. -Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will -fail validation. We believe it is rare for applications to attempt to use -multiple huge page sizes. This restriction may be lifted in the future with -community presented use cases. Introducing the feature with this restriction -limits the exposure of API changes needed when consuming huge pages via volumes. - -In order to consume huge pages backed by the `hugetlbfs` filesystem inside the -specified container in the pod, it is helpful to understand the set of mount -options used with `hugetlbfs`. For more details, see "Using Huge Pages" here: -https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt - -``` -mount -t hugetlbfs \ - -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ - min_size=<value>,nr_inodes=<value> none /mnt/huge -``` - -The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy -this use case. A new `medium=HugePages` option would be supported. To write -into this volume, the pod must make a request for huge pages. The `pagesize` -argument is inferred from the `hugepages-<hugepagesize>` from the resource -request. If in the future, multiple huge page sizes are supported in a single -pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page -size. The existing `sizeLimit` option for `emptyDir` would restrict usage to -the minimum value specified between `sizeLimit` and the sum of huge page limits -of all containers in a pod. This keeps the behavior consistent with memory -backed `emptyDir` volumes whose usage is ultimately constrained by the pod -cgroup sandbox memory settings. The `min_size` option is omitted as its not -necessary. The `nr_inodes` mount option is omitted at this time in the same -manner it is omitted with `medium=Memory` when using `tmpfs`. - -The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It -can consume those pages using `shmget()` or via `mmap()` with the specified -volume. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: example -spec: - containers: -... - volumeMounts: - - mountPath: /hugepages - name: hugepage - resources: - requests: - hugepages-2Mi: 1Gi - limits: - hugepages-2Mi: 1Gi - volumes: - - name: hugepage - emptyDir: - medium: HugePages -``` - -## CRI Updates - -The `LinuxContainerResources` message should be extended to support specifying -huge page limits per size. The specification for huge pages should align with -opencontainers/runtime-spec. - -see: -https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits - -The CRI changes are required before promoting this feature to Beta. - -## Cgroup Enforcement - -To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the -`hugetlb` cgroup must be mounted. - -The `kubepods` cgroup is bounded by the `Allocatable` value. - -The QoS level cgroups are left unbounded across all huge page pool sizes. - -The pod level cgroup sandbox is configured as follows, where `hugepagesize` is -the system supported huge page size(s). If no request is made for huge pages of -a particular size, the limit is set to 0 for all supported types on the node. - -``` -pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>]) -``` - -If the container runtime supports specification of huge page limits, the -container cgroup sandbox will be configured with the specified limit. - -The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level -cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed. - -## Limits and Quota - -The `ResourceQuota` resource will be extended to support accounting for -`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange` -resource will be extended to define min and max constraints for `hugepages` -similar to `cpu` and `memory`. - -## Scheduler changes - -The scheduler will need to ensure any huge page request defined in the pod spec -can be fulfilled by a candidate node. - -## cAdvisor changes - -cAdvisor will need to be modified to return the number of pre-allocated huge -pages per page size on the node. It will be used to determine capacity and -calculate allocatable values on the node. - -## Roadmap - -### Version 1.8 - -Initial alpha support for huge pages usage by pods. - -### Version 1.9 - -Resource Quota support. Limit Range support. Beta support for huge pages -(pending community feedback) - -## Known Issues - -### Huge pages as shared memory - -For the Java use case, the JVM maps the huge pages as a shared memory segment -and memlocks them to prevent the system from moving or swapping them out. - -There are several issues here: -- The user running the Java app must be a member of the gid set in the - `vm.huge_tlb_shm_group` sysctl -- sysctl `kernel.shmmax` must allow the size of the shared memory segment -- The user's memlock ulimits must allow the size of the shared memory segment -- `vm.huge_tlb_shm_group` is not namespaced. - -### NUMA - -NUMA is complicated. To support NUMA, the node must support cpu pinning, -devices, and memory locality. Extending that requirement to huge pages is not -much different. It is anticipated that the `kubelet` will provide future NUMA -locality guarantees as a feature of QoS. In particular, pods in the -`Guaranteed` QoS class are expected to have NUMA locality preferences. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/resource-management/resource-quota-scoping.md b/contributors/design-proposals/resource-management/resource-quota-scoping.md index a6f3f947..f0fbec72 100644 --- a/contributors/design-proposals/resource-management/resource-quota-scoping.md +++ b/contributors/design-proposals/resource-management/resource-quota-scoping.md @@ -1,328 +1,6 @@ -# Resource Quota - Scoping resources +Design proposals have been archived. -## Problem Description +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Ability to limit compute requests and limits - -The existing `ResourceQuota` API object constrains the total amount of compute -resource requests. This is useful when a cluster-admin is interested in -controlling explicit resource guarantees such that there would be a relatively -strong guarantee that pods created by users who stay within their quota will find -enough free resources in the cluster to be able to schedule. The end-user creating -the pod is expected to have intimate knowledge on their minimum required resource -as well as their potential limits. - -There are many environments where a cluster-admin does not extend this level -of trust to their end-user because user's often request too much resource, and -they have trouble reasoning about what they hope to have available for their -application versus what their application actually needs. In these environments, -the cluster-admin will often just expose a single value (the limit) to the end-user. -Internally, they may choose a variety of other strategies for setting the request. -For example, some cluster operators are focused on satisfying a particular over-commit -ratio and may choose to set the request as a factor of the limit to control for -over-commit. Other cluster operators may defer to a resource estimation tool that -sets the request based on known historical trends. In this environment, the -cluster-admin is interested in exposing a quota to their end-users that maps -to their desired limit instead of their request since that is the value the user -manages. - -### Ability to limit impact to node and promote fair-use - -The current `ResourceQuota` API object does not allow the ability -to quota best-effort pods separately from pods with resource guarantees. -For example, if a cluster-admin applies a quota that caps requested -cpu at 10 cores and memory at 10Gi, all pods in the namespace must -make an explicit resource request for cpu and memory to satisfy -quota. This prevents a namespace with a quota from supporting best-effort -pods. - -In practice, the cluster-admin wants to control the impact of best-effort -pods to the cluster, but not restrict the ability to run best-effort pods -altogether. - -As a result, the cluster-admin requires the ability to control the -max number of active best-effort pods. In addition, the cluster-admin -requires the ability to scope a quota that limits compute resources to -exclude best-effort pods. - -### Ability to quota long-running vs. bounded-duration compute resources - -The cluster-admin may want to quota end-users separately -based on long-running vs. bounded-duration compute resources. - -For example, a cluster-admin may offer more compute resources -for long running pods that are expected to have a more permanent residence -on the node than bounded-duration pods. Many batch style workloads -tend to consume as much resource as they can until something else applies -the brakes. As a result, these workloads tend to operate at their limit, -while many traditional web applications may often consume closer to their -request if there is no active traffic. An operator that wants to control -density will offer lower quota limits for batch workloads than web applications. - -A classic example is a PaaS deployment where the cluster-admin may -allow a separate budget for pods that run their web application vs. pods that -build web applications. - -Another example is providing more quota to a database pod than a -pod that performs a database migration. - -## Use Cases - -* As a cluster-admin, I want the ability to quota - * compute resource requests - * compute resource limits - * compute resources for terminating vs. non-terminating workloads - * compute resources for best-effort vs. non-best-effort pods - -## Proposed Change - -### New quota tracked resources - -Support the following resources that can be tracked by quota. - -| Resource Name | Description | -| ------------- | ----------- | -| cpu | total cpu requests (backwards compatibility) | -| memory | total memory requests (backwards compatibility) | -| requests.cpu | total cpu requests | -| requests.memory | total memory requests | -| limits.cpu | total cpu limits | -| limits.memory | total memory limits | - -### Resource Quota Scopes - -Add the ability to associate a set of `scopes` to a quota. - -A quota will only measure usage for a `resource` if it matches -the intersection of enumerated `scopes`. - -Adding a `scope` to a quota limits the number of resources -it supports to those that pertain to the `scope`. Specifying -a resource on the quota object outside of the allowed set -would result in a validation error. - -| Scope | Description | -| ----- | ----------- | -| Terminating | Match `kind=Pod` where `spec.activeDeadlineSeconds >= 0` | -| NotTerminating | Match `kind=Pod` where `spec.activeDeadlineSeconds = nil` | -| BestEffort | Match `kind=Pod` where `status.qualityOfService in (BestEffort)` | -| NotBestEffort | Match `kind=Pod` where `status.qualityOfService not in (BestEffort)` | - -A `BestEffort` scope restricts a quota to tracking the following resources: - -* pod - -A `Terminating`, `NotTerminating`, `NotBestEffort` scope restricts a quota to -tracking the following resources: - -* pod -* memory, requests.memory, limits.memory -* cpu, requests.cpu, limits.cpu - -## Data Model Impact - -```go -// The following identify resource constants for Kubernetes object types -const ( - // CPU request, in cores. (500m = .5 cores) - ResourceRequestsCPU ResourceName = "requests.cpu" - // Memory request, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024) - ResourceRequestsMemory ResourceName = "requests.memory" - // CPU limit, in cores. (500m = .5 cores) - ResourceLimitsCPU ResourceName = "limits.cpu" - // Memory limit, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024) - ResourceLimitsMemory ResourceName = "limits.memory" -) - -// A scope is a filter that matches an object -type ResourceQuotaScope string -const ( - ResourceQuotaScopeTerminating ResourceQuotaScope = "Terminating" - ResourceQuotaScopeNotTerminating ResourceQuotaScope = "NotTerminating" - ResourceQuotaScopeBestEffort ResourceQuotaScope = "BestEffort" - ResourceQuotaScopeNotBestEffort ResourceQuotaScope = "NotBestEffort" -) - -// ResourceQuotaSpec defines the desired hard limits to enforce for Quota -// The quota matches by default on all objects in its namespace. -// The quota can optionally match objects that satisfy a set of scopes. -type ResourceQuotaSpec struct { - // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty"` - // A collection of filters that must match each object tracked by a quota. - // If not specified, the quota matches all objects. - Scopes []ResourceQuotaScope `json:"scopes,omitempty"` -} -``` - -## Rest API Impact - -None. - -## Security Impact - -None. - -## End User Impact - -The `kubectl` commands that render quota should display its scopes. - -## Performance Impact - -This feature will make having more quota objects in a namespace -more common in certain clusters. This impacts the number of quota -objects that need to be incremented during creation of an object -in admission control. It impacts the number of quota objects -that need to be updated during controller loops. - -## Developer Impact - -None. - -## Alternatives - -This proposal initially enumerated a solution that leveraged a -`FieldSelector` on a `ResourceQuota` object. A `FieldSelector` -grouped an `APIVersion` and `Kind` with a selector over its -fields that supported set-based requirements. It would have allowed -a quota to track objects based on cluster defined attributes. - -For example, a quota could do the following: - -* match `Kind=Pod` where `spec.restartPolicy in (Always)` -* match `Kind=Pod` where `spec.restartPolicy in (Never, OnFailure)` -* match `Kind=Pod` where `status.qualityOfService in (BestEffort)` -* match `Kind=Service` where `spec.type in (LoadBalancer)` - * see [#17484](https://github.com/kubernetes/kubernetes/issues/17484) - -Theoretically, it would enable support for fine-grained tracking -on a variety of resource types. While extremely flexible, there -are cons to this approach that make it premature to pursue -at this time. - -* Generic field selectors are not yet settled art - * see [#1362](https://github.com/kubernetes/kubernetes/issues/1362) - * see [#19084](https://github.com/kubernetes/kubernetes/pull/19804) -* Discovery API Limitations - * Not possible to discover the set of field selectors supported by kind. - * Not possible to discover if a field is readonly, readwrite, or immutable - post-creation. - -The quota system would want to validate that a field selector is valid, -and it would only want to select on those fields that are readonly/immutable -post creation to make resource tracking work during update operations. - -The current proposal could grow to support a `FieldSelector` on a -`ResourceQuotaSpec` and support a simple migration path to convert -`scopes` to the matching `FieldSelector` once the project has identified -how it wants to handle `fieldSelector` requirements longer term. - -This proposal previously discussed a solution that leveraged a -`LabelSelector` as a mechanism to partition quota. This is potentially -interesting to explore in the future to allow `namespace-admins` to -quota workloads based on local knowledge. For example, a quota -could match all kinds that match the selector -`tier=cache, environment in (dev, qa)` separately from quota that -matched `tier=cache, environment in (prod)`. This is interesting to -explore in the future, but labels are insufficient selection targets -for `cluster-administrators` to control footprint. In those instances, -you need fields that are cluster controlled and not user-defined. - -## Example - -### Scenario 1 - -The cluster-admin wants to restrict the following: - -* limit 2 best-effort pods -* limit 2 terminating pods that can not use more than 1Gi of memory, and 2 cpu cores -* limit 4 long-running pods that can not use more than 4Gi of memory, and 4 cpu cores -* limit 6 pods in total, 10 replication controllers - -This would require the following quotas to be added to the namespace: - -```sh -$ cat quota-best-effort -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-best-effort -spec: - hard: - pods: "2" - scopes: - - BestEffort - -$ cat quota-terminating -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-terminating -spec: - hard: - pods: "2" - memory.limit: 1Gi - cpu.limit: 2 - scopes: - - Terminating - - NotBestEffort - -$ cat quota-longrunning -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-longrunning -spec: - hard: - pods: "2" - memory.limit: 4Gi - cpu.limit: 4 - scopes: - - NotTerminating - - NotBestEffort - -$ cat quota -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota -spec: - hard: - pods: "6" - replicationcontrollers: "10" -``` - -In the above scenario, every pod creation will result in its usage being -tracked by `quota` since it has no additional scoping. The pod will then -be tracked by at 1 additional quota object based on the scope it -matches. In order for the pod creation to succeed, it must not violate -the constraint of any matching quota. So for example, a best-effort pod -would only be created if there was available quota in `quota-best-effort` -and `quota`. - -## Implementation - -### Assignee - -@derekwaynecarr - -### Work Items - -* Add support for requests and limits -* Add support for scopes in quota-related admission and controller code - -## Dependencies - -None. - -Longer term, we should evaluate what we want to do with `fieldSelector` as -the requests around different quota semantics will continue to grow. - -## Testing - -Appropriate unit and e2e testing will be authored. - -## Documentation Impact - -Existing resource quota documentation and examples will be updated. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/rules-review-api.md b/contributors/design-proposals/rules-review-api.md index 1bc3d9d6..f0fbec72 100644 --- a/contributors/design-proposals/rules-review-api.md +++ b/contributors/design-proposals/rules-review-api.md @@ -1,128 +1,6 @@ -# "What can I do?" API +Design proposals have been archived. -Author: Eric Chiang (eric.chiang@coreos.com) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Overview -Currently, to determine if a user is authorized to perform a set of actions, that user has to query each action individually through a `SelfSubjectAccessReview`. - -Beyond making the authorization layer hard to reason about, it means web interfaces such as the OpenShift Web Console, Tectonic Console, and Kubernetes Dashboard, have to perform individual calls for _every resource_ a page displays. There's no way for a user, or an application acting on behalf of a user, to ask for all the permissions a user can make in bulk. This makes its hard to build pages that are proactive about what's displayed or grayed out based on the end user's permissions. UIs can only handle 403 responses after a user has already performed a forbidden action. - -This is a proposal to add authorization APIs that allow a client to determine what actions they can make within a namespace. We expect this API to be used by UIs to show/hide actions, or to quickly let an end user reason about their permissions. This API should NOT be used by external systems to drive their own authorization decisions, as this raises confused deputy, cache lifetime/revocation, and correctness concerns. The `*AccessReview` APIs remain the correct way to defer authorization decisions to the API server. - -OpenShift adopted a [`RulesReview` API][openshift-rules-review] to accomplish this same goal, and this proposal is largely a port of that implementation. - -[kubernetes/kubernetes#48051](https://github.com/kubernetes/kubernetes/pull/48051) implements most of this proposal. - -## API additions - -Add a top level type to the `authorization.k8s.io` API group called `SelfSubjectRulesReview`. This mirrors the existing `SelfSubjectAccessReview`. - -``` -type SelfSubjectRulesReview struct { - metav1.TypeMeta - - Spec SelfSubjectRulesReviewSpec - - // Status is filled in by the server and represents the set of actions a user can perform. - Status SubjectRulesReviewStatus -} - -type SelfSubjectRulesReviewSpec struct { - // Namespace to evaluate rules for. Required. - Namespace string -} - -type SubjectRulesReviewStatus struct { - // ResourceRules is the list of actions the subject is allowed to perform on resources. - // The list ordering isn't significant, may contain duplicates, and possibly be incomplete. - ResourceRules []ResourceRule - // NonResourceRules is the list of actions the subject is allowed to perform on non-resources. - // The list ordering isn't significant, may contain duplicates, and possibly be incomplete. - NonResourceRules []NonResourceRule - // EvaluationError can appear in combination with Rules. It indicates an error occurred during - // rule evaluation, such as an authorizer that doesn't support rule evaluation, and that - // ResourceRules and/or NonResourceRules may be incomplete. - EvaluationError string - // Incomplete indicates that the returned list is known to be incomplete. - Incomplete bool -} -``` - -The `ResourceRules` and `NonResourceRules` rules are similar to the types use by RBAC and the internal authorization system. - -``` -# docstrings omitted for brevity. -type ResourceRule struct { - Verbs []string - APIGroups []string - Resources []string - ResourceNames []string -} - -type NonResourceRule struct { - Verbs []string - NonResourceURLs []string -} -``` - -All of these fields can include the string `*` to indicate all values are allowed. - -### Differences from OpenShift: user extras vs. scopes - -OpenShift `SelfSubjectRulesReviewSpec` takes a set of [`Scopes`][openshift-scopes]. This lets OpenShift clients use the API for queries such as _"what could I do if I provide this scope to limit my credentials?"_ - - In core kube, scopes are replaced by "user extras" field, a map of opaque strings that can be used for implementation specific user data. Unlike OpenShift, where scopes are always used to restrict credential powers, user extras are commonly used to expand powers. For example, the proposed [Keystone authentiator][keystone-authn] used them to include additional roles and project fields. - -Since user extras can be used to expand the power of users, instead of only restricting, this proposal argues that `SelfSubjectRulesReview` shouldn't let a client specify them like `Scopes`. It wouldn't be within the spirit of a `SelfSubject` resource to let a user determine information about other projects or roles. - -This could hopefully be solved by introducing a `SubjectRulesReview` API to query the rules for any user. An aggregated API server could use the `SubjectRulesReview` to back an API resource that let a user provide restrictive user extras, such as scopes. - -## Webhook authorizers - -Some authorizers live external to Kubernetes through an API server webhook and wouldn't immediately support a rules review query. - -To communicate with external authorizers, the following types will be defined to query the rules for an arbitrary user. This proposal does NOT propose adding these types to the API immediately, since clients can use user impersonation and a `SelfSubjectRulesReview` to accomplish something similar. - -``` -type SubjectRulesReview struct { - metav1.TypeMeta - - Spec SubjectRulesReviewSpec - - // Status is filled in by the server and indicates the set of actions a user can perform. - Status SubjectRulesReviewStatus -} - -type SubjectRulesReviewSpec struct { - // Namespace to evalue rules for. Required. - Namespace string - - // User to be evaluated for. - UID string - User string - Groups []string - Extras map[string][]string -} -``` - -Currently, external authorizers are configured through the following API server flag and which POSTs a `SubjectAccessReview` to determine a user's access: - -``` ---authorization-webhook-config-file -``` - -The config file uses the kubeconfig format. - -There are a few options to support a second kind of query. - -* Add another webhook flag with a second config file. -* Introduce a [kubeconfig extension][kubeconfig-extension] that indicates the server can handle either a `SubjectRulesReview` or a `SubjectAccessReview` -* Introduce a second context in the kubeconfig for the `SubjectRulesReview`. Have some way of indicating which context for `SubjectRulesReview` and which is for `SubjectAccessReview`, for example by well-known context names for each. - -The doc proposed adding a second webhook config for `RulesReview`, and not overloading the existing config passed to `--authorization-webhook-config-file`. - -[openshift-rules-review]: https://github.com/openshift/origin/blob/v3.6.0/pkg/authorization/apis/authorization/types.go#L152 -[openshift-scopes]: https://github.com/openshift/origin/blob/v3.6.0/pkg/authorization/apis/authorization/types.go#L164-L168 -[keystone-authn]: https://github.com/kubernetes/kubernetes/pull/25624/files#diff-897f0cab87e784d9fc6813f04f128f62R40 -[kubeconfig-extension]: https://github.com/kubernetes/client-go/blob/v3.0.0/tools/clientcmd/api/v1/types.go#L51 +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scalability/Kubemark_architecture.png b/contributors/design-proposals/scalability/Kubemark_architecture.png Binary files differdeleted file mode 100644 index 479ad8b1..00000000 --- a/contributors/design-proposals/scalability/Kubemark_architecture.png +++ /dev/null diff --git a/contributors/design-proposals/scalability/OWNERS b/contributors/design-proposals/scalability/OWNERS deleted file mode 100644 index 6b57aa45..00000000 --- a/contributors/design-proposals/scalability/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-scalability-leads -approvers: - - sig-scalability-leads -labels: - - sig/scalability diff --git a/contributors/design-proposals/scalability/kubemark.md b/contributors/design-proposals/scalability/kubemark.md index 533295e0..f0fbec72 100644 --- a/contributors/design-proposals/scalability/kubemark.md +++ b/contributors/design-proposals/scalability/kubemark.md @@ -1,153 +1,6 @@ -# Kubemark proposal +Design proposals have been archived. -## Goal of this document +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document describes a design of Kubemark - a system that allows performance testing of a Kubernetes cluster. It describes the -assumption, high level design and discusses possible solutions for lower-level problems. It is supposed to be a starting point for more -detailed discussion. - -## Current state and objective - -Currently performance testing happens on ‘live’ clusters of up to 100 Nodes. It takes quite a while to start such cluster or to push -updates to all Nodes, and it uses quite a lot of resources. At this scale the amount of wasted time and used resources is still acceptable. -In the next quarter or two we're targeting 1000 Node cluster, which will push it way beyond ‘acceptable’ level. Additionally we want to -enable people without many resources to run scalability tests on bigger clusters than they can afford at given time. Having an ability to -cheaply run scalability tests will enable us to run some set of them on "normal" test clusters, which in turn would mean ability to run -them on every PR. - -This means that we need a system that will allow for realistic performance testing on (much) smaller number of “real” machines. First -assumption we make is that Nodes are independent, i.e. number of existing Nodes do not impact performance of a single Node. This is not -entirely true, as number of Nodes can increase latency of various components on Master machine, which in turn may increase latency of Node -operations, but we're not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by -Node daemons affects the performance of Master components. - -## Kubemark architecture overview - -The high-level idea behind Kubemark is to write library that allows running artificial "Hollow" Nodes that will be able to simulate a -behavior of real Kubelet and KubeProxy in a single, lightweight binary. Hollow components will need to correctly respond to Controllers -(via API server), and preferably, in the fullness of time, be able to ‘replay’ previously recorded real traffic (this is out of scope for -initial version). To teach Hollow components replaying recorded traffic they will need to store data specifying when given Pod/Container -should die (e.g. observed lifetime). Such data can be extracted e.g. from etcd Raft logs, or it can be reconstructed from Events. In the -initial version we only want them to be able to fool Master components and put some configurable (in what way TBD) load on them. - -When we have Hollow Node ready, we'll be able to test performance of Master Components by creating a real Master Node, with API server, -Controllers, etcd and whatnot, and create number of Hollow Nodes that will register to the running Master. - -To make Kubemark easier to maintain when system evolves Hollow components will reuse real "production" code for Kubelet and KubeProxy, but -will mock all the backends with no-op or very simple mocks. We believe that this approach is better in the long run than writing special -"performance-test-aimed" separate version of them. This may take more time to create an initial version, but we think maintenance cost will -be noticeably smaller. - -### Option 1 - -For the initial version we will teach Master components to use port number to identify Kubelet/KubeProxy. This will allow running those -components on non-default ports, and in the same time will allow to run multiple Hollow Nodes on a single machine. During setup we will -generate credentials for cluster communication and pass them to HollowKubelet/HollowProxy to use. Master will treat all HollowNodes as -normal ones. - - -*Kubemark architecture diagram for option 1* - -### Option 2 - -As a second (equivalent) option we will run Kubemark on top of 'real' Kubernetes cluster, where both Master and Hollow Nodes will be Pods. -In this option we'll be able to use Kubernetes mechanisms to streamline setup, e.g. by using Kubernetes networking to ensure unique IPs for -Hollow Nodes, or using Secrets to distribute Kubelet credentials. The downside of this configuration is that it's likely that some noise -will appear in Kubemark results from either CPU/Memory pressure from other things running on Nodes (e.g. FluentD, or Kubelet) or running -cluster over an overlay network. We believe that it'll be possible to turn off cluster monitoring for Kubemark runs, so that the impact -of real Node daemons will be minimized, but we don't know what will be the impact of using higher level networking stack. Running a -comparison will be an interesting test in itself. - -### Discussion - -Before taking a closer look at steps necessary to set up a minimal Hollow cluster it's hard to tell which approach will be simpler. It's -quite possible that the initial version will end up as hybrid between running the Hollow cluster directly on top of VMs and running the -Hollow cluster on top of a Kubernetes cluster that is running on top of VMs. E.g. running Nodes as Pods in Kubernetes cluster and Master -directly on top of VM. - -## Things to simulate - -In real Kubernetes on a single Node we run two daemons that communicate with Master in some way: Kubelet and KubeProxy. - -### KubeProxy - -As a replacement for KubeProxy we'll use HollowProxy, which will be a real KubeProxy with injected no-op mocks everywhere it makes sense. - -### Kubelet - -As a replacement for Kubelet we'll use HollowKubelet, which will be a real Kubelet with injected no-op or simple mocks everywhere it makes -sense. - -Kubelet also exposes cadvisor endpoint which is scraped by Heapster, healthz to be read by supervisord, and we have FluentD running as a -Pod on each Node that exports logs to Elasticsearch (or Google Cloud Logging). Both Heapster and Elasticsearch are running in Pods in the -cluster so do not add any load on a Master components by themselves. There can be other systems that scrape Heapster through proxy running -on Master, which adds additional load, but they're not the part of default setup, so in the first version we won't simulate this behavior. - -In the first version we'll assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model -of short-running batch jobs, but in the initial version we'll assume only serving-like Pods. - -### Heapster - -In addition to system components we run Heapster as a part of cluster monitoring setup. Heapster currently watches Events, Pods and Nodes -through the API server. In the test setup we can use real heapster for watching API server, with mocked out piece that scrapes cAdvisor -data from Kubelets. - -### Elasticsearch and Fluentd - -Similarly to Heapster Elasticsearch runs outside the Master machine but generates some traffic on it. Fluentd “daemon” running on Master -periodically sends Docker logs it gathered to the Elasticsearch running on one of the Nodes. In the initial version we omit Elasticsearch, -as it produces only a constant small load on Master Node that does not change with the size of the cluster. - -## Necessary work - -There are three more or less independent things that needs to be worked on: -- HollowNode implementation, creating a library/binary that will be able to listen to Watches and respond in a correct fashion with Status -updates. This also involves creation of a CloudProvider that can produce such Hollow Nodes, or making sure that HollowNodes can correctly -self-register in no-provider Master. -- Kubemark setup, including figuring networking model, number of Hollow Nodes that will be allowed to run on a single “machine”, writing -setup/run/teardown scripts (in [option 1](#option-1)), or figuring out how to run Master and Hollow Nodes on top of Kubernetes -(in [option 2](#option-2)) -- Creating a Player component that will send requests to the API server putting a load on a cluster. This involves creating a way to -specify desired workload. This task is -very well isolated from the rest, as it is about sending requests to the real API server. Because of that we can discuss requirements -separately. - -## Concerns - -Network performance most likely won't be a problem for the initial version if running on directly on VMs rather than on top of a Kubernetes -cluster, as Kubemark will be running on standard networking stack (no cloud-provider software routes, or overlay network is needed, as we -don't need custom routing between Pods). Similarly we don't think that running Kubemark on Kubernetes virtualized cluster networking will -cause noticeable performance impact, but it requires testing. - -On the other hand when adding additional features it may turn out that we need to simulate Kubernetes Pod network. In such, when running -'pure' Kubemark we may try one of the following: - - running overlay network like Flannel or OVS instead of using cloud providers routes, - - write simple network multiplexer to multiplex communications from the Hollow Kubelets/KubeProxies on the machine. - -In case of Kubemark on Kubernetes it may turn that we run into a problem with adding yet another layer of network virtualization, but we -don't need to solve this problem now. - -## Work plan - -- Teach/make sure that Master can talk to multiple Kubelets on the same Machine [option 1](#option-1): - - make sure that Master can talk to a Kubelet on non-default port, - - make sure that Master can talk to all Kubelets on different ports, -- Write HollowNode library: - - new HollowProxy, - - new HollowKubelet, - - new HollowNode combining the two, - - make sure that Master can talk to two HollowKubelets running on the same machine -- Make sure that we can run Hollow cluster on top of Kubernetes [option 2](#option-2) -- Write a player that will automatically put some predefined load on Master, <- this is the moment when it's possible to play with it and is useful by itself for -scalability tests. Alternatively we can just use current density/load tests, -- Benchmark our machines - see how many Watch clients we can have before everything explodes, -- See how many HollowNodes we can run on a single machine by attaching them to the real master <- this is the moment it starts to useful -- Update kube-up/kube-down scripts to enable creating “HollowClusters”/write a new scripts/something, integrate HollowCluster with a Elasticsearch/Heapster equivalents, -- Allow passing custom configuration to the Player - -## Future work - -In the future we want to add following capabilities to the Kubemark system: -- replaying real traffic reconstructed from the recorded Events stream, -- simulating scraping things running on Nodes through Master proxy. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scalability/scalability-testing.md b/contributors/design-proposals/scalability/scalability-testing.md index 0195d7b1..f0fbec72 100644 --- a/contributors/design-proposals/scalability/scalability-testing.md +++ b/contributors/design-proposals/scalability/scalability-testing.md @@ -1,68 +1,6 @@ +Design proposals have been archived. -## Background +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -We have a goal to be able to scale to 1000-node clusters by end of 2015. -As a result, we need to be able to run some kind of regression tests and deliver -a mechanism so that developers can test their changes with respect to performance. - -Ideally, we would like to run performance tests also on PRs - although it might -be impossible to run them on every single PR, we may introduce a possibility for -a reviewer to trigger them if the change has non obvious impact on the performance -(something like "k8s-bot run scalability tests please" should be feasible). - -However, running performance tests on 1000-node clusters (or even bigger in the -future is) is a non-starter. Thus, we need some more sophisticated infrastructure -to simulate big clusters on relatively small number of machines and/or cores. - -This document describes two approaches to tackling this problem. -Once we have a better understanding of their consequences, we may want to -decide to drop one of them, but we are not yet in that position. - - -## Proposal 1 - Kubemark - -In this proposal we are focusing on scalability testing of master components. -We do NOT focus on node-scalability - this issue should be handled separately. - -Since we do not focus on the node performance, we don't need real Kubelet nor -KubeProxy - in fact we don't even need to start real containers. -All we actually need is to have some Kubelet-like and KubeProxy-like components -that will be simulating the load on apiserver that their real equivalents are -generating (e.g. sending NodeStatus updated, watching for pods, watching for -endpoints (KubeProxy), etc.). - -What needs to be done: - -1. Determine what requests both KubeProxy and Kubelet are sending to apiserver. -2. Create a KubeletSim that is generating the same load on apiserver that the - real Kubelet, but is not starting any containers. In the initial version we - can assume that pods never die, so it is enough to just react on the state - changes read from apiserver. - TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake" - interfaces to it? -3. Similarly create a KubeProxySim that is generating the same load on apiserver - as a real KubeProxy. Again, since we are not planning to talk to those - containers, it basically doesn't need to do anything apart from that. - TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake" - interfaces to it? -4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting - a cluster with KubeletSim and KubeProxySim instead of real ones and put - a bunch of them on a single machine. -5. Create a load generator for it (probably initially it would be enough to - reuse tests that we use in gce-scalability suite). - - -## Proposal 2 - Oversubscribing - -The other method we are proposing is to oversubscribe the resource, -or in essence enable a single node to look like many separate nodes even though -they reside on a single host. This is a well established pattern in many different -cluster managers (for more details see -http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ). -There are a couple of different ways to accomplish this, but the most viable method -is to run privileged kubelet pods under a hosts kubelet process. These pods then -register back with the master via the introspective service using modified names -as not to collide. - -Complications may currently exist around container tracking and ownership in docker. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/OWNERS b/contributors/design-proposals/scheduling/OWNERS deleted file mode 100644 index f6155ab6..00000000 --- a/contributors/design-proposals/scheduling/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-scheduling-leads -approvers: - - sig-scheduling-leads -labels: - - sig/scheduling diff --git a/contributors/design-proposals/scheduling/images/.gitignore b/contributors/design-proposals/scheduling/images/.gitignore deleted file mode 100644 index e69de29b..00000000 --- a/contributors/design-proposals/scheduling/images/.gitignore +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/OWNERS b/contributors/design-proposals/scheduling/images/OWNERS deleted file mode 100644 index 14c05899..00000000 --- a/contributors/design-proposals/scheduling/images/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - bsalamat - - michelleN -approvers: - - bsalamat - - michelleN diff --git a/contributors/design-proposals/scheduling/images/preemption_1.png b/contributors/design-proposals/scheduling/images/preemption_1.png Binary files differdeleted file mode 100644 index 6d1660b6..00000000 --- a/contributors/design-proposals/scheduling/images/preemption_1.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/preemption_2.png b/contributors/design-proposals/scheduling/images/preemption_2.png Binary files differdeleted file mode 100644 index 38fc9088..00000000 --- a/contributors/design-proposals/scheduling/images/preemption_2.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/preemption_3.png b/contributors/design-proposals/scheduling/images/preemption_3.png Binary files differdeleted file mode 100644 index 0f750edb..00000000 --- a/contributors/design-proposals/scheduling/images/preemption_3.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/preemption_4.png b/contributors/design-proposals/scheduling/images/preemption_4.png Binary files differdeleted file mode 100644 index f8343f3b..00000000 --- a/contributors/design-proposals/scheduling/images/preemption_4.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/preemption_flowchart.png b/contributors/design-proposals/scheduling/images/preemption_flowchart.png Binary files differdeleted file mode 100644 index fc72f422..00000000 --- a/contributors/design-proposals/scheduling/images/preemption_flowchart.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/multiple-schedulers.md b/contributors/design-proposals/scheduling/multiple-schedulers.md index 1581a0d9..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/multiple-schedulers.md +++ b/contributors/design-proposals/scheduling/multiple-schedulers.md @@ -1,135 +1,6 @@ -# Multi-Scheduler in Kubernetes +Design proposals have been archived. -**Status**: Design & Implementation in progress. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -> Contact @HaiyangDING for questions & suggestions. - -## Motivation - -In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster. -However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services, -are running in the same cluster and they need to be scheduled in different ways. For example, in -[Omega](http://research.google.com/pubs/pub41684.html) batch workload and service workload are scheduled by two types of schedulers: -the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate -and the service workload is scheduled by another one which considers the reserved resources in the -cluster and many other constraints since their performance must meet some higher SLOs. -[Mesos](http://mesos.apache.org/) has done a great work to support multiple schedulers by building a -two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler -so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling -behavior as they need. As previously discussed in [#11793](https://github.com/kubernetes/kubernetes/issues/11793), -[#9920](https://github.com/kubernetes/kubernetes/issues/9920) and [#11470](https://github.com/kubernetes/kubernetes/issues/11470), -the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods. -It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets -set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer, -as the doc currently does. - -Before going to the details of this proposal, below lists a number of the methods to extend the scheduler: - -- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal -- Use the callout approach such as the one implemented in [#13580](https://github.com/kubernetes/kubernetes/issues/13580) -- Recompile the scheduler with a new policy -- Restart the scheduler with a new [scheduler policy config file](https://git.k8s.io/examples/staging/scheduler-policy-config.json) -- Or maybe in future dynamically link a new policy into the running scheduler - -## Challenges in multiple schedulers - -- Separating the pods - - Each pod should be scheduled by only one scheduler. As for implementation, a pod should - have an additional field to tell by which scheduler it wants to be scheduled. Besides, - each scheduler, including the default one, should have a unique logic of how to add unscheduled - pods to its to-be-scheduled pod queue. Details will be explained in later sections. - -- Dealing with conflicts - - Different schedulers are essentially separated processes. When all schedulers try to schedule - their pods onto the nodes, there might be conflicts. - - One example of the conflicts is resource racing: Suppose there be a `pod1` scheduled by - `my-scheduler` requiring 1 CPU's *request*, and a `pod2` scheduled by `kube-scheduler` (k8s native - scheduler, acting as default scheduler) requiring 2 CPU's *request*, while `node-a` only has 2.5 - free CPU's, if both schedulers all try to put their pods on `node-a`, then one of them would eventually - fail when Kubelet on `node-a` performs the create action due to insufficient CPU resources. - - This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet - to do the conflict check and if the conflict happens, effected pods would be put back to scheduler - and waiting to be scheduled again. Implementation details are in later sections. - -## Where to start: initial design - -We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes -we want to make in the first step. - -- Add an annotation in pod template: `scheduler.alpha.kubernetes.io/name: scheduler-name`, this is used to -separate pods between schedulers. `scheduler-name` should match one of the schedulers' `scheduler-name` -- Add a `scheduler-name` to each scheduler. It is done by hardcode or as command-line argument. The -Kubernetes native scheduler (now `kube-scheduler` process) would have the name as `kube-scheduler` -- The `scheduler-name` plays an important part in separating the pods between different schedulers. -Pods are statically dispatched to different schedulers based on `scheduler.alpha.kubernetes.io/name: scheduler-name` -annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must -NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if: - 1. The pod has no nodeName, **AND** - 2. The `scheduler-name` specified in the pod's annotation `scheduler.alpha.kubernetes.io/name: scheduler-name` - matches the `scheduler-name` of the scheduler. - - The only one exception is the default scheduler. Any pod that has no `scheduler.alpha.kubernetes.io/name: scheduler-name` - annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature, - the default scheduler would be the Kubernetes built-in scheduler with `scheduler-name` as `kube-scheduler`. - The Kubernetes build-in scheduler will claim any pod which has no `scheduler.alpha.kubernetes.io/name: scheduler-name` - annotation or which has `scheduler.alpha.kubernetes.io/name: kube-scheduler`. In the future, it may be possible to - change which scheduler is the default for a given cluster. - -- Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as -the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler -may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling -it back the same node. To make it easier for people who write new schedulers to obey this rule, we will -create a library containing the predicates Kubelet uses. (See issue [#12744](https://github.com/kubernetes/kubernetes/issues/12744).) - -In summary, in the initial version of this multi-scheduler design, we will achieve the following: - -- If a pod has the annotation `scheduler.alpha.kubernetes.io/name: kube-scheduler` or the user does not explicitly -sets this annotation in the template, it will be picked up by default scheduler -- If the annotation is set and refers to a valid `scheduler-name`, it will be picked up by the scheduler of -specified `scheduler-name` -- If the annotation is set but refers to an invalid `scheduler-name`, the pod will not be picked by any scheduler. -The pod will keep PENDING. - -### An example - -```yaml - kind: Pod - apiVersion: v1 - metadata: - name: pod-abc - labels: - foo: bar - annotations: - scheduler.alpha.kubernetes.io/name: my-scheduler -``` - -This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler -of name "my-scheduler", the pod will never be scheduled. - -## Next steps - -1. Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the -admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if -there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on -which the client has set a scheduler annotation that does not correspond to a running scheduler. -2. Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also -requires some work on authorization and authentication to control what schedulers can write the /binding -subresource of which pods. -3. Optimize the behaviors of priority functions in multi-scheduler scenario. In the case where multiple schedulers have -the same predicate and priority functions (for example, when using multiple schedulers for parallelism rather than to -customize the scheduling policies), all schedulers would tend to pick the same node as "best" when scheduling identical -pods and therefore would be likely to conflict on the Kubelet. To solve this problem, we can pass -an optional flag such as `--randomize-node-selection=N` to scheduler, setting this flag would cause the scheduler to pick -randomly among the top N nodes instead of the one with the highest score. - -## Other issues/discussions related to scheduler design - -- [#13580](https://github.com/kubernetes/kubernetes/pull/13580): scheduler extension -- [#17097](https://github.com/kubernetes/kubernetes/issues/17097): policy config file in pod template -- [#16845](https://github.com/kubernetes/kubernetes/issues/16845): scheduling groups of pods -- [#17208](https://github.com/kubernetes/kubernetes/issues/17208): guide to writing a new scheduler +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/nodeaffinity.md b/contributors/design-proposals/scheduling/nodeaffinity.md index 31fb520a..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/nodeaffinity.md +++ b/contributors/design-proposals/scheduling/nodeaffinity.md @@ -1,267 +1,6 @@ -# Node affinity and NodeSelector +Design proposals have been archived. -## Introduction - -This document proposes a new label selector representation, called -`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit -more flexible and is intended to be used only for selecting nodes. - -In addition, we propose to replace the `map[string]string` in `PodSpec` that the -scheduler currently uses as part of restricting the set of nodes onto which a -pod is eligible to schedule, with a field of type `Affinity` that contains one -or more affinity specifications. In this document we discuss `NodeAffinity`, -which contains one or more of the following: -* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be -represented by a `NodeSelector`, and thus generalizes the scheduling behavior of -the current `map[string]string` but still serves the purpose of restricting -the set of nodes onto which the pod can schedule. In addition, unlike the -behavior of the current `map[string]string`, when it becomes violated the system -will try to eventually evict the pod from its node. -* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is -identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the -system may or may not try to eventually evict the pod from its node. -* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that -specifies which nodes are preferred for scheduling among those that meet all -scheduling requirements. - -(In practice, as discussed later, we will actually *add* the `Affinity` field -rather than replacing `map[string]string`, due to backward compatibility -requirements.) - -The affinity specifications described above allow a pod to request various -properties that are inherent to nodes, for example "run this pod on a node with -an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." -([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes -some of the properties that a node might publish as labels, which affinity -expressions can match against.) They do *not* allow a pod to request to schedule -(or not schedule) on a node based on what other pods are running on the node. -That feature is called "inter-pod topological affinity/anti-affinity" and is -described [here](https://github.com/kubernetes/kubernetes/pull/18265). - -## API - -### NodeSelector - -```go -// A node selector represents the union of the results of one or more label queries -// over a set of nodes; that is, it represents the OR of the selectors represented -// by the nodeSelectorTerms. -type NodeSelector struct { - // nodeSelectorTerms is a list of node selector terms. The terms are ORed. - NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` -} - -// An empty node selector term matches all objects. A null node selector term -// matches no objects. -type NodeSelectorTerm struct { - // matchExpressions is a list of node selector requirements. The requirements are ANDed. - MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` -} - -// A node selector requirement is a selector that contains values, a key, and an operator -// that relates the key and values. -type NodeSelectorRequirement struct { - // key is the label key that the selector applies to. - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - // operator represents a key's relationship to a set of values. - // Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. - Operator NodeSelectorOperator `json:"operator"` - // values is an array of string values. If the operator is In or NotIn, - // the values array must be non-empty. If the operator is Exists or DoesNotExist, - // the values array must be empty. If the operator is Gt or Lt, the values - // array must have a single element, which will be interpreted as an integer. - // This array is replaced during a strategic merge patch. - Values []string `json:"values,omitempty"` -} - -// A node selector operator is the set of operators that can be used in -// a node selector requirement. -type NodeSelectorOperator string - -const ( - NodeSelectorOpIn NodeSelectorOperator = "In" - NodeSelectorOpNotIn NodeSelectorOperator = "NotIn" - NodeSelectorOpExists NodeSelectorOperator = "Exists" - NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist" - NodeSelectorOpGt NodeSelectorOperator = "Gt" - NodeSelectorOpLt NodeSelectorOperator = "Lt" -) -``` - -### NodeAffinity - -We will add one field to `PodSpec` - -```go -Affinity *Affinity `json:"affinity,omitempty"` -``` - -The `Affinity` type is defined as follows - -```go -type Affinity struct { - NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"` -} - -type NodeAffinity struct { - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a node label update), - // the system will try to eventually evict the pod from its node. - RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a node label update), - // the system may or may not try to eventually evict the pod from its node. - RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -// An empty preferred scheduling term matches all objects with implicit weight 0 -// (i.e. it's a no-op). A null preferred scheduling term matches no objects. -type PreferredSchedulingTerm struct { - // weight is in the range 1-100 - Weight int `json:"weight"` - // matchExpressions is a list of node selector requirements. The requirements are ANDed. - MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` -} -``` - -Unfortunately, the name of the existing `map[string]string` field in PodSpec is -`NodeSelector` and we can't change it since this name is part of the API. -Hopefully this won't cause too much confusion. - -## Examples - -Run a pod on a node with an Intel or AMD CPU and in availability zone Z: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: pod-with-node-affinity -spec: - affinity: - nodeAffinity: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - key: kubernetes.io/arch - operator: In - values: - - intel - - amd64 - preferredDuringSchedulingIgnoredDuringExecution: - - weight: 1 - preference: - matchExpressions: - - key: failure-domain.kubernetes.io/zone - operator: In - values: - - Z - containers: - - name: pod-with-node-affinity - image: tomcat:8 -``` - -## Backward compatibility - -When we add `Affinity` to PodSpec, we will deprecate, but not remove, the -current field in PodSpec - -```go -NodeSelector map[string]string `json:"nodeSelector,omitempty"` -``` - -Old version of the scheduler will ignore the `Affinity` field. New versions of -the scheduler will apply their scheduling predicates to both `Affinity` and -`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets -of requirements. We will not attempt to convert between `Affinity` and -`nodeSelector`. - -Old versions of non-scheduling clients will not know how to do anything -semantically meaningful with `Affinity`, but we don't expect that this will -cause a problem. - -See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) -for more discussion. - -Users should not start using `NodeAffinity` until the full implementation has -been in Kubelet and the master for enough binary versions that we feel -comfortable that we will not need to roll back either Kubelet or master to a -version that does not support them. Longer-term we will use a programatic -approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Implementation plan - -1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, -`PreferredDuringSchedulingIgnoredDuringExecution`, and -`RequiredDuringSchedulingIgnoredDuringExecution` types to the API. -2. Implement a scheduler predicate that takes -`RequiredDuringSchedulingIgnoredDuringExecution` into account. -3. Implement a scheduler priority function that takes -`PreferredDuringSchedulingIgnoredDuringExecution` into account. -4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be -marked as deprecated. -5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API. -6. Modify the scheduler predicate from step 2 to also take -`RequiredDuringSchedulingRequiredDuringExecution` into account. -7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission -decision. -8. Implement code in Kubelet *or* the controllers that evicts a pod that no -longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). - -We assume Kubelet publishes labels describing the node's membership in all of -the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). - -## Extensibility - -The design described here is the result of careful analysis of use cases, a -decade of experience with Borg at Google, and a review of similar features in -other open-source container orchestration systems. We believe that it properly -balances the goal of expressiveness against the goals of simplicity and -efficiency of implementation. However, we recognize that use cases may arise in -the future that cannot be expressed using the syntax described here. Although we -are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire -for Kubernetes users to get a consistent experience, etc.), the regular -Kubernetes annotation mechanism can be used to add or replace affinity rules. -The way this work would is: - -1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to -request the desired scheduling behavior. If the new rule(s) *replace* one or -more fields of `Affinity` then the user would omit those fields from `Affinity`; -if they are *additional rules*, then the user would fill in `Affinity` as well -as the annotation(s). -1. Scheduler takes the annotation(s) into account when scheduling. - -If some particular new syntax becomes popular, we would consider upstreaming it -by integrating it into the standard `Affinity`. - -## Future work - -Are there any other fields we should convert from `map[string]string` to -`NodeSelector`? - -## Related issues - -The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261). - -The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341). -Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related. -Those issues reference other related issues. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/pod-preemption.md b/contributors/design-proposals/scheduling/pod-preemption.md index 02100c19..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/pod-preemption.md +++ b/contributors/design-proposals/scheduling/pod-preemption.md @@ -1,431 +1,6 @@ -# Pod Preemption in Kubernetes +Design proposals have been archived. -_Status: Draft_ -_Author: @bsalamat_ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -- [Pod Preemption in Kubernetes](#pod-preemption-in-kubernetes) -- [Objectives](#objectives) - - [Non-Goals](#non-goals) -- [Background](#background) - - [Terminology](#terminology) -- [Overview](#overview) -- [Detailed Design](#detailed-design) - - [Preemption scenario](#preemption-scenario) - - [Scheduler performs preemption](#scheduler-performs-preemption) - - [Preemption order](#preemption-order) - - [Preemption - Eviction workflow](#preemption---eviction-workflow) - - [Race condition in multi-scheduler clusters](#race-condition-in-multi-scheduler-clusters) - - [Starvation Problem](#starvation-problem) - - [Supporting PodDisruptionBudget](#supporting-poddisruptionbudget) - - [Supporting Inter-Pod Affinity on Lower Priority Pods](#supporting-inter-pod-affinity-on-lower-priority-pods?) - - [Supporting Cross Node Preemption](#supporting-cross-node-preemption?) -- [Interactions with Cluster Autoscaler](#interactions-with-cluster-autoscaler) -- [Alternatives Considered](#alternatives-considered) - - [Rescheduler or Kubelet performs preemption](#rescheduler-or-kubelet-performs-preemption) - - [Preemption order](#preemption-order) -- [References](#references) - -# Objectives - -- Define the concept of preemption in Kubernetes. -- Define how priority and other metrics affect preemption. -- Define scenarios under which a pod may get preempted. -- Define the interaction between scheduler preemption and Kubelet evictions. -- Define mechanics of preemption. -- Propose new changes to the scheduling algorithms. -- Propose new changes to the cluster auto-scaler. - -## Non-Goals - -- How **eviction** works in Kubernetes. (Please see [Background](#background) for the definition of "eviction".) -- How quota management works in Kubernetes. - -# Background - -Running various types of workloads with different priorities is a common practice in medium and large clusters to achieve higher resource utilization. In such scenarios, the amount of workload can be larger than what the total resources of the cluster can handle. If so, the cluster chooses the most important workloads and runs them. The importance of workloads are specified by a combination of [priority](https://github.com/bsalamat/community/blob/564ebff843532faf5dcb06a7e50b0db5c5b501cf/contributors/design-proposals/pod-priority-api.md), QoS, or other cluster-specific metrics. The potential to have more work than what cluster resources can handle is called "overcommitment". Overcommitment is very common in on-prem clusters where the number of nodes is fixed, but it can similarly happen in cloud as cloud customers may choose to run their clusters overcommitted/overloaded at times in order to save money. For example, a cloud customer may choose to run at most 100 nodes, knowing that all of their critical workloads fit on 100 nodes and if there is more work, they won't be critical and can wait until cluster load decreases. - -## Terminology - -When a new pod has certain scheduling requirements that makes it infeasible on any node in the cluster, scheduler may choose to kill lower priority pods to satisfy the scheduling requirements of the new pod. We call this operation "**preemption**". Preemption is distinguished from "[**eviction**](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy)" where kubelet kills a pod on a node because that particular node is running out of resources. - -# Overview - -This document describes how preemption in Kubernetes works. Preemption is the action taken when an important pod requires resources or conditions which are not available in the cluster. So, one or more pods need to be killed to make room for the important pod. - -# Detailed Design - -## Preemption scenario - -In this proposal, the only scenario under which a group of pods in Kubernetes may be preempted is when a higher priority pod cannot be scheduled due to various unmet scheduling requirements, such as lack of resources, unsatisfied affinity or anti-affinity rules, etc., and the preemption of the lower priority pods allows the higher priority pod to be scheduled. So, if the preemption of the lower priority pods does not help with scheduling of the higher priority pod, those lower priority pods will keep running and the higher priority pod will stay pending. -Please note the terminology here. The above scenario does not include "evictions" that are performed by the Kubelet when a node runs out of resources. -Please also note that scheduler may preempt a pod on one node in order to meet the scheduling requirements of a pending pod on another node. For example, if there is a low-priority pod running on node N in rack R, and there is a high-priority pending pod, and one or both of the pods have a requiredDuringScheduling anti-affinity rule saying they can't run on the same rack, then the lower-priority pod might be preempted to enable the higher-priority pod to schedule onto some node M != N on rack R (or, of course, M == N, which is the more standard same-node preemption scenario). - -## Scheduler performs preemption - -We propose preemption to be done by the scheduler -- it does it by deleting the being preempted pods. The component that performs the preemption must have the logic to find the right nodes for the pending pod. It must also have the logic to check whether preempting the chosen pods allows scheduling of the pending pod. These require the component to have the knowledge of predicate and priority functions. -We believe having scheduler perform preemption has the following benefits: - -- Avoids replicating all of the scheduling logic in another component. -- Reduces the risk of race condition between pod preemption and pending pod scheduling. If both of these are performed by the scheduler, scheduler can perform them serially (although currently not atomically). However, if a different component performs preemption, scheduler may schedule a different pod (than the preemptor) on the node whose pods are preempted. -The race condition will still exist if we have multiple schedulers. More on this below. - -## Preemption order - -When scheduling a pending pod, scheduler tries to place the pod on a node that does not require preemption. If there is no such a node, scheduler may favor a node where the number and/or priority of victims (preempted pods) is smallest. After choosing the node, scheduler considers the lowest [priority](https://github.com/bsalamat/community/blob/564ebff843532faf5dcb06a7e50b0db5c5b501cf/contributors/design-proposals/pod-priority-api.md) pods for preemption first. Scheduler starts from the lowest priority and considers enough pods that should be preempted to allow the pending pod to schedule. Scheduler only considers pods that have lower priority than the pending pod. - -#### Important notes - -- When ordering the pods from lowest to highest priority for considering which pod(s) to preempt, among pods with equal priority the pods are ordered by their [QoS class](/contributors/design-proposals/node/resource-qos.md#qos-classes): Best Effort, Burstable, Guaranteed. -- Scheduler respects pods' disruption budget when considering them for preemption. -- Scheduler will try to minimize the number of preempted pods. As a result, it may preempt a pod while leaving lower priority pods running if preemption of those lower priority pods is not enough to schedule the pending pod while preemption of the higher priority pod(s) is enough to schedule the pending pod. For example, if node capacity is 10, and pending pod is priority 10 and requires 5 units of resource, and the running pods are {priority 0 request 3, priority 1 request 1, priority 2 request 5, priority 3 request 1}, scheduler will preempt the priority 2 pod only and leaves priority 1 and priority 0 running. -- Scheduler does not have the knowledge of resource usage of pods. It makes scheduling decisions based on the requested resources ("requests") of the pods and when it considers a pod for preemption, it assumes the "requests" to be freed on the node. - - This means that scheduler will never preempt a Best Effort pod to make more resources available. That's because the requests of Best Effort pods is zero and therefore, preempting them will not release any resources on the node from the scheduler's point of view. - - The scheduler may still preempt Best Effort pods for reasons other than releasing their resources. For example, it may preempt a Best Effort pod in order to satisfy affinity rules of the pending pod. - - The amount that needs to be freed (when the issue is resources) is request of the pending pod. - -## Preemption - Eviction workflow - -"Eviction" is the act of killing one or more pods on a node when the node is under resource pressure. Kubelet performs eviction. The eviction process is described in separate document by sig-node, but since it is closely related to the "preemption", we explain it briefly here. -Kubelet uses a function of priority, usage, and requested resources to determine -which pod(s) should be evicted. When pods with the same priority are considered for eviction, the one with the highest percentage of usage over "requests" is the one that is evicted first. -This implies that Best Effort pods are more likely to be evicted among a set of pods with the same priority. The reason is that any amount of resource usage by Best Effort pods translates into a very large percentage of usage over "requests", as Best Effort pods have zero requests for resources. So, while scheduler does not preempt Best Effort pods for releasing resources on a node, it is likely that these pods are evicted by the Kubelet after scheduler schedules a higher priority pod on the node. -Here is an example: - -1. Assume we have a node with 2GB of usable memory by pods. -1. The node is running a burstable pod that uses 1GB of memory and 500MB memory request. It also runs a best effort pod that uses 1GB of memory (and 0 memory request). Both of the pods have priority of 100. -1. A new pod with priority 200 arrives and asks for 2GB of memory. -1. Scheduler knows that it has to preempt the burstable pod. From scheduler's point of view, the best effort pod needs no resources and its preemption will not release any resources. -1. Scheduler preempts the burstable pod and schedules the high priority pending pod on the node. -1. The high priority pod uses more than 1GB of memory. Kubelet detects the resource pressure and kills the best effort pod. - -So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly. -Now, assume everything in the above example, but the best effort pod has priority 2000. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it may be evicted by the Kubelet, because Kubelet's eviction function may determine that the best effort pod should stay given its high priority and despite its usage above request. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet. This is an optimization to prevent possible ping-pong behavior between Kubelet and Scheduler. - -## Race condition in multi-scheduler clusters - -Kubernetes allows a cluster to have more than one scheduler. This introduces a race condition where one scheduler (scheduler A) may perform preemption of one or more pods and another scheduler (scheduler B) schedules a different pod than the initial pending pod in the space opened after the preemption of pods and before the scheduler A has the chance to schedule the initial pending pod. In this case, scheduler A goes ahead and schedules the initial pending pod on the node thinking that the space is still available. However, the pod from A will be rejected by the kubelet admission process if there are not enough free resources on the node after the pod from B has been bound (or any other predicate that kubelet admission checks fails). This is not a major issue, as schedulers will try again to schedule the rejected pod. -Our assumption is that multiple schedulers cooperate with one another. If they don't, scheduler A may schedule pod A. Scheduler B preempts pod A to schedule pod B which is then preempted by scheduler A to schedule pod A and we go in a loop. - - -## Starvation Problem - -Evicting victim(s) and binding the pending Pod (P) are not transactional. -Preemption victims may have "`TerminationGracePeriodSeconds`" which will create -even a larger time gap between the eviction and binding points. When a victim -with termination grace period receives its termination signal, it keeps running -on the node until it terminates successfully or its grace period is over. This -creates a time gap between the point that the scheduler preempts Pods and the -time when the pending Pod (P) can be scheduled on the Node (N). Note that the -pending queue is a FIFO and when a Pod is considered for scheduling and it -cannot be scheduled, it goes to the end of the queue. When P is determined -unschedulable and it preempts victims, it goes to the end of the queue as well. -After preempting victims, the scheduler keeps scheduling other pending Pods. As -victims exit or get terminated, the scheduler tries to schedule Pods in the -pending queue, and one or more of them may be considered and scheduled to N -before the scheduler considers scheduling P again. In such a case, it is likely -that when all the victims exit, Pod P won't fit on Node N anymore. So, scheduler -will have to preempt other Pods on Node N or another Node so that P can be -scheduled. This scenario might be repeated again for the second and subsequent -rounds of preemption, and P might not get scheduled for a while. This scenario -can cause problems in various clusters, but is particularly problematic in -clusters with a high Pod creation rate. - - -### Solution - -#### Changes to the data structures - -1. Scheduler pending queue is changed from a FIFO to a priority queue (heap). -The head of the queue will always be the highest priority pending pod. -1. A new list is added to hold unschedulable pods. - - -#### New Scheduler Algorithm - -1. Pick the head of the pending queue (highest priority pending pod). -1. Try to schedule the pod. -1. If the pod is schedulable, assume and bind it. -1. If the pod is not schedulable, run preemption for the pod. -1. Move the pod to the list of unschedulable pods. -1. If a node was chosen to preempt pods, set the node name as an annotation with - the "scheduler.kubernetes.io/nominated-node-name" key to the pod. This key is referred to as - "NominatedNodeName" in this doc for brevity. - When this annotation exists, scheduler knows that the pod is destined to run on the - given node and takes it into account when making scheduling decisions for other pods. -1. When any pod is terminated, a node is added/removed, or when -pods or nodes updated, remove all the pods from the unschedulable pods -list and add them to the scheduling queue. (Scheduler should keep its existing rate -limiting.) We should also put the pending pods with inter-pod affinity back to -the scheduling queue when a new pod is scheduled. To be more efficient, we may check -if the newly scheduled pod matches any of the pending pods affinity rules before -putting the pending pods back into the scheduling queue. - - -#### Changes to predicate processing - -When determining feasibility of a pod on a node, assume that all the pods with -higher or equal priority in the unschedulable list are already running on their -respective "nominated" nodes. Pods in the unschedulable list that do not have a -nominated node are not considered running. - -If the pod was schedulable on the node in presence of the higher priority pods, -run predicates again without those higher priority pods on the nodes. If the pod -is still schedulable, then run it. This second step is needed, because those -higher priority pods are not actually running on the nodes yet. As a result, -certain predicates, like inter-pod affinity, may not be satisfied. - -This applies to preemption logic as well, i.e., preemption logic must follow the -two steps when it considers viability of preemption. - -#### Changes to the preemption workflow - -The alpha version of preemption already has a logic that performs preemption for -a pod only in one of the two scenarios: - -1. The pod does not have annotations["NominatedNodeName"]. -1. The pod has annotations["NominatedNodeName"], but there is no lower priority -pod on the nominated node in terminating state. - -The new changes are as follows: - -* If preemption is tried, but no node is chosen for preempting pods, preemption -function should remove annotations["NominatedNodeName"] of the pod if it already -exists. This is needed to give the pod another chance to be considered for -preemption in the next round. -* When a pod NominatedNodeName is set, scheduler reevaluates whether lower -priority pods whose NominatedNodeNames are the same still fit on the node. If they -no longer fit, scheduler clears their NominatedNodeNames and moves them to the -scheduling queue. This gives those pods another chance to preempt other pods on -other nodes. - -#### Notes - -* When scheduling a pod, scheduler ignores "NominatedNodeName" of the pod. So, - it may or may not schedule the pod on the nominated node. - -#### Flowchart of the new scheduling algorithm - - - -#### Examples - -##### **Example 1** - - - -* There is only 1 node (other than the master) in the cluster. The node -capacity is 10 units of resources. -* There are two pods, A and B, running on the node. Both have priority 100 and -each use 5 units of resources. Pod A has 60 seconds of graceful termination -period and pod B has 30 seconds. -* Scheduler has two pods, C and D, in its queue. Pod C needs 10 units of -resources and its priority is 1000. Pod D needs 2 units of resources and its priority is 50. -* Given that pod C's priority is 1000, scheduler preempts both of pods A and B -and sets the nominated node name of C to Node 1. Pod D cannot be scheduled -anywhere. Both are moved to the unschedulable list. -* After 30 seconds (or less) pod B terminates and 5 units of resources become -available. Scheduler removed C and D from the unschedulable list and puts them -back in Scheduling queue. Scheduler looks at pod C, but it cannot be scheduled -yet. Pod C has a nominated node name so it won't cause more preemption. It is -moved to unschedulable list again. -* Scheduler tries to schedule pod D, but since pod C in unschedulable list has -higher priority than D, scheduler assumes that it is bound to Node 1 when it -evaluates feasibility of pod D. With this assumption, scheduler determines that -the node does not have enough resources for pod D. So, D is moved to unschedulable -list as well. -* After 60 seconds (or less) pod A also terminates and scheduler schedules pod -C on the node. Scheduler then looks at pod D, but it cannot be scheduled. - - -##### Example 2 - - - -* Everything is similar to the previous example, but here we have two nodes. -Node 2 is running pod E with priority 2000 and request of 10 units. -* Similar to example 1, scheduler preempts pods A and B on Node 1 and sets the -nominated node name of pod C to Node 1. Pod D cannot be scheduled anywhere. C and -D are moved to unschedulable list. -* While waiting for the graceful termination of pods A and B, pod E terminates on Node 2. -* Termination of pod E brings C and D back to the scheduling queue and scheduler -finds Node 2 available for pod C. It schedules pod C on Node 2 (ignoring its -nominated node name). D cannot be scheduled. It goes back to unschedulable list. -* After 30 seconds (or less) pod B terminates. A scheduler pass is triggered -and scheduler schedules pod D on Node 1. -* **Important note:** This may make an observer think that scheduler preempted -pod B to schedule pod D which has a lower priority. Looking at the sequence of -events and the fact that pod D's nominated node name is not set to Node 1 may -help remove the confusion. - -##### Example 3 - - - -* Everything is similar to example 2, but pod E uses 8 units of resources. So, -2 units of resources are available on Node 2. -* Similar to example 2, scheduler preempts pods A and B and sets the future -node of pod C to Node 1. C is moved to unschedulable list. -* Scheduler looks at pod D in the queue. Pod D can fit on Node 2. -* Scheduler goes ahead and binds pod D to Node 2 while pods A and B are in -their graceful termination period and pod C is not bound yet. - -##### Example 4 - - - -* Everything is similar to example 1, but while scheduler is waiting for pods -A and B to gracefully terminate, a new higher priority pod F is created and goes -to the head of the queue. -* Scheduler evaluates the feasibility of pod F and determines that it can be -scheduled on Node 1. So, it sets the nominated node name of pod F to Node 1 and -places it in unschedulable list. -* Scheduler clears nominated node name of C and moves it to the scheduling queue. -* C is evaluated for scheduling, but it cannot be scheduled as pod F's nominated -node name is set to Node 1. -* When B terminates, scheduler brings F, C, and D back to the scheduling queue. -F is evaluated first. They cannot be scheduled. -* Eventually when pods A and B terminate, pod F is bound to Node 1 and pods C -and D remain unschedulable. - - -## Supporting PodDisruptionBudget - -Scheduler preemption will support PDB for Beta, but respecting PDB is not -guaranteed. Preemption will try to avoid violating PDB, but if it doesn't find -any lower priority pod to preempt without violating PDB, it goes ahead and -preempts victims despite violating PDB. This is to guarantee that higher priority -pods will always get precedence over lower priority pods in obtaining cluster resources. - -Here is what preemption will do: - -1. When choosing victims on any evaluated node, preemption logic will try to -reprieve pods whose PDBs are violated first. (In the alpha version, pods are -reprieved by their ascending priority and PDB is ignored.) -1. In scoring nodes and choosing one for preemption, the number of pods whose -PDBs are violated will be the most significant metric. So, a node with the lowest -number of victims whose PDBs are violated is the one chosen for preemption. In -the alpha version, most significant metric is the highest priority of victims. -If there are more than one node with the same smallest number of victims whose -PDBs are violated, lowest high priority victim will be used (as in alpha) and -the rest of the metrics remain the same as before. - -## Supporting Inter-Pod Affinity on lower priority Pods? - -The first step of preemption algorithm is to find whether a given Node (N) has -the potential to run the pending pod (P). In order to do so, preemption logic -simulates removal of all Pods with lower priority than P from N and then checks -whether P can be scheduled on N. If P still cannot be scheduled on N, then N is -considered infeasible. - -The problem in this approach is that if P has an inter-pod affinity to one of -those lower priority pods on N, then preemption logic determines N infeasible -for preemption, while N may be able to run both P and the other Pod(s) that P -has affinity to. - - -### Potential Solution - -In order to solve this problem, we propose the following algorithm. - -1. Preemption simulates removal of all lower priority pods from N. -1. It then tries scheduling P on N. -1. If P fails to schedule for any reason other than "pod affinity", N is infeasible for preemption. -1. If P fails to schedule because of "pod affinity", get the set of pods among -potential victims that match any of the affinity rules of P. -1. Find the permutation of pods that can satisfy affinity. -1. Reprieve each set of pod in the permutation and check whether P can be scheduled on N with these reprieved pods. -1. If found a set of pods that makes P schedulable, reprieve them first. -1. Perform the reprieval process as before for reprieving as many other pods as possible. - -**Considerations:** - -* Scheduler now has more detailed predicate failure reasons than what it had -in 1.8. So, in step 3, we can actually tell whether P is unschedulable due to -affinity, anti-affinity, or existing pod anti-affinity. Step 3 passes only if -the failure is due to pod affinity. -* If there are many pods that match one or more affinity rules of P (step 4) -their permutation may produce a large set. Trying them all in step 6 may cause -performance degradation. - - -### Decision - -Supporting inter-pod affinity on lower priority pods needs a fairly complex logic -which could degrade performance when there are many pods matching the pending -pod's affinity rules. We could have limited the maximum number of matching pods -supported in order to address the performance issue, but it would have been very -confusing to users and would have removed predictability of scheduling. Moreover, -inter-pod affinity is a way for users to define dependency among pods. Inter-pod -affinity to lower priority pods creates dependency on lower priority pods. Such -a dependency is probably not desired in most realistic scenarios. Given these -points, we decided not to implement this feature. - -## Supporting Cross Node Preemption? - -In certain scenarios, scheduling a pending pod (P) on a node (N1), requires -preemption of one or more pods on other nodes. An example of such scenarios is a -lower priority pod with anti-affinity to P running on a different node in the same zone and the -topology of the anti-affinity is zone. Another example is a lower priority pod -running on a different node than N1 and is consuming a non-local resource that P -needs. In all of such cases, preemption of one or more pods on nodes other than -N1 is required to make P schedulable on N1. Such a preemption is called "cross -node preemption". - -### Potential Solution - -When a pod P is not schedulable on a node N even after removal of all lower -priority pods from node N, there may be other pods on other nodes that are not -allowing it to schedule. Since scheduler preemption logic should not rely on -the internals of its predicate functions, it has to perform an exhaustive search -for other pods whose removal may allow P to be scheduled. Such an exhaustive -search will be prohibitively expensive in large clusters. - -### Decision - -Given that we do not have a solution with reasonable performance for supporting -cross node preemption, we have decided not to implement this feature. - -# Interactions with Cluster Autoscaler - -Preemption gives higher precedence to most important pods in the cluster and -tries to provide better availability of cluster resources for such pods. As a -result, we may not need to scale the cluster up for all pending pods. Particularly, -scaling up the cluster may not be necessary in two scenarios: - -1. The pending pod has already preempted pods and is going to run on a node soon. -1. The pending pod is very low priority and the owner of the cluster prefers to save -money by not scaling up the cluster for such a pod. - -In order to address these cases: -1. Cluster Autoscaler will not scale up the cluster for pods with -`scheduler.kubernetes.io/nominated-node-name` annotation. -1. Cluster Autoscaler ignores all the pods whose priority is below a certain value. -This value may be configured by a command line flag and will be zero by default. - -# Alternatives Considered - -## Rescheduler or Kubelet performs preemption - -There are two potential alternatives for the component that performs preemption: rescheduler and Kubelet. -Kubernetes has a "[rescheduler](https://kubernetes.io/docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods/)" that performs a rudimentary form of preemption today. The more sophisticated form of preemption that is proposed in this document would require many changes to the current rescheduler. The main drawbacks of using the rescheduler for preemption are - -- requires replicating the scheduler logic in another component. In particular, the rescheduler is responsible for choosing which node the pending pod should schedule onto, which requires it to know the predicate and priority functions. -- increases the race condition between pod preemption and pending pod scheduling. - -Another option is for the scheduler to send the pending pod to a node without doing any preemption, and relying on the kubelet to do the preemption(s). Similar to the rescheduler option, this option requires replicating the preemption and scheduling logic. Kubelet already has the logic to evict pods when a node is under resource pressure, but this logic is much simpler than the whole scheduling logic that considers various scheduling parameters, such as affinity, anti-affinity, PodDisruptionBudget, etc. That is why we believe the scheduler is the right component to perform preemption. - -## Preemption order - -An alternative to preemption by priority and breaking ties with QoS which was proposed earlier, is to preempt by QoS first and break ties with priority. We believe this could cause confusion for users and might reduce cluster utilization. Imagine the following scenario: - -- User runs a web server with a very high priority and is willing to give as much resources as possible to this web server. So, the users chooses a reasonable "requests" for resources and does not set any "limits" to let the web server use as much resources as it needs. - -If scheduler uses QoS as the first metric for preemption, the web server will be preempted by lower priority "Guaranteed" pods. This can be counter intuitive to users, as they probably don't expect a lower priority pod to preempt a higher priority one. -To solve the problem, the user might try running his web server as Guaranteed, but in that case, the user might have to set much higher "requests" than the web server normally uses. This would prevent other Guaranteed pods from being scheduled on the node running the web server and therefore, would lower resource utilization of the node. - -# References - -- [Controlled Rescheduling in Kubernetes](/contributors/design-proposals/scheduling/rescheduling.md) -- [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA) -- [Design proposal for adding priority to Kubernetes API](https://github.com/kubernetes/community/pull/604/files) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/pod-priority-api.md b/contributors/design-proposals/scheduling/pod-priority-api.md index 28cd414a..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/pod-priority-api.md +++ b/contributors/design-proposals/scheduling/pod-priority-api.md @@ -1,243 +1,6 @@ -# Priority in Kubernetes API +Design proposals have been archived. -@bsalamat +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -May 2017 - * [Objective](#objective) - * [Non-Goals](#non-goals) - * [Background](#background) - * [Overview](#overview) - * [Detailed Design](#detailed-design) - * [Effect of priority on scheduling](#effect-of-priority-on-scheduling) - * [Effect of priority on preemption](#effect-of-priority-on-preemption) - * [Priority in PodSpec](#priority-in-podspec) - * [Priority Classes](#priority-classes) - * [Resolving priority class names](#resolving-priority-class-names) - * [Ordering of priorities](#ordering-of-priorities) - * [System Priority Class Names](#system-priority-class-names) - * [Modifying Priority Classes](#modifying-priority-classes) - * [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes) - * [Priority and QoS classes](#priority-and-qos-classes) - -## Objective - - - -* How to specify priority for workloads in Kubernetes API. -* Define how the order of these priorities are specified. -* Define how new priority levels are added. -* Effect of priority on scheduling and preemption. - -### Non-Goals - - - -* How preemption works in Kubernetes. -* How quota allocation and accounting works for each priority. - -## Background - -It is fairly common in clusters to have more tasks than what the cluster -resources can handle. Often times the workload is a mix of high priority -critical tasks, and non-urgent tasks that can wait. Cluster management should be -able to distinguish these workloads in order to decide which ones should acquire -the resources sooner and which ones can wait. Priority of the workload is one of -the key metrics that provides the information to the cluster. This document is a -more detailed design proposal for part of the high-level architecture described -in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA). - -## Overview - -This design doc introduces the concept of priorities for pods in Kubernetes and -how the priority impacts scheduling and preemption of pods when the cluster -runs out of resources. A pod can specify a priority at the creation time. The -priority must be one of the valid values and there is a total order on the -values. The priority of a pod is independent of its workload type. The priority -is global and not specific to a particular namespace. - -## Detailed Design - -### Effect of priority on scheduling - -One could generally expect a pod with higher priority has a higher chance of -getting scheduled than the same pod with lower priority. However, there are -many other parameters that affect scheduling decisions. So, a high priority pod -may or may not be scheduled before lower priority pods. The details of -what determines the order at which pods are scheduled are beyond the scope of -this document. - -### Effect of priority on preemption - -Generally, lower priority pods are more likely to get preempted by higher -priority pods when cluster has reached a threshold. In such a case, scheduler -may decide to preempt lower priority pods to release enough resources for higher -priority pending pods. As mentioned before, there are many other parameters -that affect scheduling decisions, such as affinity and anti-affinity. If -scheduler determines that a high priority pod cannot be scheduled even if lower -priority pods are preempted, it will not preempt lower priority pods. Scheduler -may have other restrictions on preempting pods, for example, it may refuse to -preempt a pod if PodDisruptionBudget is violated. The details of scheduling and -preemption decisions are beyond the scope of this document. - -### Priority in PodSpec - -Pods may have priority in their pod spec. PodSpec will have two new fields -called "PriorityClassName" which is specified by user, and "Priority" which will -be populated by Kubernetes. User-specified priority (PriorityClassName) is a -string and all of the valid priority classes are defined by a system wide -mapping that maps each string to an integer. The PriorityClassName specified in -a pod spec must be found in this map or the pod creation request will be -rejected. If PriorityClassName is empty, it will resolve to the default -priority (See below for more info on name resolution). Once the -PriorityClassName is resolved to an integer, it is placed in "Priority" field of -PodSpec. - - -``` -type PodSpec struct { - ... - PriorityClassName string - Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly. -} -``` - -### Priority Classes - -The cluster may have many user defined priority classes for -various use cases. The following list is an example of how the priorities and -their values may look like. -Kubernetes will also have special priority class names reserved for critical system -pods. Please see [System Priority Class Names](#system-priority-class-names) for -more information. Any priority value above 1 billion is reserved for system use. -Aside from those system priority classes, Kubernetes is not shipped with predefined -priority classes usable by user pods. The main goal of having no built-in -priority classes for user pods is to avoid creating defacto standard names which -may be hard to change in the future. - -``` -system 2147483647 (int_max) -tier1 4000 -tier2 2000 -tier3 1000 -``` - -The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority: - -* Kubernetes system daemons (per-node like fluentd, and cluster-level like - Heapster) -* Critical user infrastructure (e.g. storage servers, monitoring system like - Prometheus, etc.) -* Components that are in the user-facing request serving path and must be able - to scale up arbitrarily in response to load spikes (web servers, middleware, - etc.) -* Important interruptible workloads that need strong guarantee of - schedulability and of not being interrupted -* Less important interruptible workloads that need a less strong guarantee of - schedulability and of not being interrupted -* Best effort / opportunistic - -### Resolving priority class names - -User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec. -Admission controller resolves a PriorityClassName to its corresponding number -and populates the "Priority" field of the pod spec. The rest of Kubernetes -components look at the "Priority" field of pod status and work with the integer -value. In other words, `PriorityClassName` will be ignored by the rest of the -system. - -We are going to add a new API object called PriorityClass. The priority class -defines the mapping between the priority name and its value. It can have an -optional description. It is an arbitrary string and is provided -only as a guideline for users. - -A priority class can be marked as "Global Default" by setting its -`GlobalDefault` field to true. If a pod does not specify any `PriorityClassName`, -the system resolves it to the value of the global default priority class if -exists. If there is no global default, the pod's priority will be resolved to -zero. Priority admission controller ensures that there is only one global -default priority class. - -``` -type PriorityClass struct { - metav1.TypeMeta - // +optional - metav1.ObjectMeta - - // The value of this priority class. This is the actual priority that pods - // receive when they have the above name in their pod spec. - Value int32 - GlobalDefault bool - Description string -} -``` - -### Ordering of priorities - -As mentioned earlier, a PriorityClassName is resolved by the admission controller to -its integral value and Kubernetes components use the integral value. The higher -the value, the higher the priority. - -### System Priority Class Names -There will be special priority class names reserved for system use only. These -classes have a value larger than one billion. -Priority admission controller ensures that new priority classes will be not -created with those names. They are used for critical system pods that must not -be preempted. We set default policies that deny creation of pods with -PriorityClassNames corresponding to these priorities. Cluster admins can -authorize users or service accounts to create pods with these priorities. When -non-authorized users set PriorityClassName to one of these priority classes in -their pod spec, their pod creation request will be rejected. For pods created by -controllers, the service account must be authorized by cluster admins. - -### Modifying priority classes - -Priority classes can be added or removed, but their name and value cannot be -updated. We allow updating `GlobalDefault` and `Description` as long as there is -a maximum of one global default. While -Kubernetes can work fine if priority classes are changed at run-time, the change -can be confusing to users as pods with a priority class which were created -before the change will have a different priority value than those created after -the change. Deletion of priority classes is allowed, despite the fact that there -may be existing pods that have specified such priority class names in their pod -spec. In other words, there will be no referential integrity for priority -classes. This is another reason that all system components should only work with -the integer value of the priority and not with the `PriorityClassName`. - -One could delete an existing priority class and create another one with the same -name and a different value. By doing so, they can achieve the same effect as -updating a priority class, but we still do not allow updating priority classes -to prevent accidental changes. - -Newly added priority classes cannot have a value higher than what is reserved -for "system". The reason for this restriction -is that Kubernetes critical system pods will have one of the "system" priorities -and no pod should be able to preempt them. - -#### Drawbacks of changing priority classes - -While Kubernetes effectively allows changing priority classes (by deleting and -adding them with a different value), it should be done only when -absolutely needed. Changing priority classes has the following disadvantages: - - -* May remove config portability: pod specs written for one cluster are no - longer guaranteed to work on a different cluster if the same priority classes - do not exist in the second cluster. -* If quota is specified for existing priority classes (at the time of this writing, - we don't have this feature in Kubernetes), adding or deleting priority classes - will require reconfiguration of quota allocations. -* An existing pods may have an integer value of priority that does not reflect - the current value of its PriorityClass. - -### Priority and QoS classes - -Kubernetes has [three QoS -classes](/contributors/design-proposals/node/resource-qos.md#qos-classes) -which are derived from request and limit of pods. Priority is introduced as an -independent concept; meaning that any QoS class may have any valid priority. -When a node is out of resources and pods needs to be preempted, we give -priority a higher weight over QoS classes. In other words, we preempt the lowest -priority pod and break ties with some other metrics, such as, QoS class, usage -above request, etc. This is not finalized yet. We will discuss and finalize -preemption in a separate doc. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/pod-priority-resourcequota.md b/contributors/design-proposals/scheduling/pod-priority-resourcequota.md index a638a825..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/pod-priority-resourcequota.md +++ b/contributors/design-proposals/scheduling/pod-priority-resourcequota.md @@ -1,254 +1,6 @@ -# Priority in ResourceQuota +Design proposals have been archived. -Authors: +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Harry Zhang [@resouer](https://github.com/resouer) -Vikas Choudhary [@vikaschoudhary16](https://github.com/vikaschoudhary16) -Main Reviewers: - -Bobby [@bsalamat](https://github.com/bsalamat) -Derek [@derekwaynecarr](https://github.com/derekwaynecarr) - -Dec 2017 - - * [Objective](#objective) - * [Non-Goals](#non-goals) - * [Background](#background) - * [Overview](#overview) - * [Detailed Design](#detailed-design) - * [Changes in ResourceQuota](#changes-in-resourceQuota) - * [Changes in Admission Controller configuration](#changes-in-admission-controller-configuration) - * [Expected behavior of ResourceQuota admission controller and Quota system](#expected-behavior-of-resourcequota-admission-controller-and-resourcequota-system) - * [Backward Compatibility](#backward-compatibility) - * [Sample user story 1](#sample-user-story-1) - * [Sample user story 2](#sample-user-story-2) - - -## Objective - -This feature is designed to make `ResourceQuota` become priority aware, several sub-tasks are included. - -1. Expand `Scopes` in `ResourceQuotaSpec` to represent priority class names and corresponding behavior. -2. Incorporate corresponding behavior in quota checking process. -3. Update the `ResourceQuota` admission controller to check priority class name and perform expected admission. - -### Non-Goals - -* Add priority in Pod spec (this is implemented separately in: [45610](https://github.com/kubernetes/kubernetes/pull/45610)) - -## Background - -Since we already have [priority field in Pod spec](https://github.com/kubernetes/kubernetes/pull/45610), -Pods can now be classified into different priority classes. We would like to be able to create quota for various priority classes in order to manage cluster resources better and limit abuse scenarios. - -One approach to implement this is by adding priority class name field to `ResourceQuota` API definition. While this arbitrary field of API object will introduce inflexibility to potential change in future and also not adequate to express all semantics. - -Thus, we decide to reuse the existing `Scopes` of `ResourceQuotaSpec` to provide a richer semantics for quota to cooperate with priority classes. - -## Overview - -This design doc introduces how to define a priority class scope and scope selectors for the quota to match with and explains how quota enforcement logic is changed to apply the quota to pods with the given priority classes. - -## Detailed Design - -### Changes in ResourceQuota - -ResourceQuotaSpec contains an array of filters, `Scopes`, that if mentioned, must match each object tracked by a ResourceQuota. - -A new field `scopeSelector` will be introduced. -```go -// ResourceQuotaSpec defines the desired hard limits to enforce for Quota -type ResourceQuotaSpec struct { - ... - - // A collection of filters that must match each object tracked by a quota. - // If not specified, the quota matches all objects. - // +optional - Scopes []ResourceQuotaScope - // ScopeSelector is also a collection of filters like Scopes that must match each object tracked by a quota - // but expressed using ScopeSelectorOperator in combination with possible values. - // +optional - ScopeSelector *ScopeSelector -} - -// A scope selector represents the AND of the selectors represented -// by the scoped-resource selector terms. -type ScopeSelector struct { - // A list of scope selector requirements by scope of the resources. - // +optional - MatchExpressions []ScopedResourceSelectorRequirement -} - -// A scoped-resource selector requirement is a selector that contains values, a scope name, and an operator -// that relates the scope name and values. -type ScopedResourceSelectorRequirement struct { - // The name of the scope that the selector applies to. - ScopeName ResourceQuotaScope - // Represents a scope's relationship to a set of values. - // Valid operators are In, NotIn, Exists, DoesNotExist. - Operator ScopeSelectorOperator - // An array of string values. If the operator is In or NotIn, - // the values array must be non-empty. If the operator is Exists or DoesNotExist, - // the values array must be empty. - // This array is replaced during a strategic merge patch. - // +optional - Values []string -} - -// A scope selector operator is the set of operators that can be used in -// a scope selector requirement. -type ScopeSelectorOperator string - -const ( - ScopeSelectorOpIn ScopeSelectorOperator = "In" - ScopeSelectorOpNotIn ScopeSelectorOperator = "NotIn" - ScopeSelectorOpExists ScopeSelectorOperator = "Exists" - ScopeSelectorOpDoesNotExist ScopeSelectorOperator = "DoesNotExist" -) -``` -A new `ResourceQuotaScope` will be defined for matching pods based on priority class names. - -```go -// A ResourceQuotaScope defines a filter that must match each object tracked by a quota -type ResourceQuotaScope string - -const ( - ... - ResourceQuotaScopePriorityClass ResourceQuotaScope = "PriorityClass" -) -``` - -### Changes in Admission Controller Configuration - -A new field `MatchScopes` will be added to `Configuration.LimitedResource`. `MatchScopes` will be a collection of one or more of the four newly added priority class based `Scopes` that are explained in above section. - -```go -// Configuration provides configuration for the ResourceQuota admission controller. -type Configuration struct { - ... - LimitedResources []LimitedResource -} - -// LimitedResource matches a resource whose consumption is limited by default. -// To consume the resource, there must exist an associated quota that limits -// its consumption. -type LimitedResource struct { - ... - // For each intercepted request, the quota system will figure out if the input object - // satisfies a scope which is present in this listing, then - // quota system will ensure that there is a covering quota. In the - // absence of a covering quota, the quota system will deny the request. - // For example, if an administrator wants to globally enforce that - // a quota must exist to create a pod with "cluster-services" priorityclass - // the list would include "scopeName=PriorityClass, Operator=In, Value=cluster-services" - // +optional - MatchScopes []v1.ScopedResourceSelectorRequirement `json:"matchScopes,omitempty"` -} -``` - -### Expected Behavior of ResourceQuota Admission Controller and ResourceQuota System -`MatchScopes` will be configured in admission controller configuration to apply quota based on priority class names. If `MatchScopes` matches/selects an incoming pod request, request will be **denied if a Covering Quota is missing**. The meaning of Covering Quota is: any quota which has priority class based `Scopes` that matches/selects the pod in the request. - -Please note that this priority class based criteria will be an **additional** criteria that must be satisfied by covering quota. - -For more details, please refer to the `Sample user story` sections at the end of this doc. - -#### Backward Compatibility - -If a Pod's requested resources are not matched by any of the filters in admission controller configuration's `MatchScopes`, overall behavior for the pod will be same as it is today where `ResourceQuota` has no awareness of priority. In such a case, request will be allowed if no covering `ResourceQuota` is found. - -Couple of other noteworthy details: -1. If multiple `ResourceQuota` apply to a Pod, the pod must satisfy all of them. -2. We do not enforce referential integrity across objects. i.e. Creation or updating of ResourceQuota object, scopes of which names a PriorityClass that does not exist, are allowed. - -This design also tries to enable flexibility for its configuration. Here are several sample user stories. - -#### Sample User Story 1 -**As a cluster admin, I want `cluster-services` priority only apply to `kube-system` namespace , so that I can ensure those critical daemons on each node while normal user's workloads will not disrupt that ability.** - -To enforce above policy: -1. Admin will create admission controller configuration as below: -```yaml -apiVersion: apiserver.k8s.io/v1alpha1 -kind: AdmissionConfiguration -plugins: -- name: "ResourceQuota" - configuration: - apiVersion: resourcequota.admission.k8s.io/v1alpha1 - kind: Configuration - limitedResources: - - resource: pods - matchScopes: - - scopeName: PriorityClass - operator: In - values: ["cluster-services"] -``` - -2. Admin will then create a corresponding resource quota object in `kube-system` namespace: - -```shell -$ cat ./quota.yml -- apiVersion: v1 - kind: ResourceQuota - metadata: - name: pods-cluster-services - spec: - hard: - pods: "10" - scopeSelector: - matchExpressions: - - operator : In - scopeName: PriorityClass - values: ["cluster-services"] - -$ kubectl create -f ./quota.yml -n kube-system` -``` - -In this case, a pod creation will be allowed if: -1. Pod has no priority class and created in any namespace. -2. Pod has priority class other than `cluster-services` and created in any namespace. -3. Pod has priority class `cluster-services` and created in `kube-system` namespace, and passed resource quota check. - -Pod creation will be rejected if pod has priority class `cluster-services` and created in namespace other than `kube-system` - - -#### Sample User Story 2 -**As a cluster admin, I want a specific resource quota apply to any pod which has priority been set** - -To enforce above policy: -1. Create admission controller configuration: -```yaml -apiVersion: apiserver.k8s.io/v1alpha1 -kind: AdmissionConfiguration -plugins: -- name: "ResourceQuota" - configuration: - apiVersion: resourcequota.admission.k8s.io/v1alpha1 - kind: Configuration - limitedResources: - - resource: pods - matchScopes: - - operator : Exists - scopeName: PriorityClass -``` - -2. Create resource quota to match all pods where there is priority set - -```shell -$ cat ./quota.yml -- apiVersion: v1 - kind: ResourceQuota - metadata: - name: pods-cluster-services - spec: - hard: - pods: "10" - scopeSelector: - matchExpressions: - - operator : In - scopeName: PriorityClass - values: ["cluster-services"] - -$ kubectl create -f ./quota.yml -n kube-system` -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/podaffinity.md b/contributors/design-proposals/scheduling/podaffinity.md index 89752150..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/podaffinity.md +++ b/contributors/design-proposals/scheduling/podaffinity.md @@ -1,667 +1,6 @@ -# Inter-pod topological affinity and anti-affinity +Design proposals have been archived. -## Introduction +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -NOTE: It is useful to read about [node affinity](nodeaffinity.md) first. - -This document describes a proposal for specifying and implementing inter-pod -topological affinity and anti-affinity. By that we mean: rules that specify that -certain pods should be placed in the same topological domain (e.g. same node, -same rack, same zone, same power domain, etc.) as some other pods, or, -conversely, should *not* be placed in the same topological domain as some other -pods. - -Here are a few example rules; we explain how to express them using the API -described in this doc later, in the section "Examples." -* Affinity - * Co-locate the pods from a particular service or Job in the same availability -zone, without specifying which zone that should be. - * Co-locate the pods from service S1 with pods from service S2 because S1 uses -S2 and thus it is useful to minimize the network latency between them. -Co-location might mean same nodes and/or same availability zone. -* Anti-affinity - * Spread the pods of a service across nodes and/or availability zones, e.g. to -reduce correlated failures. - * Give a pod "exclusive" access to a node to guarantee resource isolation -- -it must never share the node with other pods. - * Don't schedule the pods of a particular service on the same nodes as pods of -another service that are known to interfere with the performance of the pods of -the first service. - -For both affinity and anti-affinity, there are three variants. Two variants have -the property of requiring the affinity/anti-affinity to be satisfied for the pod -to be allowed to schedule onto a node; the difference between them is that if -the condition ceases to be met later on at runtime, for one of them the system -will try to eventually evict the pod, while for the other the system may not try -to do so. The third variant simply provides scheduling-time *hints* that the -scheduler will try to satisfy but may not be able to. These three variants are -directly analogous to the three variants of [node affinity](nodeaffinity.md). - -Note that this proposal is only about *inter-pod* topological affinity and -anti-affinity. There are other forms of topological affinity and anti-affinity. -For example, you can use [node affinity](nodeaffinity.md) to require (prefer) -that a set of pods all be scheduled in some specific zone Z. Node affinity is -not capable of expressing inter-pod dependencies, and conversely the API we -describe in this document is not capable of expressing node affinity rules. For -simplicity, we will use the terms "affinity" and "anti-affinity" to mean -"inter-pod topological affinity" and "inter-pod topological anti-affinity," -respectively, in the remainder of this document. - -## API - -We will add one field to `PodSpec` - -```go -Affinity *Affinity `json:"affinity,omitempty"` -``` - -The `Affinity` type is defined as follows - -```go -type Affinity struct { - PodAffinity *PodAffinity `json:"podAffinity,omitempty"` - PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` -} - -type PodAffinity struct { - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -type PodAntiAffinity struct { - // If the anti-affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the anti-affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the anti-affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the anti-affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the anti-affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling anti-affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -type WeightedPodAffinityTerm struct { - // weight is in the range 1-100 - Weight int `json:"weight"` - PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"` -} - -type PodAffinityTerm struct { - LabelSelector *LabelSelector `json:"labelSelector,omitempty"` - // namespaces specifies which namespaces the LabelSelector applies to (matches against); - // nil list means "this pod's namespace," empty list means "all namespaces" - // The json tag here is not "omitempty" since we need to distinguish nil and empty. - // See https://golang.org/pkg/encoding/json/#Marshal for more details. - Namespaces []api.Namespace `json:"namespaces,omitempty"` - // empty topology key is interpreted by the scheduler as "all topologies" - TopologyKey string `json:"topologyKey,omitempty"` -} -``` - -Note that the `Namespaces` field is necessary because normal `LabelSelector` is -scoped to the pod's namespace, but we need to be able to match against all pods -globally. - -To explain how this API works, let's say that the `PodSpec` of a pod `P` has an -`Affinity` that is configured as follows (note that we've omitted and collapsed -some fields for simplicity, but this should sufficiently convey the intent of -the design): - -```go -PodAffinity { - RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}}, - PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}}, -} -PodAntiAffinity { - RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}}, - PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}} -} -``` - -Then when scheduling pod P, the scheduler: -* Can only schedule P onto nodes that are running pods that satisfy `P1`. -(Assumes all nodes have a label with key `node` and value specifying their node -name.) -* Should try to schedule P onto zones that are running pods that satisfy `P2`. -(Assumes all nodes have a label with key `zone` and value specifying their -zone.) -* Cannot schedule P onto any racks that are running pods that satisfy `P3`. -(Assumes all nodes have a label with key `rack` and value specifying their rack -name.) -* Should try not to schedule P onto any power domains that are running pods that -satisfy `P4`. (Assumes all nodes have a label with key `power` and value -specifying their power domain.) - -When `RequiredDuringScheduling` has multiple elements, the requirements are -ANDed. For `PreferredDuringScheduling` the weights are added for the terms that -are satisfied for each node, and the node(s) with the highest weight(s) are the -most preferred. - -In reality there are two variants of `RequiredDuringScheduling`: one suffixed -with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`. -For the first variant, if the affinity/anti-affinity ceases to be met at some -point during pod execution (e.g. due to a pod label update), the system will try -to eventually evict the pod from its node. In the second variant, the system may -or may not try to eventually evict the pod from its node. - -## A comment on symmetry - -One thing that makes affinity and anti-affinity tricky is symmetry. - -Imagine a cluster that is running pods from two services, S1 and S2. Imagine -that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not -run me on nodes that are running pods from S2." It is not sufficient just to -check that there are no S2 pods on a node when you are scheduling a S1 pod. You -also need to ensure that there are no S1 pods on a node when you are scheduling -a S2 pod, *even though the S2 pod does not have any anti-affinity rules*. -Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's -RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving -S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling -anti-affinity rule, then: -* if a node is empty, you can schedule S1 or S2 onto the node -* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node - -Note that while RequiredDuringScheduling anti-affinity is symmetric, -RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 -have a RequiredDuringScheduling affinity rule "run me on nodes that are running -pods from S2," it is not required that there be S1 pods on a node in order to -schedule a S2 pod onto that node. More specifically, if S1 has the -aforementioned RequiredDuringScheduling affinity rule, then: -* if a node is empty, you can schedule S2 onto the node -* if a node is empty, you cannot schedule S1 onto the node -* if a node is running S2, you can schedule S1 onto the node -* if a node is running S1+S2 and S1 terminates, S2 continues running -* if a node is running S1+S2 and S2 terminates, the system terminates S1 -(eventually) - -However, although RequiredDuringScheduling affinity is not symmetric, there is -an implicit PreferredDuringScheduling affinity rule corresponding to every -RequiredDuringScheduling affinity rule: if the pods of S1 have a -RequiredDuringScheduling affinity rule "run me on nodes that are running pods -from S2" then it is not required that there be S1 pods on a node in order to -schedule a S2 pod onto that node, but it would be better if there are. - -PreferredDuringScheduling is symmetric. If the pods of S1 had a -PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that -are running pods from S2" then we would prefer to keep a S1 pod that we are -scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that -we are scheduling off of nodes that are running S1 pods. Likewise if the pods of -S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that -are running pods from S2" then we would prefer to place a S1 pod that we are -scheduling onto a node that is running a S2 pod, and also to place a S2 pod that -we are scheduling onto a node that is running a S1 pod. - -## Examples - -Here are some examples of how you would express various affinity and -anti-affinity rules using the API we described. - -### Affinity - -In the examples below, the word "put" is intentionally ambiguous; the rules are -the same whether "put" means "must put" (RequiredDuringScheduling) or "try to -put" (PreferredDuringScheduling)--all that changes is which field the rule goes -into. Also, we only discuss scheduling-time, and ignore the execution-time. -Finally, some of the examples use "zone" and some use "node," just to make the -examples more interesting; any of the examples with "zone" will also work for -"node" if you change the `TopologyKey`, and vice-versa. - -* **Put the pod in zone Z**: -Tricked you! It is not possible express this using the API described here. For -this you should use node affinity. - -* **Put the pod in a zone that is running at least one pod from service S**: -`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}` - -* **Put the pod on a node that is already running a pod that requires a license -for software package P**: Assuming pods that require a license for software -package P have a label `{key=license, value=P}`: -`{LabelSelector: "license" In "P", TopologyKey: "node"}` - -* **Put this pod in the same zone as other pods from its same service**: -Assuming pods from this pod's service have some label `{key=service, value=S}`: -`{LabelSelector: "service" In "S", TopologyKey: "zone"}` - -This last example illustrates a small issue with this API when it is used with a -scheduler that processes the pending queue one pod at a time, like the current -Kubernetes scheduler. The RequiredDuringScheduling rule -`{LabelSelector: "service" In "S", TopologyKey: "zone"}` -only "works" once one pod from service S has been scheduled. But if all pods in -service S have this RequiredDuringScheduling rule in their PodSpec, then the -RequiredDuringScheduling rule will block the first pod of the service from ever -scheduling, since it is only allowed to run in a zone with another pod from the -same service. And of course that means none of the pods of the service will be -able to schedule. This problem *only* applies to RequiredDuringScheduling -affinity, not PreferredDuringScheduling affinity or any variant of -anti-affinity. There are at least three ways to solve this problem: -* **short-term**: have the scheduler use a rule that if the -RequiredDuringScheduling affinity requirement matches a pod's own labels, and -there are no other such pods anywhere, then disregard the requirement. This -approach has a corner case when running parallel schedulers that are allowed to -schedule pods from the same replicated set (e.g. a single PodTemplate): both -schedulers may try to schedule pods from the set at the same time and think -there are no other pods from that set scheduled yet (e.g. they are trying to -schedule the first two pods from the set), but by the time the second binding is -committed, the first one has already been committed, leaving you with two pods -running that do not respect their RequiredDuringScheduling affinity. There is no -simple way to detect this "conflict" at scheduling time given the current system -implementation. -* **longer-term**: when a controller creates pods from a PodTemplate, for -exactly *one* of those pods, it should omit any RequiredDuringScheduling -affinity rules that select the pods of that PodTemplate. -* **very long-term/speculative**: controllers could present the scheduler with a -group of pods from the same PodTemplate as a single unit. This is similar to the -first approach described above but avoids the corner case. No special logic is -needed in the controllers. Moreover, this would allow the scheduler to do proper -[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since -it could receive an entire gang simultaneously as a single unit. - -### Anti-affinity - -As with the affinity examples, the examples here can be RequiredDuringScheduling -or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as -"must not" or as "try not to" depending on whether the rule appears in -`RequiredDuringScheduling` or `PreferredDuringScheduling`. - -* **Spread the pods of this service S across nodes and zones**: -`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, -{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}` -(note that if this is specified as a RequiredDuringScheduling anti-affinity, -then the first clause is redundant, since the second clause will force the -scheduler to not put more than one pod from S in the same zone, and thus by -definition it will not put more than one pod from S on the same node, assuming -each node is in one zone. This rule is more useful as PreferredDuringScheduling -anti-affinity, e.g. one might expect it to be common in -[Cluster Federation](/contributors/design-proposals/multicluster/federation.md) clusters.) - -* **Don't co-locate pods of this service with pods from service "evilService"**: -`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` - -* **Don't co-locate pods of this service with any other pods including pods of this service**: -`{LabelSelector: empty, TopologyKey: "node"}` - -* **Don't co-locate pods of this service with any other pods except other pods of this service**: -Assuming pods from the service have some label `{key=service, value=S}`: -`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}` -Note that this works because `"service" NotIn "S"` matches pods with no key -"service" as well as pods with key "service" and a corresponding value that is -not "S." - -## Algorithm - -An example algorithm a scheduler might use to implement affinity and -anti-affinity rules is as follows. There are certainly more efficient ways to -do it; this is just intended to demonstrate that the API's semantics are -implementable. - -Terminology definition: We say a pod P is "feasible" on a node N if P meets all -of the scheduler predicates for scheduling P onto N. Note that this algorithm is -only concerned about scheduling time, thus it makes no distinction between -RequiredDuringExecution and IgnoredDuringExecution. - -To make the algorithm slightly more readable, we use the term "HardPodAffinity" -as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and -"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity." -Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." - -**TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account; currently it assumes all terms have weight 1.** - -``` -Z = the pod you are scheduling -{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z -// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction -X = {Z's PodSpec's HardPodAffinity} -foreach element H of {X} - P = {all pods in the system that match H.LabelSelector} - M map[string]int // topology value -> number of pods running on nodes with that topology value - foreach pod Q of {P} - L = {labels of the node on which Q is running, represented as a map from label key to label value} - M[L[H.TopologyKey]]++ - {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]} -// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity -// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0 -X = {Z's PodSpec's HardPodAntiAffinity} -foreach element H of {X} - P = {all pods in the system that match H.LabelSelector} - M map[string]int // topology value -> number of pods running on nodes with that topology value - foreach pod Q of {P} - L = {labels of the node on which Q is running, represented as a map from label key to label value} - M[L[H.TopologyKey]]++ - {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]} -// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity -foreach node A of {N} - foreach pod B that is bound to A - if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N} -// At this point, all node in {N} are feasible for Z. -// Step 3a: Soft version of Step 1a -Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node -Initialize the keys of Y to all of the nodes in {N}, and the values to 0 -X = {Z's PodSpec's SoftPodAffinity} -Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" -// Step 3b: Soft version of Step 1b -X = {Z's PodSpec's SoftPodAntiAffinity} -Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" -// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft -foreach node A of {N} - foreach pod B that is bound to A - increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A -// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is -// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with -// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better). -``` - -## Special considerations for RequiredDuringScheduling anti-affinity - -In this section we discuss three issues with RequiredDuringScheduling -anti-affinity: Denial of Service (DoS), co-existing with daemons, and -determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265) -for additional discussion of these topics. - -### Denial of Service - -Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity -can intentionally or unintentionally cause various problems for other pods, due -to the symmetry property of anti-affinity. - -The most notable danger is the ability for a pod that arrives first to some -topology domain, to block all other pods from scheduling there by stating a -conflict with all other pods. The standard approach to preventing resource -hogging is quota, but simple resource quota cannot prevent this scenario because -the pod may request very little resources. Addressing this using quota requires -a quota scheme that charges based on "opportunity cost" rather than based simply -on requested resources. For example, when handling a pod that expresses -RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` -(i.e. exclusive access to a node), it could charge for the resources of the -average or largest node in the cluster. Likewise if a pod expresses -RequiredDuringScheduling anti-affinity for all pods using a "cluster" -`TopologyKey`, it could charge for the resources of the entire cluster. If node -affinity is used to constrain the pod to a particular topology domain, then the -admission-time quota charging should take that into account (e.g. not charge for -the average/largest machine if the PodSpec constrains the pod to a specific -machine with a known size; instead charge for the size of the actual machine -that the pod was constrained to). In all cases once the pod is scheduled, the -quota charge should be adjusted down to the actual amount of resources allocated -(e.g. the size of the actual machine that was assigned, not the -average/largest). If a cluster administrator wants to overcommit quota, for -example to allow more than N pods across all users to request exclusive node -access in a cluster with N nodes, then a priority/preemption scheme should be -added so that the most important pods run when resource demand exceeds supply. - -An alternative approach, which is a bit of a blunt hammer, is to use a -capability mechanism to restrict use of RequiredDuringScheduling anti-affinity -to trusted users. A more complex capability mechanism might only restrict it -when using a non-"node" TopologyKey. - -Our initial implementation will use a variant of the capability approach, which -requires no configuration: we will simply reject ALL requests, regardless of -user, that specify "all namespaces" with non-"node" TopologyKey for -RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use -case while prohibiting the more dangerous ones. - -A weaker variant of the problem described in the previous paragraph is a pod's -ability to use anti-affinity to degrade the scheduling quality of another pod, -but not completely block it from scheduling. For example, a set of pods S1 could -use node affinity to request to schedule onto a set of nodes that some other set -of pods S2 prefers to schedule onto. If the pods in S1 have -RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for -S2, then due to the symmetry property of anti-affinity, they can prevent the -pods in S2 from scheduling onto their preferred nodes if they arrive first (for -sure in the RequiredDuringScheduling case, and with some probability that -depends on the weighting scheme for the PreferredDuringScheduling case). A very -sophisticated priority and/or quota scheme could mitigate this, or alternatively -we could eliminate the symmetry property of the implementation of -PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling -anti-affinity could affect scheduling quality of another pod, and as we -described in the previous paragraph, such pods could be charged quota for the -full topology domain, thereby reducing the potential for abuse. - -We won't try to address this issue in our initial implementation; we can -consider one of the approaches mentioned above if it turns out to be a problem -in practice. - -### Co-existing with daemons - -A cluster administrator may wish to allow pods that express anti-affinity -against all pods, to nonetheless co-exist with system daemon pods, such as those -run by DaemonSet. In principle, we would like the specification for -RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or -more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263) -for a more detailed explanation of the toleration concept). -There are at least two ways to accomplish this: - -* Scheduler special-cases the namespace(s) where daemons live, in the - sense that it ignores pods in those namespaces when it is - determining feasibility for pods with anti-affinity. The name(s) of - the special namespace(s) could be a scheduler configuration - parameter, and default to `kube-system`. We could allow - multiple namespaces to be specified if we want cluster admins to be - able to give their own daemons this special power (they would add - their namespace to the list in the scheduler configuration). And of - course this would be symmetric, so daemons could schedule onto a node - that is already running a pod with anti-affinity. - -* We could add an explicit "toleration" concept/field to allow the - user to specify namespaces that are excluded when they use - RequiredDuringScheduling anti-affinity, and use an admission - controller/defaulter to ensure these namespaces are always listed. - -Our initial implementation will use the first approach. - -### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution) - -Because anti-affinity is symmetric, in the case of -RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must -determine which pod(s) to kill when a pod's labels are updated in such as way as -to cause them to conflict with one or more other pods' -RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the -absence of a priority/preemption scheme, our rule will be that the pod with the -anti-affinity rule that becomes violated should be the one killed. A pod should -only specify constraints that apply to namespaces it trusts to not do malicious -things. Once we have priority/preemption, we can change the rule to say that the -lowest-priority pod(s) are killed until all -RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied. - -## Special considerations for RequiredDuringScheduling affinity - -The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its -symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with -conflicting pods, and pods that conflict with P cannot schedule onto the node -one P has been scheduled there. The design we have described says that the -symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P -says it can only schedule onto nodes running pod Q, this does not mean Q can -only run on a node that is running P, but the scheduler will try to schedule Q -onto a node that is running P (i.e. treats the reverse direction as preferred). -This raises the same scheduling quality concern as we mentioned at the end of -the Denial of Service section above, and can be addressed in similar ways. - -The nature of affinity (as opposed to anti-affinity) means that there is no -issue of determining which pod(s) to kill when a pod's labels change: it is -obviously the pod with the affinity rule that becomes violated that must be -killed. (Killing a pod never "fixes" violation of an affinity rule; it can only -"fix" violation an anti-affinity rule.) However, affinity does have a different -question related to killing: how long should the system wait before declaring -that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met -at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q -is temporarily killed so that it can be updated to a new binary version, should -that trigger killing of P? More generally, how long should the system wait -before declaring that P's affinity is violated? (Of course affinity is expressed -in terms of label selectors, not for a specific pod, but the scenario is easier -to describe using a concrete pod.) This is closely related to the concept of -forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)). -In theory we could make this time duration be configurable by the user on a per-pod -basis, but for the first version of this feature we will make it a configurable -property of whichever component does the killing and that applies across all pods -using the feature. Making it configurable by the user would require a nontrivial -change to the API syntax (since the field would only apply to -RequiredDuringSchedulingRequiredDuringExecution affinity). - -## Implementation plan - -1. Add the `Affinity` field to PodSpec and the `PodAffinity` and -`PodAntiAffinity` types to the API along with all of their descendant types. -2. Implement a scheduler predicate that takes -`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into -account. Include a workaround for the issue described at the end of the Affinity -section of the Examples section (can't schedule first pod). -3. Implement a scheduler priority function that takes -`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity -into account. -4. Implement admission controller that rejects requests that specify "all -namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` -anti-affinity. This admission controller should be enabled by default. -5. Implement the recommended solution to the "co-existing with daemons" issue -6. At this point, the feature can be deployed. -7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity -and anti-affinity, and make sure the pieces of the system already implemented -for `RequiredDuringSchedulingIgnoredDuringExecution` also take -`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the -scheduler predicate, the quota mechanism, the "co-existing with daemons" -solution). -8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" -`TopologyKey` to Kubelet's admission decision. -9. Implement code in Kubelet *or* the controllers that evicts a pod that no -longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet -then only for "node" `TopologyKey`; if controller then potentially for all -`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). -Do so in a way that addresses the "determining which pod(s) to kill" issue. - -We assume Kubelet publishes labels describing the node's membership in all of -the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). - -## Backward compatibility - -Old versions of the scheduler will ignore `Affinity`. - -Users should not start using `Affinity` until the full implementation has been -in Kubelet and the master for enough binary versions that we feel comfortable -that we will not need to roll back either Kubelet or master to a version that -does not support them. Longer-term we will use a programmatic approach to -enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Extensibility - -The design described here is the result of careful analysis of use cases, a -decade of experience with Borg at Google, and a review of similar features in -other open-source container orchestration systems. We believe that it properly -balances the goal of expressiveness against the goals of simplicity and -efficiency of implementation. However, we recognize that use cases may arise in -the future that cannot be expressed using the syntax described here. Although we -are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire -for Kubernetes users to get a consistent experience, etc.), the regular -Kubernetes annotation mechanism can be used to add or replace affinity rules. -The way this work would is: -1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to -request the desired scheduling behavior. If the new rule(s) *replace* one or -more fields of `Affinity` then the user would omit those fields from `Affinity`; -if they are *additional rules*, then the user would fill in `Affinity` as well -as the annotation(s). -1. Scheduler takes the annotation(s) into account when scheduling. - -If some particular new syntax becomes popular, we would consider upstreaming it -by integrating it into the standard `Affinity`. - -## Future work and non-work - -One can imagine that in the anti-affinity RequiredDuringScheduling case one -might want to associate a number with the rule, for example "do not allow this -pod to share a rack with more than three other pods (in total, or from the same -service as the pod)." We could allow this to be specified by adding an integer -`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case. -However, this flexibility complicates the system and we do not intend to -implement it. - -It is likely that the specification and implementation of pod anti-affinity -can be unified with [taints and tolerations](taint-toleration-dedicated.md), -and likewise that the specification and implementation of pod affinity -can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod -labels would be "inherited" by the node, and pods would only be able to specify -affinity and anti-affinity for a node's labels. Our main motivation for not -unifying taints and tolerations with pod anti-affinity is that we foresee taints -and tolerations as being a concept that only cluster administrators need to -understand (and indeed in some setups taints and tolerations wouldn't even be -directly manipulated by a cluster administrator, instead they would only be set -by an admission controller that is implementing the administrator's high-level -policy about different classes of special machines and the users who belong to -the groups allowed to access them). Moreover, the concept of nodes "inheriting" -labels from pods seems complicated; it seems conceptually simpler to separate -rules involving relatively static properties of nodes from rules involving which -other pods are running on the same node or larger topology domain. - -Data/storage affinity is related to pod affinity, and is likely to draw on some -of the ideas we have used for pod affinity. Today, data/storage affinity is -expressed using node affinity, on the assumption that the pod knows which -node(s) store(s) the data it wants. But a more flexible approach would allow the -pod to name the data rather than the node. - -## Related issues - -The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265). - -The topic of affinity/anti-affinity has generated a lot of discussion. The main -issue is [#367](https://github.com/kubernetes/kubernetes/issues/367) -but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485), -[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369), -[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707), -[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341), -[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906) -all have additional discussion and use cases. - -As the examples in this document have demonstrated, topological affinity is very -useful in clusters that are spread across availability zones, e.g. to co-locate -pods of a service in the same zone to avoid a wide-area network hop, or to -spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059), -[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063), -and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant. - -Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related. - -This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816). - -## Related work - -**TODO: cite references** +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/predicates-ordering.md b/contributors/design-proposals/scheduling/predicates-ordering.md index 4c569496..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/predicates-ordering.md +++ b/contributors/design-proposals/scheduling/predicates-ordering.md @@ -1,94 +1,6 @@ -# predicates ordering +Design proposals have been archived. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Status: proposal - -Author: yastij - -Approvers: -* gmarek -* bsalamat -* k82cn - - - - -## Abstract - -This document describes how and why reordering predicates helps to achieve performance for the kubernetes scheduler. -We will expose the motivations behind this proposal, The two steps/solution we see to tackle this problem and the timeline decided to implement these. - - -## Motivation - -While working on a [Pull request](https://github.com/kubernetes/kubernetes/pull/50185) related to a proposal, we saw that the order of running predicates isn’t defined. - -This makes the scheduler perform extra-computation that isn’t needed, As an example we [outlined](https://github.com/kubernetes/kubernetes/pull/50185) that the kubernetes scheduler runs predicates against nodes even if marked “unschedulable”. - -Reordering predicates allows us to avoid this problem, by computing the most restrictive predicates first. To do so, we propose two reordering types. - - - -## Static ordering - -This ordering will be the default ordering. If a policy config is provided with a subset of predicates, only those predicates will be invoked using the static ordering. - - - - -|Position | Predicate | comments (note, justification...) | - ----------------- | ---------------------------- | ------------------ -| 1 | `CheckNodeConditionPredicate` | we really don’t want to check predicates against unschedulable nodes. | -| 2 | `PodFitsHost` | we check the pod.spec.nodeName. | -| 3 | `PodFitsHostPorts` | we check ports asked on the spec. | -| 4 | `PodMatchNodeSelector` | check node label after narrowing search. | -| 5 | `PodFitsResources ` | this one comes here since it’s not restrictive enough as we do not try to match values but ranges. | -| 6 | `NoDiskConflict` | Following the resource predicate, we check disk | -| 7 | `PodToleratesNodeTaints '` | check toleration here, as node might have toleration | -| 8 | `PodToleratesNodeNoExecuteTaints` | check toleration here, as node might have toleration | -| 9 | `CheckNodeLabelPresence ` | labels are easy to check, so this one goes before | -| 10 | `checkServiceAffinity ` | - | -| 11 | `MaxPDVolumeCountPredicate ` | - | -| 12 | `VolumeNodePredicate ` | - | -| 13 | `VolumeZonePredicate ` | - | -| 14 | `CheckNodeMemoryPressurePredicate` | doesn’t happen often | -| 15 | `CheckNodeDiskPressurePredicate` | doesn’t happen often | -| 16 | `InterPodAffinityMatches` | Most expensive predicate to compute | - - -## End-user ordering - -Using scheduling policy file, the cluster admin can override the default static ordering. This gives administrator the maximum flexibility regarding scheduler behaviour and enables scheduler to adapt to cluster usage. -Please note that the order must be a positive integer, also, when providing equal ordering for many predicates, scheduler will determine the order and won't guarantee that the order will remain the same between them. -Finally updating the scheduling policy file will require a scheduler restart. - -as an example the following is scheduler policy file using an end-user ordering: - -``` json -{ -"kind" : "Policy", -"apiVersion" : "v1", -"predicates" : [ - {"name" : "PodFitsHostPorts", "order": 2}, - {"name" : "PodFitsResources", "order": 3}, - {"name" : "NoDiskConflict", "order": 5}, - {"name" : "PodToleratesNodeTaints", "order": 4}, - {"name" : "MatchNodeSelector", "order": 6}, - {"name" : "PodFitsHost", "order": 1} - ], -"priorities" : [ - {"name" : "LeastRequestedPriority", "weight" : 1}, - {"name" : "BalancedResourceAllocation", "weight" : 1}, - {"name" : "ServiceSpreadingPriority", "weight" : 1}, - {"name" : "EqualPriority", "weight" : 1} - ], -"hardPodAffinitySymmetricWeight" : 10 -} -``` - - -## Timeline - -* static ordering: GA in 1.9 -* dynamic ordering: TBD based on customer feedback +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/rescheduler.md b/contributors/design-proposals/scheduling/rescheduler.md index df36464a..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/rescheduler.md +++ b/contributors/design-proposals/scheduling/rescheduler.md @@ -1,120 +1,6 @@ -# Rescheduler design space +Design proposals have been archived. -@davidopp, @erictune, @briangrant +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -July 2015 - -## Introduction and definition - -A rescheduler is an agent that proactively causes currently-running -Pods to be moved, so as to optimize some objective function for -goodness of the layout of Pods in the cluster. (The objective function -doesn't have to be expressed mathematically; it may just be a -collection of ad-hoc rules, but in principle there is an objective -function. Implicitly an objective function is described by the -scheduler's predicate and priority functions.) It might be triggered -to run every N minutes, or whenever some event happens that is known -to make the objective function worse (for example, whenever any Pod goes -PENDING for a long time.) - -## Motivation and use cases - -A rescheduler is useful because without a rescheduler, scheduling -decisions are only made at the time Pods are created. But later on, -the state of the cell may have changed in some way such that it would -be better to move the Pod to another node. - -There are two categories of movements a rescheduler might trigger: coalescing -and spreading. - -### Coalesce Pods - -This is the most common use case. Cluster layout changes over time. For -example, run-to-completion Pods terminate, producing free space in their wake, but that space -is fragmented. This fragmentation might prevent a PENDING Pod from scheduling -(there are enough free resource for the Pod in aggregate across the cluster, -but not on any single node). A rescheduler can coalesce free space like a -disk defragmenter, thereby producing enough free space on a node for a PENDING -Pod to schedule. In some cases it can do this just by moving Pods into existing -holes, but often it will need to evict (and reschedule) running Pods in order to -create a large enough hole. - -A second use case for a rescheduler to coalesce pods is when it becomes possible -to support the running Pods on a fewer number of nodes. The rescheduler can -gradually move Pods off of some set of nodes to make those nodes empty so -that they can then be shut down/removed. More specifically, -the system could do a simulation to see whether after removing a node from the -cluster, will the Pods that were on that node be able to reschedule, -either directly or with the help of the rescheduler; if the answer is -yes, then you can safely auto-scale down (assuming services will still -meeting their application-level SLOs). - -### Spread Pods - -The main use cases for spreading Pods revolve around relieving congestion on (a) highly -utilized node(s). For example, some process might suddenly start receiving a significantly -above-normal amount of external requests, leading to starvation of best-effort -Pods on the node. We can use the rescheduler to move the best-effort Pods off of the -node. (They are likely to have generous eviction SLOs, so are more likely to be movable -than the Pod that is experiencing the higher load, but in principle we might move either.) -Or even before any node becomes overloaded, we might proactively re-spread Pods from nodes -with high-utilization, to give them some buffer against future utilization spikes. In either -case, the nodes we move the Pods onto might have been in the system for a long time or might -have been added by the cluster auto-scaler specifically to allow the rescheduler to -rebalance utilization. - -A second spreading use case is to separate antagonists. -Sometimes the processes running in two different Pods on the same node -may have unexpected antagonistic -behavior towards one another. A system component might monitor for such -antagonism and ask the rescheduler to move one of the antagonists to a new node. - -### Ranking the use cases - -The vast majority of users probably only care about rescheduling for three scenarios: - -1. Move Pods around to get a PENDING Pod to schedule -1. Redistribute Pods onto new nodes added by a cluster auto-scaler when there are no PENDING Pods -1. Move Pods around when CPU starvation is detected on a node - -## Design considerations and design space - -Because rescheduling is disruptive--it causes one or more -already-running Pods to die when they otherwise wouldn't--a key -constraint on rescheduling is that it must be done subject to -disruption SLOs. There are a number of ways to specify these SLOs--a -global rate limit across all Pods, a rate limit across a set of Pods -defined by some particular label selector, a maximum number of Pods -that can be down at any one time among a set defined by some -particular label selector, etc. These policies are presumably part of -the Rescheduler's configuration. - -There are a lot of design possibilities for a rescheduler. To explain -them, it's easiest to start with the description of a baseline -rescheduler, and then describe possible modifications. The Baseline -rescheduler -* only kicks in when there are one or more PENDING Pods for some period of time; its objective function is binary: completely happy if there are no PENDING Pods, and completely unhappy if there are PENDING Pods; it does not try to optimize for any other aspect of cluster layout -* is not a scheduler -- it simply identifies a node where a PENDING Pod could fit if one or more Pods on that node were moved out of the way, and then kills those Pods to make room for the PENDING Pod, which will then be scheduled there by the regular scheduler(s). [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless] Of course it should only do this if it is sure the killed Pods will be able to reschedule into already-free space in the cluster. Note that although it is not a scheduler, the Rescheduler needs to be linked with the predicate functions of the scheduling algorithm(s) so that it can know (1) that the PENDING Pod would actually schedule into the hole it has identified once the hole is created, and (2) that the evicted Pod(s) will be able to schedule somewhere else in the cluster. - -Possible variations on this Baseline rescheduler are - -1. it can kill the Pod(s) whose space it wants **and also schedule the Pod that will take that space and reschedule the Pod(s) that were killed**, rather than just killing the Pod(s) whose space it wants and relying on the regular scheduler(s) to schedule the Pod that will take that space (and to reschedule the Pod(s) that were evicted) -1. it can run continuously in the background to optimize general cluster layout instead of just trying to get a PENDING Pod to schedule -1. it can try to move groups of Pods instead of using a one-at-a-time / greedy approach -1. it can formulate multi-hop plans instead of single-hop - -A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s). -* For the Baseline rescheduler, it needs to know the predicate functions used by the cluster's scheduler(s) else it can't know how to create a hole that the PENDING Pod will fit into, nor be sure that the evicted Pod(s) will be able to reschedule elsewhere. -* If it is going to run continuously in the background to optimize cluster layout but is still only going to kill Pods, then it still needs to know the predicate functions for the reason mentioned above. In principle it doesn't need to know the priority functions; it could just randomly kill Pods and rely on the regular scheduler to put them back in better places. However, this is a rather inexact approach. Thus it is useful for the rescheduler to know the priority functions, or at least some subset of them, so it can be sure that an action it takes will actually improve the cluster layout. -* If it is going to run continuously in the background to optimize cluster layout and is going to act as a scheduler rather than just killing Pods, then it needs to know the predicate functions and some compatible (but not necessarily identical) priority functions One example of a case where "compatible but not identical" might be useful is if the main scheduler(s) has a very simple scheduling policy optimized for low scheduling latency, and the Rescheduler having a more sophisticated/optimal scheduling policy that requires more computation time. The main thing to avoid is for the scheduler(s) and rescheduler to have incompatible priority functions, as this will cause them to "fight" (though it still can't lead to an infinite loop, since the scheduler(s) only ever touches a Pod once). - -## Appendix: Integrating rescheduler with cluster auto-scaler (scale up) - -For scaling up the cluster, a reasonable workflow might be: - -1. pod horizontal auto-scaler decides to add one or more Pods to a service, based on the metrics it is observing -1. the Pod goes PENDING due to lack of a suitable node with sufficient resources -1. rescheduler notices the PENDING Pod and determines that the Pod cannot schedule just by rearranging existing Pods (while respecting SLOs) -1. rescheduler triggers cluster auto-scaler to add a node of the appropriate type for the PENDING Pod -1. the PENDING Pod schedules onto the new node (and possibly the rescheduler also moves other Pods onto that node) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/rescheduling-for-critical-pods.md b/contributors/design-proposals/scheduling/rescheduling-for-critical-pods.md index f899a08d..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/rescheduling-for-critical-pods.md +++ b/contributors/design-proposals/scheduling/rescheduling-for-critical-pods.md @@ -1,84 +1,6 @@ -# Rescheduler: guaranteed scheduling of critical addons +Design proposals have been archived. -## Motivation +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine -there is a bunch of addons which due to various reasons have to run on a regular cluster node, not the master. -Some of them are critical to have fully functional cluster: Heapster, DNS, UI. Users can break their cluster -by evicting a critical addon (either manually or as a side effect of another operation like upgrade) -which possibly can become pending (for example when the cluster is highly utilized). -To avoid such situation we want to have a mechanism which guarantees that -critical addons are scheduled assuming the cluster is big enough. -This possibly may affect other pods (including production user's applications). - -## Design - -Rescheduler will ensure that critical addons are always scheduled. -In the first version it will implement only this policy, but later we may want to introduce other policies. -It will be a standalone component running on master machine similarly to scheduler. -Those components will share common logic (initially rescheduler will in fact import some of scheduler packages). - -### Guaranteed scheduling of critical addons - -Rescheduler will observe critical addons -(with annotation `scheduler.alpha.kubernetes.io/critical-pod`). -If one of them is marked by scheduler as unschedulable (pod condition `PodScheduled` set to `false`, the reason set to `Unschedulable`) -the component will try to find a space for the addon by evicting some pods and then the scheduler will schedule the addon. - -#### Scoring nodes - -Initially we want to choose a random node with enough capacity -(chosen as described in [Evicting pods](rescheduling-for-critical-pods.md#evicting-pods)) to schedule given addons. -Later we may want to introduce some heuristic: -* minimize number of evicted pods with violation of disruption budget or shortened termination grace period -* minimize number of affected pods by choosing a node on which we have to evict less pods -* increase probability of scheduling of evicted pods by preferring a set of pods with the smallest total sum of requests -* avoid nodes which are ‘non-drainable’ (according to drain logic), for example on which there is a pod which doesn't belong to any RC/RS/Deployment - -#### Evicting pods - -There are 2 mechanism which possibly can delay a pod eviction: Disruption Budget and Termination Grace Period. - -While removing a pod we will try to avoid violating Disruption Budget, though we can't guarantee it -since there is a chance that it would block this operation for longer period of time. -We will also try to respect Termination Grace Period, though without any guarantee. -In case we have to remove a pod with termination grace period longer than 10s it will be shortened to 10s. - -The proposed order while choosing a node to schedule a critical addon and pods to remove: -1. a node where the critical addon pod can fit after evicting only pods satisfying both -(1) their disruption budget will not be violated by such eviction and (2) they have grace period <= 10 seconds -1. a node where the critical addon pod can fit after evicting only pods whose disruption budget will not be violated by such eviction -1. any node where the critical addon pod can fit after evicting some pods - -### Interaction with Scheduler - -To avoid situation when Scheduler will schedule another pod into the space prepared for the critical addon, -the chosen node has to be temporarily excluded from a list of nodes considered by Scheduler while making decisions. -For this purpose the node will get a temporary -[Taint](../../docs/design/taint-toleration-dedicated.md) “CriticalAddonsOnly” -and each critical addon has to have defined toleration for this taint. -After Rescheduler has no more work to do: all critical addons are scheduled or cluster is too small for them, -all taints will be removed. - -### Interaction with Cluster Autoscaler - -Rescheduler possibly can duplicate the responsibility of Cluster Autoscaler: -both components are taking action when there is unschedulable pod. -It may cause the situation when CA will add extra node for a pending critical addon -and Rescheduler will evict some running pods to make a space for the addon. -This situation would be rare and usually an extra node would be anyway needed for evicted pods. -In the worst case CA will add and then remove the node. -To not complicate architecture by introducing interaction between those 2 components we accept this overlap. - -We want to ensure that CA won't remove nodes with critical addons by adding appropriate logic there. - -### Rescheduler control loop - -The rescheduler control loop will be as follow: -* while there is an unschedulable critical addon do the following: - * choose a node on which the addon should be scheduled (as described in Evicting pods) - * add taint to the node to prevent scheduler from using it - * delete pods which blocks the addon from being scheduled - * wait until scheduler will schedule the critical addon -* if there is no more critical addons for which we can help, ensure there is no node with the taint +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/rescheduling.md b/contributors/design-proposals/scheduling/rescheduling.md index fa06cdc4..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/rescheduling.md +++ b/contributors/design-proposals/scheduling/rescheduling.md @@ -1,489 +1,6 @@ -# Controlled Rescheduling in Kubernetes +Design proposals have been archived. -## Overview +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Although the Kubernetes scheduler(s) try to make good placement decisions for pods, -conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes -are removed due to failures or planned maintenance or auto-scaling down, nodes appear due -to recovery after a failure or re-joining after maintenance or auto-scaling up or adding -new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are -some interactions between pods, or between pods and nodes, that they cannot predict). As -a result, the initial node selected for a pod may turn out to be a bad match, from the -perspective of the pod and/or the cluster as a whole, at some point after the pod has -started running. - -Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless -it terminates on its own, is deleted by the user, or experiences some unplanned event -(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the -assignment of pods to nodes degrades over time, no matter how good an initial scheduling -decision the scheduler makes. This observation motivates "controlled rescheduling," a -mechanism by which Kubernetes will "move" already-running pods over time to improve their -placement. Controlled rescheduling is the subject of this proposal. - -Note that the term "move" is not technically accurate -- the mechanism used is that -Kubernetes will terminate a pod that is managed by a controller, and the controller will -create a replacement pod that is then scheduled by the pod's scheduler. The terminated -pod and replacement pod are completely separate pods, and no pod migration is -implied. However, describing the process as "moving" the pod is approximately accurate -and easier to understand, so we will use this terminology in the document. - -We use the term "rescheduling" to describe any action the system takes to move an -already-running pod. The decision may be made and executed by any component; we will -introduce the concept of a "rescheduler" component later, but it is not the only -component that can do rescheduling. - -This proposal primarily focuses on the architecture and features/mechanisms used to -achieve rescheduling, and only briefly discuss example policies. We expect that community -experimentation will lead to a significantly better understanding of the range, potential, -and limitations of rescheduling policies. - -## Example use cases - -Example use cases for rescheduling are - -* moving a running pod onto a node that better satisfies its scheduling criteria - * moving a pod onto an under-utilized node - * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences -* moving a running pod off of a node in anticipation of a known or speculated future event - * draining a node in preparation for maintenance, decommissioning, auto-scale-down, etc. - * "preempting" a running pod to make room for a pending pod to schedule - * proactively/speculatively make room for large and/or exclusive pods to facilitate - fast scheduling in the future (often called "defragmentation") - * (note that these last two cases are the only use cases where the first-order intent - is to move a pod specifically for the benefit of another pod) -* moving a running pod off of a node from which it is receiving poor service - * anomalous crashlooping or other mysterious incompatibility between the pod and the node - * repeated out-of-resource killing (see #18724) - * repeated attempts by the scheduler to schedule the pod onto some node, but it is - rejected by Kubelet admission control due to incomplete scheduler knowledge - * poor performance due to interference from other containers on the node (CPU hogs, - cache thrashers, etc.) (note that in this case there is a choice of moving the victim - or the aggressor) - -## Some axes of the design space - -Among the key design decisions are - -* how does a pod specify its tolerance for these system-generated disruptions, and how - does the system enforce such disruption limits -* for each use case, where is the decision made about when and which pods to reschedule - (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.) -* rescheduler design issues: how much does a rescheduler need to know about pods' - schedulers' policies, how does the rescheduler specify its rescheduling - requests/decisions (e.g. just as an eviction, an eviction with a hint about where to - reschedule, or as an eviction paired with a specific binding), how does the system - implement these requests, does the rescheduler take into account the second-order - effects of decisions (e.g. whether an evicted pod will reschedule, will cause - a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans - (e.g. evict two pods at the same time with the intent of moving one into the space - vacated by the other, or even more complex plans) - -Additional musings on the rescheduling design space can be found [here](rescheduler.md). - -## Design proposal - -The key mechanisms and components of the proposed design are priority, preemption, -disruption budgets, the `/evict` subresource, and the rescheduler. - -### Priority - -#### Motivation - - -Just as it is useful to overcommit nodes to increase node-level utilization, it is useful -to overcommit clusters to increase cluster-level utilization. Scheduling priority (which -we abbreviate as *priority*, in combination with disruption budgets (described in the -next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow -it to safely overcommit nodes. - -Today, cluster sharing among users, workload types, etc. is regulated via the -[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster -administrator has two choices: (1) the sum of the quotas is less than or equal to the -capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the -cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster -under-utilization, while (2) is unsafe in the sense that someone's pods may go pending -indefinitely even though they are still within their quota. Priority makes cluster -overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify -which pods should be allowed to run, and which should go pending, when demand for cluster -resources exceeds supply to due to cluster overcommitment. - -Priority is also useful in some special-case scenarios, such as ensuring that system -DaemonSets can always schedule and reschedule onto every node where they want to run -(assuming they are given the highest priority), e.g. see #21767. - -#### Specifying priorities - -We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and -the cluster administrator defines a total ordering on these strings (for example -`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is -easy for an administrator to add new priority levels in between existing levels, to -encourage thinking about priority in terms of user intent and avoid magic numbers, and to -make the internal implementation more flexible. - -When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's -scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at -the same or lower priority than P (subject to disruption budgets, see next section) from -a node in order to make room for P, i.e. in order to make the scheduling predicates -satisfied for P on that node. (Note that when we add cluster-level resources (#19080), -it might be necessary to preempt from multiple nodes, but that scenario is outside the -scope of this document.) The preempted pod(s) may or may not be able to reschedule. The -net effect of this process is that when demand for cluster resources exceeds supply, the -higher-priority pods will be able to run while the lower-priority pods will be forced to -wait. The detailed mechanics of preemption are described in a later section. - -In addition to taking disruption budget into account, for equal-priority preemptions the -scheduler will try to enforce fairness (across victim controllers, services, etc.) - -Priorities could be specified directly by users in the podTemplate, or assigned by an -admission controller using -properties of the pod. Either way, all schedulers must be configured to understand the -same priorities (names and ordering). This could be done by making them constants in the -API, or using ConfigMap to configure the schedulers with the information. The advantage of -the former (at least making the names, if not the ordering, constants in the API) is that -it allows the API server to do validation (e.g. to catch mis-spelling). - -In the future, which priorities are usable for a given namespace and pods with certain -attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy. - -Priority and resource QoS are independent. - -The priority we have described here might be used to prioritize the scheduling queue -(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two -priority concepts do not have to be connected. It is somewhat logical to tie them -together, since a higher priority generally indicates that a pod is more urgent to get -running. Also, scheduling low-priority pods before high-priority pods might lead to -avoidable preemptions if the high-priority pods end up preempting the low-priority pods -that were just scheduled. - -TODO: Priority and preemption are global or namespace-relative? See -[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389). - -#### Relationship of priority to quota - -Of course, if the decision of what priority to give a pod is solely up to the user, then -users have no incentive to ever request any priority less than the maximum. Thus -priority is intimately related to quota, in the sense that resource quotas must be -allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM -at priority B, etc.). The "guarantee" that highest-priority pods will always be able to -schedule can only be achieved if the sum of the quotas at the top priority level is less -than or equal to the cluster capacity. This is analogous to QoS, where safety can only be -achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or -equal to the node capacity. In terms of incentives, an organization could "charge" -an amount proportional to the priority of the resources. - -The topic of how to allocate quota at different priority levels to achieve a desired -balance between utilization and probability of schedulability is an extremely complex -topic that is outside the scope of this document. For example, resource fragmentation and -RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the -sum of the quotas at the top priority level is less than or equal to the total aggregate -capacity of the cluster, some pods at the top priority level might still go pending. In -general, priority provides a *probabilistic* guarantees of pod schedulability in the face -of overcommitment, by allowing prioritization of which pods should be allowed to run pods -when demand for cluster resources exceeds supply. - -### Disruption budget - -While priority can protect pods from one source of disruption (preemption by a -lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated -causes, including preemption by an equal or higher-priority pod, or being evicted to -achieve other rescheduling goals. In particular, each pod is optionally associated with a -"disruption budget," a new API resource that limits Kubernetes-initiated terminations -across a set of pods (e.g. the pods of a particular Service might all point to the same -disruption budget object), regardless of cause. Initially we expect disruption budget -(e.g. `DisruptionBudgetSpec`) to consist of - -* a rate limit on disruptions (preemption and other evictions) across the corresponding - set of pods, e.g. no more than one disruption per hour across the pods of a particular Service -* a minimum number of pods that must be up simultaneously (sometimes called "shard - strength") (of course this can also be expressed as the inverse, i.e. the number of - pods of the collection that can be down simultaneously) - -The second item merits a bit more explanation. One use case is to specify a quorum size, -e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up -at the same time. In practice, a service should ideally create enough replicas to survive -at least one planned and one unplanned outage. So in our quorum example, we would specify -that at least 4 replicas must be up at the same time; this allows for one intentional -disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit -of shard strength budget) and one unplanned disruption (bringing the number of live -replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also -useful for simpler replicated services; for example, you might not want more than 10% of -your front-ends to be down at the same time, so as to avoid overloading the remaining -replicas. - -Initially, disruption budgets will be specified by the user. Thus as with priority, -disruption budgets need to be tied into quota, to prevent users from saying none of their -pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD, -though a simple starting point would be to have an admission controller assign a default -disruption budget based on priority level (more liberal with decreasing priority). -We also likely need a quota that applies to Kubernetes *components*, to the limit the rate -at which any one component is allowed to consume disruption budget. - -Of course there should also be a `DisruptionBudgetStatus` that indicates the current -disruption rate that the collection of pods is experiencing, and the number of pods that -are up. - -For the purposes of disruption budget, a pod is considered to be disrupted as soon as its -graceful termination period starts. - -A pod that is not covered by a disruption budget but is managed by a controller, -gets an implicit disruption budget of infinite (though the system should try to not -unduly victimize such pods). How a pod that is not managed by a controller is -handled is TBD. - -TBD: In addition to `PodSpec`, where do we store pointer to disruption budget -(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption -budget (e.g. when instantiating a Service), or require the user to create it manually -before they create a controller? Which objects should return the disruption budget object -as part of the output on `kubectl get` other than (obviously) `kubectl get` for the -disruption budget itself? - -TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes" -and "down due to unplanned outage" in spec and status. - -For now, there is nothing to prevent clients from circumventing the disruption budget -protections. Of course, clients that do this are not being "good citizens." In the next -section we describe a mechanism that at least makes it easy for well-behaved clients to -obey the disruption budgets. - -See #12611 for additional discussion of disruption budgets. - -### /evict subresource and PreferAvoidPods - -Although we could put the responsibility for checking and updating disruption budgets -solely on the client, it is safer and more convenient if we implement that functionality -in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to -today's "delete" on pod except - - * It will be rejected if the deletion would violate disruption budget. (See how - Deployment handles failure of /rollback for ideas on how clients could handle failure - of `/evict`.) There are two possible ways to implement this: - - * For the initial implementation, this will be accomplished by the API server just - looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the - `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller - keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and - creations in the cluster, so that an approved disruption is quickly reflected in the - `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or - more additional disruptions could be approved before the first one is reflected in the - `DisruptionBudgetStatus`. - - * Thus a subsequent implementation will have the API server explicitly debit the - `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a - controller, to keep the shard strength status up-to-date when replacement pods are - created after an eviction; the controller may also be necessary for the rate status - depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.) - Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will - be placed in the same transaction. - - * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its - graceful termination period starts (so when we say "delete" here we do not mean - "deleted from etcd" but rather "graceful termination period has started"). - - * It will allow clients to communicate additional parameters when they wish to delete a - pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific - type analogous to `api.DeleteOptions`.) - -We will make `kubectl delete pod` use `/evict` by default, and require a command-line -flag to delete the pod directly. - -We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid -that node (provisionally called `PreferAvoidPods`). One of the pieces of information -specified in the `/evict` subresource is whether the eviction should add the evicted -pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod -signature will be a -[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648), -i.e. a reference to the pod's controller. Controllers are responsible for garbage -collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API -server will also enforce a bounded size on the list. All schedulers will have a -highest-weighted priority function that gives a node the worst priority if the pod it is -scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in -`PreferAvoidPods` is similar to -[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md) -but it takes precedence over all other priority criteria and is not explicitly listed in -the `NodeAffinity` of the pod. - -`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is -receiving poor service" use case, as it reduces the chance that the replacement pod will -end up on the same node (keep in mind that most of those cases are situations that the -scheduler does not have explicit priority functions for, for example it cannot know in -advance that a pod will be starved). Also, though we do not intend to implement any such -policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts -two pods A and B with the intention of moving A into the space vacated by B (it prevents -B from rescheduling back into the space it vacated before A's scheduler has a chance to -reschedule A there). Note that these two uses are subtly different; in the first -case we want the avoidance to last a relatively long time, whereas in the second case we -may only need it to last until A schedules. - -See #20699 for more discussion. - -### Preemption mechanics - -**NOTE: We expect a fuller design doc to be written on preemption before it is implemented. -However, a sketch of some ideas are presented here, since preemption is closely related to the -concepts discussed in this doc.** - -Pod schedulers will decide and enact preemptions, subject to the priority and disruption -budget rules described earlier. (Though note that we currently do not have any mechanism -to prevent schedulers from bypassing either the priority or disruption budget rules.) -The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The -eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption -budget(s) of the victim(s), but it does not request to add the victim pod(s) to the -nodes' `PreferAvoidPods`. - -Evicting victim(s) and binding the pending pod that the evictions are intended to enable -to schedule, are not transactional. We expect the scheduler to issue the operations in -sequence, but it is still possible that another scheduler could schedule its pod in -between the eviction(s) and the binding, or that the set of pods running on the node in -question changed between the time the scheduler made its decision and the time it sent -the operations to the API server thereby causing the eviction(s) to be not sufficient to get the -pending pod to schedule. In general there are a number of race conditions that cannot be -avoided without (1) making the evictions and binding be part of a single transaction, and -(2) making the binding preconditioned on a version number that is associated with the -node and is incremented on every binding. We may or may not implement those mechanisms in -the future. - -Given a choice between a node where scheduling a pod requires preemption and one where it -does not, all other things being equal, a scheduler should choose the one where -preemption is not required. (TBD: Also, if the selected node does require preemption, the -scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the -scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and -one 4 GB high-priority pod, all of which have sufficient disruption budget, it should -preempt the two low-priority pods). This is debatable, since all have sufficient -disruption budget. But still better to err on the side of giving better disruption SLO to -higher-priority pods when possible?) - -Preemption victims must be given their termination grace period. One possible sequence -of events is - -1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the -preempting pod) and sets `deletionTimestamp` on the victims -2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their -graceful termination period -3. Kubelet sees the preempting pod. It runs the admission checks on the new pod -assuming all pods that are in their graceful termination period are gone and that -all pods that are in the waiting state (see (4)) are running. -4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the -new pod in a waiting state, and does not run it until the pod passes the -admission checks using the set of actually running pods. - -Note that there are a lot of details to be figured out here; above is just a very -hand-wavy sketch of one general approach that might work. - -See #22212 for additional discussion. - -### Node drain - -Node drain will be handled by one or more components not described in this document. They -will respect disruption budgets. Initially, we will just make `kubectl drain` -respect disruption budgets. See #17393 for other discussion. - -### Rescheduler - -All rescheduling other than preemption and node drain will be decided and enacted by a -new component called the *rescheduler*. It runs continuously in the background, looking -for opportunities to move pods to better locations. It acts when the degree of -improvement meets some threshold and is allowed by the pod's disruption budget. The -action is eviction of a pod using the `/evict` subresource, with the pod's signature -enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any -particular node. Thus it is really an "unscheduler"; only in combination with the evicted -pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See -the "Example use cases" section earlier for some example use cases. - -The rescheduler is a best-effort service that makes no guarantees about how quickly (or -whether) it will resolve a suboptimal pod placement. - -The first version of the rescheduler will not take into consideration where or whether an -evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the -corresponding shard strength disruption budget by one indefinitely. By using the `/evict` -subresource, the rescheduler ensures that an evicted pod has sufficient budget for the -evicted pod to go and stay pending. We expect future versions of the rescheduler may be -linked with the "mandatory" predicate functions (currently, the ones that constitute the -Kubelet admission criteria), and will only evict if the rescheduler determines that the -pod can reschedule somewhere according to those criteria. (Note that this still does not -guarantee that the pod actually will be able to reschedule, for at least two reasons: (1) -the state of the cluster may change between the time the rescheduler evaluates it and -when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the -evicted pod's scheduler may have additional predicate functions in addition to the -mandatory ones). - -(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)). - -The first version of the rescheduler will only implement two objectives: moving a pod -onto an under-utilized node, and moving a pod onto a node that meets more of the pod's -affinity/anti-affinity preferences than wherever it is currently running. (We assume that -nodes that are intentionally under-utilized, e.g. because they are being drained, are -marked unschedulable, thus the first objective will not cause the rescheduler to "fight" -a system that is draining nodes.) We assume that all schedulers sufficiently weight the -priority functions for affinity/anti-affinity and avoiding very packed nodes, -otherwise evicted pods may not actually move onto a node that is better according to -the criteria that caused it to be evicted. (But note that in all cases it will move to a -node that is better according to the totality of its scheduler's priority functions, -except in the case where the node where it was already running was the only node -where it can run.) As a general rule, the rescheduler should only act when it sees -particularly bad situations, since (1) an eviction for a marginal improvement is likely -not worth the disruption--just because there is sufficient budget for an eviction doesn't -mean an eviction is painless to the application, and (2) rescheduling the pod might not -actually mitigate the identified problem if it is minor enough that other scheduling -factors dominate the decision of where the replacement pod is scheduled. - -We assume schedulers' priority functions are at least vaguely aligned with the -rescheduler's policies; otherwise the rescheduler will never accomplish anything useful, -given that it relies on the schedulers to actually reschedule the evicted pods. (Even if -the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want -this to be true, to prevent the schedulers and rescheduler from "fighting" one another.) - -The rescheduler will be configured using ConfigMap; the cluster administrator can enable -or disable policies and can tune the rescheduler's aggressiveness (aggressive means it -will use a relatively low threshold for triggering an eviction and may consume a lot of -disruption budget, while non-aggressive means it will use a relatively high threshold for -triggering an eviction and will try to leave plenty of buffer in disruption budgets). The -first version of the rescheduler will not be extensible or pluggable, since we want to -keep the code simple while we gain experience with the overall concept. In the future, we -anticipate a version that will be extensible and pluggable. - -We might want some way to force the evicted pod to the front of the scheduler queue, -independently of its priority. - -See #12140 for additional discussion. - -### Final comments - -In general, the design space for this topic is huge. This document describes some of the -design considerations and proposes one particular initial implementation. We expect -certain aspects of the design to be "permanent" (e.g. the notion and use of priorities, -preemption, disruption budgets, and the `/evict` subresource) while others may change over time -(e.g. the partitioning of functionality between schedulers, controllers, rescheduler, -horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements; -the factors the rescheduler takes into account when making decisions (e.g. knowledge of -schedulers' predicate and priority functions, second-order effects like whether and where -evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its -decisions; and the complexity of the plans the rescheduler attempts to implement). - -## Implementation plan - -The highest-priority feature to implement is the rescheduler with the two use cases -highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a -node that meets more of the pod's affinity/anti-affinity preferences. The former is -useful to rebalance pods after cluster auto-scale-up, and the latter is useful for -Ubernetes. This requires implementing disruption budgets and the `/evict` subresource, -but not priority or preemption. - -Because the general topic of rescheduling is very speculative, we have intentionally -proposed that the first version of the rescheduler be very simple -- only uses eviction -(no attempt to guide replacement pod to any particular node), doesn't know schedulers' -predicate or priority functions, doesn't try to move two pods at the same time, and only -implements two use cases. As alluded to in the previous subsection, we expect the design -and implementation to evolve over time, and we encourage members of the community to -experiment with more sophisticated policies and to report their results from using them -on real workloads. - -## Alternative implementations - -TODO. - -## Additional references - -TODO. - -TODO: Add reference to this doc from docs/proposals/rescheduler.md +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/resources.md b/contributors/design-proposals/scheduling/resources.md index 356f57e7..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/resources.md +++ b/contributors/design-proposals/scheduling/resources.md @@ -1,366 +1,6 @@ -**Note: this is a design doc, which describes features that have not been -completely implemented. User documentation of the current state is -[here](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/). The tracking issue for -implementation of this model is [#168](http://issue.k8s.io/168). Currently, both -limits and requests of memory and cpu on containers (not pods) are supported. -"memory" is in bytes and "cpu" is in milli-cores.** +Design proposals have been archived. -# The Kubernetes resource model +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -To do good pod placement, Kubernetes needs to know how big pods are, as well as -the sizes of the nodes onto which they are being placed. The definition of "how -big" is given by the Kubernetes resource model — the subject of this -document. - -The resource model aims to be: -* simple, for common cases; -* extensible, to accommodate future growth; -* regular, with few special cases; and -* precise, to avoid misunderstandings and promote pod portability. - -## The resource model - -A Kubernetes _resource_ is something that can be requested by, allocated to, or -consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, -and network bandwidth. - -Once resources on a node have been allocated to one pod, they should not be -allocated to another until that pod is removed or exits. This means that -Kubernetes schedulers should ensure that the sum of the resources allocated -(requested and granted) to its pods never exceeds the usable capacity of the -node. Testing whether a pod will fit on a node is called _feasibility checking_. - -Note that the resource model currently prohibits over-committing resources; we -will want to relax that restriction later. - -### Resource types - -All resources have a _type_ that is identified by their _typename_ (a string, -e.g., "memory"). Several resource types are predefined by Kubernetes (a full -list is below), although only two will be supported at first: CPU and memory. -Users and system administrators can define their own resource types if they wish -(e.g., Hadoop slots). - -A fully-qualified resource typename is constructed from a DNS-style _subdomain_, -followed by a slash `/`, followed by a name. -* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) -(e.g., `kubernetes.io`, `example.com`). -* The name must be not more than 63 characters, consisting of upper- or -lower-case alphanumeric characters, with the `-`, `_`, and `.` characters -allowed anywhere except the first or last character. -* As a shorthand, any resource typename that does not start with a subdomain and -a slash will automatically be prefixed with the built-in Kubernetes _namespace_, -`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for -code in the open source Kubernetes repository; as a result, all user typenames -MUST be fully qualified, and cannot be created in this namespace. - -Some example typenames include `memory` (which will be fully-qualified as -`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. - -For future reference, note that some resources, such as CPU and network -bandwidth, are _compressible_, which means that their usage can potentially be -throttled in a relatively benign manner. All other resources are -_incompressible_, which means that any attempt to throttle them is likely to -cause grief. This distinction will be important if a Kubernetes implementation -supports over-committing of resources. - -### Resource quantities - -Initially, all Kubernetes resource types are _quantitative_, and have an -associated _unit_ for quantities of the associated resource (e.g., bytes for -memory, bytes per seconds for bandwidth, instances for software licences). The -units will always be a resource type's natural base units (e.g., bytes, not MB), -to avoid confusion between binary and decimal multipliers and the underlying -unit multiplier (e.g., is memory measured in MiB, MB, or GB?). - -Resource quantities can be added and subtracted: for example, a node has a fixed -quantity of each resource type that can be allocated to pods/containers; once -such an allocation has been made, the allocated resources cannot be made -available to other pods/containers without over-committing the resources. - -To make life easier for people, quantities can be represented externally as -unadorned integers, or as fixed-point integers with one of these SI suffices -(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, - Ki). For example, the following represent roughly the same value: 128974848, -"129e6", "129M" , "123Mi". Small quantities can be represented directly as -decimals (e.g., 0.3), or using milli-units (e.g., "300m"). - * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML -resource specifications that might be generated or read by people. - * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI -suffix. There are no power-of-two equivalents for SI suffixes that represent -multipliers less than 1. - * These conventions only apply to resource quantities, not arbitrary values. - -Internally (i.e., everywhere else), Kubernetes will represent resource -quantities as integers so it can avoid problems with rounding errors, and will -not use strings to represent numeric values. To achieve this, quantities that -naturally have fractional parts (e.g., CPU seconds/second) will be scaled to -integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. -Internal APIs, data structures, and protobufs will use these scaled integer -units. Raw measurement data such as usage may still need to be tracked and -calculated using floating point values, but internally they should be rescaled -to avoid some values being in milli-units and some not. - * Note that reading in a resource quantity and writing it out again may change -the way its values are represented, and truncate precision (e.g., 1.0001 may -become 1.000), so comparison and difference operations (e.g., by an updater) -must be done on the internal representations. - * Avoiding milli-units in external representations has advantages for people -who will use Kubernetes, but runs the risk of developers forgetting to rescale -or accidentally using floating-point representations. That seems like the right -choice. We will try to reduce the risk by providing libraries that automatically -do the quantization for JSON/YAML inputs. - -### Resource specifications - -Both users and a number of system components, such as schedulers, (horizontal) -auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers -need to reason about resource requirements of workloads, resource capacities of -nodes, and resource usage. Kubernetes divides specifications of *desired state*, -aka the Spec, and representations of *current state*, aka the Status. Resource -requirements and total node capacity fall into the specification category, while -resource usage, characterizations derived from usage (e.g., maximum usage, -histograms), and other resource demand signals (e.g., CPU load) clearly fall -into the status category and are discussed in the Appendix for now. - -Resource requirements for a container or pod should have the following form: - -```yaml -resourceRequirementSpec: [ - request: [ cpu: 2.5, memory: "40Mi" ], - limit: [ cpu: 4.0, memory: "99Mi" ], -] -``` - -Where: -* _request_ [optional]: the amount of resources being requested, or that were -requested and have been allocated. Scheduler algorithms will use these -quantities to test feasibility (whether a pod will fit onto a node). -If a container (or pod) tries to use more resources than its _request_, any -associated SLOs are voided — e.g., the program it is running may be -throttled (compressible resource types), or the attempt may be denied. If -_request_ is omitted for a container, it defaults to _limit_ if that is -explicitly specified, otherwise to an implementation-defined value; this will -always be 0 for a user-defined resource type. If _request_ is omitted for a pod, -it defaults to the sum of the (explicit or implicit) _request_ values for the -containers it encloses. - -* _limit_ [optional]: an upper bound or cap on the maximum amount of resources -that will be made available to a container or pod; if a container or pod uses -more resources than its _limit_, it may be terminated. The _limit_ defaults to -"unbounded"; in practice, this probably means the capacity of an enclosing -container, pod, or node, but may result in non-deterministic behavior, -especially for memory. - -Total capacity for a node should have a similar structure: - -```yaml -resourceCapacitySpec: [ - total: [ cpu: 12, memory: "128Gi" ] -] -``` - -Where: -* _total_: the total allocatable resources of a node. Initially, the resources -at a given scope will bound the resources of the sum of inner scopes. - -#### Notes - - * It is an error to specify the same resource type more than once in each -list. - - * It is an error for the _request_ or _limit_ values for a pod to be less than -the sum of the (explicit or defaulted) values for the containers it encloses. -(We may relax this later.) - - * If multiple pods are running on the same node and attempting to use more -resources than they have requested, the result is implementation-defined. For -example: unallocated or unused resources might be spread equally across -claimants, or the assignment might be weighted by the size of the original -request, or as a function of limits, or priority, or the phase of the moon, -perhaps modulated by the direction of the tide. Thus, although it's not -mandatory to provide a _request_, it's probably a good idea. (Note that the -_request_ could be filled in by an automated system that is observing actual -usage and/or historical data.) - - * Internally, the Kubernetes master can decide the defaulting behavior and the -kubelet implementation may expected an absolute specification. For example, if -the master decided that "the default is unbounded" it would pass 2^64 to the -kubelet. - - -## Kubernetes-defined resource types - -The following resource types are predefined ("reserved") by Kubernetes in the -`kubernetes.io` namespace, and so cannot be used for user-defined resources. -Note that the syntax of all resource types in the resource spec is deliberately -similar, but some resource types (e.g., CPU) may receive significantly more -support than simply tracking quantities in the schedulers and/or the Kubelet. - -### Processor cycles - - * Name: `cpu` (or `kubernetes.io/cpu`) - * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to -a canonical "Kubernetes CPU") - * Internal representation: milli-KCUs - * Compressible? yes - * Qualities: this is a placeholder for the kind of thing that may be supported -in the future — see [#147](http://issue.k8s.io/147) - * [future] `schedulingLatency`: as per lmctfy - * [future] `cpuConversionFactor`: property of a node: the speed of a CPU -core on the node's processor divided by the speed of the canonical Kubernetes -CPU (a floating point value; default = 1.0). - -To reduce performance portability problems for pods, and to avoid worse-case -provisioning behavior, the units of CPU will be normalized to a canonical -"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be -equivalent to a single CPU hyperthreaded core for some recent x86 processor. The -normalization may be implementation-defined, although some reasonable defaults -will be provided in the open-source Kubernetes code. - -Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will -be allocated — control of aspects like this will be handled by resource -_qualities_ (a future feature). - - -### Memory - - * Name: `memory` (or `kubernetes.io/memory`) - * Units: bytes - * Compressible? no (at least initially) - -The precise meaning of what "memory" means is implementation dependent, but the -basic idea is to rely on the underlying `memcg` mechanisms, support, and -definitions. - -Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory -quantities rather than decimal ones: "64MiB" rather than "64MB". - - -## Resource metadata - -A resource type may have an associated read-only ResourceType structure, that -contains metadata about the type. For example: - -```yaml -resourceTypes: [ - "kubernetes.io/memory": [ - isCompressible: false, ... - ] - "kubernetes.io/cpu": [ - isCompressible: true, - internalScaleExponent: 3, ... - ] - "kubernetes.io/disk-space": [ ... ] -] -``` - -Kubernetes will provide ResourceType metadata for its predefined types. If no -resource metadata can be found for a resource type, Kubernetes will assume that -it is a quantified, incompressible resource that is not specified in -milli-units, and has no default value. - -The defined properties are as follows: - -| field name | type | contents | -| ---------- | ---- | -------- | -| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) | -| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) | -| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". | -| isCompressible | bool, default=false | true if the resource type is compressible | -| defaultRequest | string, default=none | in the same format as a user-supplied value | -| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). | - - -# Appendix: future extensions - -The following are planned future extensions to the resource model, included here -to encourage comments. - -## Usage data - -Because resource usage and related metrics change continuously, need to be -tracked over time (i.e., historically), can be characterized in a variety of -ways, and are fairly voluminous, we will not include usage in core API objects, -such as [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/) and Nodes, but will provide separate APIs -for accessing and managing that data. See the Appendix for possible -representations of usage data, but the representation we'll use is TBD. - -Singleton values for observed and predicted future usage will rapidly prove -inadequate, so we will support the following structure for extended usage -information: - -```yaml -resourceStatus: [ - usage: [ cpu: <CPU-info>, memory: <memory-info> ], - maxusage: [ cpu: <CPU-info>, memory: <memory-info> ], - predicted: [ cpu: <CPU-info>, memory: <memory-info> ], -] -``` - -where a `<CPU-info>` or `<memory-info>` structure looks like this: - -```yaml -{ - mean: <value> # arithmetic mean - max: <value> # maximum value - min: <value> # minimum value - count: <value> # number of data points - percentiles: [ # map from %iles to values - "10": <10th-percentile-value>, - "50": <median-value>, - "99": <99th-percentile-value>, - "99.9": <99.9th-percentile-value>, - ... - ] -} -``` - -All parts of this structure are optional, although we strongly encourage -including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. -_[In practice, it will be important to include additional info such as the -length of the time window over which the averages are calculated, the -confidence level, and information-quality metrics such as the number of dropped -or discarded data points.]_ and predicted - -## Future resource types - -### _[future] Network bandwidth_ - - * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`) - * Units: bytes per second - * Compressible? yes - -### _[future] Network operations_ - - * Name: "network-iops" (or `kubernetes.io/network-iops`) - * Units: operations (messages) per second - * Compressible? yes - -### _[future] Storage space_ - - * Name: "storage-space" (or `kubernetes.io/storage-space`) - * Units: bytes - * Compressible? no - -The amount of secondary storage space available to a container. The main target -is local disk drives and SSDs, although this could also be used to qualify -remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a -disk array, or a file system fronting any of these, is left for future work. - -### _[future] Storage time_ - - * Name: storage-time (or `kubernetes.io/storage-time`) - * Units: seconds per second of disk time - * Internal representation: milli-units - * Compressible? yes - -This is the amount of time a container spends accessing disk, including actuator -and transfer time. A standard disk drive provides 1.0 diskTime seconds per -second. - -### _[future] Storage operations_ - - * Name: "storage-iops" (or `kubernetes.io/storage-iops`) - * Units: operations per second - * Compressible? yes +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/schedule-DS-pod-by-scheduler.md b/contributors/design-proposals/scheduling/schedule-DS-pod-by-scheduler.md index c7038eac..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/schedule-DS-pod-by-scheduler.md +++ b/contributors/design-proposals/scheduling/schedule-DS-pod-by-scheduler.md @@ -1,70 +1,6 @@ -# Schedule DaemonSet Pods by default scheduler, not DaemonSet controller +Design proposals have been archived. -[@k82cn](http://github.com/k82cn), Feb 2018, [#42002](https://github.com/kubernetes/kubernetes/issues/42002). +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -A DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Normally, the machine that a pod runs on is selected by the Kubernetes scheduler; however, pods of DaemonSet are created and scheduled by DaemonSet controller who leveraged kube-scheduler’s predicates policy. That introduces the following issues: - -* DaemonSet can not respect Node’s resource changes, e.g. more resources after other Pods exit ([#46935](https://github.com/kubernetes/kubernetes/issues/46935), [#58868](https://github.com/kubernetes/kubernetes/issues/58868)) -* DaemonSet can not respect Pod Affinity and Pod AntiAffinity ([#29276](https://github.com/kubernetes/kubernetes/issues/29276)) -* Duplicated logic to respect scheduler features, e.g. critical pods ([#42028](https://github.com/kubernetes/kubernetes/issues/42028)), tolerant/taint -* Hard to debug why DaemonSet’s Pod is not created, e.g. not enough resources; it’s better to have a pending Pods with predicates’ event -* Hard to support preemption in different components, e.g. DS and default scheduler - -After [discussions](https://docs.google.com/document/d/1v7hsusMaeImQrOagktQb40ePbK6Jxp1hzgFB9OZa_ew/edit#), SIG scheduling approved changing DaemonSet controller to create DaemonSet Pods and set their node-affinity and let them be scheduled by default scheduler. After this change, DaemonSet controller will no longer schedule DaemonSet Pods directly. - -## Solutions - -Before the discussion of solutions/options, there’s some requirements/questions on DaemonSet: - -* **Q**: DaemonSet controller can make pods even if the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), -Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network? - - **A**: This will be handled by supporting scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerate `NetworkUnavailable` taint. - -* **Q**: DaemonSet controller can make pods even if when the scheduler has not been started, which can help cluster bootstrap. - - **A**: As the scheduling logic is moved to default scheduler, the kube-scheduler must be started during cluster start-up. - -* **Q**: Will this change/constrain update strategies, such as scheduling an updated pod to a node before the previous pod is gone? - - **A**: no, this will NOT change update strategies. - -* **Q**: How would Daemons be integrated into Node lifecycle, such as being scheduled before any other nodes and/or remaining after all others are evicted? This isn't currently implemented, but was planned. - - **A**: Similar to the other Pods; DaemonSet Pods only has attributes to make sure one Pod per Node, DaemonSet controller will create Pods based on node number (by considering ‘nodeSelector’). - - -Currently, pods of DaemonSet are created and scheduled by DaemonSet controller: - -1. DS controller filter nodes by nodeSelector and scheduler’s predicates -2. For each node, create a Pod for it by setting spec.hostName directly; it’ll skip default scheduler - -This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller: - -1. DS controller filter nodes by nodeSelector, but does NOT check against scheduler’s predicates (e.g. PodFitHostResources) -2. For each node, DS controller creates a Pod for it with the following NodeAffinity - ```yaml - nodeAffinity: - requiredDuringSchedulingIgnoredDuringExecution: - - nodeSelectorTerms: - matchExpressions: - - key: kubernetes.io/hostname - operator: in - values: - - dest_hostname - ``` -3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes -4. In scheduler, DaemonSet Pods will stay pending if scheduling predicates fail. To avoid this, an appropriate priority must - be set to all critical DaemonSet Pods. Scheduler will preempt other pods to ensure critical pods were scheduled even when - the cluster is under resource pressure. - -## Reference - -* [DaemonsetController can't feel it when node has more resources, e.g. other Pod exits](https://github.com/kubernetes/kubernetes/issues/46935) -* [DaemonsetController can't feel it when node recovered from outofdisk state](https://github.com/kubernetes/kubernetes/issues/45628) -* [DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller](https://github.com/kubernetes/kubernetes/issues/42002) -* [NodeController should add NoSchedule taints and we should get rid of getNodeConditionPredicate()](https://github.com/kubernetes/kubernetes/issues/42001) -* [DaemonSet should respect Pod Affinity and Pod AntiAffinity](https://github.com/kubernetes/kubernetes/issues/29276) -* [Make DaemonSet respect critical pods annotation when scheduling](https://github.com/kubernetes/kubernetes/pull/42028) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/scheduler-equivalence-class.md b/contributors/design-proposals/scheduling/scheduler-equivalence-class.md index 808de966..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/scheduler-equivalence-class.md +++ b/contributors/design-proposals/scheduling/scheduler-equivalence-class.md @@ -1,326 +1,6 @@ -# Equivalence class based scheduling in Kubernetes +Design proposals have been archived. -**Authors**: +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -@resouer @wojtek-t @davidopp - -# Guideline - -- [Objectives](#objectives) - - [Goals](#goals) - - [Non-Goals](#non-goals) -- [Background](#background) - - [Terminology](#terminology) -- [Overview](#overview) -- [Detailed Design](#detailed-design) - - [Define equivalence class](#define-equivalence-class) - - [Equivalence class in predicate phase](#equivalence-class-in-predicate) - - [Keep equivalence class cache up-to-date](#keep-equivalence--class-cache-up-to-date) -- [Notes for scheduler developers](#notes-for-scheduler-developer) -- [References](#references) - -# Objectives - -## Goals - -- Define the equivalence class for pods during predicate phase in Kubernetes. -- Define how to use equivalence class to speed up predicate process. -- Define how to ensure information cached in equivalence class is up-to-date. - -## Non-Goals - -- Apply equivalence class to priorities. We have refactored priorities to a Map-Reduce style process, we need to re-evaluate whether equivalence design can or can not apply to this new model. - -# Background - -Pods in Kubernetes cluster usually have identical requirements and constraints, just think about a Deployment with a number of replications. So rather than determining feasibility for every pending pod on every node, we can only do predicates one pod per equivalence class – a group of tasks with identical requirements, and reuse the predicate results for other equivalent pods. - -We hope to use this mechanism to help to improve scheduler's scalability, especially in cases like Replication Controller with huge number of instances, or eliminate pressure caused by complex predicate functions. - -The concept of equivalence class in scheduling is a proven feature used originally in [Google Borg] [1]. - -## Terminology - -Equivalence class: a group of pods which has identical requirements and constraints. - -Equivalence class based scheduling: the scheduler will do predicate for only one pod per equivalence class, and reuse this result for all other equivalent pods. - -# Overview - -This document describes what is equivalence class, and how to do equivalence based scheduling in Kubernetes. The basic idea is when you apply the predicate functions to a pod, cache the results (namely, for each machine, whether the pod is feasible on that machine). - -Scheduler watches for API objects change like bindings and unbindings and node changes, and marks a cached value as invalid whenever there is a change that invalidates a cached value. (For example, if the labels on a node change, or a new pod gets bound to a machine, then all cached values related to that machine are invalidated.) In the future when we have in-place updates, some updates to pods running on the machine would also cause the node to be marked invalid. This is how we keep equivalence class cache up-to-date. - -When scheduling a new pod, check to see if the predicate result for an equivalent pod is already cached. If so, re-evaluate the predicate functions just for the "invalid" values (i.e. not for all nodes and predicates), and update the cache. - - -# Detailed Design - -## 1. Define equivalence class - -There are two options were proposed. - -Option 1: use the attributes of Pod API object to decide if given pods are equivalent, the attributes include labels, some annotations, affinity, resource limit etc. - -Option 2: use controller reference, i.e. simply consider pods belonging to same controller reference -to be equivalent. - -Regarding first option - The biggest concern in this approach is that if someone will add dependency on some new field at some point, we don't have good way to test it and ensure that equivalence pod will be updated at that point too. - -Regarding second option - In detail, using the "ControllerRef" which is defined as "OwnerReference (from ObjectMeta) with the "Controller" field set to true as the "equivalence class". In this approach, we would have all RC, RS, Job etc handled by exactly the same mechanism. Also, this would be faster to compute it. - -For example, two pods created by the same `ReplicaSets` will be considered as equivalent since they will have exactly the same resource requirements from one pod template. On the other hand, two pods created by two `ReplicaSets` will not be considered as equivalent regardless of whether they have same resource requirements or not. - -**Conclusion:** - -Choose option 2. And we will calculate a unique `uint64` hash for pods belonging to same equivalence class which known as `equivalenceHash`. - -## 2. Equivalence class in predicate phase - -Predicate is the first phase in scheduler to filter out nodes which are feasible to run the workload. In detail: - -1. Predicates functions are registered in scheduler -2. The predicates will be checked by `scheduler.findNodesThatFit(pod, nodes, predicateFuncs ...)`. -3. The check process `scheduler.podFitsOnNode(pod, node, predicateFuncs ...)` is executed in parallel for every node. - -### 2.1 Design an equivalence class cache - -The step 3 is where registered predicate functions will be called against given pod and node. This step includes: - -1. Check if given pod has equivalence class. -2. If yes, use equivalence class cache to do predicate. - -In detail, we need to have an equivalence class cache to store all predicates results per node. The data structure is a 3 level map with keys of the levels being: `nodeName`, `predicateKey` and `equivalenceHash`. - -```go -predicateMap := algorithmCache[nodeName].predicatesCache.Get(predicateKey) -hostPredicate := predicateMap[equivalenceHash] -``` -For example: the cached `GeneralPredicates` result for equivalence class `1000392826` on node `node_1` is: - -```go -algorithmCache["node_1"].predicatesCache.Get("GeneralPredicates")[1000392826] -``` - -This will return a `HostPredicate` struct: - -```go -type HostPredicate struct { - Fit bool - FailReasons []algorithm.PredicateFailureReason -} - -``` - -Please note we use predicate name as key in `predicatesCache`, so the number of entries in the cache is less or equal to the total number of registered predicates in scheduler. The cache size is limited. - -### 2.2 Use cached predicate result to do predicate - -The pseudo code of predicate process with equivalence class will be like: - -```go -func (ec *EquivalenceCache) PredicateWithECache( - podName, nodeName, predicateKey string, - equivalenceHash uint64, -) (bool, []algorithm.PredicateFailureReason, bool) { - if algorithmCache, exist := ec.algorithmCache[nodeName]; exist { - if predicateMap, exist := algorithmCache.predicatesCache.Get(predicateKey); exist { - if hostPredicate, ok := predicateMap[equivalenceHash]; ok { - // fit - if hostPredicate.Fit { - return true, []algorithm.PredicateFailureReason{}, false - } else { - // unfit - return false, hostPredicate.FailReasons, false - } - } else { - // cached result is invalid - return false, []algorithm.PredicateFailureReason{}, true - } - } - } - return false, []algorithm.PredicateFailureReason{}, true -} -``` - -One thing to note is, if the `hostPredicate` is not present in the logic above, it will be considered as `invalid`. That means although this pod has equivalence class, it does not have cached predicate result yet, or the cached data is not valid. It needs to go through normal predicate process and write the result into equivalence class cache. - -### 2.3 What if no equivalence class is found for pod? - -If no equivalence class is found for given pod, normal predicate process will be executed. - -## 3. Keep equivalence class cache up-to-date - -The key of this equivalence class based scheduling is how to keep the equivalence cache up-to-date. Since even one single pod been scheduled to a node will make the cached result not stand as the available resource on this node has changed. - -One approach is that we can invalidate the cached predicate result for this node. But in a heavy load cluster state change happens frequently and makes the design less meaningful. - -So in this design, we proposed the ability to invalidate cached result for specific predicate. For example, when a new pod is scheduled to a node, the cached result for `PodFitsResources` should be invalidated on this node while others can still be re-used. That's also another reason we use predicate name as key for the cached value. - -During the implementation, we need to consider all the cases which may affect the effectiveness of cached predicate result. The logic includes three dimensions: - -- **Operation**: - - what operation will cause this cache invalid. -- **Invalid predicates**: - - what predicate should be invalidated. -- **Scope**: - - the cache of which node should be invalidated, or all nodes. - -Please note with the change of predicates in subsequent development, this doc will become out-of-date, while you can always check the latest e-class cache update process in `pkg/scheduler/factory/factory.go`. - -### 3.1 Persistent Volume - -- **Operation:** - - ADD, DELETE - -- **Invalid predicates**: - - - `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` (only if the added/deleted PV is one of them) - -- **Scope**: - - - All nodes (we don't know which node this PV will be attached to) - - -### 3.2 Persistent Volume Claim - -- **Operation:** - - ADD, DELETE - -- **Invalid predicates:** - - - `MaxPDVolumeCountPredicate` (only if the added/deleted PVC as a bound volume so it drops to the PV change case, otherwise it should not affect scheduler). - -- **Scope:** - - All nodes (we don't know which node this PV will be attached to). - - -### 3.3 Service - -- **Operation:** - - ADD, DELETE - -- **Invalid predicates:** - - - `ServiceAffinity` - -- **Scope:** - - All nodes (`serviceAffinity` is a cluster scope predicate). - - - -- **Operation:** - - UPDATE - -- **Invalid predicates:** - - - `ServiceAffinity` (only if the `spec.Selector` filed is updated) - -- **Scope:** - - All nodes (`serviceAffinity` is a cluster scope predicate),. - - -### 3.4 Pod - -- **Operation:** - - ADD - -- **Invalid predicates:** - - `GeneralPredicates`. This invalidate should be done during `scheduler.assume(...)` because binding can be asynchronous. So we just optimistically invalidate predicate cached result there, and if later this pod failed to bind, the following pods will go through normal predicate functions and nothing breaks. - - - No `MatchInterPodAffinity`: the scheduler will make sure newly bound pod will not break the existing inter pod affinity. So we do not need to invalidate MatchInterPodAffinity when pod added. But when a pod is deleted, existing inter pod affinity may become invalid. (e.g. this pod was preferred by some else, or vice versa). - - - NOTE: assumptions above **will not** stand when we implemented features like `RequiredDuringSchedulingRequiredDuringExecution`. - - - No `NoDiskConflict`: the newly scheduled pod fits to existing pods on this node, it will also fits to equivalence class of existing pods. - -- **Scope:** - - The node where the pod is bound. - - - -- **Operation:** - - UPDATE - -- **Invalid predicates:** - - - Only if `pod.NodeName` did not change (otherwise it drops to add/delete case) - - - `GeneralPredicates` if the pod's resource requests are updated. - - - `MatchInterPodAffinity` if the pod's labels are updated. - -- **Scope:** - - The node where the pod is bound. - - - -- **Operation:** - - DELETE - -- **Invalid predicates:** - - `MatchInterPodAffinity` if the pod's labels are updated. - -- **Scope:** - - All nodes in the same failure domain - -- **Invalid predicates:** - - - `NoDiskConflict` if the pod has special volume like `RBD`, `ISCSI`, `GCEPersistentDisk` etc. - -- **Scope:** - - The node where the pod is bound. - - -### 3.5 Node - - -- **Operation:** - - UPDATE - -- **Invalid predicates:** - - - `GeneralPredicates`, if `node.Status.Allocatable` or node labels changed. - - - `ServiceAffinity`, if node labels changed, since selector result may change. - - - `MatchInterPodAffinity`, if value of label changed, since any node label can be topology key of pod. - - - `NoVolumeZoneConflict`, if zone related label change. - - - `PodToleratesNodeTaints`, if node taints changed. - - - `CheckNodeMemoryPressure`, `CheckNodeDiskPressure`, `CheckNodeCondition`, if related node condition changed. - -- **Scope:** - - The updated node. - -- **Operation:** - - DELETE - -- **Invalid predicates:** - - All predicates - -- **Scope:** - - The deleted node - - -# Notes for scheduler developers - -1. When implementing a new predicate, developers are expected to check how related API object changes (add/delete/update) affect the result of their new predicate function and invalidate cached results of the predicate function if necessary, in scheduler/factory/factory.go. - -2. When updating an existing predicate, developers should consider whether their changes introduce new dependency on attributes of any API objects like Pod, Node, Service, etc. If so, developer should consider invalidating caches results of this predicate in scheduler/factory/factory.go. - - -# References - -Main implementation PRs: - -- https://github.com/kubernetes/kubernetes/pull/31605 -- https://github.com/kubernetes/kubernetes/pull/34685 -- https://github.com/kubernetes/kubernetes/pull/36238 -- https://github.com/kubernetes/kubernetes/pull/41541 - - -[1]: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf "Google Borg paper" +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/scheduler_extender.md b/contributors/design-proposals/scheduling/scheduler_extender.md index bc65f9ba..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/scheduler_extender.md +++ b/contributors/design-proposals/scheduling/scheduler_extender.md @@ -1,122 +1,6 @@ -# Scheduler extender +Design proposals have been archived. -There are three ways to add new scheduling rules (predicates and priority -functions) to Kubernetes: (1) by adding these rules to the scheduler and -recompiling, [described here](/contributors/devel/sig-scheduling/scheduler.md), -(2) implementing your own scheduler process that runs instead of, or alongside -of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" -process that the standard Kubernetes scheduler calls out to as a final pass when -making scheduling decisions. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document describes the third approach. This approach is needed for use -cases where scheduling decisions need to be made on resources not directly -managed by the standard Kubernetes scheduler. The extender helps make scheduling -decisions based on such resources. (Note that the three approaches are not -mutually exclusive.) -When scheduling a pod, the extender allows an external process to filter and -prioritize nodes. Two separate http/https calls are issued to the extender, one -for "filter" and one for "prioritize" actions. Additionally, the extender can -choose to bind the pod to apiserver by implementing the "bind" action. - -To use the extender, you must create a scheduler policy configuration file. The -configuration specifies how to reach the extender, whether to use http or https -and the timeout. - -```go -// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty, -// it is assumed that the extender chose not to provide that extension. -type ExtenderConfig struct { - // URLPrefix at which the extender is available - URLPrefix string `json:"urlPrefix"` - // Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. - FilterVerb string `json:"filterVerb,omitempty"` - // Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. - PrioritizeVerb string `json:"prioritizeVerb,omitempty"` - // Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender. - // If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. - BindVerb string `json:"bindVerb,omitempty"` - // The numeric multiplier for the node scores that the prioritize call generates. - // The weight should be a positive integer - Weight int `json:"weight,omitempty"` - // EnableHttps specifies whether https should be used to communicate with the extender - EnableHttps bool `json:"enableHttps,omitempty"` - // TLSConfig specifies the transport layer security config - TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"` - // HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize - // timeout is ignored, k8s/other extenders priorities are used to select the node. - HTTPTimeout time.Duration `json:"httpTimeout,omitempty"` -} -``` - -A sample scheduler policy file with extender configuration: - -```json -{ - "predicates": [ - { - "name": "HostName" - }, - { - "name": "MatchNodeSelector" - }, - { - "name": "PodFitsResources" - } - ], - "priorities": [ - { - "name": "LeastRequestedPriority", - "weight": 1 - } - ], - "extenders": [ - { - "urlPrefix": "http://127.0.0.1:12345/api/scheduler", - "filterVerb": "filter", - "enableHttps": false - } - ] -} -``` - -Arguments passed to the FilterVerb endpoint on the extender are the set of nodes -filtered through the k8s predicates and the pod. Arguments passed to the -PrioritizeVerb endpoint on the extender are the set of nodes filtered through -the k8s predicates and extender predicates and the pod. - -```go -// ExtenderArgs represents the arguments needed by the extender to filter/prioritize -// nodes for a pod. -type ExtenderArgs struct { - // Pod being scheduled - Pod api.Pod `json:"pod"` - // List of candidate nodes where the pod can be scheduled - Nodes api.NodeList `json:"nodes"` -} -``` - -The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call -returns priorities for each node (schedulerapi.HostPriorityList). - -The "filter" call may prune the set of nodes based on its predicates. Scores -returned by the "prioritize" call are added to the k8s scores (computed through -its priority functions) and used for final host selection. - -"bind" call is used to delegate the bind of a pod to a node to the extender. It can -be optionally implemented by the extender. When it is implemented, it is the extender's -responbility to issue the bind call to the apiserver. Pod name, namespace, UID and Node -name are passed to the extender. -```go -// ExtenderBindingArgs represents the arguments to an extender for binding a pod to a node. -type ExtenderBindingArgs struct { - // PodName is the name of the pod being bound - PodName string - // PodNamespace is the namespace of the pod being bound - PodNamespace string - // PodUID is the UID of the pod being bound - PodUID types.UID - // Node selected by the scheduler - Node string -} -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/scheduling-framework.md b/contributors/design-proposals/scheduling/scheduling-framework.md index 1de43aab..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/scheduling-framework.md +++ b/contributors/design-proposals/scheduling/scheduling-framework.md @@ -1,9 +1,6 @@ - -Status: Draft -Created: 2018-04-09 / Last updated: 2019-03-01 -Author: bsalamat -Contributors: misterikkit +Design proposals have been archived. ---- +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -The scheduling framework design has moved to https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md + +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/taint-node-by-condition.md b/contributors/design-proposals/scheduling/taint-node-by-condition.md index 2e352d4f..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/taint-node-by-condition.md +++ b/contributors/design-proposals/scheduling/taint-node-by-condition.md @@ -1,40 +1,6 @@ -# Taints Node according to NodeConditions +Design proposals have been archived. -@k82cn, @gmarek, @jamiehannaford, Jul 15, 2017 +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Relevant issues: - -* https://github.com/kubernetes/kubernetes/issues/42001 -* https://github.com/kubernetes/kubernetes/issues/45717 - -## Motivation -In kubernetes 1.8 and before, there are six Node Conditions, each with three possible values: True, False or Unknown. Kubernetes components modify and check those node conditions without any consideration to pods and their specs. For example, the scheduler will filter out all nodes whose NetworkUnavailable condition is True, meaning that pods on the host network can not be scheduled to those nodes, even though a user might want that. The motivation of this proposal is to taint Nodes based on certain conditions, so that other components can leverage Tolerations for more advanced scheduling. - -## Functional Detail -Currently (1.8 and before), the conditions of nodes are updated by the kubelet and Node Controller. The kubelet updates the value to either True or False according to the node’s status. If the kubelet did not update the value, the Node Controller will set the value to Unknown after a specific grace period. - -In addition to this, with taint-based-eviction, the Node Controller already taints nodes with either NotReady and Unreachable if certain conditions are met. In this proposal, the Node Controller will use additional taints on Nodes. The new taints are described below: - -| ConditionType | Condition Status |Effect | Key | -| ------------------ | ------------------ | ------------ | -------- | -|Ready |True | - | | -| |False | NoExecute | node.kubernetes.io/not-ready | -| |Unknown | NoExecute | node.kubernetes.io/unreachable | -|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk | -| |False | - | | -| |Unknown | - | | -|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure | -| |False | - | | -| |Unknown | - | | -|DiskPressure |True | NoSchedule | node.kubernetes.io/disk-pressure | -| |False | - | | -| |Unknown | - | | -|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable | -| |False | - | | -| |Unknown | - | | -|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure | -| |False | - | | -| |Unknown | - | | - -For example, if a CNI network is not detected on the node (e.g. a network is unavailable), the Node Controller will taint the node with `node.kubernetes.io/network-unavailable=:NoSchedule`. This will then allow users to add a toleration to their `PodSpec`, ensuring that the pod can be scheduled to this node if necessary. If the kubelet did not update the node’s status after a grace period, the Node Controller will only taint the node with `node.kubernetes.io/unreachable`; it will not taint the node with any unknown condition. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/scheduling/taint-toleration-dedicated.md b/contributors/design-proposals/scheduling/taint-toleration-dedicated.md index dc7a6483..f0fbec72 100644 --- a/contributors/design-proposals/scheduling/taint-toleration-dedicated.md +++ b/contributors/design-proposals/scheduling/taint-toleration-dedicated.md @@ -1,285 +1,6 @@ -# Taints, Tolerations, and Dedicated Nodes +Design proposals have been archived. -## Introduction +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document describes *taints* and *tolerations*, which constitute a generic -mechanism for restricting the set of pods that can use a node. We also describe -one concrete use case for the mechanism, namely to limit the set of users (or -more generally, authorization domains) who can access a set of nodes (a feature -we call *dedicated nodes*). There are many other uses--for example, a set of -nodes with a particular piece of hardware could be reserved for pods that -require that hardware, or a node could be marked as unschedulable when it is -being drained before shutdown, or a node could trigger evictions when it -experiences hardware or software problems or abnormal node configurations; see -issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and -[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion. -## Taints, tolerations, and dedicated nodes - -A *taint* is a new type that is part of the `NodeSpec`; when present, it -prevents pods from scheduling onto the node unless the pod *tolerates* the taint -(tolerations are listed in the `PodSpec`). Note that there are actually multiple -flavors of taints: taints that prevent scheduling on a node, taints that cause -the scheduler to try to avoid scheduling on a node but do not prevent it, taints -that prevent a pod from starting on Kubelet even if the pod's `NodeName` was -written directly (i.e. pod did not go through the scheduler), and taints that -evict already-running pods. -[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) -has more background on these different scenarios. We will focus on the first -kind of taint in this doc, since it is the kind required for the "dedicated -nodes" use case. - -Implementing dedicated nodes using taints and tolerations is straightforward: in -essence, a node that is dedicated to group A gets taint `dedicated=A` and the -pods belonging to group A get toleration `dedicated=A`. (The exact syntax and -semantics of taints and tolerations are described later in this doc.) This keeps -all pods except those belonging to group A off of the nodes. This approach -easily generalizes to pods that are allowed to schedule into multiple dedicated -node groups, and nodes that are a member of multiple dedicated node groups. - -Note that because tolerations are at the granularity of pods, the mechanism is -very flexible -- any policy can be used to determine which tolerations should be -placed on a pod. So the "group A" mentioned above could be all pods from a -particular namespace or set of namespaces, or all pods with some other arbitrary -characteristic in common. We expect that any real-world usage of taints and -tolerations will employ an admission controller to apply the tolerations. For -example, to give all pods from namespace A access to dedicated node group A, an -admission controller would add the corresponding toleration to all pods from -namespace A. Or to give all pods that require GPUs access to GPU nodes, an -admission controller would add the toleration for GPU taints to pods that -request the GPU resource. - -Everything that can be expressed using taints and tolerations can be expressed -using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. -in the example in the previous paragraph, you could put a label `dedicated=A` on -the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not* -belonging to group A. But it is cumbersome to express exclusion policies using -node affinity because every time you add a new type of restricted node, all pods -that aren't allowed to use those nodes need to start avoiding those nodes using -node affinity. This means the node affinity list can get quite long in clusters -with lots of different groups of special nodes (lots of dedicated node groups, -lots of different kinds of special hardware, etc.). Moreover, you need to also -update any Pending pods when you add new types of special nodes. In contrast, -with taints and tolerations, when you add a new type of special node, "regular" -pods are unaffected, and you just need to add the necessary toleration to the -pods you subsequent create that need to use the new type of special nodes. To -put it another way, with taints and tolerations, only pods that use a set of -special nodes need to know about those special nodes; with the node affinity -approach, pods that have no interest in those special nodes need to know about -all of the groups of special nodes. - -One final comment: in practice, it is often desirable to not only keep "regular" -pods off of special nodes, but also to keep "special" pods off of regular nodes. -An example in the dedicated nodes case is to not only keep regular users off of -dedicated nodes, but also to keep dedicated users off of non-dedicated (shared) -nodes. In this case, the "non-dedicated" nodes can be modeled as their own -dedicated node group (for example, tainted as `dedicated=shared`), and pods that -are not given access to any dedicated nodes ("regular" pods) would be given a -toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations -will be added by an admission controller.) In this case taints/tolerations are -still better than node affinity because with taints/tolerations each pod only -needs one special "marking", versus in the node affinity case where every time -you add a dedicated node group (i.e. a new `dedicated=` value), you need to add -a new node affinity rule to all pods (including pending pods) except the ones -allowed to use that new dedicated node group. - -## API - -```go -// The node this Taint is attached to has the effect "effect" on -// any pod that does not tolerate the Taint. -type Taint struct { - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - Value string `json:"value,omitempty"` - Effect TaintEffect `json:"effect"` -} - -type TaintEffect string - -const ( - // Do not allow new pods to schedule unless they tolerate the taint, - // but allow all pods submitted to Kubelet without going through the scheduler - // to start, and allow all already-running pods to continue running. - // Enforced by the scheduler. - TaintEffectNoSchedule TaintEffect = "NoSchedule" - // Like TaintEffectNoSchedule, but the scheduler tries not to schedule - // new pods onto the node, rather than prohibiting new pods from scheduling - // onto the node. Enforced by the scheduler. - TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule" - // Do not allow new pods to schedule unless they tolerate the taint, - // do not allow pods to start on Kubelet unless they tolerate the taint, - // but allow all already-running pods to continue running. - // Enforced by the scheduler and Kubelet. - TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit" - // Do not allow new pods to schedule unless they tolerate the taint, - // do not allow pods to start on Kubelet unless they tolerate the taint, - // and try to eventually evict any already-running pods that do not tolerate the taint. - // Enforced by the scheduler and Kubelet. - TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute" -) - -// The pod this Toleration is attached to tolerates any taint that matches -// the triple <key,value,effect> using the matching operator <operator>. -type Toleration struct { - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - // operator represents a key's relationship to the value. - // Valid operators are Exists and Equal. Defaults to Equal. - // Exists is equivalent to wildcard for value, so that a pod can - // tolerate all taints of a particular category. - Operator TolerationOperator `json:"operator"` - Value string `json:"value,omitempty"` - Effect TaintEffect `json:"effect"` - // TODO: For forgiveness (#1574), we'd eventually add at least a grace period - // here, and possibly an occurrence threshold and period. -} - -// A toleration operator is the set of operators that can be used in a toleration. -type TolerationOperator string - -const ( - TolerationOpExists TolerationOperator = "Exists" - TolerationOpEqual TolerationOperator = "Equal" -) - -``` - -(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) -to understand the motivation for the various taint effects.) - -We will add: - -```go - // Multiple tolerations with the same key are allowed. - Tolerations []Toleration `json:"tolerations,omitempty"` -``` - -to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type -TaintEffectPreferNoSchedule) in order to be able to schedule onto that node. - -We will add: - -```go - // Multiple taints with the same key are not allowed. - Taints []Taint `json:"taints,omitempty"` -``` - -to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union -of the taints specified by various sources. For now, the only source is -the `NodeSpec` itself, but in the future one could imagine a node inheriting -taints from pods (if we were to allow taints to be attached to pods), from -the node's startup configuration, etc. The scheduler should look at the `Taints` -in `NodeStatus`, not in `NodeSpec`. - -Taints and tolerations are not scoped to namespace. - -## Implementation plan: taints, tolerations, and dedicated nodes - -Using taints and tolerations to implement dedicated nodes requires these steps: - -1. Add the API described above -1. Add a scheduler predicate function that respects taints and tolerations (for -TaintEffectNoSchedule) and a scheduler priority function that respects taints -and tolerations (for TaintEffectPreferNoSchedule). -1. Add to the Kubelet code to implement the "no admit" behavior of -TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute -1. Implement code in Kubelet that evicts a pod that no longer satisfies -TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the -controllers instead, but since taints might be used to enforce security -policies, it is better to do in kubelet because kubelet can respond quickly and -can guarantee the rules will be applied to all pods. Eviction may need to happen -under a variety of circumstances: when a taint is added, when an existing taint -is updated, when a toleration is removed from a pod, or when a toleration is -modified on a pod. -1. Add a new `kubectl` command that adds/removes taints to/from nodes, -1. (This is the one step is that is specific to dedicated nodes) Implement an -admission controller that adds tolerations to pods that are supposed to be -allowed to use dedicated nodes (for example, based on pod's namespace). - -In the future one can imagine a generic policy configuration that configures an -admission controller to apply the appropriate tolerations to the desired class -of pods and taints to Nodes upon node creation. It could be used not just for -policies about dedicated nodes, but also other uses of taints and tolerations, -e.g. nodes that are restricted due to their hardware configuration. - -The `kubectl` command to add and remove taints on nodes will be modeled after -`kubectl label`. Examples usages: - -```sh -# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. -# If a taint with that key already exists, its value and effect are replaced as specified. -$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute - -# Remove from node 'foo' the taint with key 'dedicated' if one exists. -$ kubectl taint nodes foo dedicated- -``` - -## Example: implementing a dedicated nodes policy - -Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available -only to pods in a particular namespace `banana`. First the administrator does - -```sh -$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute -$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute -$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute - -``` - -(assuming they want to evict pods that are already running on those nodes if those -pods don't already tolerate the new taint) - -Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify -a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`. - -In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having -to enumerate them by name. - -## Future work - -At present, the Kubernetes security model allows any user to add and remove any -taints and tolerations. Obviously this makes it impossible to securely enforce -rules like dedicated nodes. We need some mechanism that prevents regular users -from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them -from mutating any fields of `NodeSpec`) and from mutating the `Tolerations` -field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549) -is relevant. - -Another security vulnerability arises if nodes are added to the cluster before -receiving their taint. Thus we need to ensure that a new node does not become -"Ready" until it has been configured with its taints. One way to do this is to -have an admission controller that adds the taint whenever a Node object is -created. - -A quota policy may want to treat nodes differently based on what taints, if any, -they have. For example, if a particular namespace is only allowed to access -dedicated nodes, then it may be convenient to give the namespace unlimited -quota. (To use finite quota, you'd have to size the namespace's quota to the sum -of the sizes of the machines in the dedicated node group, and update it when -nodes are added/removed to/from the group.) - -It's conceivable that taints and tolerations could be unified with -[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). -We have chosen not to do this for the reasons described in the "Future work" -section of that doc. - -## Backward compatibility - -Old scheduler versions will ignore taints and tolerations. New scheduler -versions will respect them. - -Users should not start using taints and tolerations until the full -implementation has been in Kubelet and the master for enough binary versions -that we feel comfortable that we will not need to roll back either Kubelet or -master to a version that does not support them. Longer-term we will use a -programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Related issues - -This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190). -There are a number of other related issues, all of which are linked to from -[#17190](https://github.com/kubernetes/kubernetes/issues/17190). - -The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574). - -The concepts of taints and tolerations were originally developed as part of the -Omega project at Google. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/service-catalog/OWNERS b/contributors/design-proposals/service-catalog/OWNERS deleted file mode 100644 index a4884d4d..00000000 --- a/contributors/design-proposals/service-catalog/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-service-catalog-leads -approvers: - - sig-service-catalog-leads -labels: - - sig/service-catalog diff --git a/contributors/design-proposals/service-catalog/pod-preset.md b/contributors/design-proposals/service-catalog/pod-preset.md index 8991e9f8..f0fbec72 100644 --- a/contributors/design-proposals/service-catalog/pod-preset.md +++ b/contributors/design-proposals/service-catalog/pod-preset.md @@ -1,726 +1,6 @@ -# Pod Preset +Design proposals have been archived. - * [Abstract](#abstract) - * [Motivation](#motivation) - * [Constraints and Assumptions](#constraints-and-assumptions) - * [Use Cases](#use-cases) - * [Summary](#summary) - * [Prior Art](#prior-art) - * [Objectives](#objectives) - * [Proposed Changes](#proposed-changes) - * [PodPreset API object](#podpreset-api-object) - * [Validations](#validations) - * [AdmissionControl Plug-in: PodPreset](#admissioncontrol-plug-in-podpreset) - * [Behavior](#behavior) - * [PodPreset Exclude Annotation](#podpreset-exclude-annotation) - * [Examples](#examples) - * [Simple Pod Spec Example](#simple-pod-spec-example) - * [Pod Spec with `ConfigMap` Example](#pod-spec-with-configmap-example) - * [ReplicaSet with Pod Spec Example](#replicaset-with-pod-spec-example) - * [Multiple PodPreset Example](#multiple-podpreset-example) - * [Conflict Example](#conflict-example) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Abstract - -**PodPresets did not progress out of alpha since 1.6. They were removed in 1.20** - -Describes a policy resource that allows for the loose coupling of a Pod's -definition from additional runtime requirements for that Pod. For example, -mounting of Secrets, or setting additional environment variables, -may not be known at Pod deployment time, but may be required at Pod creation -time. - -## Motivation - -Consuming a service involves more than just connectivity. In addition to -coordinates to reach the service, credentials and non-secret configuration -parameters are typically needed to use the service. The primitives for this -already exist, but a gap exists where loose coupling is desired: it should be -possible to inject pods with the information they need to use a service on a -service-by-service basis, without the pod authors having to incorporate the -information into every pod spec where it is needed. - -## Constraints and Assumptions - -1. Future work might require new mechanisms to be made to work with existing - controllers such as deployments and replicasets that create pods. Existing - controllers that create pods should recreate their pods when a new Pod Injection - Policy is added that would effect them. - -## Use Cases - -- As a user, I want to be able to provision a new pod - without needing to know the application configuration primitives the - services my pod will consume. -- As a cluster admin, I want specific configuration items of a service to be - withheld visibly from a developer deploying a service, but not to block the - developer from shipping. -- As an app developer, I want to provision a Cloud Spanner instance and then - access it from within my Kubernetes cluster. -- As an app developer, I want the Cloud Spanner provisioning process to - configure my Kubernetes cluster so the endpoints and credentials for my - Cloud Spanner instance are implicitly injected into Pods matching a label - selector (without me having to modify the PodSpec to add the specific - Configmap/Secret containing the endpoint/credential data). - - -**Specific Example:** - -1. Database Administrator provisions a MySQL service for their cluster. -2. Database Administrator creates secrets for the cluster containing the - database name, username, and password. -3. Database Administrator creates a `PodPreset` defining the database - port as an environment variable, as well as the secrets. See - [Examples](#examples) below for various examples. -4. Developer of an application can now label their pod with the specified - `Selector` the Database Administrator tells them, and consume the MySQL - database without needing to know any of the details from step 2 and 3. - -### Summary - -The use case we are targeting is to automatically inject into Pods the -information required to access non-Kubernetes-Services, such as accessing an -instances of Cloud Spanner. Accessing external services such as Cloud Spanner -may require the Pods to have specific credential and endpoint data. - -Using a Pod Preset allows pod template authors to not have to explicitly -set information for every pod. This way authors of pod templates consuming a -specific service do not need to know all the details about that service. - -### Prior Art - -Internally for Kubernetes we already support accessing the Kubernetes api from -all Pods by injecting the credentials and endpoint data automatically - e.g. -injecting the serviceaccount credentials into a volume (via secret) using an -[admission controller](https://github.com/kubernetes/kubernetes/blob/97212f5b3a2961d0b58a20bdb6bda3ccfa159bd7/plugin/pkg/admission/serviceaccount/admission.go), -and injecting the Service endpoints into environment -variables. This is done without the Pod explicitly mounting the serviceaccount -secret. - -### Objectives - -The goal of this proposal is to generalize these capabilities so we can introduce -similar support for accessing Services running external to the Kubernetes cluster. -We can assume that an appropriate Secret and Configmap have already been created -as part of the provisioning process of the external service. The need then is to -provide a mechanism for injecting the Secret and Configmap into Pods automatically. - -The [ExplicitServiceLinks proposal](https://github.com/kubernetes/community/pull/176), -will allow us to decouple where a Service's credential and endpoint information -is stored in the Kubernetes cluster from a Pod's intent to access that Service -(e.g. in declaring it wants to access a Service, a Pod is automatically injected -with the credential and endpoint data required to do so). - -## Proposed Changes - -### PodPreset API object - -This resource is alpha. The policy itself is immutable. The API group will be -added to new group `settings` and the version is `v1alpha1`. - -```go -// PodPreset is a policy resource that defines additional runtime -// requirements for a Pod. -type PodPreset struct { - unversioned.TypeMeta - ObjectMeta - - // +optional - Spec PodPresetSpec -} - -// PodPresetSpec is a description of a pod preset. -type PodPresetSpec struct { - // Selector is a label query over a set of resources, in this case pods. - // Required. - Selector unversioned.LabelSelector - // Env defines the collection of EnvVar to inject into containers. - // +optional - Env []EnvVar - // EnvFrom defines the collection of EnvFromSource to inject into - // containers. - // +optional - EnvFrom []EnvFromSource - // Volumes defines the collection of Volume to inject into the pod. - // +optional - Volumes []Volume `json:omitempty` - // VolumeMounts defines the collection of VolumeMount to inject into - // containers. - // +optional - VolumeMounts []VolumeMount -} -``` - -#### Validations - -In order for the Pod Preset to be valid it must fulfill the -following constraints: - -- The `Selector` field must be defined. This is how we know which pods - to inject so therefore it is required and cannot be empty. -- The policy must define _at least_ 1 of `Env`, `EnvFrom`, or `Volumes` with - corresponding `VolumeMounts`. -- If you define a `Volume`, it has to define a `VolumeMount`. -- For `Env`, `EnvFrom`, `Volumes`, and `VolumeMounts` all existing API - validations are applied. - -This resource will be immutable, if you want to change something you can delete -the old policy and recreate a new one. We can change this to be mutable in the -future but by disallowing it now, we will not break people in the future. - -#### Conflicts - -There are a number of edge conditions that might occur at the time of -injection. These are as follows: - -- Merging lists with no conflicts: if a pod already has a `Volume`, - `VolumeMount` or `EnvVar` defined **exactly** as defined in the - PodPreset. No error will occur since they are the exact same. The - motivation behind this is if services have no quite converted to using pod - injection policies yet and have duplicated information and an error should - obviously not be thrown if the items that need to be injected already exist - and are exactly the same. -- Merging lists with conflicts: if a PIP redefines an `EnvVar` or a `Volume`, - an event on the pod showing the error on the conflict will be thrown and - nothing will be injected. -- Conflicts between `Env` and `EnvFrom`: this would throw an error with an - event on the pod showing the error on the conflict. Nothing would be - injected. - -> **Note:** In the case of a conflict nothing will be injected. The entire -> policy is ignored and an event is thrown on the pod detailing the conflict. - -### AdmissionControl Plug-in: PodPreset - -The **PodPreset** plug-in introspects all incoming pod creation -requests and injects the pod based off a `Selector` with the desired -attributes, except when the [PodPreset Exclude Annotation](#podpreset-exclude-annotation) -is set to true. - -For the initial alpha, the order of precedence for applying multiple -`PodPreset` specs is from oldest to newest. All Pod Injection -Policies in a namespace should be order agnostic; the order of application is -unspecified. Users should ensure that policies do not overlap. -However we can use merge keys to detect some of the conflicts that may occur. - -This will not be enabled by default for all clusters, but once GA will be -a part of the set of strongly recommended plug-ins documented -[here](https://kubernetes.io/docs/admin/admission-controllers/#is-there-a-recommended-set-of-plug-ins-to-use). - -**Why not an Initializer?** - -This will be first implemented as an AdmissionControl plug-in then can be -converted to an Initializer once that is fully ready. The proposal for -Initializers can be found at [kubernetes/community#132](https://github.com/kubernetes/community/pull/132). - -#### PodPreset Exclude Annotation -There may be instances where you wish for a pod to not be altered by any pod -preset mutations. For these events, one can add an annotation in the pod spec -of the form: `podpreset.admission.kubernetes.io/exclude: "true"`. - -#### Behavior - -This will modify the pod spec. The supported changes to -`Env`, `EnvFrom`, and `VolumeMounts` apply to the container spec for -all containers in the pod with the specified matching `Selector`. The -changes to `Volumes` apply to the pod spec for all pods matching `Selector`. - -The resultant modified pod spec will be annotated to show that it was modified by -the `PodPreset`. This will be of the form -`podpreset.admission.kubernetes.io/podpreset-<pip name>": "<resource version>"`. - -*Why modify all containers in a pod?* - -Currently there is no concept of labels on specific containers in a pod which -would be necessary for per-container pod injections. We could add labels -for specific containers which would allow this and be the best solution to not -injecting all. Container labels have been discussed various times through -multiple issues and proposals, which all congregate to this thread on the -[kubernetes-sig-node mailing -list](https://groups.google.com/forum/#!topic/kubernetes-sig-node/gijxbYC7HT8). -In the future, even if container labels were added, we would need to be careful -about not making breaking changes to the current behavior. - -Other solutions include basing the container to inject based off -matching its name to another field in the `PodPreset` spec, but -this would not scale well and would cause annoyance with configuration -management. - -In the future we might question whether we need or want containers to express -that they expect injection. At this time we are deferring this issue. - -## Examples - -### Simple Pod Spec Example - -This is a simple example to show how a Pod spec is modified by the Pod -Injection Policy. - -**User submitted pod spec:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend -spec: - containers: - - name: website - image: ecorp/website - ports: - - containerPort: 80 -``` - -**Example Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: allow-database - namespace: myns -spec: - selector: - matchLabels: - role: frontend - env: - - name: DB_PORT - value: 6379 - volumeMounts: - - mountPath: /cache - name: cache-volume - volumes: - - name: cache-volume - emptyDir: {} -``` - -**Pod spec after admission controller:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend - annotations: - podpreset.admission.kubernetes.io/allow-database: "resource version" -spec: - containers: - - name: website - image: ecorp/website - volumeMounts: - - mountPath: /cache - name: cache-volume - ports: - - containerPort: 80 - env: - - name: DB_PORT - value: 6379 - volumes: - - name: cache-volume - emptyDir: {} -``` - -### Pod Spec with `ConfigMap` Example - -This is an example to show how a Pod spec is modified by the Pod Injection -Policy that defines a `ConfigMap` for Environment Variables. - -**User submitted pod spec:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend -spec: - containers: - - name: website - image: ecorp/website - ports: - - containerPort: 80 -``` - -**User submitted `ConfigMap`:** - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: etcd-env-config -data: - number_of_members: "1" - initial_cluster_state: new - initial_cluster_token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN - discovery_token: DUMMY_ETCD_DISCOVERY_TOKEN - discovery_url: http://etcd_discovery:2379 - etcdctl_peers: http://etcd:2379 - duplicate_key: FROM_CONFIG_MAP - REPLACE_ME: "a value" -``` - -**Example Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: allow-database - namespace: myns -spec: - selector: - matchLabels: - role: frontend - env: - - name: DB_PORT - value: 6379 - - name: duplicate_key - value: FROM_ENV - - name: expansion - value: $(REPLACE_ME) - envFrom: - - configMapRef: - name: etcd-env-config - volumeMounts: - - mountPath: /cache - name: cache-volume - - mountPath: /etc/app/config.json - readOnly: true - name: secret-volume - volumes: - - name: cache-volume - emptyDir: {} - - name: secret-volume - secretName: config-details -``` - -**Pod spec after admission controller:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend - annotations: - podpreset.admission.kubernetes.io/allow-database: "resource version" -spec: - containers: - - name: website - image: ecorp/website - volumeMounts: - - mountPath: /cache - name: cache-volume - - mountPath: /etc/app/config.json - readOnly: true - name: secret-volume - ports: - - containerPort: 80 - env: - - name: DB_PORT - value: 6379 - - name: duplicate_key - value: FROM_ENV - - name: expansion - value: $(REPLACE_ME) - envFrom: - - configMapRef: - name: etcd-env-config - volumes: - - name: cache-volume - emptyDir: {} - - name: secret-volume - secretName: config-details -``` - -### ReplicaSet with Pod Spec Example - -The following example shows that only the pod spec is modified by the Pod -Injection Policy. - -**User submitted ReplicaSet:** - -```yaml -apiVersion: settings/v1alpha1 -kind: ReplicaSet -metadata: - name: frontend -spec: - replicas: 3 - selector: - matchLabels: - tier: frontend - matchExpressions: - - {key: tier, operator: In, values: [frontend]} - template: - metadata: - labels: - app: guestbook - tier: frontend - spec: - containers: - - name: php-redis - image: gcr.io/google_samples/gb-frontend:v3 - resources: - requests: - cpu: 100m - memory: 100Mi - env: - - name: GET_HOSTS_FROM - value: dns - ports: - - containerPort: 80 -``` - -**Example Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: allow-database - namespace: myns -spec: - selector: - matchLabels: - tier: frontend - env: - - name: DB_PORT - value: 6379 - volumeMounts: - - mountPath: /cache - name: cache-volume - volumes: - - name: cache-volume - emptyDir: {} -``` - -**Pod spec after admission controller:** - -```yaml -kind: Pod - metadata: - labels: - app: guestbook - tier: frontend - annotations: - podpreset.admission.kubernetes.io/allow-database: "resource version" - spec: - containers: - - name: php-redis - image: gcr.io/google_samples/gb-frontend:v3 - resources: - requests: - cpu: 100m - memory: 100Mi - volumeMounts: - - mountPath: /cache - name: cache-volume - env: - - name: GET_HOSTS_FROM - value: dns - - name: DB_PORT - value: 6379 - ports: - - containerPort: 80 - volumes: - - name: cache-volume - emptyDir: {} -``` - -### Multiple PodPreset Example - -This is an example to show how a Pod spec is modified by multiple Pod -Injection Policies. - -**User submitted pod spec:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend -spec: - containers: - - name: website - image: ecorp/website - ports: - - containerPort: 80 -``` - -**Example Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: allow-database - namespace: myns -spec: - selector: - matchLabels: - role: frontend - env: - - name: DB_PORT - value: 6379 - volumeMounts: - - mountPath: /cache - name: cache-volume - volumes: - - name: cache-volume - emptyDir: {} -``` - -**Another Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: proxy - namespace: myns -spec: - selector: - matchLabels: - role: frontend - volumeMounts: - - mountPath: /etc/proxy/configs - name: proxy-volume - volumes: - - name: proxy-volume - emptyDir: {} -``` - -**Pod spec after admission controller:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend - annotations: - podpreset.admission.kubernetes.io/allow-database: "resource version" - podpreset.admission.kubernetes.io/proxy: "resource version" -spec: - containers: - - name: website - image: ecorp/website - volumeMounts: - - mountPath: /cache - name: cache-volume - - mountPath: /etc/proxy/configs - name: proxy-volume - ports: - - containerPort: 80 - env: - - name: DB_PORT - value: 6379 - volumes: - - name: cache-volume - emptyDir: {} - - name: proxy-volume - emptyDir: {} -``` - -### Conflict Example - -This is a example to show how a Pod spec is not modified by the Pod Injection -Policy when there is a conflict. - -**User submitted pod spec:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend -spec: - containers: - - name: website - image: ecorp/website - volumeMounts: - - mountPath: /cache - name: cache-volume - ports: - volumes: - - name: cache-volume - emptyDir: {} - - containerPort: 80 -``` - -**Example Pod Preset:** - -```yaml -kind: PodPreset -apiVersion: settings/v1alpha1 -metadata: - name: allow-database - namespace: myns -spec: - selector: - matchLabels: - role: frontend - env: - - name: DB_PORT - value: 6379 - volumeMounts: - - mountPath: /cache - name: other-volume - volumes: - - name: other-volume - emptyDir: {} -``` - -**Pod spec after admission controller will not change because of the conflict:** - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: website - labels: - app: website - role: frontend -spec: - containers: - - name: website - image: ecorp/website - volumeMounts: - - mountPath: /cache - name: cache-volume - ports: - volumes: - - name: cache-volume - emptyDir: {} - - containerPort: 80 -``` - -**If we run `kubectl describe...` we can see the event:** - -``` -$ kubectl describe ... -.... -Events: - FirstSeen LastSeen Count From SubobjectPath Reason Message - Tue, 07 Feb 2017 16:56:12 -0700 Tue, 07 Feb 2017 16:56:12 -0700 1 {podpreset.admission.kubernetes.io/allow-database } conflict Conflict on pod preset. Duplicate mountPath /cache. -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/OWNERS b/contributors/design-proposals/storage/OWNERS deleted file mode 100644 index 6dd5158f..00000000 --- a/contributors/design-proposals/storage/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-storage-leads -approvers: - - sig-storage-leads -labels: - - sig/storage diff --git a/contributors/design-proposals/storage/attacher-detacher-refactor-for-local-storage.md b/contributors/design-proposals/storage/attacher-detacher-refactor-for-local-storage.md index 0833aa0a..f0fbec72 100644 --- a/contributors/design-proposals/storage/attacher-detacher-refactor-for-local-storage.md +++ b/contributors/design-proposals/storage/attacher-detacher-refactor-for-local-storage.md @@ -1,281 +1,6 @@ ---- +Design proposals have been archived. -title: Attacher/Detacher refactor for local storage +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -authors: -- "@NickrenREN" -owning-sig: sig-storage - -participating-sigs: - - nil - -reviewers: - - "@msau42" - - "@jsafrane" - -approvers: - - "@jsafrane" - - "@msau42" - - "@saad-ali" - -editor: TBD - -creation-date: 2018-07-30 - -last-updated: 2018-07-30 - -status: provisional - ---- - -## Table of Contents - * [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) -* [Implementation](#implementation) - * [Volume plugin interface change](#volume-plugin-interface-change) - * [MountVolume/UnmountDevice generation function change](#MountVolume/UnmountDevice-generation-function-change) - * [Volume plugin change](#volume-plugin-change) -* [Future](#future) - -## Summary - -Today, the workflow for a volume to be used by pod is: - -- attach a remote volume to the node instance (if it is attachable) -- wait for the volume to be attached (if it is attachable) -- mount the device to a global path (if it is attachable) -- mount the global path to a pod directory - -It is ok for remote block storage plugins which have a remote attach api,such as `GCE PD`, `AWS EBS` -and remote fs storage plugins such as `NFS`, and `Cephfs`. - -But it is not so good for plugins which need local attach such as `fc`, `iscsi` and `RBD`. - -It is not so good for local storage neither which is not attachable but needs `MountDevice` - - -## Motivation - -### Goals - - Update Attacher/Detacher interfaces for local storage - -### Non-Goals - - Update `fc`, `iscsi` and `RBD` implementation according to the new interfaces - -## Proposal - -Here we propose to only update the Attacher/Detacher interfaces for local storage. -We may expand it in future to `iscsi`, `RBD` and `fc`, if we figure out how to prevent multiple local attach without implementing attacher interface. - -## Implementation - -### Volume plugin interface change - -We can create a new interface `DeviceMounter`, move `GetDeviceMountPath` and `MountDevice` from `Attacher`to it. - -We can put `DeviceMounter` in `Attacher` which means any one who implements the `Attacher` interface must implement `DeviceMounter`. - -``` -// Attacher can attach a volume to a node. -type Attacher interface { - DeviceMounter - - // Attaches the volume specified by the given spec to the node with the given Name. - // On success, returns the device path where the device was attached on the - // node. - Attach(spec *Spec, nodeName types.NodeName) (string, error) - - // VolumesAreAttached checks whether the list of volumes still attached to the specified - // node. It returns a map which maps from the volume spec to the checking result. - // If an error is occurred during checking, the error will be returned - VolumesAreAttached(specs []*Spec, nodeName types.NodeName) (map[*Spec]bool, error) - - // WaitForAttach blocks until the device is attached to this - // node. If it successfully attaches, the path to the device - // is returned. Otherwise, if the device does not attach after - // the given timeout period, an error will be returned. - WaitForAttach(spec *Spec, devicePath string, pod *v1.Pod, timeout time.Duration) (string, error) -} - -// DeviceMounter can mount a block volume to a global path. -type DeviceMounter interface { - // GetDeviceMountPath returns a path where the device should - // be mounted after it is attached. This is a global mount - // point which should be bind mounted for individual volumes. - GetDeviceMountPath(spec *Spec) (string, error) - - // MountDevice mounts the disk to a global path which - // individual pods can then bind mount - // Note that devicePath can be empty if the volume plugin does not implement any of Attach and WaitForAttach methods. - MountDevice(spec *Spec, devicePath string, deviceMountPath string) error -} - -``` - -Note: we also need to make sure that if our plugin implements the `DeviceMounter` interface, -then executing mount operation from multiple pods referencing the same volume in parallel should be avoided, -even if it does not implement the `Attacher` interface. - -Since `NestedPendingOperations` can achieve this by setting the same volumeName and same or empty podName in one operation, -we just need to add another check in `MountVolume`: check if the volume is DeviceMountable. - -We also need to create another new interface `DeviceUmounter`, and move `UnmountDevice` to it. -``` -// Detacher can detach a volume from a node. -type Detacher interface { - DeviceUnmounter - - // Detach the given volume from the node with the given Name. - // volumeName is name of the volume as returned from plugin's - // GetVolumeName(). - Detach(volumeName string, nodeName types.NodeName) error -} - -// DeviceUnmounter can unmount a block volume from the global path. -type DeviceUnmounter interface { - // UnmountDevice unmounts the global mount of the disk. This - // should only be called once all bind mounts have been - // unmounted. - UnmountDevice(deviceMountPath string) error -} -``` -Accordingly, we need to create a new interface `DeviceMountableVolumePlugin` and move `GetDeviceMountRefs` to it. -``` -// AttachableVolumePlugin is an extended interface of VolumePlugin and is used for volumes that require attachment -// to a node before mounting. -type AttachableVolumePlugin interface { - DeviceMountableVolumePlugin - NewAttacher() (Attacher, error) - NewDetacher() (Detacher, error) -} - -// DeviceMountableVolumePlugin is an extended interface of VolumePlugin and is used -// for volumes that requires mount device to a node before binding to volume to pod. -type DeviceMountableVolumePlugin interface { - VolumePlugin - NewDeviceMounter() (DeviceMounter, error) - NewDeviceUmounter() (DeviceUmounter, error) - GetDeviceMountRefs(deviceMountPath string) ([]string, error) -} -``` - -### MountVolume/UnmountDevice generation function change - -Currently we will check if the volume plugin is attachable in `GenerateMountVolumeFunc`, if it is, we need to call `WaitForAttach` ,`GetDeviceMountPath` and `MountDevice` first, and then set up the volume. - -After the refactor, we can split that into three sections: check if volume is attachable, check if it is deviceMountable and set up the volume. -``` -devicePath := volumeToMount.DevicePath -if volumeAttacher != nil { - devicePath, err = volumeAttacher.WaitForAttach( - volumeToMount.VolumeSpec, devicePath, volumeToMount.Pod, waitForAttachTimeout) - if err != nil { - // On failure, return error. Caller will log and retry. - return volumeToMount.GenerateError("MountVolume.WaitForAttach failed", err) - } - // Write the attached device path back to volumeToMount, which can be used for MountDevice. - volumeToMount.DevicePath = devicePath -} - -if volumeDeviceMounter != nil { - deviceMountPath, err := - volumeDeviceMounter.GetDeviceMountPath(volumeToMount.VolumeSpec) - if err != nil { - // On failure, return error. Caller will log and retry. - return volumeToMount.GenerateError("MountVolume.GetDeviceMountPath failed", err) - } - deviceMountPath, err := volumeDeviceMounter.MountDevice(volumeToMount.VolumeSpec, devicePath, deviceMountPath) - if err != nil { - // On failure, return error. Caller will log and retry. - return volumeToMount.GenerateError("MountVolume.MountDevice failed", err) - } - - glog.Infof(volumeToMount.GenerateMsgDetailed("MountVolume.MountDevice succeeded", fmt.Sprintf("device mount path %q", deviceMountPath))) - - // Update actual state of world to reflect volume is globally mounted - markDeviceMountedErr := actualStateOfWorld.MarkDeviceAsMounted( - volumeToMount.VolumeName) - if markDeviceMountedErr != nil { - // On failure, return error. Caller will log and retry. - return volumeToMount.GenerateError("MountVolume.MarkDeviceAsMounted failed", markDeviceMountedErr) - } -} -``` -Note that since local storage plugin will not implement the Attacher interface, we can get the device path directly from `spec.PersistentVolume.Spec.Local.Path` when we run `MountDevice` - -The device unmounting operation will be executed in `GenerateUnmountDeviceFunc`, we can update the device unmounting generation function as below: -``` -// Get DeviceMounter plugin -deviceMountableVolumePlugin, err := - og.volumePluginMgr.FindDeviceMountablePluginByName(deviceToDetach.PluginName) -if err != nil || deviceMountableVolumePlugin == nil { - return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.FindDeviceMountablePluginByName failed", err) -} - -volumeDeviceUmounter, err := deviceMountablePlugin.NewDeviceUmounter() -if err != nil { - return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.NewDeviceUmounter failed", err) -} - -volumeDeviceMounter, err := deviceMountableVolumePlugin.NewDeviceMounter() -if err != nil { - return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.NewDeviceMounter failed", err) -} - -unmountDeviceFunc := func() (error, error) { - deviceMountPath, err := - volumeDeviceMounter.GetDeviceMountPath(deviceToDetach.VolumeSpec) - if err != nil { - // On failure, return error. Caller will log and retry. - return deviceToDetach.GenerateError("GetDeviceMountPath failed", err) - } - refs, err := deviceMountablePlugin.GetDeviceMountRefs(deviceMountPath) - - if err != nil || mount.HasMountRefs(deviceMountPath, refs) { - if err == nil { - err = fmt.Errorf("The device mount path %q is still mounted by other references %v", deviceMountPath, refs) - } - return deviceToDetach.GenerateError("GetDeviceMountRefs check failed", err) - } - // Execute unmount - unmountDeviceErr := volumeDeviceUmounter.UnmountDevice(deviceMountPath) - if unmountDeviceErr != nil { - // On failure, return error. Caller will log and retry. - return deviceToDetach.GenerateError("UnmountDevice failed", unmountDeviceErr) - } - // Before logging that UnmountDevice succeeded and moving on, - // use mounter.PathIsDevice to check if the path is a device, - // if so use mounter.DeviceOpened to check if the device is in use anywhere - // else on the system. Retry if it returns true. - deviceOpened, deviceOpenedErr := isDeviceOpened(deviceToDetach, mounter) - if deviceOpenedErr != nil { - return nil, deviceOpenedErr - } - // The device is still in use elsewhere. Caller will log and retry. - if deviceOpened { - return deviceToDetach.GenerateError( - "UnmountDevice failed", - fmt.Errorf("the device is in use when it was no longer expected to be in use")) - } - - ... - - return nil, nil - } - -``` - -### Volume plugin change - -We need to olny implement the DeviceMounter/DeviceUnmounter interface for local storage since it is not attachable. -And we can keep `fc`,`iscsi` and `RBD` unchanged at the first stage. - -## Future -Update `iscsi`, `RBD` and `fc` volume plugins accordingly, if we figure out how to prevent multiple local attach without implementing attacher interface. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/container-storage-interface-pod-information.md b/contributors/design-proposals/storage/container-storage-interface-pod-information.md index 872f9d45..f0fbec72 100644 --- a/contributors/design-proposals/storage/container-storage-interface-pod-information.md +++ b/contributors/design-proposals/storage/container-storage-interface-pod-information.md @@ -1,48 +1,6 @@ -# Pod in CSI NodePublish request -Author: @jsafrane +Design proposals have been archived. -## Goal -* Pass Pod information (pod name/namespace/UID + service account) to CSI drivers in `NodePublish` request as CSI volume attributes. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -We'd like to move away from exec based Flex to gRPC based CSI volumes. In Flex, kubelet always passes `pod.namespace`, `pod.name`, `pod.uid` and `pod.spec.serviceAccountName` ("pod information") in every `mount` call. In Kubernetes community we've seen some Flex drivers that use pod or service account information to authorize or audit usage of a volume or generate content of the volume tailored to the pod (e.g. https://github.com/Azure/kubernetes-keyvault-flexvol). -CSI is agnostic to container orchestrators (such as Kubernetes, Mesos or CloudFoundry) and as such does not understand concept of pods and service accounts. [Enhancement of CSI protocol](https://github.com/container-storage-interface/spec/pull/252) to pass "workload" (~pod) information from Kubernetes to CSI driver has met some resistance. - -## High-level design -We decided to pass the pod information as `NodePublishVolumeRequest.volume_attributes`. - -* Kubernetes passes pod information only to CSI drivers that explicitly require that information in their [`CSIDriver` instance](https://github.com/kubernetes/community/pull/2523). These drivers are tightly coupled to Kubernetes and may not work or may require reconfiguration on other cloud orchestrators. It is expected (but not limited to) that these drivers will provide ephemeral volumes similar to Secrets or ConfigMap, extending Kubernetes secret or configuration sources. -* Kubernetes will not pass pod information to CSI drivers that don't know or don't care about pods and service accounts. It is expected (but not limited to) that these drivers will provide real persistent storage. Such CSI driver would reject a CSI call with pod information as invalid. This is current behavior of Kubernetes and it will be the default behavior. - -## Detailed design - -### API changes -No API changes. - -### CSI enhancement -We don't need to change CSI protocol in any way. It allows kubelet to pass `pod.name`, `pod.uid` and `pod.spec.serviceAccountName` in [`NodePublish` call as `volume_attributes`]((https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume)). `NodePublish` is roughly equivalent to Flex `mount` call. - -The only thing we need to do is to **define** names of the `volume_attributes` keys that CSI drivers can expect: - * `csi.storage.k8s.io/pod.name`: name of the pod that wants the volume. - * `csi.storage.k8s.io/pod.namespace`: namespace of the pod that wants the volume. - * `csi.storage.k8s.io/pod.uid`: uid of the pod that wants the volume. - * `csi.storage.k8s.io/serviceAccount.name`: name of the service account under which the pod operates. Namespace of the service account is the same as `pod.namespace`. - -Note that these attribute names are very similar to [parameters we pass to flex volume plugin](https://github.com/kubernetes/kubernetes/blob/10688257e63e4d778c499ba30cddbc8c6219abe9/pkg/volume/flexvolume/driver-call.go#L55). - -### Kubelet -Kubelet needs to create informer to cache `CSIDriver` instances. It passes the informer to CSI volume plugin as a new argument of [`ProbeVolumePlugins`](https://github.com/kubernetes/kubernetes/blob/43f805b7bdda7a5b491d34611f85c249a63d7f97/pkg/volume/csi/csi_plugin.go#L58). - -### CSI volume plugin -In `SetUpAt()`, the CSI volume plugin checks the `CSIDriver` informer if `CSIDriver` instance exists for a particular CSI driver that handles the volume. If the instance exists and has `PodInfoRequiredOnMount` set, the volume plugin adds `csi.storage.k8s.io/*` attributes to `volume_attributes` of the CSI volume. It blindly overwrites any existing values there. - -Kubelet and the volume plugin must tolerate when CRD for `CSIDriver` is not created (yet). Kubelet and CSI volume plugin falls back to original behavior, i.e. does not pass any pod information to CSI. We expect that CSI drivers will return reasonable error code instead of mounting a wrong volume. - -TODO(jsafrane): check what (shared?) informer does when it's created for non-existing CRD. Will it start working automatically when the CRD is created? Or shall we retry creation of the informer every X seconds until the CRD is created? Alternatively, we may GEt fresh `CSIDriver` from API server in `SetUpAt()`, without any informer. - -## Implementation - -* Alpha in 1.12 (behind `CSIPodInfo` feature gate) -* Beta in 1.13 (behind `CSIPodInfo` feature gate) -* GA 1.14? +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/container-storage-interface-skip-attach.md b/contributors/design-proposals/storage/container-storage-interface-skip-attach.md index 6e956e92..f0fbec72 100644 --- a/contributors/design-proposals/storage/container-storage-interface-skip-attach.md +++ b/contributors/design-proposals/storage/container-storage-interface-skip-attach.md @@ -1,80 +1,6 @@ -# Skip attach for non-attachable CSI volumes +Design proposals have been archived. -Author: @jsafrane +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goal -* Non-attachable CSI volumes should not require external attacher and `VolumeAttachment` instance creation. This will speed up pod startup. -## Motivation -Currently, CSI requires admin to start external CSI attacher for **all** CSI drivers, including those that don't implement attach/detach operation (such as NFS or all ephemeral Secrets-like volumes). Kubernetes Attach/Detach controller always creates `VolumeAttachment` objects for them and always waits until they're reported as "attached" by external CSI attacher. - -We want to skip creation of `VolumeAttachment` objects in A/D controller for CSI volumes that don't require 3rd party attach/detach. - -## Dependencies -In order to skip both A/D controller attaching a volume and kubelet waiting for the attachment, both of them need to know if a particular CSI driver is attachable or not. In this document we expect that proposal #2514 is implemented and both A/D controller and kubelet has informer on `CSIDriver` so they can check if a volume is attachable easily. - -## Design -### CSI volume plugin -* Rework [`Init`](https://github.com/kubernetes/kubernetes/blob/43f805b7bdda7a5b491d34611f85c249a63d7f97/pkg/volume/csi/csi_plugin.go#L58) to get or create informer to cache CSIDriver instances. - * Depending on where the API for CSIDriver ends up, we may: - * Rework VolumeHost to either provide the informer. This leaks CSI implementation details to A/D controller and kubelet - * Or the CSI volume plugin can create and run CSIDriver informer by itself. No other component in controller-manager or kubelet needs the informer right now, so a non-shared informer is viable option. Depending on when the API for CSIDriver ends up, `VolumeHost` may need to be extended to provide client interface to the API and kubelet and A/D controller may need to be updated to create the interface (somewhere in `cmd/`, where RESTConfig is still available to create new clients ) and pass it to their `VolumeHost` implementations. -* Rework `Attach`, `Detach`, `VolumesAreAttached` and `WaitForAttach` to check for `CSIDriver` instance using the informer. - * If CSIDriver for the driver exists and it's attachable, perform usual logic. - * If CSIDriver for the driver exists and it's not attachable, return success immediately (basically NOOP). A/D controller will still mark the volume as attached in `Node.Status.VolumesAttached`. - * If CSIDriver for the driver does not exist, perform usual logic (i.e. treat the volume as attachable). - * This keeps the behavior the same as in old Kubernetes version without CSIDriver object. - * This also happens when CSIDriver informer has not been quick enough. It is suggested that CSIDriver instance is created **before** any pod that uses corresponding CSI driver can run. - * In case that CSIDriver informer (or user) is too slow, CSI volume plugin `Attach()` will create `VolumeAttachment` instance and wait for (non-existing) external attacher to fulfill it. The CSI plugin shall recover when `CSIDriver` instance is created and skip attach. Any `VolumeAttachment` instance created here will be deleted on `Detach()`, see the next bullet. -* In addition to the above, `Detach()` removes `VolumeAttachment` instance even if the volume is not attachable. This deletes `VolumeAttachment` instances created by old A/D controller or before `CSIDriver` instance was created. - - -### Authorization -* A/D controller and kubelet must be allowed to list+watch CSIDriver instances. Updating RBAC rules should be enough. - -## API -No API changes. - -## Upgrade -This chapter covers: -* Upgrade from old Kubernetes that has `CSISkipAttach` disabled to new Kubernetes with `CSISkipAttach` enabled. -* Update from Kubernetes that has `CSISkipAttach` disabled to the same Kubernetes with `CSISkipAttach` enabled. -* Creation of CSIDriver instance with non-attachable CSI driver. - -In all cases listed above, an "attachable" CSI driver becomes non-attachable. Upgrade does not affect attachable CSI drivers, both "old" and "new" Kubernetes processes them in the same way. - -For non-attachable volumes, if the volume was attached by "old" Kubernetes (or "new" Kubernetes before CSIDriver instance was created), it has `VolumeAttachment` instance. It will be deleted by `Detach()`, as it deletes `VolumeAttachment` instance also for non-attachable volumes. - -## Downgrade -This chapter covers: -* Downgrade from new Kubernetes that has `CSISkipAttach` enabled to old Kubernetes with `CSISkipAttach disabled. -* Update from Kubernetes that has `CSISkipAttach` feature enabled to the same Kubernetes with `CSISkipAttach` disabled. -* Deletion of CSIDriver instance with non-attachable CSI driver. - -In all cases listed above, a non-attachable CSI driver becomes "attachable" (i.e. requires external attacher). Downgrade does not affect attachable CSI drivers, both "old" and "new" Kubernetes processes them in the same way. - -For non-attachable volumes, if the volume was mounted by "new" Kubernetes, it has no VolumeAttachment instance. "Old" A/D controller does not know about it. However, it will periodically call plugin's `VolumesAreAttached()` that checks for `VolumeAttachment` presence. Volumes without `VolumeAttachment` will be reported as not attached and A/D controller will call `Attach()` on these. Since "old" Kubernetes required an external attacher even for non-attachable CSI drivers, the external attacher will pick the `VolumeAttachment` instances and fulfil them in the usual way. - - -## Performance considerations - -* Flow suggested in this proposal adds new `CSIDriver` informer both to A/D controller and kubelet. We don't expect any high amount of instances of `CSIDriver` nor any high frequency of updates. `CSIDriver` should have negligible impact on performance. - -* A/D controller will not create `VolumeAttachment` instances for non-attachable volumes. Etcd load will be reduced. - -* On the other hand, all CSI volumes still must go though A/D controller. A/D controller **must** process every CSI volume and kubelet **must** wait until A/D controller marks a volume as attached, even if A/D controller basically does nothing. All CSI volumes must be added to `Node.Status.VolumesInUse` and `Node.Status.VolumesAttached`. This does not introduce any new API calls, all this is already implemented, however this proposal won't reduce `Node.Status` update frequency in any way. - * If *all* volumes move to CSI eventually, pod startup will be slower than when using in-tree volume plugins that don't go through A/D controller and `Node.Status` will grow in size. - -## Implementation - -Expected timeline: -* Alpha: 1.12 (behind feature gate `CSISkipAttach`) -* Beta: 1.13 (enabled by default) -* GA: 1.14 - -## Alternatives considered -A/D controller and kubelet can be easily extended to check if a given volume is attachable. This would make mounting of non-attachable volumes easier, as kubelet would not need to wait for A/D controller to mark the volume as attached. However, there would be issues when upgrading or downgrading Kubernetes (or marking CSIDriver as attachable or non-attachable, which has basically the same handling). -* On upgrade (i.e. a previously attachable CSI volume becomes non-attachable, e.g. when user creates CSIDriver instance while corresponding CSI driver is already running), A/D controller could discover that an attached volume is not attachable any longer. A/D controller could clean up `Node.Status.VolumesAttached`, but since A/D controller does not know anything about `VolumeAttachment`, we would either need to introduce a new volume plugin call to clean it up in CSI volume plugin, or something else would need to clean it. -* On downgrade (i.e. a previously non-attachable CSI volume becomes attachable, e.g. when user deletes CSIDriver instance or downgrades to old Kubernetes without this feature), kubelet must discover that already mounted volume has changed from non-attachable to attachable and put it into `Node.Status.VolumesInUse`. This would race with A/D controller detaching the volume when a pod was deleted at the same time a CSIDriver instance was made attachable. - -Passing all volumes through A/D controller saves us from these difficulties and even races. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/container-storage-interface.md b/contributors/design-proposals/storage/container-storage-interface.md index 0337dd9e..f0fbec72 100644 --- a/contributors/design-proposals/storage/container-storage-interface.md +++ b/contributors/design-proposals/storage/container-storage-interface.md @@ -1,798 +1,6 @@ -# CSI Volume Plugins in Kubernetes Design Doc +Design proposals have been archived. -***Status:*** Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -***Version:*** Alpha -***Author:*** Saad Ali ([@saad-ali](https://github.com/saad-ali), saadali@google.com) - -*This document was drafted [here](https://docs.google.com/document/d/10GDyPWbFE5tQunKMlTXbcWysUttMFhBFJRX8ntaS_4Y/edit?usp=sharing).* - -## Terminology - -Term | Definition ----|--- -Container Storage Interface (CSI) | A specification attempting to establish an industry standard interface that Container Orchestration Systems (COs) can use to expose arbitrary storage systems to their containerized workloads. -in-tree | Code that exists in the core Kubernetes repository. -out-of-tree | Code that exists somewhere outside the core Kubernetes repository. -CSI Volume Plugin | A new, in-tree volume plugin that acts as an adapter and enables out-of-tree, third-party CSI volume drivers to be used in Kubernetes. -CSI Volume Driver | An out-of-tree CSI compatible implementation of a volume plugin that can be used in Kubernetes through the Kubernetes CSI Volume Plugin. - - -## Background & Motivations - -Kubernetes volume plugins are currently “in-tree” meaning they are linked, compiled, built, and shipped with the core kubernetes binaries. Adding a new storage system to Kubernetes (a volume plugin) requires checking code into the core Kubernetes code repository. This is undesirable for many reasons including: - -1. Volume plugin development is tightly coupled and dependent on Kubernetes releases. -2. Kubernetes developers/community are responsible for testing and maintaining all volume plugins, instead of just testing and maintaining a stable plugin API. -3. Bugs in volume plugins can crash critical Kubernetes components, instead of just the plugin. -4. Volume plugins get full privileges of kubernetes components (kubelet and kube-controller-manager). -5. Plugin developers are forced to make plugin source code available, and can not choose to release just a binary. - -The existing [Flex Volume] plugin attempted to address this by exposing an exec based API for mount/unmount/attach/detach. Although it enables third party storage vendors to write drivers out-of-tree, it requires access to the root filesystem of node and master machines in order to deploy the third party driver files. - -Additionally, it doesn’t address another pain of in-tree volumes plugins: dependencies. Volume plugins tend to have many external requirements: dependencies on mount and filesystem tools, for example. These dependencies are assumed to be available on the underlying host OS, which often is not the case, and installing them requires direct machine access. There are efforts underway, for example https://github.com/kubernetes/community/pull/589, that are hoping to address this for in-tree volume plugins. But, enabling volume plugins to be completely containerized will make dependency management much easier. - -While Kubernetes has been dealing with these issues, the broader storage community has also been dealing with a fragmented story for how to make their storage system available in different Container Orchestration Systems (COs). Storage vendors have to either write and support multiple volume drivers for different COs or choose to not support some COs. - -The Container Storage Interface (CSI) is a specification that resulted from cooperation between community members from various COs--including Kubernetes, Mesos, Cloud Foundry, and Docker. The goal of this interface is to establish a standardized mechanism for COs to expose arbitrary storage systems to their containerized workloads. - -The primary motivation for Storage vendors to adopt the interface is a desire to make their system available to as many users as possible with as little work as possible. The primary motivation for COs to adopt the interface is to invest in a mechanism that will enable their users to use as many different storage systems as possible. In addition, for Kubernetes, adopting CSI will have the added benefit of moving volume plugins out of tree, and enabling volume plugins to be containerized. - -### Links - -* [Container Storage Interface (CSI) Spec](https://github.com/container-storage-interface/spec/blob/master/spec.md) - -## Objective - -The objective of this document is to document all the requirements for enabling a CSI compliant volume plugin (a CSI volume driver) in Kubernetes. - -## Goals - -* Define Kubernetes API for interacting with an arbitrary, third-party CSI volume drivers. -* Define mechanism by which Kubernetes master and node components will securely communicate with an arbitrary, third-party CSI volume drivers. -* Define mechanism by which Kubernetes master and node components will discover and register an arbitrary, third-party CSI volume driver deployed on Kubernetes. -* Recommend packaging requirements for Kubernetes compatible, third-party CSI Volume drivers. -* Recommend deployment process for Kubernetes compatible, third-party CSI Volume drivers on a Kubernetes cluster. - -## Non-Goals -* Replace [Flex Volume plugin] - * The Flex volume plugin exists as an exec based mechanism to create “out-of-tree” volume plugins. - * Because Flex drivers exist and depend on the Flex interface, it will continue to be supported with a stable API. - * The CSI Volume plugin will co-exist with Flex volume plugin. - -## Design Overview - -To support CSI Compliant Volume plugins, a new in-tree CSI Volume plugin will be introduced in Kubernetes. This new volume plugin will be the mechanism by which Kubernetes users (application developers and cluster admins) interact with external CSI volume drivers. - -The `SetUp`/`TearDown` calls for the new in-tree CSI volume plugin will directly invoke `NodePublishVolume` and `NodeUnpublishVolume` CSI RPCs through a unix domain socket on the node machine. - -Provision/delete and attach/detach must be handled by some external component that monitors the Kubernetes API on behalf of a CSI volume driver and invokes the appropriate CSI RPCs against it. - -To simplify integration, the Kubernetes team will offer a containers that captures all the Kubernetes specific logic and act as adapters between third-party containerized CSI volume drivers and Kubernetes (each deployment of a CSI driver would have it’s own instance of the adapter). - -## Design Details - -### Third-Party CSI Volume Drivers - -Kubernetes is as minimally prescriptive on the packaging and deployment of a CSI Volume Driver as possible. Use of the *Communication Channels* (documented below) is the only requirement for enabling an arbitrary external CSI compatible storage driver in Kubernetes. - -This document recommends a standard mechanism for deploying an arbitrary containerized CSI driver on Kubernetes. This can be used by a Storage Provider to simplify deployment of containerized CSI compatible volume drivers on Kubernetes (see the “Recommended Mechanism for Deploying CSI Drivers on Kubernetes” section below). This mechanism, however, is strictly optional. - -### Communication Channels - -#### Kubelet to CSI Driver Communication - -Kubelet (responsible for mount and unmount) will communicate with an external “CSI volume driver” running on the same host machine (whether containerized or not) via a Unix Domain Socket. - -CSI volume drivers should create a socket at the following path on the node machine: `/var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock`. For alpha, kubelet will assume this is the location for the Unix Domain Socket to talk to the CSI volume driver. For the beta implementation, we can consider using the [Device Plugin Unix Domain Socket Registration](/contributors/design-proposals/resource-management/device-plugin.md#unix-socket) mechanism to register the Unix Domain Socket with kubelet. This mechanism would need to be extended to support registration of both CSI volume drivers and device plugins independently. - -`Sanitized CSIDriverName` is CSI driver name that does not contain dangerous character and can be used as annotation name. It can follow the same pattern that we use for [volume plugins](https://git.k8s.io/utils/strings/escape.go#L28). Too long or too ugly driver names can be rejected, i.e. all components described in this document will report an error and won't talk to this CSI driver. Exact sanitization method is implementation detail (SHA in the worst case). - -Upon initialization of the external “CSI volume driver”, kubelet must call the CSI method `NodeGetInfo` to get the mapping from Kubernetes Node names to CSI driver NodeID and the associated `accessible_topology`. It must: - - * Create/update a `CSINodeInfo` object instance for the node with the NodeID and topology keys from `accessible_topology`. - * This will enable the component that will issue `ControllerPublishVolume` calls to use the `CSINodeInfo` as a mapping from cluster node ID to storage node ID. - * This will enable the component that will issue `CreateVolume` to reconstruct `accessible_topology` and provision a volume that is accesible from specific node. - * Each driver must completely overwrite its previous version of NodeID and topology keys, if they exist. - * If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID and topology keys for this driver. - * When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered. - - * Update Node API object with the CSI driver NodeID as the `csi.volume.kubernetes.io/nodeid` annotation. The value of the annotation is a JSON blob, containing key/value pairs for each CSI driver. For example: - ``` - csi.volume.kubernetes.io/nodeid: "{ \"driver1\": \"name1\", \"driver2\": \"name2\" } - ``` - - *This annotation is deprecated and will be removed according to deprecation policy (1 year after deprecation). TODO mark deprecation date.* - * If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID for this driver. - * When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered. - - * Create/update Node API object with `accessible_topology` as labels. - There are no hard restrictions on the label format, but for the format to be used by the recommended setup, please refer to [Topology Representation in Node Objects](#topology-representation-in-node-objects). - -To enable easy deployment of an external containerized CSI volume driver, the Kubernetes team will provide a sidecar "Kubernetes CSI Helper" container that can manage the unix domain socket registration and NodeId initialization. This is detailed in the “Suggested Mechanism for Deploying CSI Drivers on Kubernetes” section below. - -The new API object called `CSINodeInfo` will be defined as follows: - -```go -// CSINodeInfo holds information about status of all CSI drivers installed on a node. -type CSINodeInfo struct { - metav1.TypeMeta - // ObjectMeta.Name must be node name. - metav1.ObjectMeta - - // List of CSI drivers running on the node and their properties. - CSIDrivers []CSIDriverInfo -} - -// Information about one CSI driver installed on a node. -type CSIDriverInfo struct { - // CSI driver name. - Name string - - // ID of the node from the driver point of view. - NodeID string - - // Topology keys reported by the driver on the node. - TopologyKeys []string -} -``` - -A new object type `CSINodeInfo` is chosen instead of `Node.Status` field because Node is already big enough and there are issues with its size. `CSINodeInfo` is CRD installed by TODO (jsafrane) on cluster startup and defined in `kubernetes/kubernetes/pkg/apis/storage-csi/v1alpha1/types.go`, so k8s.io/client-go and k8s.io/api are generated automatically. All users of `CSINodeInfo` will tolerate if the CRD is not installed and retry anything they need to do with it with exponential backoff and proper error reporting. Especially kubelet is able to serve its usual duties when the CRD is missing. - -Each node must have zero or one `CSINodeInfo` instance. This is ensured by `CSINodeInfo.Name == Node.Name`. TODO: how to validate this? Each `CSINodeInfo` is "owned" by corresponding Node for garbage collection. - - -#### Master to CSI Driver Communication - -Because CSI volume driver code is considered untrusted, it might not be allowed to run on the master. Therefore, the Kube controller manager (responsible for create, delete, attach, and detach) can not communicate via a Unix Domain Socket with the “CSI volume driver” container. Instead, the Kube controller manager will communicate with the external “CSI volume driver” through the Kubernetes API. - -More specifically, some external component must watch the Kubernetes API on behalf of the external CSI volume driver and trigger the appropriate operations against it. This eliminates the problems of discovery and securing a channel between the kube-controller-manager and the CSI volume driver. - -To enable easy deployment of an external containerized CSI volume driver on Kubernetes, without making the driver Kubernetes aware, Kubernetes will provide a sidecar “Kubernetes to CSI” proxy container that will watch the Kubernetes API and trigger the appropriate operations against the “CSI volume driver” container. This is detailed in the “Suggested Mechanism for Deploying CSI Drivers on Kubernetes” section below. - -The external component watching the Kubernetes API on behalf of the external CSI volume driver must handle provisioning, deleting, attaching, and detaching. - -##### Provisioning and Deleting - -Provisioning and deletion operations are handled using the existing [external provisioner mechanism](https://github.com/kubernetes-incubator/external-storage/tree/master/docs), where the external component watching the Kubernetes API on behalf of the external CSI volume driver will act as an external provisioner. - -In short, to dynamically provision a new CSI volume, a cluster admin would create a `StorageClass` with the provisioner corresponding to the name of the external provisioner handling provisioning requests on behalf of the CSI volume driver. - -To provision a new CSI volume, an end user would create a `PersistentVolumeClaim` object referencing this `StorageClass`. The external provisioner will react to the creation of the PVC and issue the `CreateVolume` call against the CSI volume driver to provision the volume. The `CreateVolume` name will be auto-generated as it is for other dynamically provisioned volumes. The `CreateVolume` capacity will be taken from the `PersistentVolumeClaim` object. The `CreateVolume` parameters will be passed through from the `StorageClass` parameters (opaque to Kubernetes). - -If the `PersistentVolumeClaim` has the `volume.alpha.kubernetes.io/selected-node` annotation set (only added if delayed volume binding is enabled in the `StorageClass`), the provisioner will get relevant topology keys from the corresponding `CSINodeInfo` instance and the topology values from `Node` labels and use them to generate preferred topology in the `CreateVolume()` request. If the annotation is unset, preferred topology will not be specified (unless the PVC follows StatefulSet naming format, discussed later in this section). `AllowedTopologies` from the `StorageClass` is passed through as requisite topology. If `AllowedTopologies` is unspecified, the provisioner will pass in a set of aggregated topology values across the whole cluster as requisite topology. - -To perform this topology aggregation, the external provisioner will cache all existing Node objects. In order to prevent a compromised node from affecting the provisioning process, it will pick a single node as the source of truth for keys, instead of relying on keys stored in `CSINodeInfo` for each node object. For PVCs to be provisioned with late binding, the selected node is the source of truth; otherwise a random node is picked. The provisioner will then iterate through all cached nodes that contain a node ID from the driver, aggregating labels using those keys. Note that if topology keys are different across the cluster, only a subset of nodes matching the topology keys of the chosen node will be considered for provisioning. - -To generate preferred topology, the external provisioner will generate N segments for preferred topology in the `CreateVolume()` call, where N is the size of requisite topology. Multiple segments are included to support volumes that are available across multiple topological segments. The topology segment from the selected node will always be the first in preferred topology. All other segments are some reordering of remaining requisite topologies such that given a requisite topology (or any arbitrary reordering of it) and a selected node, the set of preferred topology is guaranteed to always be the same. - -If immediate volume binding mode is set and the PVC follows StatefulSet naming format, then the provisioner will choose, as the first segment in preferred topology, a segment from requisite topology based on the PVC name that ensures an even spread of topology across the StatefulSet's volumes. The logic will be similar to the name hashing logic inside the GCE Persistent Disk provisioner. Other segments in preferred topology are ordered the same way as described above. This feature will be flag-gated in the external provisioner provided as part of the recommended deployment method. - -Once the operation completes successfully, the external provisioner creates a `PersistentVolume` object to represent the volume using the information returned in the `CreateVolume` response. The topology of the returned volume is translated to the `PersistentVolume` `NodeAffinity` field. The `PersistentVolume` object is then bound to the `PersistentVolumeClaim` and available for use. - -The format of topology key/value pairs is defined by the user and must match among the following locations: -* `Node` topology labels -* `PersistentVolume` `NodeAffinity` field -* `StorageClass` `AllowedTopologies` field -When a `StorageClass` has delayed volume binding enabled, the scheduler uses the topology information of a `Node` in the following ways: - 1. During dynamic provisioning, the scheduler selects a candidate node for the provisioner by comparing each `Node`'s topology with the `AllowedTopologies` in the `StorageClass`. - 1. During volume binding and pod scheduling, the scheduler selects a candidate node for the pod by comparing `Node` topology with `VolumeNodeAffinity` in `PersistentVolume`s. - -A more detailed description can be found in the [topology-aware volume scheduling design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md). See [Topology Representation in Node Objects](#topology-representation-in-node-objects) for the format used by the recommended deployment approach. - -To delete a CSI volume, an end user would delete the corresponding `PersistentVolumeClaim` object. The external provisioner will react to the deletion of the PVC and based on its reclamation policy it will issue the `DeleteVolume` call against the CSI volume driver commands to delete the volume. It will then delete the `PersistentVolume` object. - -##### Attaching and Detaching - -Attach/detach operations must also be handled by an external component (an “attacher”). The attacher watches the Kubernetes API on behalf of the external CSI volume driver for new `VolumeAttachment` objects (defined below), and triggers the appropriate calls against the CSI volume driver to attach the volume. The attacher must watch for `VolumeAttachment` object and mark it as attached even if the underlying CSI driver does not support `ControllerPublishVolume` call, as Kubernetes has no knowledge about it. - -More specifically, an external “attacher” must watch the Kubernetes API on behalf of the external CSI volume driver to handle attach/detach requests. - -Once the following conditions are true, the external-attacher should call `ControllerPublishVolume` against the CSI volume driver to attach the volume to the specified node: - -1. A new `VolumeAttachment` Kubernetes API objects is created by Kubernetes attach/detach controller. -2. The `VolumeAttachment.Spec.Attacher` value in that object corresponds to the name of the external attacher. -3. The `VolumeAttachment.Status.Attached` value is not yet set to true. -4. * Either a Kubernetes Node API object exists with the name matching `VolumeAttachment.Spec.NodeName` and that object contains a `csi.volume.kubernetes.io/nodeid` annotation. This annotation contains a JSON blob, a list of key/value pairs, where one of they keys corresponds with the CSI volume driver name, and the value is the NodeID for that driver. This NodeId mapping can be retrieved and used in the `ControllerPublishVolume` calls. - * Or a `CSINodeInfo` API object exists with the name matching `VolumeAttachment.Spec.NodeName` and the object contains `CSIDriverInfo` for the CSI volume driver. The `CSIDriverInfo` contains NodeID for `ControllerPublishVolume` call. -5. The `VolumeAttachment.Metadata.DeletionTimestamp` is not set. - -Before starting the `ControllerPublishVolume` operation, the external-attacher should add these finalizers to these Kubernetes API objects: - -* To the `VolumeAttachment` so that when the object is deleted, the external-attacher has an opportunity to detach the volume first. External attacher removes this finalizer once the volume is fully detached from the node. -* To the `PersistentVolume` referenced by `VolumeAttachment` so the PV cannot be deleted while the volume is attached. External attacher needs information from the PV to perform detach operation. The attacher will remove the finalizer once all `VolumeAttachment` objects that refer to the PV are deleted, i.e. the volume is detached from all nodes. - -If the operation completes successfully, the external-attacher will: - -1. Set `VolumeAttachment.Status.Attached` field to true to indicate the volume is attached. -2. Update the `VolumeAttachment.Status.AttachmentMetadata` field with the contents of the returned `PublishVolumeInfo`. -3. Clear the `VolumeAttachment.Status.AttachError` field. - -If the operation fails, the external-attacher will: - -1. Ensure the `VolumeAttachment.Status.Attached` field to still false to indicate the volume is not attached. -2. Set the `VolumeAttachment.Status.AttachError` field detailing the error. -3. Create an event against the Kubernetes API associated with the `VolumeAttachment` object to inform users what went wrong. - -The external-attacher may implement it’s own error recovery strategy, and retry as long as conditions specified for attachment above are valid. It is strongly recommended that the external-attacher implement an exponential backoff strategy for retries. - -The detach operation will be triggered by the deletion of the `VolumeAttachment` Kubernetes API objects. Since the `VolumeAttachment` Kubernetes API object will have a finalizer added by the external-attacher, it will wait for confirmation from the external-attacher before deleting the object. - -Once all the following conditions are true, the external-attacher should call `ControllerUnpublishVolume` against the CSI volume driver to detach the volume from the specified node: -1. A `VolumeAttachment` Kubernetes API object is marked for deletion: the value for the `VolumeAttachment.metadata.deletionTimestamp` field is set. - -If the operation completes successfully, the external-attacher will: -1. Remove its finalizer from the list of finalizers on the `VolumeAttachment` object permitting the delete operation to continue. - -If the operation fails, the external-attacher will: - -1. Ensure the `VolumeAttachment.Status.Attached` field remains true to indicate the volume is not yet detached. -2. Set the `VolumeAttachment.Status.DetachError` field detailing the error. -3. Create an event against the Kubernetes API associated with the `VolumeAttachment` object to inform users what went wrong. - -The new API object called `VolumeAttachment` will be defined as follows: - -```GO - -// VolumeAttachment captures the intent to attach or detach the specified volume -// to/from the specified node. -// -// VolumeAttachment objects are non-namespaced. -type VolumeAttachment struct { - metav1.TypeMeta `json:",inline"` - - // Standard object metadata. - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata - // +optional - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Specification of the desired attach/detach volume behavior. - // Populated by the Kubernetes system. - Spec VolumeAttachmentSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"` - - // Status of the VolumeAttachment request. - // Populated by the entity completing the attach or detach - // operation, i.e. the external-attacher. - // +optional - Status VolumeAttachmentStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` -} - -// The specification of a VolumeAttachment request. -type VolumeAttachmentSpec struct { - // Attacher indicates the name of the volume driver that MUST handle this - // request. This is the name returned by GetPluginName() and must be the - // same as StorageClass.Provisioner. - Attacher string `json:"attacher" protobuf:"bytes,1,opt,name=attacher"` - - // AttachedVolumeSource represents the volume that should be attached. - VolumeSource AttachedVolumeSource `json:"volumeSource" protobuf:"bytes,2,opt,name=volumeSource"` - - // Kubernetes node name that the volume should be attached to. - NodeName string `json:"nodeName" protobuf:"bytes,3,opt,name=nodeName"` -} - -// VolumeAttachmentSource represents a volume that should be attached. -// Right now only PersistentVolumes can be attached via external attacher, -// in future we may allow also inline volumes in pods. -// Exactly one member can be set. -type AttachedVolumeSource struct { - // Name of the persistent volume to attach. - // +optional - PersistentVolumeName *string `json:"persistentVolumeName,omitempty" protobuf:"bytes,1,opt,name=persistentVolumeName"` - - // Placeholder for *VolumeSource to accommodate inline volumes in pods. -} - -// The status of a VolumeAttachment request. -type VolumeAttachmentStatus struct { - // Indicates the volume is successfully attached. - // This field must only be set by the entity completing the attach - // operation, i.e. the external-attacher. - Attached bool `json:"attached" protobuf:"varint,1,opt,name=attached"` - - // Upon successful attach, this field is populated with any - // information returned by the attach operation that must be passed - // into subsequent WaitForAttach or Mount calls. - // This field must only be set by the entity completing the attach - // operation, i.e. the external-attacher. - // +optional - AttachmentMetadata map[string]string `json:"attachmentMetadata,omitempty" protobuf:"bytes,2,rep,name=attachmentMetadata"` - - // The most recent error encountered during attach operation, if any. - // This field must only be set by the entity completing the attach - // operation, i.e. the external-attacher. - // +optional - AttachError *VolumeError `json:"attachError,omitempty" protobuf:"bytes,3,opt,name=attachError,casttype=VolumeError"` - - // The most recent error encountered during detach operation, if any. - // This field must only be set by the entity completing the detach - // operation, i.e. the external-attacher. - // +optional - DetachError *VolumeError `json:"detachError,omitempty" protobuf:"bytes,4,opt,name=detachError,casttype=VolumeError"` -} - -// Captures an error encountered during a volume operation. -type VolumeError struct { - // Time the error was encountered. - // +optional - Time metav1.Time `json:"time,omitempty" protobuf:"bytes,1,opt,name=time"` - - // String detailing the error encountered during Attach or Detach operation. - // This string may be logged, so it should not contain sensitive - // information. - // +optional - Message string `json:"message,omitempty" protobuf:"bytes,2,opt,name=message"` -} - -``` - -### Kubernetes In-Tree CSI Volume Plugin - -A new in-tree Kubernetes CSI Volume plugin will contain all the logic required for Kubernetes to communicate with an arbitrary, out-of-tree, third-party CSI compatible volume driver. - -The existing Kubernetes volume components (attach/detach controller, PVC/PV controller, Kubelet volume manager) will handle the lifecycle of the CSI volume plugin operations (everything from triggering volume provisioning/deleting, attaching/detaching, and mounting/unmounting) just as they do for existing in-tree volume plugins. - -#### Proposed API - -A new `CSIPersistentVolumeSource` object will be added to the Kubernetes API. It will be part of the existing `PersistentVolumeSource` object and thus can be used only via PersistentVolumes. CSI volumes will not be allow referencing directly from Pods without a `PersistentVolumeClaim`. - -```GO -type CSIPersistentVolumeSource struct { - // Driver is the name of the driver to use for this volume. - // Required. - Driver string `json:"driver" protobuf:"bytes,1,opt,name=driver"` - - // VolumeHandle is the unique volume name returned by the CSI volume - // plugin’s CreateVolume to refer to the volume on all subsequent calls. - VolumeHandle string `json:"volumeHandle" protobuf:"bytes,2,opt,name=volumeHandle"` - - // Optional: The value to pass to ControllerPublishVolumeRequest. - // Defaults to false (read/write). - // +optional - ReadOnly bool `json:"readOnly,omitempty" protobuf:"varint,5,opt,name=readOnly"` -} -``` - -#### Internal Interfaces - -The in-tree CSI volume plugin will implement the following internal Kubernetes volume interfaces: - -1. `VolumePlugin` - * Mounting/Unmounting of a volume to a specific path. -2. `AttachableVolumePlugin` - * Attach/detach of a volume to a given node. - -Notably, `ProvisionableVolumePlugin` and `DeletableVolumePlugin` are not implemented because provisioning and deleting for CSI volumes is handled by an external provisioner. - -#### Mount and Unmount - -The in-tree volume plugin’s SetUp and TearDown methods will trigger the `NodePublishVolume` and `NodeUnpublishVolume` CSI calls via Unix Domain Socket. Kubernetes will generate a unique `target_path` (unique per pod per volume) to pass via `NodePublishVolume` for the CSI plugin to mount the volume. Upon successful completion of the `NodeUnpublishVolume` call (once volume unmount has been verified), Kubernetes will delete the directory. - -The Kubernetes volume sub-system does not currently support block volumes (only file), so for alpha, the Kubernetes CSI volume plugin will only support file. - -#### Attaching and Detaching - -The attach/detach controller,running as part of the kube-controller-manager binary on the master, decides when a CSI volume must be attached or detached from a particular node. - -When the controller decides to attach a CSI volume, it will call the in-tree CSI volume plugin’s attach method. The in-tree CSI volume plugin’s attach method will do the following: - -1. Create a new `VolumeAttachment` object (defined in the “Communication Channels” section) to attach the volume. - * The name of the `VolumeAttachment` object will be `pv-<SHA256(PVName+NodeName)>`. - * `pv-` prefix is used to allow using other scheme(s) for inline volumes in the future, with their own prefix. - * SHA256 hash is to reduce length of `PVName` plus `NodeName` string, each of which could be max allowed name length (hexadecimal representation of SHA256 is 64 characters). - * `PVName` is `PV.name` of the attached PersistentVolume. - * `NodeName` is `Node.name` of the node where the volume should be attached to. - * If a `VolumeAttachment` object with the corresponding name already exists, the in-tree volume plugin will simply begin to poll it as defined below. The object is not modified; only the external-attacher should change the status fields; and the external-attacher is responsible for it’s own retry and error handling logic. -2. Poll the `VolumeAttachment` object waiting for one of the following conditions: - * The `VolumeAttachment.Status.Attached` field to become `true`. - * The operation completes successfully. - * An error to be set in the `VolumeAttachment.Status.AttachError` field. - * The operation terminates with the specified error. - * The operation to timeout. - * The operation terminates with timeout error. - * The `VolumeAttachment.DeletionTimestamp` is set. - * The operation terminates with an error indicating a detach operation is in progress. - * The `VolumeAttachment.Status.Attached` value must not be trusted. The attach/detach controller has to wait until the object is deleted by the external-attacher before creating a new instance of the object. - -When the controller decides to detach a CSI volume, it will call the in-tree CSI volume plugin’s detach method. The in-tree CSI volume plugin’s detach method will do the following: - -1. Delete the corresponding `VolumeAttachment` object (defined in the “Communication Channels” section) to indicate the volume should be detached. -2. Poll the `VolumeAttachment` object waiting for one of the following conditions: - * The `VolumeAttachment.Status.Attached` field to become false. - * The operation completes successfully. - * An error to be set in the `VolumeAttachment.Status.DetachError` field. - * The operation terminates with the specified error. - * The object to no longer exists. - * The operation completes successfully. - * The operation to timeout. - * The operation terminates with timeout error. - -### Recommended Mechanism for Deploying CSI Drivers on Kubernetes - -Although Kubernetes does not dictate the packaging for a CSI volume driver, it offers the following recommendations to simplify deployment of a containerized CSI volume driver on Kubernetes. - - - -To deploy a containerized third-party CSI volume driver, it is recommended that storage vendors: - - * Create a “CSI volume driver” container that implements the volume plugin behavior and exposes a gRPC interface via a unix domain socket, as defined in the CSI spec (including Controller, Node, and Identity services). - * Bundle the “CSI volume driver” container with helper containers (external-attacher, external-provisioner, node-driver-registrar, cluster-driver-registrar, external-resizer, external-snapshotter, livenessprobe) that the Kubernetes team will provide (these helper containers will assist the “CSI volume driver” container in interacting with the Kubernetes system). More specifically, create the following Kubernetes objects: - * To facilitate communication with the Kubernetes controllers, a `StatefulSet` or a `Deployment` (depending on the user's need; see [Cluster-Level Deployment](#cluster-level-deployment)) that has: - * The following containers - * The “CSI volume driver” container created by the storage vendor. - * Containers provided by the Kubernetes team (all of which are optional): - * `cluster-driver-registrar` (refer to the README in `cluster-driver-registrar` repository for when the container is required) - * `external-provisioner` (required for provision/delete operations) - * `external-attacher` (required for attach/detach operations. If you wish to skip the attach step, CSISkipAttach feature must be enabled in Kubernetes in addition to omitting this container) - * `external-resizer` (required for resize operations) - * `external-snapshotter` (required for volume-level snapshot operations) - * `livenessprobe` - * The following volumes: - * `emptyDir` volume - * Mounted by all containers, including the “CSI volume driver”. - * The “CSI volume driver” container should create its Unix Domain Socket in this directory to enable communication with the Kubernetes helper container(s). - * A `DaemonSet` (to facilitate communication with every instance of kubelet) that has: - * The following containers - * The “CSI volume driver” container created by the storage vendor. - * Containers provided by the Kubernetes team: - * `node-driver-registrar` - Responsible for registering the unix domain socket with kubelet. - * `livenessprobe` (optional) - * The following volumes: - * `hostpath` volume - * Expose `/var/lib/kubelet/plugins_registry` from the host. - * Mount only in `node-driver-registrar` container at `/registration` - * `node-driver-registrar` will use this unix domain socket to register the CSI driver’s unix domain socket with kubelet. - * `hostpath` volume - * Expose `/var/lib/kubelet/` from the host. - * Mount only in “CSI volume driver” container at `/var/lib/kubelet/` - * Ensure [bi-directional mount propagation](https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation) is enabled, so that any mounts setup inside this container are propagated back to the host machine. - * `hostpath` volume - * Expose `/var/lib/kubelet/plugins/[SanitizedCSIDriverName]/` from the host as `hostPath.type = "DirectoryOrCreate"`. - * Mount inside “CSI volume driver” container at the path the CSI gRPC socket will be created. - * This is the primary means of communication between Kubelet and the “CSI volume driver” container (gRPC over UDS). - * Have cluster admins deploy the above `StatefulSet` and `DaemonSet` to add support for the storage system in their Kubernetes cluster. - -Alternatively, deployment could be simplified by having all components (including external-provisioner and external-attacher) in the same pod (DaemonSet). Doing so, however, would consume more resources, and require a leader election protocol (likely https://git.k8s.io/contrib/election) in the `external-provisioner` and `external-attacher` components. - -Containers provided by Kubernetes are maintained in [GitHub kubernetes-csi organization](https://github.com/kubernetes-csi). - -#### Cluster-Level Deployment -Containers in the cluster-level deployment may be deployed in one of the following configurations: - -1. StatefulSet with single replica. Good for clusters with a single dedicated node to run the cluster-level pod. A StatefulSet guarantees that no more than 1 instance of the pod will be running at once. One downside is that if the node becomes unresponsive, the replica will never be deleted and recreated. -1. Deployment with multiple replicas and leader election enabled (if supported by the container). Good for admins who prefer faster recovery time in case the main replica fails, at a cost of higher resource usage (especially memory). -1. Deployment with a single replica and leader election enabled (if supported by the container). A compromise between the above two options. If the replica is detected to be failed, a new replica can be scheduled almost immediately. - -Note that certain cluster-level containers, such as `external-provisioner`, `external-attacher`, `external-resizer`, and `external-snapshotter`, may require credentials to the storage backend, and as such, admins may choose to run them on dedicated "infrastructure" nodes (such as master nodes) that don't run user pods. - -#### Topology Representation in Node Objects -Topology information will be represented as labels. - -Requirements: -* Must adhere to the [label format](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set). -* Must support different drivers on the same node. -* The format of each key/value pair must match those in `PersistentVolume` and `StorageClass` objects, as described in the [Provisioning and Deleting](#provisioning-and-deleting) section. - -Proposal: `"com.example.topology/rack": "rack1"` -The list of topology keys known to the driver is stored separately in the `CSINodeInfo` object. - -Justifications: -* No strange separators needed, comparing to the alternative. Cleaner format. -* The same topology key could be used across different components (different storage plugin, network plugin, etc.) -* Once NodeRestriction is moved to the newer model (see [here](https://github.com/kubernetes/community/pull/911) for context), for each new label prefix introduced in a new driver, the cluster admin has to configure NodeRestrictions to allow the driver to update labels with the prefix. Cluster installations could include certain prefixes for pre-installed drivers by default. This is less convenient compared to the alternative, which can allow editing of all CSI drivers by default using the “csi.kubernetes.io” prefix, but often times cluster admins have to whitelist those prefixes anyway (for example ‘cloud.google.com’) - -Considerations: -* Upon driver deletion/upgrade/downgrade, stale labels will be left untouched. It’s difficult for the driver to decide whether other components outside CSI rely on this label. -* During driver installation/upgrade/downgrade, controller deployment must be brought down before node deployment, and node deployment must be deployed before the controller deployment, because provisioning relies on up-to-date node information. One possible issue is if only topology values change while keys remain the same, and if AllowedTopologies is not specified, requisite topology will contain both old and new topology values, and CSI driver may fail the CreateVolume() call. Given that CSI driver should be backward compatible, this is more of an issue when a node rolling upgrade happens before the controller update. It's not an issue if keys are changed as well since requisite and preferred topology generation handles it appropriately. -* During driver installation/upgrade/downgrade, if a version of the controller (either old or new) is running while there is an ongoing rolling upgrade with the node deployment, and the new version of the CSI driver reports different topology information, nodes in the cluster may have different versions of topology information. However, this doesn't pose an issue. If AllowedTopologies is specified, a subset of nodes matching the version of topology information in AllowedTopologies will be used as provisioning candidate. If AllowedTopologies is not specified, a single node is used as the source of truth for keys -* Topology keys inside `CSINodeInfo` must reflect the topology keys from drivers currently installed on the node. If no driver is installed, the collection must be empty. However, due to the possible race condition between kubelet (the writer) and the external provisioner (the reader), the provisioner must gracefully handle the case where `CSINodeInfo` is not up-to-date. In the current design, the provisioner will erroneously provision a volume on a node where it's inaccessible. - -Alternative: -1. `"csi.kubernetes.io/topology.example.com_rack": "rack1"` - -#### Topology Representation in PersistentVolume Objects -There exists multiple ways to represent a single topology as NodeAffinity. For example, suppose a `CreateVolumeResponse` contains the following accessible topology: - -```yaml -- zone: "a" - rack: "1" -- zone: "b" - rack: "1" -- zone: "b" - rack: "2" -``` - -There are at least 3 ways to represent this in NodeAffinity (excluding `nodeAffinity`, `required`, and `nodeSelectorTerms` for simplicity): - -Form 1 - `values` contain exactly 1 element. -```yaml -- matchExpressions: - - key: zone - operator: In - values: - - "a" - - key: rack - operator: In - values: - - "1" -- matchExpressions: - - key: zone - operator: In - values: - - "b" - - key: rack - operator: In - values: - - "1" -- matchExpressions: - - key: zone - operator: In - values: - - "b" - - key: rack - operator: In - values: - - "2" -``` - -Form 2 - Reduced by `rack`. -```yaml -- matchExpressions: - - key: zone - operator: In - values: - - "a" - - "b" - - key: rack - operator: In - values: - - "1" -- matchExpressions: - - key: zone - operator: In - values: - - "b" - - key: rack - operator: In - values: - - "2" -``` -Form 3 - Reduced by `zone`. -```yaml -- matchExpressions: - - key: zone - operator: In - values: - - "a" - - key: rack - operator: In - values: - - "1" -- matchExpressions: - - key: zone - operator: In - values: - - "b" - - key: rack - operator: In - values: - - "1" - - "2" -``` -The provisioner will always choose Form 1, i.e. all `values` will have at most 1 element. Reduction logic could be added in future versions to arbitrarily choose a valid and simpler form like Forms 2 & 3. - -#### Upgrade & Downgrade Considerations -When drivers are uninstalled, topology information stored in Node labels remain untouched. The recommended label format allows multiple sources (such as CSI, networking resources, etc.) to share the same label key, so it's nontrivial to accurately determine whether a label is still used. - -In order to upgrade drivers using the recommended driver deployment mechanism, the user is recommended to tear down the StatefulSet (controller components) before the DaemonSet (node components), and deploy the DaemonSet before the StatefulSet. There may be design improvements to eliminate this constraint, but it will be evaluated at a later iteration. - -### Example Walkthrough - -#### Provisioning Volumes - -1. A cluster admin creates a `StorageClass` pointing to the CSI driver’s external-provisioner and specifying any parameters required by the driver. -2. A user creates a `PersistentVolumeClaim` referring to the new `StorageClass`. -3. The persistent volume controller realizes that dynamic provisioning is needed, and marks the PVC with a `volume.beta.kubernetes.io/storage-provisioner` annotation. -4. The external-provisioner for the CSI driver sees the `PersistentVolumeClaim` with the `volume.beta.kubernetes.io/storage-provisioner` annotation so it starts dynamic volume provisioning: - 1. It dereferences the `StorageClass` to collect the opaque parameters to use for provisioning. - 2. It calls `CreateVolume` against the CSI driver container with parameters from the `StorageClass` and `PersistentVolumeClaim` objects. -5. Once the volume is successfully created, the external-provisioner creates a `PersistentVolume` object to represent the newly created volume and binds it to the `PersistentVolumeClaim`. - -#### Deleting Volumes - -1. A user deletes a `PersistentVolumeClaim` object bound to a CSI volume. -2. The external-provisioner for the CSI driver sees the `PersistentVolumeClaim` was deleted and triggers the retention policy: - 1. If the retention policy is `delete` - 1. The external-provisioner triggers volume deletion by issuing a `DeleteVolume` call against the CSI volume plugin container. - 2. Once the volume is successfully deleted, the external-provisioner deletes the corresponding `PersistentVolume` object. - 2. If the retention policy is `retain` - 1. The external-provisioner does not delete the `PersistentVolume` object. - -#### Attaching Volumes - -1. The Kubernetes attach/detach controller, running as part of the `kube-controller-manager` binary on the master, sees that a pod referencing a CSI volume plugin is scheduled to a node, so it calls the in-tree CSI volume plugin’s attach method. -2. The in-tree volume plugin creates a new `VolumeAttachment` object in the kubernetes API and waits for its status to change to completed or error. -3. The external-attacher sees the `VolumeAttachment` object and triggers a `ControllerPublish` against the CSI volume driver container to fulfil it (meaning the external-attacher container issues a gRPC call via underlying UNIX domain socket to the CSI driver container). -4. Upon successful completion of the `ControllerPublish` call the external-attacher updates the status of the `VolumeAttachment` object to indicate the volume is successfully attached. -5. The in-tree volume plugin watching the status of the `VolumeAttachment` object in the Kubernetes API, sees the `Attached` field set to true indicating the volume is attached, so it updates the attach/detach controller’s internal state to indicate the volume is attached. - -#### Detaching Volumes - -1. The Kubernetes attach/detach controller, running as part of the `kube-controller-manager` binary on the master, sees that a pod referencing an attached CSI volume plugin is terminated or deleted, so it calls the in-tree CSI volume plugin’s detach method. -2. The in-tree volume plugin deletes the corresponding `VolumeAttachment` object. -3. The external-attacher sees a `deletionTimestamp` set on the `VolumeAttachment` object and triggers a `ControllerUnpublish` against the CSI volume driver container to detach it. -4. Upon successful completion of the `ControllerUnpublish` call, the external-attacher removes the finalizer from the `VolumeAttachment` object to indicate successful completion of the detach operation allowing the `VolumeAttachment` object to be deleted. -5. The in-tree volume plugin waiting for the `VolumeAttachment` object sees it deleted and assumes the volume was successfully detached, so It updates the attach/detach controller’s internal state to indicate the volume is detached. - -#### Mounting Volumes - -1. The volume manager component of kubelet notices a new volume, referencing a CSI volume, has been scheduled to the node, so it calls the in-tree CSI volume plugin’s `WaitForAttach` method. -2. The in-tree volume plugin’s `WaitForAttach` method watches the `Attached` field of the `VolumeAttachment` object in the kubernetes API to become `true`, it then returns without error. -3. Kubelet then calls the in-tree CSI volume plugin’s `MountDevice` method which is a no-op and returns immediately. -4. Finally kubelet calls the in-tree CSI volume plugin’s mount (setup) method, which causes the in-tree volume plugin to issue a `NodePublishVolume` call via the registered unix domain socket to the local CSI driver. -5. Upon successful completion of the `NodePublishVolume` call the specified path is mounted into the pod container. - -#### Unmounting Volumes -1. The volume manager component of kubelet, notices a mounted CSI volume, referenced by a pod that has been deleted or terminated, so it calls the in-tree CSI volume plugin’s `UnmountDevice` method which is a no-op and returns immediately. -2. Next kubelet calls the in-tree CSI volume plugin’s unmount (teardown) method, which causes the in-tree volume plugin to issue a `NodeUnpublishVolume` call via the registered unix domain socket to the local CSI driver. If this call fails from any reason, kubelet re-tries the call periodically. -3. Upon successful completion of the `NodeUnpublishVolume` call the specified path is unmounted from the pod container. - - -### CSI Credentials - -CSI allows specifying credentials in CreateVolume/DeleteVolume, ControllerPublishVolume/ControllerUnpublishVolume, NodeStageVolume/NodeUnstageVolume, and NodePublishVolume/NodeUnpublishVolume operations. - -Kubernetes will enable cluster admins and users deploying workloads on the cluster to specify these credentials by referencing Kubernetes secret object(s). Kubernetes (either the core components or helper containers) will fetch the secret(s) and pass them to the CSI volume plugin. - -If a secret object contains more than one secret, all secrets are passed. - -#### Secret to CSI Credential Encoding - -CSI accepts credentials for all the operations specified above as a map of string to string (e.g. `map<string, string> controller_create_credentials`). - -Kubernetes, however, defines secrets as a map of string to byte-array (e.g. `Data map[string][]byte`). It also allows specifying text secret data in string form via a write-only convenience field `StringData` which is a map of string to string. - -Therefore, before passing secret data to CSI, Kubernetes (either the core components or helper containers) will convert the secret data from bytes to string (Kubernetes does not specify the character encoding, but Kubernetes internally uses golang to cast from string to byte and vice versa which assumes UTF-8 character set). - -Although CSI only accepts string data, a plugin MAY dictate in its documentation that a specific secret contain binary data and specify a binary-to-text encoding to use (base64, quoted-printable, etc.) to encode the binary data and allow it to be passed in as a string. It is the responsibility of the entity (cluster admin, user, etc.) that creates the secret to ensure its content is what the plugin expects and is encoded in the format the plugin expects. - -#### CreateVolume/DeleteVolume Credentials - -The CSI CreateVolume/DeleteVolume calls are responsible for creating and deleting volumes. -These calls are executed by the CSI external-provisioner. -Credentials for these calls will be specified in the Kubernetes `StorageClass` object. - -```yaml -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: fast-storage -provisioner: com.example.team.csi-driver -parameters: - type: pd-ssd - csiProvisionerSecretName: mysecret - csiProvisionerSecretNamespace: mynamespaace -``` - -The CSI external-provisioner will reserve the parameter keys `csiProvisionerSecretName` and `csiProvisionerSecretNamespace`. If specified, the CSI Provisioner will fetch the secret `csiProvisionerSecretName` in the Kubernetes namespace `csiProvisionerSecretNamespace` and pass it to: -1. The CSI `CreateVolumeRequest` in the `controller_create_credentials` field. -2. The CSI `DeleteVolumeRequest` in the `controller_delete_credentials` field. - -See "Secret to CSI Credential Encoding" section above for details on how secrets will be mapped to CSI credentials. - -It is assumed that since `StorageClass` is a non-namespaced field, only trusted users (e.g. cluster administrators) should be able to create a `StorageClass` and, thus, specify which secret to fetch. - -The only Kubernetes component that needs access to this secret is the CSI external-provisioner, which would fetch this secret. The permissions for the external-provisioner may be limited to the specified (external-provisioner specific) namespace to prevent a compromised provisioner from gaining access to other secrets. - -#### ControllerPublishVolume/ControllerUnpublishVolume Credentials - -The CSI ControllerPublishVolume/ControllerUnpublishVolume calls are responsible for attaching and detaching volumes. -These calls are executed by the CSI external-attacher. -Credentials for these calls will be specified in the Kubernetes `CSIPersistentVolumeSource` object. - -```go -type CSIPersistentVolumeSource struct { - - // ControllerPublishSecretRef is a reference to the secret object containing - // sensitive information to pass to the CSI driver to complete the CSI - // ControllerPublishVolume and ControllerUnpublishVolume calls. - // This secret will be fetched by the external-attacher. - // This field is optional, and may be empty if no secret is required. If the - // secret object contains more than one secret, all secrets are passed. - // +optional - ControllerPublishSecretRef *SecretReference -} -``` - -If specified, the CSI external-attacher will fetch the Kubernetes secret referenced by `ControllerPublishSecretRef` and pass it to: -1. The CSI `ControllerPublishVolume` in the `controller_publish_credentials` field. -2. The CSI `ControllerUnpublishVolume` in the `controller_unpublish_credentials` field. - -See "Secret to CSI Credential Encoding" section above for details on how secrets will be mapped to CSI credentials. - -It is assumed that since `PersistentVolume` objects are non-namespaced and `CSIPersistentVolumeSource` can only be referenced via a `PersistentVolume`, only trusted users (e.g. cluster administrators) should be able to create a `PersistentVolume` objects and, thus, specify which secret to fetch. - -The only Kubernetes component that needs access to this secret is the CSI external-attacher, which would fetch this secret. The permissions for the external-attacher may be limited to the specified (external-attacher specific) namespace to prevent a compromised attacher from gaining access to other secrets. - -#### NodeStageVolume/NodeUnstageVolume Credentials - -The CSI NodeStageVolume/NodeUnstageVolume calls are responsible for mounting (setup) and unmounting (teardown) volumes. -These calls are executed by the Kubernetes node agent (kubelet). -Credentials for these calls will be specified in the Kubernetes `CSIPersistentVolumeSource` object. - -```go -type CSIPersistentVolumeSource struct { - - // NodeStageSecretRef is a reference to the secret object containing sensitive - // information to pass to the CSI driver to complete the CSI NodeStageVolume - // and NodeStageVolume and NodeUnstageVolume calls. - // This secret will be fetched by the kubelet. - // This field is optional, and may be empty if no secret is required. If the - // secret object contains more than one secret, all secrets are passed. - // +optional - NodeStageSecretRef *SecretReference -} -``` - -If specified, the kubelet will fetch the Kubernetes secret referenced by `NodeStageSecretRef` and pass it to: -1. The CSI `NodeStageVolume` in the `node_stage_credentials` field. -2. The CSI `NodeUnstageVolume` in the `node_unstage_credentials` field. - -See "Secret to CSI Credential Encoding" section above for details on how secrets will be mapped to CSI credentials. - -It is assumed that since `PersistentVolume` objects are non-namespaced and `CSIPersistentVolumeSource` can only be referenced via a `PersistentVolume`, only trusted users (e.g. cluster administrators) should be able to create a `PersistentVolume` objects and, thus, specify which secret to fetch. - -The only Kubernetes component that needs access to this secret is the kubelet, which would fetch this secret. The permissions for the kubelet may be limited to the specified (kubelet specific) namespace to prevent a compromised attacher from gaining access to other secrets. - -The Kubernetes API server's node authorizer must be updated to allow kubelet to access the secrets referenced by `CSIPersistentVolumeSource.NodeStageSecretRef`. - -#### NodePublishVolume/NodeUnpublishVolume Credentials - -The CSI NodePublishVolume/NodeUnpublishVolume calls are responsible for mounting (setup) and unmounting (teardown) volumes. -These calls are executed by the Kubernetes node agent (kubelet). -Credentials for these calls will be specified in the Kubernetes `CSIPersistentVolumeSource` object. - -```go -type CSIPersistentVolumeSource struct { - - // NodePublishSecretRef is a reference to the secret object containing - // sensitive information to pass to the CSI driver to complete the CSI - // NodePublishVolume and NodeUnpublishVolume calls. - // This secret will be fetched by the kubelet. - // This field is optional, and may be empty if no secret is required. If the - // secret object contains more than one secret, all secrets are passed. - // +optional - NodePublishSecretRef *SecretReference -} -``` - -If specified, the kubelet will fetch the Kubernetes secret referenced by `NodePublishSecretRef` and pass it to: -1. The CSI `NodePublishVolume` in the `node_publish_credentials` field. -2. The CSI `NodeUnpublishVolume` in the `node_unpublish_credentials` field. - -See "Secret to CSI Credential Encoding" section above for details on how secrets will be mapped to CSI credentials. - -It is assumed that since `PersistentVolume` objects are non-namespaced and `CSIPersistentVolumeSource` can only be referenced via a `PersistentVolume`, only trusted users (e.g. cluster administrators) should be able to create a `PersistentVolume` objects and, thus, specify which secret to fetch. - -The only Kubernetes component that needs access to this secret is the kubelet, which would fetch this secret. The permissions for the kubelet may be limited to the specified (kubelet specific) namespace to prevent a compromised attacher from gaining access to other secrets. - -The Kubernetes API server's node authorizer must be updated to allow kubelet to access the secrets referenced by `CSIPersistentVolumeSource.NodePublishSecretRef`. - -## Alternatives Considered - -### Extending PersistentVolume Object - -Instead of creating a new `VolumeAttachment` object, another option we considered was extending the existing `PersistentVolume` object. - -`PersistentVolumeSpec` would be extended to include: -* List of nodes to attach the volume to (initially empty). - -`PersistentVolumeStatus` would be extended to include: -* List of nodes the volume was successfully attached to. - -We dismissed this approach because having attach/detach triggered by the creation/deletion of an object is much easier to manage (for both external-attacher and Kubernetes) and more robust (fewer corner cases to worry about). - - -[Flex Volume]: /contributors/devel/sig-storage/flexvolume.md -[Flex Volume plugin]: /contributors/devel/sig-storage/flexvolume.md +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/container-storage-interface_diagram1.png b/contributors/design-proposals/storage/container-storage-interface_diagram1.png Binary files differdeleted file mode 100644 index 93230659..00000000 --- a/contributors/design-proposals/storage/container-storage-interface_diagram1.png +++ /dev/null diff --git a/contributors/design-proposals/storage/containerized-mounter-pod.md b/contributors/design-proposals/storage/containerized-mounter-pod.md index 00eb3884..f0fbec72 100644 --- a/contributors/design-proposals/storage/containerized-mounter-pod.md +++ b/contributors/design-proposals/storage/containerized-mounter-pod.md @@ -1,149 +1,6 @@ -# Containerized mounter using volume utilities in pods +Design proposals have been archived. -## Goal -Kubernetes should be able to run all utilities that are needed to provision/attach/mount/unmount/detach/delete volumes in *pods* instead of running them on *the host*. The host can be a minimal Linux distribution without tools to create e.g. Ceph RBD or mount GlusterFS volumes. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Secondary objectives -These are not requirements per se, just things to consider before drawing the final design. -* CNCF designs Container Storage Interface (CSI). So far, this CSI expects that "volume plugins" on each host are long-running processes with a fixed gRPC API. We should aim the same direction, hoping to switch to CSI when it's ready. In other words, there should be one long-running container for a volume plugin that serves all volumes of given type on a host. -* We should try to avoid complicated configuration. The system should work out of the box or with very limited configuration. -## Terminology - -**Mount utilities** for a volume plugin are all tools that are necessary to use a volume plugin. This includes not only utilities needed to *mount* the filesystem (e.g. `mount.glusterfs` for Gluster), but also utilities needed to attach, detach, provision or delete the volume, such as `/usr/bin/rbd` for Ceph RBD. - -## User story -Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes. -1. Admin installs and runs Kubernetes in any way. -1. Admin deploys a DaemonSet that runs a pod with `mount.glusterfs` on each node. In future, this could be done by installer. -1. User creates a pod that uses a GlusterFS volume. Kubelet finds a pod with mount utilities on the node and uses it to mount the volume instead of expecting that `mount.glusterfs` is available on the host. - -- User does not need to configure anything and sees the pod Running as usual. -- Admin just needs to deploy the DaemonSet. -- It's quite hard to update the DaemonSet, see below. - -## Alternatives -### Sidecar containers -We considered this user story: -* Admin installs Kubernetes. -* Admin configures Kubernetes to use sidecar container with template XXX for glusterfs mount/unmount operations and pod with template YYY for glusterfs provision/attach/detach/delete operations. These templates would be yaml files stored somewhere. -* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers. -* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container. - --> User does not need to configure anything and sees the pod Running as usual. --> Admin needs to set up the templates. - -Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod. - -Advantages: -* It's probably easier to update the templates than update the DaemonSet. - -Drawbacks: -* Admin needs to store the templates somewhere. Where? -* Short-living processes instead of long-running ones that would mimic CSI (so we could catch bugs early or even redesign CSI). -* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...) - -### Infrastructure containers - -Mount utilities could be also part of infrastructure container that holds network namespace (when using Docker). Now it's typically simple `pause` container that does not do anything, it could hold mount utilities too. - -Advantages: -* Easy to set up -* No extra container running - -Disadvantages: -* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot. -* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...) -* Short-living processes instead of long-running ones that would mimic CSI (so we could catch bugs early or even redesign CSI). -* Infrastructure container is implementation detail and CRI does not even allow executing binaries in it. - -**We've decided to go with long running DaemonSet pod as described below.** - -## Design - -* Pod with mount utilities puts a registration JSON file into `/var/lib/kubelet/plugin-containers/<plugin name>.json` on the host with name of the container where mount utilities should be executed: - ```json - { - "podNamespace": "kubernetes-storage", - "podName": "gluster-daemon-set-xtzwv", - "podUID": "5d1942bd-7358-40e8-9547-a04345c85be9", - "containerName": "gluster" - } - ``` - * Pod UID is used to avoid situation when a pod with mount utilities is terminated and leaves its registration file on the host. Kubelet should not assume that newly started pod with the same namespace+name has the same mount utilities. - * All slashes in `<plugin name>` must be replaced with tilde, e.g. `kubernetes.io~glusterfs.json`. - * Creating the file must be atomic so kubelet cannot accidentally read partly written file. - - * All volume plugins use `VolumeHost.GetExec` to get the right exec interface when running their utilities. - - * Kubelet's implementation of `VolumeHost.GetExec` looks at `/var/lib/kubelet/plugin-containers/<plugin name>.json` if it has a container for given volume plugin. - * If the file exists and referred container is running, it returns `Exec` interface implementation that leads to CRI's `ExecSync` into the container (i.e. `docker exec <container> ...`) - * If the file does not exist or referred container is not running, it returns `Exec` interface implementation that leads to `os.Exec`. This way, pods do not need to remove the registration file when they're terminated. - * Kubelet does not cache content of `plugin-containers/`, one extra `open()`/`read()` with each exec won't harm and it makes Kubelet more robust to changes in the directory. - -* In future, this registration of volume plugin pods should be replaced by a gRPC interface based on Device Plugin registration. - -## Requirements on DaemonSets with mount utilities -These are rules that need to be followed by DaemonSet authors: -* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice. -* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed. - * E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it. - * Kernel modules should be available in the pod with mount utilities too. "Available" does not imply that they need to be shipped in a container, we expect that binding `/lib/modules` from host to `/lib/modules` in the pod will be enough for all modules that are needed by Kubernetes internal volume plugins (all distros I checked incl. the "minimal" ones ship scsi.ko, rbd.ko, nfs.ko and fuse). This will allow future flex volumes ship vendor-specific kernel modules. It's up to the vendor to ensure that any kernel module matches the kernel on the host. - * The only exception is udev (or similar device manager). Only one udev can run on a system, therefore it should run on the host. If a volume plugin needs to talk to udev (e.g. by calling `udevadm trigger`), they must do it on the host and not in a container with mount utilities. -* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods. -* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons. -* The pods with mount utilities must put a file into `/var/lib/kubelet/plugin-containers/<plugin name>.json` for each volume plugin it supports. It should overwrite any existing file - it's probably leftover from older pod. - * Admin is responsible to run only one pod with utilities for one volume plugin on a single host. When two pods for say GlusterFS are scheduled on the same node they will overwrite the registration file of each other. - * Downward API can be used to get pod's name and namespace. - * Root privileges (or CAP_DAC_OVERRIDE) are needed to write to `/var/lib/kubelet/plugin-containers/`. - -To sum it up, it's just a daemon set that spawns privileged pods, running a simple init and registering itself into Kubernetes by placing a file into well-known location. - -**Note**: It may be quite difficult to create a pod that see's host's `/dev` and `/sys`, contains necessary kernel modules, does the initialization right and reaps zombies. We're going to provide a template with all this. - -### Upgrade -Upgrade of DaemonSets with pods with fuse-based mount utilities needs to be done node by node and with extra care. Killing a pod with fuse daemon(s) inside will un-mount all volumes that are used by other pods on the host and may result in data loss. - -In order to update the fuse-based DaemonSet (=GlusterFS or CephFS), admin must do for every node: -* Mark the node as tainted. Only the pod with mount utilities can tolerate the taint, all other pods are evicted. As result, all volumes are unmounted and detached. -* Update the pod. -* Remove the taint. - -Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better? - -### Containerized kubelet - -Kubelet should behave the same when it runs inside a container: -* Use `os.Exec` to run mount utilities inside its own container when no pod with mount utilities is registered. This is current behavior, `mkfs.ext4`, `lsblk`, `rbd` and such are executed in context of the kubelet's container now. -* Use `nsenter <host> mount` to mount things when no pod with mount utilities is registered. Again, this is current behavior. -* Use CRI's `ExecSync` to execute both utilities and the final `mount` when a pod with mount utilities is registered so everything is executed in this pod. - -## Open items - -* How will controller-manager talk to pods with mount utilities? - - 1. Mount pods expose a gRPC service. - * controller-manager must be configured with the service namespace + name. - * Some authentication must be implemented (=additional configuration of certificates and whatnot). - * -> seems to be complicated. - - 2. Mount pods run in a dedicated namespace and have labels that tell which volume plugins they can handle. - * controller manager scans a namespace with a labelselector and does `kubectl exec <pod>` to execute anything in the pod. - * Needs configuration of the namespace. - * Admin must make sure that nothing else can run in the namespace (e.g. rogue pods that would steal volumes). - * Admin/installer must configure access to the namespace so only pv-controller and attach-detach-controller can do `exec` there. - - 3. We allow pods to run on hosts that run controller-manager. - - * Usual socket in `/var/lib/kubelet/plugin-sockets` will work. - * Can it work on GKE? - -We do not implement any of these approaches, as we expect that most volume plugins are going to be moved to CSI soon-ish. The only affected volume plugins are: - -* Ceph dynamic provisioning - we can use external provisioner during tests. -* Flex - it has its own dynamic registration of flex drivers. - -## Implementation notes -As we expect that most volume plugins are going to be moved to CSI soon, all implementation of this proposal will be guarded by alpha feature gate "MountContainers" which is never going leave alpha. Whole implementation of this proposal is going to be removed when the plugins are fully moved to CSI. - -Corresponding e2e tests for internal volume plugins will initially run only when with the feature gate is enabled and they will continue running when we move the volume plugins to CSI to ensure we won't introduce regressions. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/containerized-mounter.md b/contributors/design-proposals/storage/containerized-mounter.md index b1c8f298..f0fbec72 100644 --- a/contributors/design-proposals/storage/containerized-mounter.md +++ b/contributors/design-proposals/storage/containerized-mounter.md @@ -1,43 +1,6 @@ -# Containerized Mounter with Chroot for Container-Optimized OS +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Due security and management overhead, our new Container-Optimized OS used by GKE -does not carry certain storage drivers and tools needed for such as nfs and -glusterfs. This project takes a containerized mount approach to package mount -binaries into a container. Volume plugin will execute mount inside of container -and share the mount with the host. - -## Design - -1. A docker image has storage tools (nfs and glusterfs) pre-installed and uploaded - to gcs. -2. During GKE cluster configuration, the docker image is pulled and installed on - the cluster node. -3. When nfs or glusterfs type mount is invoked by kubelet, it will run the mount - command inside of a container with the pre-install docker image and the mount - propagation set to “shared. In this way, the mount inside the container will - visible to host node too. -4. A special case for NFSv3, a rpcbind process is issued before running mount - command. - -## Implementation details - -* In the first version of containerized mounter, we use rkt fly to dynamically - start a container during mount. When mount command finishes, the container is - normally exited and will be garbage-collected. However, in case the glusterfs - mount, because a gluster daemon is running after command mount finishes util - glusterfs unmount, the container started for mount will continue to run until - glusterfs client finishes. The container cannot be garbage-collected right away - and multiple containers might be running for some time. Due to shared mount - propagation, with more containers running, the number of mounts will increase - significantly and might cause kernel panic. To solve this problem, a chroot - approach is proposed and implemented. -* In the second version, instead of running a container on the host, the docker - container’s file system is exported as a tar archive and pre-installed on host. - Kubelet directory is shared mount between host and inside of the container’s - rootfs. When a gluster/nfs mount is issued, a mounter script will use chroot to - change to the container’s rootfs and run the mount. This approach is very clean - since there is no need to manage a container’s lifecycle and avoid having large - number of mounts. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/csi-migration.md b/contributors/design-proposals/storage/csi-migration.md index 7e392486..f0fbec72 100644 --- a/contributors/design-proposals/storage/csi-migration.md +++ b/contributors/design-proposals/storage/csi-migration.md @@ -1,934 +1,6 @@ -# In-tree Storage Plugin to CSI Migration Design Doc +Design proposals have been archived. -Authors: @davidz627, @jsafrane +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document presents a detailed design for migrating in-tree storage plugins -to CSI. This will be an opt-in feature turned on at cluster creation time that -will redirect in-tree plugin operations to a corresponding CSI Driver. -## Glossary - -* ADC (Attach Detach Controller): Controller binary that handles Attach and Detach portion of a volume lifecycle -* Kubelet: Kubernetes component that runs on each node, it handles the Mounting and Unmounting portion of volume lifecycle -* CSI (Container Storage Interface): An RPC interface that Kubernetes uses to interface with arbitrary 3rd party storage drivers -* In-tree: Code that is compiled into native Kubernetes binaries -* Out-of-tree: Code that is not compiled into Kubernetes binaries, but can be run as Deployments on Kubernetes - -## Background and Motivations - -The Kubernetes volume plugins are currently in-tree meaning all logic and -handling for each plugin lives in the Kubernetes codebase itself. With the -Container Storage Interface (CSI) the goal is to move those plugins out-of-tree. -CSI defines a standard interface for communication between the Container -Orchestrator (CO), Kubernetes in our case, and the storage plugins. - -As the CSI Spec moves towards GA and more storage plugins are being created and -becoming production ready, we will want to migrate our in-tree plugin logic to -use CSI plugins instead. This is motivated by the fact that we are currently -supporting two versions of each plugin (one in-tree and one CSI), and that we -want to eventually migrate all storage users to CSI. - -In order to do this we need to migrate the internals of the in-tree plugins to -call out to CSI Plugins because we will be unable to deprecate the current -internal plugin API’s due to Kubernetes API deprecation policies. This will -lower cost of development as we only have to maintain one version of each -plugin, as well as ease the transition to CSI when we are able to deprecate the -internal APIs. - -## Roadmap - -The migration from in-tree plugins to CSI plugins will involve the following phases: - -Phase 1: Typically, a CSI plugin (that an in-tree plugin has been migrated to) -will be invoked for operations on persistent volumes backed by a specific in-tree -plugin under the following conditions: -1. An overall feature flag: CSIMigration is enabled for the Kubernetes Controller -Manager and Kubelet. -2. A feature flag for the specific in-tree plugin around migration (e.g. CSIMigrationGCE, -CSIMigrationAWS) is enabled for the Kubernetes Controller Manager and Kubelet. -In case the Kubelet on a specific node does not have the above feature flags enabled -(or running an old version that does not support the above feature flags), the in-tree -plugin code will be executed for operations like attachment/detachment -and mount/dismount of volumes. To support this, ProbeVolumePlugins function for -in-tree plugin packages will continue to be invoked (as is the case today). This -will result in all in-tree plugins added to the list of plugins whose methods -can be invoked by the Kubelet and Kubernetes cluster-wide volume -controllers when necessary. - -Phase 2: ProbeVolumePlugins function for specific migrated in-tree plugin packages -will no longer be invoked by the Kubernetes Controller Manager and Kubelet under -the following conditions: -1. An overall feature flag: CSIMigration is enabled for the Kubernetes Controller -Manager and Kubelets on all nodes. -2. A feature flag for the specific in-tree plugin around migration (e.g. CSIMigrationGCE, -CSIMigrationAWS) is enabled for the Kubernetes Controller Manager and Kubelets on -all nodes. -3. An overall feature flag: CSIMigrationInTreeOff is enabled for the Kubernetes -Controller Manager and Kubelet. -All nodes in a cluster must satisfy at least [1] and [2] above in the Kubelet -configuration for [3] to take effect and function correctly. This requires that -all nodes in the cluster must have migrated CSI plugins installed and configured. - -Phase 3: Files containing in-tree plugin code are no longer compiled as part of -Kubernetes components using golang build tag: nolegacyproviders in preparation -for the final Phase 4 below. This may only be in effect in test environments. - -Phase 4: In-tree code for specific plugins is removed from Kubernetes. - -## Goals - -* Compile all requirements for a successful transition of the in-tree plugins to - CSI - * In-tree plugin code for migrated plugins can be completely removed from Kubernetes - * In-tree plugin API is untouched, user Pods and PVs continue working after - upgrades - * Minimize user visible changes -* Design a robust mechanism for redirecting in-tree plugin usage to appropriate - CSI drivers, while supporting seamless upgrade and downgrade between new - Kubernetes version that uses CSI drivers for in-tree volume plugins to an old - Kubernetes version that uses old-fashioned volume plugins without CSI. -* Design framework for migration that allows for easy interface extension by - in-tree plugin authors to “migrate” their plugins. - * Migration must be modular so that each plugin can have migration turned on - and off separately - -## Non-Goals - -* Design a mechanism for deploying CSI drivers on all systems so that users can - use the current storage system the same way they do today without having to do - extra set up. -* Implementing CSI Drivers for existing plugins -* Define set of volume plugins that should be migrated to CSI - -## Implementation Schedule - -Alpha [1.16] -* Feature flag for Phase 1, CSIMigration, disabled by default -* Proof of concept migration of at least 2 storage plugins [AWS, GCE] -* Framework for plugin migration built for Dynamic provisioning, pre-provisioned - volumes, and in-tree volumes - -Beta [Target 1.17] -* Feature flags for Phase 1, CSIMigration, disabled by default -* Feature flags for Phase 2, CSIMigrationInTreeOff disabled by default -* Feature flag for migrated in-tree plugins disabled by default -* Translations of a subset of the cloud provider plugins to CSI in progress - -GA [TBD] -* Feature flags for Phase 1 and 2 enabled by default, per-plugin toggle on for - relevant cloud provider by default -* CSI Drivers for migrated plugins available on related cloud provider cluster - by default - -## Milestones - -* Translation Library implemented in Kubernetes staging -* Translation of volumes in volume controllers to support Provision, Attach, - Detach, Mount, Unmount (including Inline Volumes) using migrated CSI plugins. -* Translation of volumes in volume controllers to support Resize, Block using - migrated CSI plugins. -* CSI Driver lifecycle manager -* GCE PD feature parity in CSI with in-tree implementation -* AWS EBS feature parity in CSI with in-tree implementation -* Cloud Driver feature parity in CSI with in-tree implementation -* Skip ProbeVolumePlugins of migrated in-tree plugin code (Phase 2). - -## Dependency Graph - - - -## Feature Gating - -We will have two feature gates for the overall feature: CSIMigration and CSIMigrationInTreeOff -corresponding to Phase 1 and 2 respectively. Additionally, plugin-specific feature flags -(e.g. CSIMigrationGCE, CSIMigrationEBS) will determine whether a migration phase -is enabled for a specific in-tree plugin. This allows administrators to enable a specific -phase of migration functionality on the cluster as a whole as well as the flexibility -to toggle migration functionality for each legacy in-tree plugin individually. - -With CSIMigration feature flag enabled on Kubernetes Controller Manager and Kubelet, -several volume actions associated with in-tree plugins (that have plugin specific -migration feature flags enabled) will be handled by CSI plugins that the in-tree -plugins have migrated to. If the Kubelet on a cluster node does not have CSIMigration -and plugin-specific migration feature flags enabled or running an old version of Kubelet -before the CSI migration feature flags were introduced, the in-tree plugin code -will continue to handle actions like attach/detach and mount/unmount of volumes on that node. - -During initialization, with CSIMigrationInTreeOff feature flag enabled, Kubernetes -Controller Manager and Kubelet will skip invocation of ProbeVolumePlugins for migrated -in-tree plugins (that have plugin-specific migration feature flags enabled). -As a result, all nodes in the cluster must have: [1] CSI plugins (that in-tree -plugins have been migrated to) configured and installed and [2] CSIMigration and -plugin specific feature flags enabled for the Kubelet. If these requirements are -not fulfilled on each node, operations involving volumes backed by in-tree plugins -will fail with errors. - - -The new feature gates for alpha are: -``` -// Enables the in-tree storage to CSI Plugin migration feature. -CSIMigration utilfeature.Feature = "CSIMigration" - -// Disables the in-tree storage plugin code -CSIMigrationInTreeOff utilfeature.Feature = "CSIMigrationInTreeOff" - -// Enables the GCE PD in-tree driver to GCE CSI Driver migration feature. -CSIMigrationGCE utilfeature.Feature = "CSIMigrationGCE" - -// Enables the AWS in-tree driver to AWS CSI Driver migration feature. -CSIMigrationAWS utilfeature.Feature = "CSIMigrationAWS" - -// Enables the Azure Disk in-tree driver to Azure Disk Driver migration feature. -CSIMigrationAzureDisk featuregate.Feature = "CSIMigrationAzureDisk" - -// Enables the Azure File in-tree driver to Azure File Driver migration feature. -CSIMigrationAzureFile featuregate.Feature = "CSIMigrationAzureFile" - -// Enables the OpenStack Cinder in-tree driver to OpenStack Cinder CSI Driver migration feature. -CSIMigrationOpenStack featuregate.Feature = "CSIMigrationOpenStack" -``` - -## Translation Layer - -The main mechanism we will use to migrate plugins is redirecting in-tree -operation calls to the CSI Driver instead of the in-tree driver, the external -components will pick up these in-tree PV's and use a translation library to -translate to CSI Source. - -Pros: -* Keeps old API objects as they are -* Facilitates gradual roll-over to CSI - -Cons: -* Somewhat complicated and error prone. -* Bespoke translation logic for each in-tree plugin - -### Dynamically Provisioned Volumes - -#### Kubernetes Changes - -Dynamically Provisioned volumes will continue to be provisioned with the in-tree -`PersistentVolumeSource`. The CSI external-provisioner to pick up the -in-tree PVC's when migration is turned on and provision using the CSI Drivers; -it will then use the imported translation library to return with a PV that contains an equivalent of the original -in-tree PV. The PV will then go through all the same steps outlined below in the -"Non-Dynamic Provisioned Volumes" for the rest of the volume lifecycle. - -#### Leader Election - -There will have to be some mechanism to switch between in-tree and external -provisioner when the migration feature is turned on/off. The two should be -compatible as they both will create the same volume and PV based on the same -PVC, as well as both be able to delete the same PV/PVCs. The in-tree provisioner -will have logic added so that it will stand down and mark the PV as "migrated" -with an annotation when the migration is turned on and the external provisioner -will take care of the PV when it sees the annotation. - -### Translation Library - -In order to make this on-the-fly translation work we will develop a separate -translation library. This library will have to be able to translate from in-tree -PV Source to the equivalent CSI Source. This library can then be imported by -both Kubernetes and the external CSI Components to translate Volume Sources when -necessary. The cost of doing this translation will be very low as it will be an -imported library and part of whatever binary needs the translation (no extra -API or RPC calls). - -#### Library Interface - -``` -type CSITranslator interface { - // TranslateInTreePVToCSI takes a persistent volume and will translate - // the in-tree source to a CSI Source if the translation logic - // has been implemented. The input persistent volume will not - // be modified - TranslateInTreePVToCSI(pv *v1.PersistentVolume) (*v1.PersistentVolume, error) { - - // TranslateCSIPVToInTree takes a PV with a CSI PersistentVolume Source and will translate - // it to a in-tree Persistent Volume Source for the specific in-tree volume specified - // by the `Driver` field in the CSI Source. The input PV object will not be modified. - TranslateCSIPVToInTree(pv *v1.PersistentVolume) (*v1.PersistentVolume, error) { - - // TranslateInTreeInlineVolumeToPVSpec takes an inline intree volume and will translate - // the in-tree volume source to a PersistentVolumeSpec containing a CSIPersistentVolumeSource - TranslateInTreeInlineVolumeToPVSpec(volume *v1.Volume) (*v1.PersistentVolumeSpec, error) { - - // IsMigratableByName tests whether there is Migration logic for the in-tree plugin - // for the given `pluginName` - IsMigratableByName(pluginName string) bool { - - // GetCSINameFromIntreeName maps the name of a CSI driver to its in-tree version - GetCSINameFromIntreeName(pluginName string) (string, error) { - - // IsPVMigratable tests whether there is Migration logic for the given Persistent Volume - IsPVMigratable(pv *v1.PersistentVolume) bool { - - // IsInlineMigratable tests whether there is Migration logic for the given Inline Volume - IsInlineMigratable(vol *v1.Volume) bool { -} -``` - -#### Library Versioning - -Since the library will be imported by various components it is imperative that -all components import a version of the library that supports in-tree driver x -before the migration feature flag for x is turned on. If not, the TranslateToCSI -function will return an error when the translation is attempted. - - -### Pre-Provisioned Volumes (and volumes provisioned before migration) - -In the OperationGenerator at the start of each volume operation call we will -check to see whether the plugin has been migrated. - -For Controller calls, we will call the CSI calls instead of the in-tree calls. -The OperationGenerator can do the translation of the PV Source before handing it -to the CSI calls, therefore the CSI in-tree plugin will only have to deal with -what it sees as a CSI Volume. Special care must be taken that `volumeHandle` is -unique and also deterministic so that we can always find the correct volume. -We also foresee that future controller calls such as resize and snapshot will use a similar mechanism. All these external components -will also need to be updated to accept PV's of any source type when it is given -and use the translation library to translate the in-tree PV Source into a CSI -Source when necessary. - -For Node calls, the VolumeToMount object will contain the in-tree PV Source, -this can then be translated by the translation library when needed and -information can be fed to the CSI components when necessary. - -Then the rest of the code in the Operation Generator can execute as normal with -the CSI Plugin and the annotation in the requisite locations. - -Caveat: For ALL detach calls of plugins that MAY have already been migrated we -have to attempt to DELETE the VolumeAttachment object that would have been -created if that plugin was migrated. This is because Attach after migration -creates a VolumeAttachment object, and if for some reason we are doing a detach -with the in-tree plugin, the VolumeAttachment object becomes orphaned. - - -### In-line Volumes - -In-line controller calls are a special case because there is no PV. In this case, -we will translate the in-line Volume into a PersistentVolumeSpec using -plugin-specific translation logic in the CSI translation library method, -`TranslateInTreeInlineVolumeToPVSpec`. The resulting PersistentVolumeSpec will -be stored in a new field `VolumeAttachment.Spec.Source.VolumeAttachmentSource.InlineVolumeSpec`. - -The plugin-specific CSI translation logic invoked by `TranslateInTreeInlineVolumeToPVSpec` -will need to populate the `CSIPersistentVolumeSource` field along with appropriate -values for `AccessModes` and `MountOptions` fields in -`VolumeAttachment.Spec.Source.VolumeAttachmentSource.InlineVolumeSpec`. Since -`AccessModes` and `MountOptions` are not specified for inline volumes, default values -for these fields suitable for the CSI plugin will need to be populated in addition -to translation logic to populate `CSIPersistentVolumeSource`. - -The VolumeAttachment name must be made with the CSI translated version of the -VolumeSource in order for it to be discoverable by Detach and WaitForAttach -(described in more detail below). - -The CSI Attacher will have to be modified to also check for `InlineVolumeSpec` -besides the `PersistentVolumeName`. Only one of the two may be specified. If `PersistentVolumeName` -is empty and `InlineVolumeSpec` is set, the CSI Attacher will not look for -an associated PV in it's PV informer cache as it implies the inline volume scenario -(where no PVs are created). - -The CSI Attacher will have access to all the data it requires for handling in-line -volumes attachment (through the CSI plugins) from fields in the `InlineVolumeSpec`. - -The new VolumeAttachmentSource API will look as such: -``` -// VolumeAttachmentSource represents a volume that should be attached. -// Inline volumes and Persistent volumes can be attached via external attacher. -// Exactly one member can be set. -type VolumeAttachmentSource struct { - // Name of the persistent volume to attach. - // +optional - PersistentVolumeName *string `json:"persistentVolumeName,omitempty" protobuf:"bytes,1,opt,name=persistentVolumeName"` - - // A PersistentVolumeSpec whose fields contain translated data from a pod's inline - // VolumeSource to support shimming of in-tree inline volumes to a CSI backend. - // This field is alpha-level and is only honored by servers that - // enable the CSIMigration feature. - // +optional - InlineVolumeSpec *v1.PersistentVolumeSpec `json:"inlineVolumeSpec,omitempty" protobuf:"bytes,2,opt,name=inlineVolumeSpec"` -} -``` - -We need to be careful with naming VolumeAttachments for in-line volumes. The -name needs to be unique and ADC must be able to find the right VolumeAttachment -when a pod is deleted (i.e. using only info in Node.Status). CSI driver in -kubelet must be able to find the VolumeAttachment too to call WaitForAttach and -VolumesAreAttached. - -The attachment name is usually a hash of the volume name, CSI Driver name, and -Node name. We are able to get all this information for Detach and WaitForAttach -by translating the in-tree inline volume source to a CSI volume source before -passing it to to the volume operations. - -There is currently a race condition in in-tree inline volumes where if a pod -object is deleted and the ADC restarts we lose the information for the inline -volume and will not be able to detach the volume. This is a known issue and we -will retain the same behavior with migrated inline volumes. However, we may be -able to solve this in the future by reconciling the VolumeAttachment object with -existing Pods in the ADC. - - -### Volume Resize -#### Offline Resizing -For controller expansion, in the in-tree resize controller, we will create a new PVC annotation `volume.kubernetes.io/storage-resizer` -and set the value to the name of resizer. If the PV is CSI PV or migrated in-tree PV, the annotation will be set to -the name of CSI driver; otherwise, it will be set to the name of in-tree plugin. - -For migrated volume, The CSI resizer name will be derived from translating in-tree plugin name -to CSI driver name by translation library. We will also add an event to PVC about resizing being handled -by external controller. - -For external resizer, we will update it to expand volume for both CSI volume and in-tree -volume (only if migration is enabled). For migrated in-tree volume, it will update in-tree PV object -with new volume size and mark in-tree PVC as resizing finished. - -To synchronize between in-tree resizer and external resizer, external resizer will find resizer name -using PVC annotation `volume.kubernetes.io/storage-resizer`. Since `volume.kubernetes.io/storage-resizer` -annotation defines the CSI plugin name which will handle external resizing, it should -match driver running with external-resizer, hence external resizer will proceed with volume resizing. Otherwise, -it will yield to in-tree resizer. - -For filesystem expansion, in the OperationGenerator, `GenerateMountVolumeFunc` is used to expand file system after volume -is expanded and staged/mounted. The migration logic is covered by previous migration of volume mount. - -#### Online Resizing -Handling online resizing does not require anything special in control plane. The behaviour will be -same as offline resizing. - -To handle expansion on kubelet - we will convert volume spec to CSI spec before handling the call -to volume plugin inside `GenerateExpandVolumeFSWithoutUnmountingFunc`. - -### Raw Block -In the OperationGenerator, `GenerateMapVolumeFunc`, `GenerateUnmapVolumeFunc` and -`GenerateUnmapDeviceFunc` are used to prepare and mount/umount block devices. At the -beginning of each API, we will check whether migration is enabled for the plugin. If -enabled, volume spec will be translated from the in-tree spec to out-of-tree spec using -CSI as the persistence volume source. - -Caveat: the original spec needs to be used when setting the state of `actualStateOfWorld` -for where is it used before the translation. - -### Volume Reconstruction - -Volume Reconstruction is currently a routine in the reconciler that runs on the -nodes when a Kubelet restarts and loses its cached state (`desiredState` and -`actualState`). It is kicked off in `syncStates()` in -`pkg/kubeletvolumemanager/reconciler/reconciler.go` and attempts to reconstruct -a volume based on the mount path on the host machine. - -When CSI Migration is turned on, when the reconstruction code is run and it -finds a CSI mounted volume we currently do not know whether it was mounted as a -native CSI volume or migrated from in-tree. To solve this issue we will save a -`migratedVolume` boolean in the `saveVolumeData` function when the `NewMounter` -is created during the `MountVolume` call for that particular volume in the -Operation generator. - -When the Kubelet is restarted and we lose state the Kubelet will call -`reconstructVolume` we can `loadVolumeData` and determine whether that CSI -volume was migrated or not, as well as get the information about the original -plugin requested. With that information we should be able to call the -`ReconstructVolumeOperation` with the correct in-tree plugin to get the original -in-tree spec that we can then pass to the rest of volume reconstruction. The -rest of the volume reconstruction code will then use this in-tree spec passed to -the `desiredState`, `actualState`, and `operationGenerator` and the volume will -go through the standard volume pathways and go through the standard migrated -volume lifecycles described above in the "Pre-Provisioned Volumes" section. - -### Volume Limit - -TODO: Design - -## Interactions with PV-PVC Protection Finalizers - -PV-PVC Protection finalizers prevent deletion of a PV when it is bound to a PVC, -and prevent deletion of a PVC when it is in use by a pod. - -There is no known issue with interaction here. The finalizers will still work in -the same ways as we are not removing/adding PV’s or PVC’s in out of the ordinary -ways. - -## Dealing with CSI Driver Failures - -Plugin should fail if the CSI Driver is down and migration is turned on. When -the driver recovers we should be able to resume gracefully. - -We will also create a playbook entry for how to turn off the CSI Driver -migration gracefully, how to tell when the CSI Driver is broken or non-existent, -and how to redeploy a CSI Driver in a cluster. - -## API Changes - -### CSINodeInfo API - -Changes in: https://github.com/kubernetes/kubernetes/pull/70515 - -#### Old CSINodeInfo API - -``` -// CSINodeInfo holds information about all CSI drivers installed on a node. -type CSINodeInfo struct { - metav1.TypeMeta `json:",inline"` - - // metadata.name must be the Kubernetes node name. - metav1.ObjectMeta `json:"metadata,omitempty"` - - // List of CSI drivers running on the node and their properties. - // +patchMergeKey=driver - // +patchStrategy=merge - CSIDrivers []CSIDriverInfo `json:"csiDrivers" patchStrategy:"merge" patchMergeKey:"driver"` -} - -// CSIDriverInfo contains information about one CSI driver installed on a node. -type CSIDriverInfo struct { - // driver is the name of the CSI driver that this object refers to. - // This MUST be the same name returned by the CSI GetPluginName() call for - // that driver. - Driver string `json:"driver"` - - // nodeID of the node from the driver point of view. - // This field enables Kubernetes to communicate with storage systems that do - // not share the same nomenclature for nodes. For example, Kubernetes may - // refer to a given node as "node1", but the storage system may refer to - // the same node as "nodeA". When Kubernetes issues a command to the storage - // system to attach a volume to a specific node, it can use this field to - // refer to the node name using the ID that the storage system will - // understand, e.g. "nodeA" instead of "node1". - NodeID string `json:"nodeID"` - - // topologyKeys is the list of keys supported by the driver. - // When a driver is initialized on a cluster, it provides a set of topology - // keys that it understands (e.g. "company.com/zone", "company.com/region"). - // When a driver is initialized on a node it provides the same topology keys - // along with values that kubelet applies to the coresponding node API - // object as labels. - // When Kubernetes does topology aware provisioning, it can use this list to - // determine which labels it should retrieve from the node object and pass - // back to the driver. - TopologyKeys []string `json:"topologyKeys"` -} -``` - -#### New CSINodeInfo API - -``` -// +genclient -// +genclient:nonNamespaced -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// CSINodeInfo holds information about all CSI drivers installed on a node. -// CSI drivers do not need to create the CSINodeInfo object directly. As long as -// they use the node-driver-registrar sidecar container, the kubelet will -// automatically populate the CSINodeInfo object for the CSI driver as part of -// kubelet plugin registration. -// CSINodeInfo has the same name as a node. If it is missing, it means either -// there are no CSI Drivers available on the node, or the Kubelet version is low -// enough that it doesn't create this object. -// CSINodeInfo has an OwnerReference that points to the corresponding node object. -type CSINodeInfo struct { - metav1.TypeMeta - - // metadata.name must be the Kubernetes node name. - metav1.ObjectMeta - - // spec is the specification of CSINodeInfo - Spec CSINodeInfoSpec -} - -// CSINodeInfoSpec holds information about the specification of all CSI drivers installed on a node -type CSINodeInfoSpec struct { - // drivers is a list of information of all CSI Drivers existing on a node. - // It can be empty on initialization. - // +patchMergeKey=name - // +patchStrategy=merge - Drivers []CSIDriverInfoSpec -} - -// CSIDriverInfoSpec holds information about the specification of one CSI driver installed on a node -type CSIDriverInfoSpec struct { - // This is the name of the CSI driver that this object refers to. - // This MUST be the same name returned by the CSI GetPluginName() call for - // that driver. - Name string - - // nodeID of the node from the driver point of view. - // This field enables Kubernetes to communicate with storage systems that do - // not share the same nomenclature for nodes. For example, Kubernetes may - // refer to a given node as "node1", but the storage system may refer to - // the same node as "nodeA". When Kubernetes issues a command to the storage - // system to attach a volume to a specific node, it can use this field to - // refer to the node name using the ID that the storage system will - // understand, e.g. "nodeA" instead of "node1". - // This field must be populated. An empty string means NodeID is not initialized - // by the driver and it is invalid. - NodeID string - - // topologyKeys is the list of keys supported by the driver. - // When a driver is initialized on a cluster, it provides a set of topology - // keys that it understands (e.g. "company.com/zone", "company.com/region"). - // When a driver is initialized on a node, it provides the same topology keys - // along with values. Kubelet will expose these topology keys as labels - // on its own node object. - // When Kubernetes does topology aware provisioning, it can use this list to - // determine which labels it should retrieve from the node object and pass - // back to the driver. - // It is possible for different nodes to use different topology keys. - // This can be empty if driver does not support topology. - // +optional - TopologyKeys []string -} -``` - -#### API Lifecycle - -A new `CSINodeInfo` API object is created for each node by the Kubelet on -Kubelet initialization before pods are able to be scheduled. A driver will be -added with all of its information populated when a driver is registered through -the plugin registration mechanism. When the driver is unregistered through the -plugin registration mechanism it's entry will be removed from the `Drivers` list -in the `CSINodeInfoSpec`. - -#### Kubelet Initialization & Migration Annotation - -On Kubelet initialization we will also pre-populate an annotation for that -node's `CSINodeInfo`. The key will be -`storage.alpha.kubernetes.io/migrated-plugins` and the value will be a list of -in-tree plugin names that the Kubelet has the migration shim turned on for -(through feature flags). This must be populated before the Kubelet becomes -schedulable in order to achieve synchronization described in the "ADC and -Kubelete CSI/In-tree Sync" section below". - -## Upgrade/Downgrade, Migrate/Un-migrate - -### Feature Flags - -ADC and Kubelet use the "same" feature flags, but in reality they are passed in -to each binary separately. There will be a feature flag per driver as well as -one for CSIMigration in general. - -Kubelet will use its own feature flags to determine whether to use the in-tree -or csi backend for Kubelet storage lifecycle operations, as well as to add the -plugins that have the feature flag on to the -`storage.alpha.kubernetes.io/migrated-plugins` annotation of `CSINodeInfo` for -the node that Kubelet is running on. - -The ADC will also use its own feature flags to help make the determination -whether to use in-tree or CSI backend for ADC storage lifecycle operations. The -other component to help determine which backend to use will be outlined below in -the "ADC and Kubelet CSI/In-tree Sync" section. - -### ADC and Kubelet CSI/In-tree Sync - -Some plugins have subtly different behavior on both ADC and Kubelet side between -in-tree and CSI implementations. Therefore it is important that if the ADC is to -use the in-tree implementation, the Kubelet must as well - and if the ADC is to -use the CSI Migrated implementation, the Kubelet must as well. Therefore we will -implement a mechanism to keep the ADC and the Kubelet in sync about the Kubelets -abilities as well as the feature gates active in each. - -In order for the ADC controller to have the requisite information from the -Kubelet to make informed decisions the Kubelet must propagate the -`storage.alpha.kubernetes.io/migrated-plugins` annotation information for each -potentially migrated driver on Kubelet startup and be considered `NotReady` -until that information is synced to the API server. This gives is the following -guarantees: -* If `CSINodeInfo` for the node does not exist, then ADC can infer the Kubelet - is not at a version with migration logic and should therefore fall-back to - in-tree implementation -* If `CSINodeInfo` exists, and `storage.alpha.kubernetes.io/migrated-plugins` - doesn't include the plugin name, then ADC can infer Kubelet has migration - logic however the Feature Flag for that particular plugin is `off` and the ADC - should therefore fall-back to in-tree storage implementation -* If `CSINodeInfo` exists, and `storage.alpha.kubernetes.io/migrated-plugins` - does include the plugin name, then ADC can infer Kubelet has migration logic - and the Feature Flag for that particular plugin is `on` and the ADC should - therefore use the csi-plugin migration implementation -* If `CSINodeInfo` exists, and `storage.alpha.kubernetes.io/migrated-plugins` - does include the plugin name but the ADC feature flags for that driver are off - (`in-tree`), then an error should be thrown notifying users that Kubelet - requested `csi-plugin` volume plugin mechanism but it was not specified on the - ADC - -In each of these above cases, the decision the ADC makes to use in-tree or csi -migration implemtnation will be mirror the Kubelets logic therefore guaranteeing -the entire lifecycle of a volume from controller to Kubelet will be done with -the same implementation. - -### Node Drain Requirement - -We require node's to be drained whenever the Kubelet is Upgrade/Downgraded or -Migrated/Unmigrated to ensure that the entire volume lifecycle is maintained -inside one code branch (CSI or In-tree). This simplifies upgrade/downgrade -significantly and reduces chance of error and races. - -### Upgrade/Downgrade Migrate/Unmigrate Scenarios - -For upgrade, starting from a non-migrated cluster you must turn on migration for -ADC first, then drain your node before turning on migration for the -Kubelet. The workflow is as follows: -1. ADC and Kubelet are both not migrated -2. ADC restarted and migrated (flags flipped) -3. ADC continues to use in-tree code for this node b/c - `storage.alpha.kubernetes.io/migrated-plugins` does NOT include the plugin - name -4. Node drained and made unschedulable. All volumes unmounted/detached with - in-tree code -6. Kubelet restarted and migrated (flags flipped) -7. Kubelet updates CSINodeInfo node to tell ADC (without informer) whether each - node/driver has been migrated by adding the plugin to the - `storage.alpha.kubernetes.io/migrated-plugins` annotation -8. Kubelet is made schedulable -9. Both ADC & Kubelet Migrated, node is in "fresh" state so all new - volumes lifecycle is CSI - -For downgrade, starting from a fully migrated cluster you must drain your node -first, then turn off migration for your Kubelet, then turn off migration for the -ADC. The workflow is as follows: -1. ADC and Kubelet are both migrated -2. Kubelet drained and made unschedulable, all volumes unmounted/detached with - CSI code -3. Kubelet restarted and un-migrated (flags flipped) -4. Kubelet removes the plugin in question to - `storage.alpha.kubernetes.io/migrated-plugins`. In case kubelet does not have - `storage.alpha.kubernetes.io/migrated-plugins` update code, admin must update - the field manually. -5. Kubelet is made schedulable. -5. At this point all volumes going onto the node would be using in-tree code for - both ADC(b/c of annotation) and Kublet -6. Restart and un-migrate ADC - -With these workflows a volume attached with CSI will be handled by CSI code for -its entire lifecycle, and a volume attached with in-tree code will be handled by -in-tree code for its entire lifecycle. - -## Cloud Provider Requirements - -There is a push to remove CloudProvider code from kubernetes. - -There will not be any general auto-deployment mechanism for ALL CSI drivers -covered in this document so the timeline to remove CloudProvider code using this -design is undetermined: For example: At some point GKE could auto-deploy the GCE -PD CSI driver and have migration for that turned on by default, however it may -not deploy any other drivers by default. And at this point we can only remove -the code for the GCE In-tree plugin (this would still break anyone doing their -own deployments while using GCE unless they install the GCE PD CSI Driver). - -We could have auto-deploy depending on what cloud provider kubernetes is running -on. But AFAIK there is no standard mechanism to guarantee this on all Cloud -Providers. - -For example the requirements for just the GCE Cloud Provider code for storage -with minimal disruption to users would be: -* In-tree to CSI Plugin migration goes GA -* GCE PD CSI Driver deployed on GCE/GKE by default (resource requirements of - driver need to be determined) -* GCE PD CSI Migration turned on by default -* Remove in-tree plugin code and cloud provider code - -And at this point users doing their own deployment and not installing the GCE PD -CSI driver encounter an error. - -## Disabling in-tree plugin code - -Before we can stop compiling and prepare to remove the code associated with migrated -in-tree plugins, we need to make sure all persistent volume operations involving -in-tree plugins continue to function in a backward compatible way when the -in-tree plugin code paths are disabled. When CSIMigrationInTreeOff feature flag -is enabled, we will not invoke ProbeVolumePlugins() for the in-tree plugins (that -have plugin-specific migration feature flag enabled) in appendAttachableLegacyProviderVolumes() -and appendLegacyProviderVolumes() in the Kubernetes Controller Manager and Kubelet. -All functions in Kubernetes code base that depend on probed in-tree plugins need to be -audited and refactored to handle errors returned (due to absence of a probed plugin). - -### Enhancements in Probing/Registration of in-tree plugins - -Functions appendLegacyProviderVolumes (in the Kubernetes Controller Manager and Kubelet) -and appendAttachableLegacyProviderVolumes (in the Kubernetes Controller Manager) -will be enhanced to [1] check plugin specific migration feature flags as well as -[2] the overall CSIMigrationInTreeOff feature flag to determine whether ProbeVolumePlugins -function of a legacy in-tree plugin will get invoked. - -Once CSIMigrationInTreeOff feature flag and the plugin specific migration -flags get enabled by default, the build tag `nolegacyproviders` can be enabled -for testing purposes. - -### Detection of migration status for a plugin - -Code paths that need to check migration status of a plugin, for example, -in provisionClaimOperationExternal, findDeletablePlugin, etc. will need to -depend on an instance of the CSIMigratedPluginManager that provides the following -utilities: -``` -func (pm CSIMigratedPluginManager) IsCSIMigrationEnabledForPluginByName(pluginName string) bool -func (pm CSIMigratedPluginManager) IsPluginMigratableToCSIBySpec(spec *Spec) (bool, error) -``` -The CSIMigratedPluginManager will be introduced in pkg/volume/csi_migration.go - -Note that a per-plugin member function of the form IsMigratedToCSI cannot be used -since [1] a plugin object for in-tree plugins will typically be nil and [2] the -code implementing IsMigratedToCSI will be removed as part of disabling plugin code. - -### Handling of errors returned by FindPluginBySpec/FindPluginByName for legacy plugins - -Once an in-tree plugin is no longer probed (through ProbeVolumePlugins), all the VolumePluginMgr -functions of the form Find*PluginBySpec/Find*PluginByName will return an error. -For example, invocation of: - -1. FindProvisionablePluginByName in the pv controller will return error for a migrated -plugin that is no longer probed in ProbeControllerVolumePlugins. -2. FindExpandablePluginBySpec in the expand controller will return error for a migrated -plugin that is no longer probed in ProbeExpandableVolumePlugins. -3. FindAttachablePluginBySpec in the attach/detach controller will return error for -a migrated plugin that is no longer probed in ProbeAttachableVolumePlugins. - -Code invoking the above functions will need to check for migration status of a plugin -based on Volume spec or Plugin name before invoking the above functions so that -an error is never encountered due to missing plugins. - -### Enhancements in Controllers handling Persistent Volumes - -#### AttachDetach Controller -The AttachDetach Controller will translate volume specs for an in-tree plugin to -a migrated CSI plugin at the points where Desired State of World cache gets -populated (through CreateVolumeSpec) when the following conditions are true: -[1] CSIMigration feature flag is enabled for Kubernetes Controller Manager and -the Kubelet where the pod with references to volumes got scheduled. -[2] A plugin-specific migration feature flag is enabled for Kubernetes Controller -Manager and the Kubelet where the pod with references to volumes got scheduled. -Translation during population of Desired State of World avoids down-stream functions -at the operation generator/executor stages from having to handle translation of -volume specs for migrated PVs whose in-tree plugins are not probed. - -Translation of volume specs for an in-tree plugin to a migrated CSI plugin as -described above will be skipped during Desired State of World population if: -[1] CSIMigration or plugin specific migration feature flags are disabled in Kubernetes -Controller Manager. -[2] CSIMigration or plugin specific migration flags are disabled for Kubelet -in a specific node where a pod with volumes got scheduled and CSIMigrationInTreeOff -is disabled in Kubernetes Controller Manager. - -Determination of whether migration feature flags are enabled in the Kubelet is -described earlier in the section: Kubelet CSI/In-tree Sync. - -#### Expansion Controller -The Expansion Controller will set the Storage Resizer annotation (volume.kubernetes.io/storage-resizer) -on a PVC (referring to a storage class associated with a legacy in-tree plugin) -with the name of a migrated CSI plugin when the following conditions are true: -[1] CSIMigration feature flag is enabled in Kubernetes Controller Manager. -[2] A plugin-specific migration feature flag is enabled in Kubernetes Controller -Manager. -This allows a migrated CSI plugin to process the resizing of the volume associated -with a PVC that refers to a migrated in-tree plugin. - -When the above conditions are not met, the Expansion Controller will use FindExpandablePluginBySpec -to determine the in-tree plugin that can be used for expanding a volume (as is -the case today). - -#### Persistent Volume Controller -The Persistent Volume Controller will set the Storage Provisioner annotation (volume.beta.kubernetes.io/storage-provisioner) -on a PVC (referring to a storage class associated with a legacy in-tree plugin) -with the name of a migrated CSI plugin when the following conditions are true: -[1] CSIMigration feature flag is enabled in Kubernetes Controller Manager. -[2] A plugin-specific migration feature flag is enabled in Kubernetes Controller -Manager. -This allows a migrated CSI plugin to process the provisioning as well as deleting -of a volume for a PVC that refers to a migrated in-tree plugin. - -When the above conditions are not met, the PV Controller will use FindProvisionablePluginByName -to determine the in-tree plugin that can be used for provisioning a volume (as is -the case today). - -Update: 1/13/2020 - -In Beta we discovered issue: https://github.com/kubernetes/kubernetes/issues/79043 - -In order to resolve this the design needs to be modified. When migration is "on" -the Persistent Volume Controller will still set the Storage Provisioner -annotation `volume.beta.kubernetes.io/storage-provisioner` with the name of the -migrated CSI driver. However, it will also set an additional annotation -`volume.beta.kubernetes.io/migrated-to` to the CSI Driver name. - -The PV Controller will be modified so that when it does a `syncClaim`, -`syncVolume`, or `provisionClaim` operation it will check -`volume.beta.kubernetes.io/storage-provisioner` and -`pv.kubernetes.io/provisioned-by` annotations respectively to set the correct -`volume.beta.kubernetes.io/migrated-to` annotation. Doing this on each `sync` -operation will incur an additional cost of checking the annotation each time we -process a claim or volume but allows the controller to re-try on error. - -Following is an example of the operation done to a PV object with -`volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/gce-pd`. When the -PV controller has `CSIMigrationGCE=true` the controller will additionally -annotate the PV with -`volume.beta.kubernetes.io/migrated-to=pd.csi.storage.gke.io`. The PV controller -will also remove `migrated-to` annotations on PV/PVCs with migration OFF to -support rollback scenarios. - -On cluster start-up there is a potential for there to be a race between the PV -Controller removing the `migrated-to` annotation and the external provisioner -deleting the volume. This is migitated by relying on idempotency requirements of -both CSI Drivers and in-tree volume plugins. One component attempting to delete -a volume already deleted or being deleted should return as a success. - -This `migrated-to` annotation will be used by `v1.6.0+` of the CSI External -provisioner to determine whether a PV or PVC should be operated on by the -provisioner. The annotation will be set (and removed on rollback) for Kubernetes -`v1.17.2+`, we will carefully document the fact that rollback with migration on -may not be successful to a version before `v1.17.2`. The benefit being that PV -Controller start-up annealing of this annotation will allow the PV Controller to -stand down and the CSI External Provisioner to pick up a PV that was dynamically -provisioned before migration was enabled. These newer external provisioners will -still be compatible with older versions of Kubernetes with migration on even if -they do not set the `migrated-to` annotation. However, without the `migrated-to` -annotation a newer provisioner with a Kubernetes cluster `<1.17.2` will not be -able to delete volumes provisioned before migration until the Kubenetes cluster -is upgraded to `v1.17.2+`. - -We are intentionally not changing the original implementation of "set\[ting\] -the Storage Provisioner annotation on a PVC with the name of a migrated CSI -plugin" so that the Kubernetes implementation is backward compatible with older -versions of the external provisioner. Because the Storage Provisioner annotation -remains in the CSI Driver, older external provisioners will continue to provision -and delete those volumes. - -## Testing - -### Migration Shim Testing -Run all existing in-tree plugin driver tests -* If migration is on for that plugin, add infrastructure piece that inspects CSI - Drivers logs to make sure that the driver is servicing the operations -* Also observer that none of the in-tree code is being called - -Additionally, we must test that a PV created from migrated dynamic provisioning -is identical to the PV created from the in-tree plugin - -This should cover all use cases of volume operations, including volume -reconstruction. - -### Upgrade/Downgrade/Skew Testing -We need to have test clusters brought up that have different feature flags -enabled on different components (ADC and Kubelet). Once these feature flag skew -configurations are brought up the test itself would have to know what -configuration it’s running in and validate the expected result. - -Configurations to test: - -| ADC | Kubelet | Expected Result | -|-------------------|----------------------------------------------------|--------------------------------------------------------------------------| -| ADC Migration On | Kubelet Migration On | Fully migrated - result should be same as “Migration Shim Testing” above | -| ADC Migration On | Kubelet Migration Off (or Kubelet version too low) | No calls made to driver. All operations serviced by in-tree plugin | -| ADC Migration Off | Kubelet Migration On | Not supported config - Undefined behavior | -| ADC Migration Off | Kubelet Migration Off | No calls made to driver. All operations service by in-tree plugin | - -### CSI Driver Feature Parity Testing - -We will need some way to automatically qualify drivers have feature parity -before promoting their migration features to Beta (on by default). - -This is as simple as on the feature flags and run through our “Migration Shim -Testing” tests. If the driver passes all of them then they have parity. If not, -we need to revisit in-tree plugin tests and make sure they test the entire suite -of possible tests. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/csi-migration_dependencies.png b/contributors/design-proposals/storage/csi-migration_dependencies.png Binary files differdeleted file mode 100644 index 36c1c2f2..00000000 --- a/contributors/design-proposals/storage/csi-migration_dependencies.png +++ /dev/null diff --git a/contributors/design-proposals/storage/csi-snapshot.md b/contributors/design-proposals/storage/csi-snapshot.md index db9abf4f..f0fbec72 100644 --- a/contributors/design-proposals/storage/csi-snapshot.md +++ b/contributors/design-proposals/storage/csi-snapshot.md @@ -1,376 +1,6 @@ -Kubernetes CSI Snapshot Proposal -================================ +Design proposals have been archived. -**Authors:** [Jing Xu](https://github.com/jingxu97), [Xing Yang](https://github.com/xing-yang), [Tomas Smetana](https://github.com/tsmetana), [Huamin Chen ](https://github.com/rootfs) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Background -Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). Snapshots can also be used for data replication, distribution and migration. - -As the initial effort to support snapshot in Kubernetes, volume snapshotting has been released as a prototype in Kubernetes 1.8. An external controller and provisioner (i.e. two separate binaries) have been added in the [external storage repo](https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot). The prototype currently supports GCE PD, AWS EBS, OpenStack Cinder, GlusterFS, and Kubernetes hostPath volumes. Volume snapshots APIs are using [CRD](https://kubernetes.io/docs/tasks/access-kubernetes-api/extend-api-custom-resource-definitions/). - -To continue that effort, this design is proposed to add the snapshot support for CSI Volume Drivers. Because the overall trend in Kubernetes is to keep the core APIs as small as possible and use CRD for everything else, this proposal adds CRD definitions to represent snapshots, and an external snapshot controller to handle volume snapshotting. Out-of-tree external provisioner can be upgraded to support creating volume from snapshot. In this design, only CSI volume drivers will be supported. The CSI snapshot spec is proposed [here](https://github.com/container-storage-interface/spec/pull/224). - - -## Objectives - -For the first version of snapshotting support in Kubernetes, only on-demand snapshots for CSI Volume Drivers will be supported. - - -### Goals - -* Goal 1: Expose standardized snapshotting operations to create, list, and delete snapshots in Kubernetes REST API. -Currently the APIs will be implemented with CRD (CustomResourceDefinitions). - -* Goal 2: Implement CSI volume snapshot support. -An external snapshot controller will be deployed with other external components (e.g., external-attacher, external-provisioner) for each CSI Volume Driver. - -* Goal 3: Provide a convenient way of creating new and restoring existing volumes from snapshots. - - -### Non-Goals - -The following are non-goals for the current phase, but will be considered at a later phase. - -* Goal 4: Offer application-consistent snapshots by providing pre/post snapshot hooks to freeze/unfreeze applications and/or unmount/mount file system. - -* Goal 5: Provide higher-level management, such as backing up and restoring a pod and statefulSet, and creating a consistent group of snapshots. - - -## Design Details - -In this proposal, volume snapshots are considered as another type of storage resources managed by Kubernetes. Therefore the snapshot API and controller follow the design pattern of existing volume management. There are three APIs, VolumeSnapshot and VolumeSnapshotContent, and VolumeSnapshotClass which are similar to the structure of PersistentVolumeClaim and PersistentVolume, and storageClass. The external snapshot controller functions similar to the in-tree PV controller. With the snapshots APIs, we also propose to add a new data source struct in PersistentVolumeClaim (PVC) API in order to support restore snapshots to volumes. The following section explains in more details about the APIs and the controller design. - - -### Snapshot API Design - -The API design of VolumeSnapshot and VolumeSnapshotContent is modeled after PersistentVolumeClaim and PersistentVolume. In the first version, the VolumeSnapshot lifecycle is completely independent of its volumes source (PVC). When PVC/PV is deleted, the corresponding VolumeSnapshot and VolumeSnapshotContents objects will continue to exist. However, for some volume plugins, snapshots have a dependency on their volumes. In a future version, we plan to have a complete lifecycle management which can better handle the relationship between snapshots and their volumes. (e.g., a finalizer to prevent deleting volumes while there are snapshots depending on them). - -#### The `VolumeSnapshot` Object - -```GO - -// +genclient -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// VolumeSnapshot is a user's request for taking a snapshot. Upon successful creation of the actual -// snapshot by the volume provider it is bound to the corresponding VolumeSnapshotContent. -// Only the VolumeSnapshot object is accessible to the user in the namespace. -type VolumeSnapshot struct { - metav1.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata - // +optional - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Spec defines the desired characteristics of a snapshot requested by a user. - Spec VolumeSnapshotSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"` - - // Status represents the latest observed state of the snapshot - // +optional - Status VolumeSnapshotStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` -} - -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// VolumeSnapshotList is a list of VolumeSnapshot objects -type VolumeSnapshotList struct { - metav1.TypeMeta `json:",inline"` - // +optional - metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Items is the list of VolumeSnapshots - Items []VolumeSnapshot `json:"items" protobuf:"bytes,2,rep,name=items"` -} - -// VolumeSnapshotSpec describes the common attributes of a volume snapshot -type VolumeSnapshotSpec struct { - // Source has the information about where the snapshot is created from. - // In Alpha version, only PersistentVolumeClaim is supported as the source. - // If not specified, user can create VolumeSnapshotContent and bind it with VolumeSnapshot manually. - // +optional - Source *TypedLocalObjectReference `json:"source" protobuf:"bytes,1,opt,name=source"` - - // SnapshotContentName binds the VolumeSnapshot object with the VolumeSnapshotContent - // +optional - SnapshotContentName string `json:"snapshotContentName" protobuf:"bytes,2,opt,name=snapshotContentName"` - - // Name of the VolumeSnapshotClass used by the VolumeSnapshot. If not specified, a default snapshot class will - // be used if it is available. - // +optional - VolumeSnapshotClassName *string `json:"snapshotClassName" protobuf:"bytes,3,opt,name=snapshotClassName"` -} - -// VolumeSnapshotStatus is the status of the VolumeSnapshot -type VolumeSnapshotStatus struct { - // CreationTime is the time the snapshot was successfully created. If it is set, - // it means the snapshot was created; Otherwise the snapshot was not created. - // +optional - CreationTime *metav1.Time `json:"creationTime" protobuf:"bytes,1,opt,name=creationTime"` - - // When restoring volume from the snapshot, the volume size should be equal or - // larger than the Restoresize if it is specified. If RestoreSize is set to nil, it means - // that the storage plugin does not have this information available. - // +optional - RestoreSize *resource.Quantity `json:"restoreSize" protobuf:"bytes,2,opt,name=restoreSize"` - - // Ready is set to true only if the snapshot is ready to use (e.g., finish uploading if - // there is an uploading phase) and also VolumeSnapshot and its VolumeSnapshotContent - // bind correctly with each other. If any of the above condition is not true, Ready is - // set to false - // +optional - Ready bool `json:"ready" protobuf:"varint,3,opt,name=ready"` - - // The last error encountered during create snapshot operation, if any. - // This field must only be set by the entity completing the create snapshot - // operation, i.e. the external-snapshotter. - // +optional - Error *storage.VolumeError -} - -``` - -Note that if an error occurs before the snapshot is cut, `Error` will be set and none of `CreatedAt`/`AvailableAt` will be set. If an error occurs after the snapshot is cut but before it is available, `Error` will be set and `CreatedAt` should still be set, but `AvailableAt` will not be set. If an error occurs after the snapshot is available, `Error` will be set and `CreatedAt` should still be set, but `AvailableAt` will no longer be set. - -#### The `VolumeSnapshotContent` Object - -```GO - -// +genclient -// +genclient:nonNamespaced -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// VolumeSnapshotContent represents the actual snapshot object -type VolumeSnapshotContent struct { - metav1.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata - // +optional - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Spec defines a specification of a volume snapshot - Spec VolumeSnapshotContentSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"` -} - -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// VolumeSnapshotContentList is a list of VolumeSnapshotContent objects -type VolumeSnapshotContentList struct { - metav1.TypeMeta `json:",inline"` - // +optional - metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Items is the list of VolumeSnapshotContents - Items []VolumeSnapshotContent `json:"items" protobuf:"bytes,2,rep,name=items"` -} - -// VolumeSnapshotContentSpec is the spec of the volume snapshot content -type VolumeSnapshotContentSpec struct { - // Source represents the location and type of the volume snapshot - VolumeSnapshotSource `json:",inline" protobuf:"bytes,1,opt,name=volumeSnapshotSource"` - - // VolumeSnapshotRef is part of bi-directional binding between VolumeSnapshot - // and VolumeSnapshotContent. It becomes non-nil when bound. - // +optional - VolumeSnapshotRef *core_v1.ObjectReference `json:"volumeSnapshotRef" protobuf:"bytes,2,opt,name=volumeSnapshotRef"` - - // PersistentVolumeRef represents the PersistentVolume that the snapshot has been - // taken from. It becomes non-nil when VolumeSnapshot and VolumeSnapshotContent are bound. - // +optional - PersistentVolumeRef *core_v1.ObjectReference `json:"persistentVolumeRef" protobuf:"bytes,3,opt,name=persistentVolumeRef"` - // Name of the VolumeSnapshotClass used by the VolumeSnapshotContent. If not specified, a default snapshot class will - // be used if it is available. - // +optional - VolumeSnapshotClassName *string `json:"snapshotClassName" protobuf:"bytes,4,opt,name=snapshotClassName"` -} - -// VolumeSnapshotSource represents the actual location and type of the snapshot. Only one of its members may be specified. -type VolumeSnapshotSource struct { - // CSI (Container Storage Interface) represents storage that handled by an external CSI Volume Driver (Alpha feature). - // +optional - CSI *CSIVolumeSnapshotSource `json:"csiVolumeSnapshotSource,omitempty"` -} - -// Represents the source from CSI volume snapshot -type CSIVolumeSnapshotSource struct { - // Driver is the name of the driver to use for this snapshot. - // Required. - Driver string `json:"driver"` - - // SnapshotHandle is the unique snapshot id returned by the CSI volume - // plugin’s CreateSnapshot to refer to the snapshot on all subsequent calls. - // Required. - SnapshotHandle string `json:"snapshotHandle"` - - // Timestamp when the point-in-time snapshot is taken on the storage - // system. This timestamp will be generated by the CSI volume driver after - // the snapshot is cut. The format of this field should be a Unix nanoseconds - // time encoded as an int64. On Unix, the command `date +%s%N` returns - // the current time in nanoseconds since 1970-01-01 00:00:00 UTC. - CreationTime *int64 `json:"creationTime,omitempty" protobuf:"varint,3,opt,name=creationTime"` - - // When restoring volume from the snapshot, the volume size should be equal or - // larger than the Restoresize if it is specified. If RestoreSize is set to nil, it means - // that the storage plugin does not have this information available. - // +optional - RestoreSize *resource.Quantity `json:"restoreSize" protobuf:"bytes,2,opt,name=restoreSize"` -} - -``` - -#### The `VolumeSnapshotClass` Object - -A new VolumeSnapshotClass API object will be added instead of reusing the existing StorageClass, in order to avoid mixing parameters between snapshots and volumes. Each CSI Volume Driver can have its own default VolumeSnapshotClass. If VolumeSnapshotClass is not provided, a default will be used. It allows to add new parameters for snapshots. - -``` - -// +genclient -// +genclient:nonNamespaced -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object - -// VolumeSnapshotClass describes the parameters used by storage system when -// provisioning VolumeSnapshots from PVCs. -// The name of a VolumeSnapshotClass object is significant, and is how users can request a particular class. -type VolumeSnapshotClass struct { - metav1.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata - // +optional - metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` - - // Snapshotter is the driver expected to handle this VolumeSnapshotClass. - Snapshotter string `json:"snapshotter" protobuf:"bytes,2,opt,name=snapshotter"` - - // Parameters holds parameters for the snapshotter. - // These values are opaque to the system and are passed directly - // to the snapshotter. - // +optional - Parameters map[string]string `json:"parameters,omitempty" protobuf:"bytes,3,rep,name=parameters"` -} - - -``` -### Volume API Changes - -With Snapshot API available, users could provision volumes from snapshot and data will be pre-populated to the volumes. Also considering clone and other possible storage operations, there could be many different types of sources used for populating the data to the volumes. In this proposal, we add a general "DataSource" which could be used to represent different types of data sources. - -#### The `DataSource` Object in PVC - -Add a new `DataSource` field into PVC to represent the source of the data which is populated to the provisioned volume. External-provisioner will check `DataSource` field and try to provision volume from the sources. In the first version, only VolumeSnapshot is the supported `Type` for data source object reference. Other types will be added in a future version. If unsupported `Type` is used, the PV Controller SHALL fail the operation. Please see more details in [here](https://github.com/kubernetes/community/pull/2495) - -Possible `DataSource` types may include the following: - - * VolumeSnapshot: restore snapshot to a new volume - * PersistentVolumeClaim: clone volume which is represented by PVC - -``` -type PersistentVolumeClaimSpec struct { - // If specified when creating, volume will be prepopulated with data from the DataSource. - // +optional - DataSource *TypedLocalObjectReference `json:"dataSource" protobuf:"bytes,2,opt,name=dataSource"` -} - -``` - -Add a TypedLocalObjectReference in core API. - -``` - -// TypedLocalObjectReference contains enough information to let you locate the referenced object inside the same namespace. -type TypedLocalObjectReference struct { - // Name of the object reference. - Name string - // Kind indicates the type of the object reference. - Kind string -} - -``` - -### Snapshot Controller Design -As the figure below shows, the CSI snapshot controller architecture consists of an external snapshotter which talks to out-of-tree CSI Volume Driver over socket (/run/csi/socket by default, configurable by -csi-address). External snapshotter is part of Kubernetes implementation of [Container Storage Interface (CSI)](https://github.com/container-storage-interface/spec). It is an external controller that monitors `VolumeSnapshot` and `VolumeSnapshotContent` objects and creates/deletes snapshot. - - -* External snapshotter uses ControllerGetCapabilities to find out if CSI driver supports CREATE_DELETE_SNAPSHOT calls. It degrades to trivial mode if not. - -* External snapshotter is responsible for creating/deleting snapshots and binding snapshot and SnapshotContent objects. It follows [controller](/contributors/devel/sig-api-machinery/controllers.md) pattern and uses informers to watch for `VolumeSnapshot` and `VolumeSnapshotContent` create/update/delete events. It filters out `VolumeSnapshot` instances with `Snapshotter==<CSI driver name>` and processes these events in workqueues with exponential backoff. - -* For dynamically created snapshot, it should have a VolumeSnapshotClass associated with it. User can explicitly specify a VolumeSnapshotClass in the VolumeSnapshot API object. If user does not specify a VolumeSnapshotClass, a default VolumeSnapshotClass created by the admin will be used. This is similar to how a default StorageClass created by the admin will be used for the provisioning of a PersistentVolumeClaim. - -* For statically binding snapshot, user/admin must specify bi-pointers correctly for both VolumeSnapshot and VolumeSnapshotContent, so that the controller knows how to bind them. Otherwise, if VolumeSnapshot points to a non-exist VolumeSnapshotContent, or VolumeSnapshotContent does not point back to the VolumeSnapshot, the Error status will be set for VolumeSnapshot - -* External snapshotter is running in the sidecar along with external-attacher and external-provisioner for each CSI Volume Driver. - -* In current design, when the storage system fails to create snapshot, retry will not be performed in the controller. This is because users may not want to retry when taking consistent snapshots or scheduled snapshots when the timing of the snapshot creation is important. In a future version, a maxRetries flag or retry termination timestamp will be added to allow users to control whether retries are needed. - - -#### Changes in CSI External Provisioner - -`DataSource` is available in `PersistentVolumeClaim` to represent the source of the data which is prepopulated to the provisioned volume. The operation of the provisioning of a volume from a snapshot data source will be handled by the out-of-tree CSI External Provisioner. The in-tree PV Controller will handle the binding of the PV and PVC once they are ready. - - -#### CSI Volume Driver Snapshot Support - -The out-of-tree CSI Volume Driver creates a snapshot on the backend storage system or cloud provider, and calls CreateSnapshot through CSI ControllerServer and returns CreateSnapshotResponse. The out-of-tree CSI Volume Driver needs to implement the following functions: - -* CreateSnapshot, DeleteSnapshot, and create volume from snapshot if it supports CREATE_DELETE_SNAPSHOT. -* ListSnapshots if it supports LIST_SNAPSHOTS. - -ListSnapshots can be an expensive operation because it will try to list all snapshots on the storage system. For a storage system that takes nightly periodic snapshots, the total number of snapshots on the system can be huge. Kubernetes should try to avoid this call if possible. Instead, calling ListSnapshots with a specific snapshot_id as filtering to query the status of the snapshot will be more desirable and efficient. - -CreateSnapshot is a synchronous function and it must be blocking until the snapshot is cut. For cloud providers that support the uploading of a snapshot as part of creating snapshot operation, CreateSnapshot function must also be blocking until the snapshot is cut and after that it shall return an operation pending gRPC error code until the uploading process is complete. - -Refer to [Container Storage Interface (CSI)](https://github.com/container-storage-interface/spec) for detailed instructions on how CSI Volume Driver shall implement snapshot functions. - - -## Transition to the New Snapshot Support - -### Existing Implementation in External Storage Repo - -For the snapshot implementation in [external storage repo](https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot), an external snapshot controller and an external provisioner need to be deployed. - -* The old implementation does not support CSI volume drivers. -* VolumeSnapshotClass concept does not exist in the old design. -* To restore a volume from the snapshot, however, user needs to create a new StorageClass that is different from the original one for the PVC. - -Here is an example yaml file to create a snapshot in the old design: - -```GO - -apiVersion: volumesnapshot.external-storage.k8s.io/v1 -kind: VolumeSnapshot -metadata: - name: hostpath-test-snapshot -spec: - persistentVolumeClaimName: pvc-test-hostpath - -``` - -### New Snapshot Design for CSI - -For the new snapshot model, a sidecar "Kubernetes to CSI" proxy container called "external-snapshotter" needs to be deployed in addition to the sidecar container for the external provisioner. This deployment model is shown in the CSI Snapshot Diagram in the CSI External Snapshot Controller section. - -* The new design supports CSI volume drivers. -* To create a snapshot for CSI, a VolumeSnapshotClass can be created and specified in the spec of VolumeSnapshot. -* To restore a volume from the snapshot, users could use the same StorageClass that is used for the original PVC. - -Here is an example to create a VolumeSnapshotClass and to create a snapshot in the new design: - -```GO - -apiVersion: snapshot.storage.k8s.io/v1alpha1 -kind: VolumeSnapshotClass -metadata: - name: csi-hostpath-snapclass -snapshotter: csi-hostpath ---- -apiVersion:snapshot.storage.k8s.io/v1alpha1 -kind: VolumeSnapshot -metadata: - name: snapshot-demo -spec: - snapshotClassName: csi-hostpath-snapclass - source: - name: hpvc - kind: PersistentVolumeClaim - -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/csi-snapshot_diagram.png b/contributors/design-proposals/storage/csi-snapshot_diagram.png Binary files differdeleted file mode 100644 index e040126e..00000000 --- a/contributors/design-proposals/storage/csi-snapshot_diagram.png +++ /dev/null diff --git a/contributors/design-proposals/storage/data-source.md b/contributors/design-proposals/storage/data-source.md index 1cd56caa..f0fbec72 100644 --- a/contributors/design-proposals/storage/data-source.md +++ b/contributors/design-proposals/storage/data-source.md @@ -1,121 +1,6 @@ -# Add DataSource for Volume Operations +Design proposals have been archived. -Note: this proposal is part of [Volume Snapshot](https://github.com/kubernetes/community/pull/2335) feature design, and also relevant to recently proposed [Volume Clone](https://github.com/kubernetes/community/pull/2533) feature. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goal -Currently in Kubernetes, volume plugin only supports to provision an empty volume. With the new storage features (including [Volume Snapshot](https://github.com/kubernetes/community/pull/2335) and [volume clone](https://github.com/kubernetes/community/pull/2533)) being proposed, there is a need to support data population for volume provisioning. For example, volume can be created from a snapshot source, or volume could be cloned from another volume source. Depending on the sources for creating the volume, there are two scenarios -1. Volume provisioner can recognize the source and be able to create the volume from the source directly (e.g., restore snapshot to a volume or clone volume). -2. Volume provisioner does not recognize the volume source, and create an empty volume. Another external component (data populator) could watch the volume creation and implement the logic to populate/import the data to the volume provisioned. Only after data is populated to the volume, the PVC is ready for use. -There could be many different types of sources used for populating the data to the volumes. In this proposal, we propose to add a generic "DataSource" field to PersistentVolumeClaimSpec to represent different types of data sources. - -## Design -### API Change -A new DataSource field is proposed to be added to PVC to represent the source of the data which is pre-populated to the provisioned volume. For DataSource field, we propose to define a new type “TypedLocalObjectReference”. It is similar to “LocalObjectReference” type with additional Kind field in order to support multiple data source types. In the alpha version, this data source is restricted in the same namespace of the PVC. The following are the APIs we propose to add. - -``` - -type PersistentVolumeClaimSpec struct { - // If specified, volume will be pre-populated with data from the specified data source. - // +optional - DataSource *TypedLocalObjectReference `json:"dataSource" protobuf:"bytes,2,opt,name=dataSource"` -} - -// TypedLocalObjectReference contains enough information to let you locate the referenced object inside the same namespace. -type TypedLocalObjectReference struct { - // Name of the object reference. - Name string - // Kind indicates the type of the object reference. - Kind string - // APIGroup is the group for the resource being referenced - APIGroup string -} - -``` -### Design Details -In the first alpha version, we only support data source from Snapshot. So the expected Kind in DataSource has to be "VolumeSnapshot". In this case, provisioner should provision volume and populate data in one step. There is no need for external data populator yet. - -For other types of data sources that require external data populator, volume creation and data population are two separate steps. Only when data is ready, PVC/PV can be marked as ready (Bound) so that users can start to use them. We are working on a separate proposal to address this using similar idea from ["Pod Ready++"](https://github.com/kubernetes/community/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md). - -Note: In order to use this data source feature, user/admin needs to update to the new external provisioner which can recognize snapshot data source. Otherwise, data source will be ignored and an empty volume will be created - -## Use cases -* Use snapshot to backup data: Alice wants to take a snapshot of her Mongo database, and accidentally delete her tables, she wants to restore her volumes from the snapshot. -To create a snapshot for a volume (represented by PVC), use the snapshot.yaml - -``` -apiVersion: snapshot.storage.k8s.io/v1alpha1 -kind: VolumeSnapshot -metadata: - name: snapshot-pd-1 - namespace: mynamespace -spec: - source: - kind: PersistentVolumeClaim - name: podpvc - snapshotClassName: snapshot-class - - ``` - After snapshot is ready, create a new volume from the snapshot - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: snapshot-pvc - Namespace: mynamespace -spec: - accessModes: - - ReadWriteOnce - storageClassName: csi-gce-pd - dataSource: - kind: VolumeSnapshot - name: snapshot-pd-1 - resources: - requests: - storage: 6Gi -``` - -* Clone volume: Bob want to copy the data from one volume to another by cloning the volume. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: clone-pvc - Namespace: mynamespace -spec: - accessModes: - - ReadWriteOnce - storageClassName: csi-gce-pd - dataSource: - kind: PersistentVolumeClaim - name: pvc-1 - resources: - requests: - storage: 10Gi -``` - -* Import data from Github repo: Alice want to import data from a github repo to her volume. The github repo is represented by a PVC (gitrepo-1). Compare with the user case 2 is that the data source should be the same kind of volume as the provisioned volume for cloning. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: clone-pvc - Namespace: mynamespace -spec: - accessModes: - - ReadWriteOnce - storageClassName: csi-gce-pd - dataSource: - kind: PersistentVolumeClaim - name: gitrepo-1 - resources: - requests: - storage: 100Gi -``` - - - - +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/default-storage-class.md b/contributors/design-proposals/storage/default-storage-class.md index c454ce62..f0fbec72 100644 --- a/contributors/design-proposals/storage/default-storage-class.md +++ b/contributors/design-proposals/storage/default-storage-class.md @@ -1,69 +1,6 @@ -# Deploying a default StorageClass during installation +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Usual Kubernetes installation tools should deploy a default StorageClass -where it makes sense. -"*Usual installation tools*" are: - -* cluster/kube-up.sh -* kops -* kubeadm - -Other "installation tools" can (and should) deploy default StorageClass -following easy steps described in this document, however we won't touch them -during implementation of this proposal. - -"*Where it makes sense*" are: - -* AWS -* Azure -* GCE -* Photon -* OpenStack -* vSphere - -Explicitly, there is no default storage class on bare metal. - -## Motivation - -In Kubernetes 1.5, we had "alpha" dynamic provisioning on aforementioned cloud -platforms. In 1.6 we want to deprecate this alpha provisioning. In order to keep -the same user experience, we need a default StorageClass instance that would -provision volumes for PVCs that do not request any special class. As -consequence, this default StorageClass would provision volumes for PVCs with -"alpha" provisioning annotation - this annotation would be ignored in 1.6 and -default storage class would be assumed. - -## Design - -1. Kubernetes will ship yaml files for default StorageClasses for each platform - as `cluster/addons/storage-class/<platform>/default.yaml` and all these - default classes will distributed together with all other addons in - `kubernetes.tar.gz`. - -2. An installation tool will discover on which platform it runs and installs - appropriate yaml file into usual directory for addon manager (typically - `/etc/kubernetes/addons/storage-class/default.yaml`). - -3. Addon manager will deploy the storage class into installed cluster in usual - way. We need to update addon manager not to overwrite any existing object - in case cluster admin has manually disabled this default storage class! - -## Implementation - -* AWS, GCE and OpenStack has a default StorageClass in - `cluster/addons/storage-class/<platform>/` - already done in 1.5 - -* We need a default StorageClass for vSphere, Azure and Photon in `cluster/addons/storage-class/<platform>` - -* cluster/kube-up.sh scripts need to be updated to install the storage class on appropriate platforms - * Already done on GCE, AWS and OpenStack. - -* kops needs to be updated to install the storage class on appropriate platforms - * already done for kops on AWS and kops does not support other platforms yet. - -* kubeadm needs to be updated to install the storage class on appropriate platforms (if it is cloud-provider aware) - -* addon manager fix: https://github.com/kubernetes/kubernetes/issues/39561 +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/flexvolume-deployment.md b/contributors/design-proposals/storage/flexvolume-deployment.md index 19b7ea63..f0fbec72 100644 --- a/contributors/design-proposals/storage/flexvolume-deployment.md +++ b/contributors/design-proposals/storage/flexvolume-deployment.md @@ -1,168 +1,6 @@ -# **Dynamic Flexvolume Plugin Discovery** +Design proposals have been archived. -## **Objective** +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Kubelet and controller-manager do not need to be restarted manually in order for new Flexvolume plugins to be recognized. -## **Background** - -Beginning in version 1.8, the Kubernetes Storage SIG is putting a stop to accepting in-tree volume plugins and advises all storage providers to implement out-of-tree plugins. Currently, there are two recommended implementations: Container Storage Interface (CSI) and Flexvolume. - -[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md) provides a single interface that storage vendors can implement in order for their storage solutions to work across many different container orchestrators, and volume plugins are out-of-tree by design. This is a large effort, the full implementation of CSI is several quarters away, and there is a need for an immediate solution for storage vendors to continue adding volume plugins. - -[Flexvolume] is an in-tree plugin that has the ability to run any storage solution by executing volume commands against a user-provided driver on the Kubernetes host, and this currently exists today. However, the process of setting up Flexvolume is very manual, pushing it out of consideration for many users. Problems include having to copy the driver to a specific location in each node, manually restarting kubelet, and user's limited access to machines. - -An automated deployment technique is discussed in [Recommended Driver Deployment Method](#recommended-driver-deployment-method). The crucial change required to enable this method is allowing kubelet and controller manager to dynamically discover plugin changes. - - -## **Overview** - -When there is a modification of the driver directory, a notification is sent to the filesystem watch from kubelet or controller manager. When kubelet or controller-manager searches for plugins (such as when a volume needs to be mounted), if there is a signal from the watch, it probes the driver directory and loads currently installed drivers as volume plugins. - -The modification can be a driver install (addition), upgrade/downgrade (update), or uninstall (deletion). If a volume depends on an existing driver, it can be *updated* but not *deleted*. - -## **Detailed Design** - -In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. Change the type of plugins inside the volume plugin manager's plugin list to `PluginStub`. - -`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. - -`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists. - -Iterating through the list of plugins inside `InitPlugins()` from `volume/plugins.go`, if the plugin is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`. If `Probe()` fails, use the cached plugin instead. However, if the plugin fails to initialize, log the error but do not use the cached version. The user needs to be aware that their driver implementation has a problem initializing, so the system should not silently use an older version. - -Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed. - -There is a possibility that a Flexvolume command execution occurs at the same time as the driver is updated, which leads to a bad execution. This cannot be solved within the Kubernetes system without an overhaul. Instead, this is discussed in [Atomic Driver Installation](#atomic-driver-installation) as part of the deployment mechanism. As part of the solution, the Prober will **ignore all files that begins with "."** in the driver directory. - -Word of caution about symlinks in the Flexvolume plugin directory: as a result of the recursive filesystem watch implementation, if a symlink links to a directory, unless the directory is visible to the prober (i.e. it's inside the Flexvolume plugin directory and does not start with '.'), the directory's files and subdirectories are not added to filesystem watch, thus their change will not trigger a probe. - - -## **Alternative Designs** - -1) Make `PluginProber` a separate component, and pass it around as a dependency. - -Pros: Avoids the common `PluginStub` interface. There isn't much shared functionality between `VolumePlugin` and `PluginProber`. The only purpose this shared abstraction serves is for `PluginProber` to reuse the existing machinery of plugins list. - -Cons: Would have to increase dependency surface area, notably `KubeletDeps`. - -I'm currently undecided whether to use this design or the `PluginStub` design. - -2) Use a polling model instead of a watch for probing for driver changes. - -Pros: Simpler to implement. - -Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity. - -3) Use a polling model + cache. Poll every x seconds/minutes. - -Pros: Mostly mitigates issues with the previous approach. - -Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly. - -4) Have the `flexVolumePluginList` cache live in `PluginProber` instead of `VolumePluginMgr`. - -Pros: `VolumePluginMgr` doesn't need to treat Flexvolume plugins any differently from other plugins. - -Cons: `PluginProber` doesn't have the function to validate a plugin. This function lives in `VolumePluginMgr`. Alternatively, the function can be passed into `PluginProber`. - - -## **Security Considerations** - -The Flexvolume driver directory can be continuously modified (accidentally or maliciously), making every` Probe()` call trigger a disk read, and `Probe()` calls could happen every couple of milliseconds and in bursts (i.e. lots of calls at first and then silence for some time). This may decrease kubelet's or controller manager's disk IO usage, impacting the performance of other system operations. - -As a safety measure, add a 1-second minimum delay between the processing of filesystem watch signals. - - -## **Testing Plan** - -Add new unit tests in `plugin_tests.go` to cover new probing functionality and the heterogeneous plugin types in the plugins list. - -Add e2e tests that follow the user story. Write one for initial driver installation, one for an update for the same driver, one for adding another driver, and one for removing a driver. - -## **Recommended Driver Deployment Method** - -This section describes one possible method to automatically deploy Flexvolume drivers. The goal is that drivers must be deployed on nodes (and master when attach is required) without having to manually access any machine instance. - -Driver Installation: - -* Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She creates an image by copying her driver and the [deployment script](#driver-deployment-script) to a busybox base image, and makes her image available Bob, a cluster admin. -* Bob modifies the existing deployment DaemonSet spec with the name of the given image, and creates the DaemonSet. -* Charlie, an end user, creates volumes using the installed plugin. - -The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec. - -### Driver Deployment Script - -This script assumes that only a *single driver file* is necessary, and is located at `/$DRIVER` on the deployment image. - -``` bash -#!/bin/sh - -set -o errexit -set -o pipefail - -VENDOR=k8s.io -DRIVER=nfs - -# Assuming the single driver file is located at /$DRIVER inside the DaemonSet image. - -driver_dir=$VENDOR${VENDOR:+"~"}${DRIVER} -if [ ! -d "/flexmnt/$driver_dir" ]; then - mkdir "/flexmnt/$driver_dir" -fi - -cp "/$DRIVER" "/flexmnt/$driver_dir/.$DRIVER" -mv -f "/flexmnt/$driver_dir/.$DRIVER" "/flexmnt/$driver_dir/$DRIVER" - -while : ; do - sleep 3600 -done -``` - -### Deployment DaemonSet -``` yaml -apiVersion: extensions/v1beta1 -kind: DaemonSet -metadata: - name: flex-set -spec: - template: - metadata: - name: flex-deploy - labels: - app: flex-deploy - spec: - containers: - - image: <deployment_image> - name: flex-deploy - securityContext: - privileged: true - volumeMounts: - - mountPath: /flexmnt - name: flexvolume-mount - volumes: - - name: flexvolume-mount - hostPath: - path: <host_driver_directory> -``` - -### Atomic Driver Installation -Regular file copy is not an atomic file operation, so if it were used to install the driver, it's possible that kubelet or controller manager executes the driver when it's partially installed, or the driver gets modified while it's being executed. Care must be taken to ensure the installation operation is atomic. - -The deployment script provided above uses renaming, which is atomic, to ensure that from the perspective of kubelet or controller manager, the driver file is completely written to disk in a single operation. The file is first installed with a name prefixed with '.', which the prober ignores. - -### Alternatives - -* Using Jobs instead of DaemonSets to deploy. - -Pros: Designed for containers that eventually terminate. No need to have the container go into an infinite loop. - -Cons: Does not guarantee every node has a pod running. Pod anti-affinity can be used to ensure no more than one pod runs on the same node, but since the Job spec requests a constant number of pods to run to completion, Jobs cannot ensure that pods are scheduled on new nodes. - -## **Open Questions** - -* How does this system work with containerized kubelet? -* Are there any SELinux implications? - -[Flexvolume]: /contributors/devel/sig-storage/flexvolume.md
\ No newline at end of file +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/grow-flexvolume-size.md b/contributors/design-proposals/storage/grow-flexvolume-size.md index 01cffba8..f0fbec72 100644 --- a/contributors/design-proposals/storage/grow-flexvolume-size.md +++ b/contributors/design-proposals/storage/grow-flexvolume-size.md @@ -1,170 +1,6 @@ -# Proposal for Growing FlexVolume Size +Design proposals have been archived. -**Authors:** [xingzhou](https://github.com/xingzhou) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Goals -Since PVC resizing is introduced in Kubernetes v1.8, several volume plugins have already supported this feature, e.g. GlusterFS, AWS EBS. In this proposal, we are proposing to support FlexVolume expansion. So when user uses FlexVolume and corresponding volume driver to connect to his/her backend storage system, he/she can expand the PV size by updating PVC in Kubernetes. - -## Non Goals - -* We only consider expanding FlexVolume size in this proposal. Decreasing size of FlexVolume will be designed in the future. -* In this proposal, user can only expand the FlexVolume size manually by updating PVC. Auto-expansion of FlexVolume based on specific meterings is not considered. -* The proposal only contains the changes made in FlexVolume, volume driver changes which should be made by user are not included. - -## Implementation Designs - -### Prerequisites - -* Kubernetes should be at least v1.8. -* Enable resizing by setting feature gate `ExpandPersistentVolumeGate` to `true`. -* Enable `PersistentVolumeClaimResize` admission plugin(optional). -* Follow the UI of PV resizing, including: - * Only dynamic provisioning supports volume resizing - * Set StorageClass attribute `allowVolumeExpansion` to `true` - -### Admission Control Changes - -Whether or not a specific volume plugin supports volume expansion is validated and checked in PV resize admission plugin. In general, we can list FlexVolume as the ones that support volume expansion and leave the actual expansion capability check to the underneath volume driver when PV resize controller calls the `ExpandVolumeDevice` method of FlexVolume. - -In PV resize admission plugin, add the following check to `checkVolumePlugin` method: -``` -// checkVolumePlugin checks whether the volume plugin supports resize -func (pvcr *persistentVolumeClaimResize) checkVolumePlugin(pv *api.PersistentVolume) bool { - ... - if pv.Spec.FlexVolume != nil { - return true - } - ... -} -``` - -### FlexVolume Plugin Changes - -FlexVolume relies on underneath volume driver to implement various volume functions, e.g. attach/detach. As a result, volume driver will decide whether volume can be expanded or not. - -By default, we assume all kinds of flex volume drivers support resizing. If they do not, flex volume plugin can detect this during resizing call to flex volume driver and always throw out error to stop the resizing process. So as a result, to implement resizing feature in flex volume plugin, the plugin itself must implement the following `ExpandableVolumePlugin` interfaces: - -#### ExpandVolumeDevice - -Volume resizing controller invokes this method while receiving a valid PVC resizing request. FlexVolume plugin calls the underneath volume driver’s corresponding `expandvolume` method with three parameters, including new size of volume(number in bytes), old size of volume(number in bytes) and volume spec, to expand PV. Once the expansion is done, volume driver should return the new size(number in bytes) of the volume to FlexVolume. - -A sample implementation of `ExpandVolumeDevice` method is like: -``` -func (plugin *flexVolumePlugin) ExpandVolumeDevice(spec *volume.Spec, newSize resource.Quantity, oldSize resource.Quantity) (resource.Quantity, error) { - const timeout = 10*time.Minute - - call := plugin.NewDriverCallWithTimeout(expandVolumeCmd, timeout) - call.Append(newSize.Value()) - call.Append(oldSize.Value()) - call.AppendSpec(spec, plugin.host, nil) - - // If the volume driver does not support resizing, Flex Volume Plugin can throw out error here - // to stop expand controller's resizing process. - ds, err := call.Run() - if err != nil { - return resource.NewQuantity(0, resource.BinarySI), err - } - - return resource.NewQuantity(ds.ActualVolumeSize, resource.BinarySI), nil -} -``` - -Add a new field in type `DriverStatus` named `ActualVolumeSize` to identify the new expanded size of the volume returned by underneath volume driver: -``` -// DriverStatus represents the return value of the driver callout. -type DriverStatus struct { - ... - ActualVolumeSize int64 `json:"volumeNewSize,omitempty"` -} -``` - -#### RequiresFSResize - -`RequiresFSResize` is a method to implement `ExpandableVolumePlugin` interface. The return value of this method identifies whether or not a file system resize is required once physical volume get expanded. If the return value is `true`, PV resize controller will consider the volume resize operation is done and then update the PV object’s capacity in K8s directly; If the return value is `false`, PV resize controller will leave kubelet to do the file system resize, and kubelet on worker node will call `ExpandFS` method of FlexVolume to finish the file system resize step(at present, only offline FS resize is supportted, online resize support is under community discussion [here](https://github.com/kubernetes/community/pull/1535)). - -The return value of `RequiresFSResize` is collected from underneath volume driver when FlexVolume invokes `init` method of volume driver. The sample code of `RequiresFSResize` in FlexVolume looks like: -``` -func (plugin *flexVolumePlugin) RequiresFSResize() bool { - return plugin.capabilities.RequiresFSResize -} -``` - -And as a result, the FlexVolume type `DriverCapability` can be redefined as: -``` -type DriverCapabilities struct { - Attach bool `json:"attach"` - RequiresFSResize bool `json:"requiresFSResize"` - SELinuxRelabel bool `json:"selinuxRelabel"` -} - -func defaultCapabilities() *DriverCapabilities { - return &DriverCapabilities{ - Attach: true, - RequiresFSResize: true, //By default, we require file system resize which will be done by kubelet - SELinuxRelabel: true, - } -} -``` - -#### ExpandFS - -`ExpandFS` is another method to implement `ExpandableVolumePlugin` interface. This method allows volume plugin itself instead of kubelet to resize the file system. If volume plugin returns `true` for `RequiresFSResize`, PV resize controller will leave FS resize to kubelet on worker node. Kubelet then will call FlexVolume `ExpandFS` to resize file system once physical volume expansion is done. - -As `ExpandFS` is called on worker node, volume driver can also take this chance to do physical volume resize together with file system resize as well. Also, current code only supports offline FS resize, online resize support is under dicsussion [here](https://github.com/kubernetes/community/pull/1535). Once online resize is implemented, we can also leverage online resize for FlexVolume by `ExpandFS` method. - -Note that `ExpandFS` is a new API for `ExpandableVolumeDriver`, the community ticket can be found [here](https://github.com/kubernetes/kubernetes/issues/58786). - -`ExpandFS` will call underneath volume driver `expandfs` method to finish FS resize. The sample code looks like: -``` -func (plugin *flexVolumePlugin) ExpandFS(spec *volume.Spec, newSize resource.Quantity, oldSize resource.Quantity) error { - const timeout = 10*time.Minute - - call := plugin.NewDriverCallWithTimeout(expandFSCmd, timeout) - call.Append(newSize.Value()) - call.Append(oldSize.Value()) - call.AppendSpec(spec, plugin.host, nil) - - _, err := call.Run() - - return err -} -``` - -For more design and details on how kubelet resizes volume file system, please refer to volume resizing proposal at: -https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/grow-volume-size.md - - -Based on the above design, the resizing process for flex volume can be summarized as: -* If flex volume driver does not support resizing, driver shall not implement `expandvolume` method and flex volume plugin will throw out error to stop the expand volume controller's resizing process. -* If flex volume driver supports resizing, it shall implement `expandvolume` method and at least, the volume driver shall be installed on master node. -* If flex volume driver supports resizing and does not need file system resizing, it shall set "requiresFSResize" capability to `false`. Otherwise kubelet on worker node will call `ExpandFS` to resize the file system. -* If flex volume driver supports resizing and requires file system resizing(`RequiresFSResize` returns `true`), after the physical volume resizing is done, `ExpandFS` will be called from kubelet on worker node. -* If flex volume driver supports resizing and requires to resize the physical volume from worker node, the driver shall be installed on both master node and worker node. The driver on master node can do a non-op process for `ExpandVolumeDevice` and returns success message. For `RequiresFSResize`, driver on master node must return `true`. This process gives drivers on worker nodes a chance to make `physical volume resize` and `file system resize` together through `ExpandFS` call from kubelet. This scenario is useful for some local storage resizing cases. - -### Volume Driver Changes - -Volume driver needs to implement two new interfaces: `expandvolume` and `expandfs` to support volume resizing. - -For `expandvolume`, it takes three parameters: new size of volume(number in bytes), old size of volume(number in bytes) and volume spec json string. `expandvolume` expands the physical backend volume size and return the new size(number in bytes) of volume. - -For those volume plugins who need file system resize after physical volume is expanded, the `expandfs` method can take the FS resize work. If volume driver set the `requiresFSResize` capability to true, this method will be called from kubelet on worker node. Volume driver can do the file system resize (or physical volume resize together with file system resize) inside this method - -In addition, those volume drivers who support resizing but do not require fils system resizing shall set `requiresFSResize` capability to `false`: -``` -if [ "$op" = "init" ]; then - log '{"status": "Success", "capabilities": {“requiresFSResize”: false}}' - exit 0 -fi -``` - -### UI - -Expand FlexVolume size follows the same process as expanding other volume plugins, like GlusterFS. User creates and binds PVC and PV first. Then by using `kubectl edit pvc xxx` command, user can update the new size of PVC. - -## References - -* [Proposal for Growing Persistent Volume Size](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/grow-volume-size.md) -* [PR for Volume Resizing Controller](https://github.com/kubernetes/kubernetes/commit/cd2a68473a5a5966fa79f455415cb3269a3f7462) -* [Online FS resize support](https://github.com/kubernetes/community/pull/1535) -* [Add “ExpandFS” method to “ExpandableVolumePlugin” interface](https://github.com/kubernetes/kubernetes/issues/58786) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/grow-volume-size.md b/contributors/design-proposals/storage/grow-volume-size.md index a968d91c..f0fbec72 100644 --- a/contributors/design-proposals/storage/grow-volume-size.md +++ b/contributors/design-proposals/storage/grow-volume-size.md @@ -1,310 +1,6 @@ -# Growing Persistent Volume size +Design proposals have been archived. -## Goals +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Enable users to increase size of PVs that their pods are using. The user will update PVC for requesting a new size. Underneath we expect that - a controller will apply the change to PV which is bound to the PVC. -## Non Goals - -* Reducing size of Persistent Volumes: We realize that, reducing size of PV is way riskier than increasing it. Reducing size of a PV could be a destructive operation and it requires support from underlying file system and volume type. In most cases it also requires that file system being resized is unmounted. - -* Rebinding PV and PVC: Kubernetes will only attempt to resize the currently bound PV and PVC and will not attempt to relocate data from a PV to a new PV and rebind the PVC to newly created PV. - -## Use Cases - -* As a user I am running Mysql on a 100GB volume - but I am running out of space, I should be able to increase size of volume mysql is using without losing all my data. (*online and with data*) -* As a user I created a PVC requesting 2GB space. I am yet to start a pod with this PVC but I realize that I probably need more space. Without having to create a new PVC, I should be able to request more size with same PVC. (*offline and no data on disk*) -* As a user I was running a rails application with 5GB of assets PVC. I have taken my application offline for maintenance but I would like to grow asset PVC to 10GB in size. (*offline but with data*) -* As a user I am running an application on glusterfs. I should be able to resize the gluster volume without losing data or mount point. (*online and with data and without taking pod offline*) -* In the logging project we run on dedicated clusters, we start out with 187Gi PVs for each of the elastic search pods. However, the amount of logs being produced can vary greatly from one cluster to another and its not uncommon that these volumes fill and we need to grow them. - -## Volume Plugin Matrix - - -| Volume Plugin | Supports Resize | Requires File system Resize | Supported in 1.8 Release | -| ----------------| :---------------: | :--------------------------:| :----------------------: | -| EBS | Yes | Yes | Yes | -| GCE PD | Yes | Yes | Yes | -| GlusterFS | Yes | No | Yes | -| Cinder | Yes | Yes | Yes | -| Vsphere | Yes | Yes | No | -| Ceph RBD | Yes | Yes | No | -| Host Path | No | No | No | -| Azure Disk | Yes | Yes | No | -| Azure File | No | No | No | -| Cephfs | No | No | No | -| NFS | No | No | No | -| Flex | Yes | Maybe | No | -| LocalStorage | Yes | Yes | No | - - -## Implementation Design - -For volume type that requires both file system expansion and a volume plugin based modification, growing persistent volumes will be two -step process. - - -For volume types that only require volume plugin based api call, this will be one step process. - -### Prerequisite - -* `pvc.spec.resources.requests.storage` field of pvc object will become mutable after this change. -* #sig-api-machinery has agreed to allow pvc's status update from kubelet as long as pvc and node relationship - can be validated by node authorizer. -* This feature will be protected by an alpha feature gate, so as API changes needed for it. - - -### Admission Control and Validations - -* Resource quota code has to be updated to take into account PVC expand feature. -* In case volume plugin doesn’t support resize feature. The resize API request will be rejected and PVC object will not be saved. This check will be performed via an admission controller plugin. -* In case requested size is smaller than current size of PVC. A validation will be used to reject the API request. (This could be moved to admission controller plugin too.) -* Not all PVCs will be resizable even if underlying volume plugin allows that. Only dynamically provisioned volumes -which are explicitly enabled by an admin will be allowed to be resized. A plugin in admission controller will forbid -size update for PVCs for which resizing is not enabled by the admin. -* The design proposal for raw block devices should make sure that, users aren't able to resize raw block devices. - - -### Controller Manager resize - -A new controller called `volume_expand_controller` will listen for pvc size expansion requests and take action as needed. The steps performed in this -new controller will be: - -* Watch for pvc update requests and add pvc to controller's work queue if a increase in volume size was requested. Once PVC is added to - controller's work queue - `pvc.Status.Conditions` will be updated with `ResizeStarted: True`. -* For unbound or pending PVCs - resize will trigger no action in `volume_expand_controller`. -* If `pv.Spec.Capacity` already is of size greater or equal than requested size, similarly no action will be performed by the controller. -* A separate goroutine will read work queue and perform corresponding volume resize operation. If there is a resize operation in progress - for same volume then resize request will be pending and retried once previous resize request has completed. -* Controller resize in effect will be level based rather than edge based. If there are more than one pending resize request for same PVC then - new resize requests for same PVC will replace older pending request. -* Resize will be performed via volume plugin interface, executed inside a goroutine spawned by `operation_executor`. -* A new plugin interface called `volume.Expander` will be added to volume plugin interface. The `Expander` interface - will also define if volume requires a file system resize: - - ```go - type Expander interface { - // ExpandVolume expands the volume - ExpandVolumeDevice(spec *Spec, newSize resource.Quantity, oldSize resource.Quantity) error - RequiresFSResize() bool - } - ``` - -* The controller call to expand the PVC will look like: - -```go -func (og *operationGenerator) GenerateExpandVolumeFunc( - pvcWithResizeRequest *expandcache.PvcWithResizeRequest, - resizeMap expandcache.VolumeResizeMap) (func() error, error) { - - volumePlugin, err := og.volumePluginMgr.FindExpandablePluginBySpec(pvcWithResizeRequest.VolumeSpec) - expanderPlugin, err := volumePlugin.NewExpander(pvcWithResizeRequest.VolumeSpec) - - - expandFunc := func() error { - expandErr := expanderPlugin.ExpandVolumeDevice(pvcWithResizeRequest.ExpectedSize, pvcWithResizeRequest.CurrentSize) - - if expandErr != nil { - og.recorder.Eventf(pvcWithResizeRequest.PVC, v1.EventTypeWarning, kevents.VolumeResizeFailed, expandErr.Error()) - resizeMap.MarkResizeFailed(pvcWithResizeRequest, expandErr.Error()) - return expandErr - } - - // CloudProvider resize succeeded - lets mark api objects as resized - if expanderPlugin.RequiresFSResize() { - err := resizeMap.MarkForFileSystemResize(pvcWithResizeRequest) - if err != nil { - og.recorder.Eventf(pvcWithResizeRequest.PVC, v1.EventTypeWarning, kevents.VolumeResizeFailed, err.Error()) - return err - } - } else { - err := resizeMap.MarkAsResized(pvcWithResizeRequest) - - if err != nil { - og.recorder.Eventf(pvcWithResizeRequest.PVC, v1.EventTypeWarning, kevents.VolumeResizeFailed, err.Error()) - return err - } - } - return nil - - } - return expandFunc, nil -} -``` - -* Once volume expand is successful, the volume will be marked as expanded and new size will be updated in `pv.spec.capacity`. Any errors will be reported as *events* on PVC object. -* If resize failed in above step, in addition to events - `pvc.Status.Conditions` will be updated with `ResizeFailed: True`. Corresponding error will be added to condition field as well. -* Depending on volume type next steps would be: - - * If volume is of type that does not require file system resize, then `pvc.status.capacity` will be immediately updated to reflect new size. This would conclude the volume expand operation. Also `pvc.Status.Conditions` will be updated with `Ready: True`. - * If volume is of type that requires file system resize then a file system resize will be performed on kubelet. Read below for steps that will be performed for file system resize. - -* If volume plugin is of type that can not do resizing of attached volumes (such as `Cinder`) then `ExpandVolumeDevice` can return error by checking for - volume status with its own API (such as by making Openstack Cinder API call in this case). Controller will keep trying to resize the volume until it is - successful. - -* To consider cases of missed PVC update events, an additional loop will reconcile bound PVCs with PVs. This additional loop will loop through all PVCs - and match `pvc.spec.resources.requests` with `pv.spec.capacity` and add PVC in `volume_expand_controller`'s work queue if `pv.spec.capacity` is less - than `pvc.spec.resources.requests`. - -* There will be additional checks in controller that grows PV size - to ensure that we do not make volume plugin API calls that can reduce size of PV. - -### File system resize on kubelet - -A File system resize will be pending on PVC until a new pod that uses this volume is scheduled somewhere. While theoretically we *can* perform -online file system resize if volume type and file system supports it - we are leaving it for next iteration of this feature. - -#### Prerequisite of File system resize - -* `pv.spec.capacity` must be greater than `pvc.status.spec.capacity`. -* A fix in pv_controller has to made to fix `claim.Status.Capacity` only during binding. See comment by jan here - https://github.com/kubernetes/community/pull/657#discussion_r128008128 -* A fix in attach_detach controller has to be made to prevent fore detaching of volumes that are undergoing resize. -This can be done by checking `pvc.Status.Conditions` during force detach. `AttachedVolume` struct doesn't hold a reference to PVC - so PVC info can either be directly cached in `AttachedVolume` along with PV spec or it can be fetched from PersistentVolume's ClaimRef binding info. - -#### Steps for resizing file system available on Volume - -* When calling `MountDevice` or `Setup` call of volume plugin, volume manager will in addition compare `pv.spec.capacity` and `pvc.status.capacity` and if `pv.spec.capacity` is greater - than `pvc.status.spec.capacity` then volume manager will additionally resize the file system of volume. -* The call to resize file system will be performed inside `operation_generator.GenerateMountVolumeFunc`. `VolumeToMount` struct will be enhanced to store PVC as well. -* The flow of file system resize will be as follow: - * Perform a resize based on file system used inside block device. - * If resize succeeds, proceed with mounting the device as usual. - * If resize failed with an error that shows no file system exists on the device, then log a warning and proceed with format and mount. - * If resize failed with any other error then fail the mount operation. -* Any errors during file system resize will be added as *events* to Pod object and mount operation will be failed. -* If there are any errors during file system resize `pvc.Status.Conditions` will be updated with `ResizeFailed: True`. Any errors will be added to - `Conditions` field. -* File System resize will not be performed on kubelet where volume being attached is ReadOnly. This is similar to pattern being used for performing formatting. -* After file system resize is successful, `pvc.status.capacity` will be updated to match `pv.spec.capacity` and volume expand operation will be considered complete. Also `pvc.Status.Conditions` will be updated with `Ready: True`. - -#### Reduce coupling between resize operation and file system type - -A file system resize in general requires presence of tools such as `resize2fs` or `xfs_growfs` on the host where kubelet is running. There is a concern -that open coding call to different resize tools directly in Kubernetes will result in coupling between file system and resize operation. To solve this problem -we have considered following options: - -1. Write a library that abstracts away various file system operations, such as - resizing, formatting etc. - - Pros: - * Relatively well known pattern - - Cons: - * Depending on version with which Kubernetes is compiled with, we are still tied to which file systems are supported in which version - of kubernetes. -2. Ship a wrapper shell script that encapsulates various file system operations and as long as the shell script supports particular file system - the resize operation is supported. - Pros: - * Kubernetes Admin can easily replace default shell script with her own version and thereby adding support for more file system types. - - Cons: - * I don't know if there is a pattern that exists in kube today for shipping shell scripts that are called out from code in Kubernetes. Flex is - different because, none of the flex scripts are shipped with Kubernetes. -3. Ship resizing tools in a container. - - -Of all options - #3 is our best bet but we are not quite there yet. Hence, I would like to propose that we ship with support for -most common file systems in current release and we revisit this coupling and solve it in next release. - -## API and UI Design - -Given a PVC definition: - -```yaml -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: volume-claim - annotations: - volume.beta.kubernetes.io/storage-class: "generalssd" -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 1Gi -``` - -Users can request new size of underlying PV by simply editing the PVC and requesting new size: - -``` -~> kubectl edit pvc volume-claim -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: volume-claim - annotations: - volume.beta.kubernetes.io/storage-class: "generalssd" -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 10Gi -``` - -## API Changes - -### PVC API Change - -`pvc.spec.resources.requests.storage` field of pvc object will become mutable after this change. - -In addition to that PVC's status will have a `Conditions []PvcCondition` - which will be used -to communicate the status of PVC to the user. - -The API change will be protected by Alpha feature gate and api-server will not allow PVCs with -`Status.Conditions` field if feature is not enabled. `omitempty` in serialization format will -prevent presence of field if not set. - -So the `PersistentVolumeClaimStatus` will become: - -```go -type PersistentVolumeClaimStatus struct { - Phase PersistentVolumeClaimPhase - AccessModes []PersistentVolumeAccessMode - Capacity ResourceList - // New Field added as part of this Change - Conditions []PVCCondition -} - -// new API type added -type PVCCondition struct { - Type PVCConditionType - Status ConditionStatus - LastProbeTime metav1.Time - LastTransitionTime metav1.Time - Reason string - Message string -} - -// new API type -type PVCConditionType string - -// new Constants -const ( - PVCReady PVCConditionType = "Ready" - PVCResizeStarted PVCConditionType = "ResizeStarted" - PVCResizeFailed PVCResizeFailed = "ResizeFailed" -) -``` - -### StorageClass API change - -A new field called `AllowVolumeExpand` will be added to StorageClass. The default of this value -will be `false` and only if it is true - PVC expansion will be allowed. - -```go -type StorageClass struct { - metav1.TypeMeta - metav1.ObjectMeta - Provisioner string - Parameters map[string]string - // New Field added - // +optional - AllowVolumeExpand bool -} -``` - -### Other API changes - -This proposal relies on ability to update PVC status from kubelet. While updating PVC's status -a PATCH request must be made from kubelet to update the status. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/local-storage-overview.md b/contributors/design-proposals/storage/local-storage-overview.md index 708f6600..f0fbec72 100644 --- a/contributors/design-proposals/storage/local-storage-overview.md +++ b/contributors/design-proposals/storage/local-storage-overview.md @@ -1,611 +1,6 @@ -# Local Storage Management -Authors: vishh@, msau42@ +Design proposals have been archived. -This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -# Goals -* Enable ephemeral & durable access to local storage -* Support storage requirements for all workloads supported by Kubernetes -* Provide flexibility for users/vendors to utilize various types of storage devices -* Define a standard partitioning scheme for storage drives for all Kubernetes nodes -* Provide storage usage isolation for shared partitions -* Support random access storage devices only (e.g., hard disks and SSDs) -# Non Goals -* Provide storage usage isolation for non-shared partitions. -* Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms. -* Support for I/O isolation using CFS & blkio cgroups. - * IOPS isn't safe to be a schedulable resource. IOPS on rotational media is very limited compared to other resources like CPU and Memory. This leads to severe resource stranding. - * Blkio cgroup + CFS based I/O isolation doesn't provide deterministic behavior compared to memory and cpu cgroups. Years of experience at Google with Borg has taught that relying on blkio or I/O scheduler isn't suitable for multi-tenancy. - * Blkio cgroup based I/O isolation isn't suitable for SSDs. Turning on CFQ on SSDs will hamper performance. Its better to statically partition SSDs and share them instead of using CFS. - * I/O isolation can be achieved by using a combination of static partitioning and remote storage. This proposal recommends this approach with illustrations below. - * Pod level resource isolation extensions will be made available in the Kubelet which will let vendors add support for CFQ if necessary for their deployments. - -# Use Cases - -## Ephemeral Local Storage -Today, ephemeral local storage is exposed to pods via the container’s writable layer, logs directory, and EmptyDir volumes. Pods use ephemeral local storage for scratch space, caching and logs. There are many issues related to the lack of local storage accounting and isolation, including: - -* Pods do not know how much local storage is available to them. -* Pods cannot request “guaranteed” local storage. -* Local storage is a “best-effort” resource. -* Pods can get evicted due to other pods filling up the local storage, after which no new pods will be admitted until sufficient storage has been reclaimed. - -## Persistent Local Storage -Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors: - -* Performance: On cloud providers, local SSDs give better performance than remote disks. -* Cost: On bare metal, in addition to performance, local storage is typically cheaper and using it is a necessity to provision distributed filesystems. - -Distributed systems often use replication to provide fault tolerance, and can therefore tolerate node failures. However, data gravity is preferred for reducing replication traffic and cold startup latencies. - -# Design Overview - -A node’s local storage can be broken into primary and secondary partitions. - -## Primary Partitions -Primary partitions are shared partitions that can provide ephemeral local storage. The two supported primary partitions are: - -### Root -This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition. - -### Runtime -This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. Container image layers and writable later is stored here. If the runtime partition exists, `root` partition will not hold any image layer or writable layers. - -## Secondary Partitions -All other partitions are exposed as local persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. All the local PVs can be queried and viewed from a cluster level using the existing PV object. Applications can continue to use their existing PVC specifications with minimal changes to request local storage. - -The local PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions at well-known directories, and can create new PVs as partitions are added to the node. A default addon can be provided to handle common configurations. - -Local PVs can only provide semi-persistence, and are only suitable for specific use cases that need performance, data gravity and can tolerate data loss. If the node or PV fails, then either the pod cannot run, or the pod has to give up on the local PV and find a new one. Failure scenarios can be handled by unbinding the PVC from the local PV, and forcing the pod to reschedule and find a new PV. - -Since local PVs are only accessible from specific nodes, the scheduler needs to take into account a PV's node constraint when placing pods. This can be generalized to a storage topology constraint, which can also work with zones, and in the future: racks, clusters, etc. - -The term `Partitions` are used here to describe the main use cases for local storage. However, the proposal doesn't require a local volume to be an entire disk or a partition - it supports arbitrary directory. This implies that cluster administrator can create multiple local volumes in one partition, each has the capacity of the partition, or even create local volume under primary partitions. Unless strictly required, e.g. if you have only one partition in your host, this is strongly discouraged. For this reason, following description will use `partition` or `mount point` exclusively. - -# User Workflows - -### Alice manages a deployment and requires “Guaranteed” ephemeral storage - -1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary partitions. This allows primary partitions' storage capacity to be considered as a first class resource when scheduling. - - ```yaml - apiVersion: v1 - kind: Node - metadata: - name: foo - status: - capacity: - storage.kubernetes.io/overlay: 100Gi - storage.kubernetes.io/scratch: 100Gi - allocatable: - storage.kubernetes.io/overlay: 100Gi - storage.kubernetes.io/scratch: 90Gi - ``` - -2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes. - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - resources: - limits: - storage.kubernetes.io/logs: 500Mi - storage.kubernetes.io/overlay: 1Gi - volumeMounts: - - name: myEmptyDir - mountPath: /mnt/data - volumes: - - name: myEmptyDir - emptyDir: - sizeLimit: 20Gi - ``` - -3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. -4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer. -5. `storage.kubernetes.io/logs` is satisfied by `storage.kubernetes.io/scratch`. -6. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod. -7. EmptyDir.size is both a request and limit that is satisfied by `storage.kubernetes.io/scratch`. -8. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi -9. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage. -10. Primary partition health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints. -11. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. - -### Bob runs batch workloads and is unsure of “storage” requirements - -1. Bob can create pods without any “storage” resource requirements. - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - namespace: myns - spec: - containers: - - name: fooc - volumeMounts: - - name: myEmptyDir - mountPath: /mnt/data - volumes: - - name: myEmptyDir - emptyDir: - ``` - -2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a [LimitRange](https://kubernetes.io/docs/user-guide/compute-resources/) to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify burst ranges and a host of other features supported by LimitRange for local storage. - - ```yaml - apiVersion: v1 - kind: LimitRange - metadata: - name: mylimits - spec: - - default: - storage.kubernetes.io/logs: 200Mi - storage.kubernetes.io/overlay: 200Mi - type: Container - - default: - sizeLimit: 1Gi - type: EmptyDir - ``` - -3. The limit range will update the pod specification as follows: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - resources: - limits: - storage.kubernetes.io/logs: 200Mi - storage.kubernetes.io/overlay: 200Mi - volumeMounts: - - name: myEmptyDir - mountPath: /mnt/data - volumes: - - name: myEmptyDir - emptyDir: - sizeLimit: 1Gi - ``` - -4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. -5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher `sizeLimit` for his EmptyDir volumes. - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - resources: - requests: - storage.kubernetes.io/logs: 500Mi - storage.kubernetes.io/overlay: 500Mi - volumeMounts: - - name: myEmptyDir - mountPath: /mnt/data - volumes: - - name: myEmptyDir - emptyDir: - sizeLimit: 2Gi - ``` - -6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. It is recommended to use Persistent Volumes as much as possible and avoid primary partitions. - -### Alice manages a Database which needs access to “durable” and fast scratch space - -1. Cluster administrator provisions machines with local SSDs and brings up the cluster -2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a node affinity ties the volume to a specific node. The node affinity specification tells the scheduler to filter PVs with the same affinity key/value on the node. For the local storage case, the key is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well. - - ```yaml - kind: StorageClass - apiVersion: storage.k8s.io/v1 - metadata: - name: local-fast - topologyKey: kubernetes.io/hostname - ``` - ```yaml - kind: PersistentVolume - apiVersion: v1 - metadata: - name: local-pv-1 - spec: - nodeAffinity: - requiredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - key: kubernetes.io/hostname - operator: In - values: - - node-1 - capacity: - storage: 100Gi - local: - path: /var/lib/kubelet/storage-partitions/local-pv-1 - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Delete - storageClassName: local-fast - ``` - ``` - $ kubectl get pv - NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM … NODE - local-pv-1 100Gi RWO Delete Available node-1 - local-pv-2 10Gi RWO Delete Available node-1 - local-pv-1 100Gi RWO Delete Available node-2 - local-pv-2 10Gi RWO Delete Available node-2 - local-pv-1 100Gi RWO Delete Available node-3 - local-pv-2 10Gi RWO Delete Available node-3 - ``` -3. Alice creates a StatefulSet that requests local storage from StorageClass "local-fast". The PVC will only be bound to PVs that match the StorageClass name. - - ```yaml - apiVersion: apps/v1beta1 - kind: StatefulSet - metadata: - name: web - spec: - serviceName: "nginx" - replicas: 3 - template: - metadata: - labels: - app: nginx - spec: - terminationGracePeriodSeconds: 10 - containers: - - name: nginx - image: k8s.gcr.io/nginx-slim:0.8 - ports: - - containerPort: 80 - name: web - volumeMounts: - - name: www - mountPath: /usr/share/nginx/html - - name: log - mountPath: /var/log/nginx - volumeClaimTemplates: - - metadata: - name: www - spec: - accessModes: [ "ReadWriteOnce" ] - storageClassName: local-fast - resources: - requests: - storage: 100Gi - - metadata: - name: log - spec: - accessModes: [ "ReadWriteOnce" ] - storageClassName: local-slow - resources: - requests: - storage: 1Gi - ``` - -4. The scheduler identifies nodes for each pod that can satisfy all the existing predicates. -5. The nodes list is further filtered by looking at the PVC's StorageClass, and checking if there is available PV of the same StorageClass on a node. -6. The scheduler chooses a node for the pod based on a ranking algorithm. -7. Once the pod is assigned to a node, then the pod’s local PVCs get bound to specific local PVs on the node. - - ``` - $ kubectl get pvc - NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE - www-local-pvc-1 Bound local-pv-1 100Gi RWO node-1 - www-local-pvc-2 Bound local-pv-1 100Gi RWO node-2 - www-local-pvc-3 Bound local-pv-1 100Gi RWO node-3 - log-local-pvc-1 Bound local-pv-2 10Gi RWO node-1 - log-local-pvc-2 Bound local-pv-2 10Gi RWO node-2 - log-local-pvc-3 Bound local-pv-2 10Gi RWO node-3 - ``` - ``` - $ kubectl get pv - NAME CAPACITY … STATUS CLAIM NODE - local-pv-1 100Gi Bound www-local-pvc-1 node-1 - local-pv-2 10Gi Bound log-local-pvc-1 node-1 - local-pv-1 100Gi Bound www-local-pvc-2 node-2 - local-pv-2 10Gi Bound log-local-pvc-2 node-2 - local-pv-1 100Gi Bound www-local-pvc-3 node-3 - local-pv-2 10Gi Bound log-local-pvc-3 node-3 - ``` - -8. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. -9. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario. No toleration specified means that the failure is not tolerated. In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV. If a toleration is set, by default, it will be tolerated forever. `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound. - - Node taints already exist today. Pod scheduling failures are specified separately as a timeout. - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - <snip> - nodeTolerations: - - key: node.alpha.kubernetes.io/not-ready - operator: TolerationOpExists - tolerationSeconds: 600 - - key: node.alpha.kubernetes.io/unreachable - operator: TolerationOpExists - tolerationSeconds: 1200 - schedulingFailureTimeoutSeconds: 600 - ``` - - A new PV taint will be introduced to handle unhealthy volumes. The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy. - ```yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - name: foo - spec: - <snip> - pvTolerations: - - key: storage.kubernetes.io/pvUnhealthy - operator: TolerationOpExists - ``` -10. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs. The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster. - -### Bob manages a distributed filesystem which needs access to all available storage on each node - -1. The cluster that Bob is using is provisioned with nodes that contain one or more secondary partitions -2. The cluster administrator runs a DaemonSet addon that discovers secondary partitions across all nodes and creates corresponding PVs for them. -3. The addon will monitor the health of secondary partitions and mark PVs as unhealthy whenever the backing local storage devices have failed. -4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it. -5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example). -6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. -7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs created by the Operator on those nodes. -8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used. -9. If a PV gets marked as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures - -### Phippy manages a cluster and intends to mitigate storage I/O abuse - -1. Phippy creates a dedicated partition with a separate device for her system daemons. She achieves this by making `/var/log/containers`, `/var/lib/kubelet`, `/var/lib/docker` (with the docker runtime) all reside on a separate partition. -2. Phippy is aware that pods can cause abuse to each other. -3. Whenever a pod experiences I/O issues with it's EmptyDir volume, Phippy reconfigures those pods to use an inline Persistent Volume, whose lifetime is tied to the pod. - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - resources: - limits: - storage.kubernetes.io/logs: 500Mi - storage.kubernetes.io/overlay: 1Gi - volumeMounts: - - name: myEphemeralPersistentVolume - mountPath: /mnt/tmpdata - volumes: - - name: myEphemeralPersistentVolume - inline: - spec: - accessModes: [ "ReadWriteOnce" ] - storageClassName: local-fast - resources: - limits: - size: 1Gi - ``` - -4. Phippy notices some of her pods are experiencing spurious downtimes. With the help of monitoring (`iostat`), she notices that the nodes pods are running on are overloaded with I/O operations. She then updates her pods to use Logging Volumes which are backed by persistent storage. If a logging volumeMount is associated with a container, Kubelet will place log data from stdout & stderr of the container under the volume mount path within the container. Kubelet will continue to expose stdout/stderr log data to external logging agents using symlinks as it does already. - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - volumeMounts: - - name: myLoggingVolume - mountPath: /var/log/ - policy: - logDir: - subDir: foo - glob: *.log - - name: barc - volumeMounts: - - name: myInMemoryLoggVolume - mountPath: /var/log/ - policy: - logDir: - subDir: bar - glob: *.log - volumes: - - name: myLoggingVolume - inline: - spec: - accessModes: [ "ReadWriteOnce" ] - storageClassName: local-slow - resources: - requests: - storage: 1Gi - - name: myInMemoryLogVolume - emptyDir: - medium: memory - resources: - limits: - size: 100Mi - ``` - -5. Phippy notices some of her pods are suffering hangs by while writing to their writable layer. Phippy again notices that I/O contention is the root cause and then updates her Pod Spec to use memory backed or persistent volumes for her pods writable layer. Kubelet will instruct the runtimes to overlay the volume with `overlay` policy over the writable layer of the container. - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: foo - spec: - containers: - - name: fooc - volumeMounts: - - name: myWritableLayer - policy: - overlay: - subDir: foo - - name: barc - volumeMounts: - - name: myDurableWritableLayer - policy: - overlay: - subDir: bar - volumes: - - name: myWritableLayer - emptyDir: - medium: memory - resources: - limits: - storage: 100Mi - - name: myDurableWritableLayer - inline: - spec: - accessModes: [ "ReadWriteOnce" ] - storageClassName: local-fast - resources: - requests: - storage: 1Gi - ``` - -### Bob manages a specialized application that needs access to Block level storage -Note: Block access will be considered as a separate feature because it can work for both remote and local storage. The examples here are a suggestion on how such a feature can be applied to this local storage model, but is subject to change based on the final design for block access. - -1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet. -2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `volumeType: block` spec. `path` is overloaded here to mean both fs path and block device path. - - ```yaml - kind: PersistentVolume - apiVersion: v1 - metadata: - name: foo - labels: - kubernetes.io/hostname: node-1 - spec: - capacity: - storage: 100Gi - volumeType: block - local: - path: /var/lib/kubelet/storage-raw-devices/foo - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Delete - storageClassName: local-fast - ``` - -3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request. The block devices will not be formatted to allow the application to handle the device using their own methods. - - ```yaml - kind: PersistentVolumeClaim - apiVersion: v1 - metadata: - name: myclaim - spec: - volumeType: block - storageClassName: local-fast - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi - ``` - -4. It is also possible for a PVC that requests `volumeType: block` to also use file-based volume. In this situation, the block device would get formatted with the filesystem type specified in the PVC spec. And when the PVC gets destroyed, then the filesystem also gets destroyed to return back to the original block state. - - ```yaml - kind: PersistentVolumeClaim - apiVersion: v1 - metadata: - name: myclaim - spec: - volumeType: block - fsType: ext4 - storageClassName: local-fast - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi - ``` - -*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* - -# Open Questions & Discussion points -* Single vs split “limit” for storage across writable layer and logs - * Split allows for enforcement of hard quotas - * Single is a simpler UI -* Local Persistent Volume bindings happening in the scheduler vs in PV controller - * Should the PV controller fold into the scheduler - * This will help spread PVs and pods across matching zones. -* Repair/replace scenarios. - * What are the implications of removing a disk and replacing it with a new one? - * We may not do anything in the system, but may need a special workflow -* Volume-level replication use cases where there is no pod associated with a volume. How could forgiveness/data gravity be handled there? - -# Related Features -* Support for encrypted secondary partitions in order to make wiping more secure and reduce latency -* Co-locating PVs and pods across zones. Binding PVCs in the scheduler will help with this feature. - -# Recommended Storage best practices -* Have the primary partition on a reliable storage device -* Have a dedicated storage device for system daemons. -* Consider using RAID and SSDs (for performance) -* Partition the rest of the storage devices based on the application needs - * SSDs can be statically partitioned and they might still meet IO requirements of apps. - * TODO: Identify common durable storage requirements for most databases -* Avoid having multiple logical partitions on hard drives to avoid IO isolation issues -* Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted -* The runtime partition for overlayfs is optional. You do not **need** one. -* Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable. -* Use EmptyDir for all scratch space requirements of your apps when IOPS isolation is not of concern. -* Make the container’s writable layer `readonly` if possible. -* Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes. - -# FAQ - -### Why is the kubelet managing logs? - -Kubelet is managing access to shared storage on the node. Container logs outputted via it's stdout and stderr ends up on the shared storage that kubelet is managing. So, kubelet needs direct control over the log data to keep the containers running (by rotating logs), store them long enough for break glass situations and apply different storage policies in a multi-tenant cluster. All of these features are not easily expressible through external logging agents like journald for example. - - -### Master are upgraded prior to nodes. How should storage as a new compute resource be rolled out on to existing clusters? - -Capacity isolation of shared partitions (ephemeral storage) will be controlled using a feature gate. Do not enable this feature gate until all the nodes in a cluster are running a kubelet version that supports capacity isolation. -Since older kubelets will not surface capacity of shared partitions, the scheduler will ignore those nodes when attempting to schedule pods that request storage capacity explicitly. - - -### What happens if storage usage is unavailable for writable layer? - -Kubelet will attempt to enforce capacity limits on a best effort basis. If the underlying container runtime cannot surface usage metrics for the writable layer, then kubelet will not provide capacity isolation for the writable layer. - - -### Are LocalStorage PVs required to be a whole partition? - -No, but it is the recommended way to ensure capacity and performance isolation. For HDDs, a whole disk is recommended for performance isolation. In some environments, multiple storage partitions are not available, so the only option is to share the same filesystem. In that case, directories in the same filesystem can be specified, and the administrator could configure group quota to provide capacity isolation. - -# Features & Milestones - -#### Features with owners - -1. Support for durable Local PVs -2. Support for capacity isolation - -Alpha support for these two features are targeted for v1.7. Beta and GA timelines are TBD. -Currently, msau42@, jinxu@ and vishh@ will be developing these features. - -#### Features needing owners - -1. Support for persistent volumes tied to the lifetime of a pod (`inline PV`) -2. Support for Logging Volumes -3. Support for changing the writable layer type of containers -4. Support for Block Level Storage +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/local-storage-pv.md b/contributors/design-proposals/storage/local-storage-pv.md index 60638dbb..f0fbec72 100644 --- a/contributors/design-proposals/storage/local-storage-pv.md +++ b/contributors/design-proposals/storage/local-storage-pv.md @@ -1,763 +1,6 @@ -# Local Storage Persistent Volumes +Design proposals have been archived. -Authors: @msau42, @vishh, @dhirajh, @ianchakeres +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document presents a detailed design for supporting persistent local storage, -as outlined in [Local Storage Overview](local-storage-overview.md). -Supporting all the use cases for persistent local storage will take many releases, -so this document will be extended for each new release as we add more features. -## Goals - -* Allow pods to mount any local block or filesystem based volume. -* Allow pods to mount dedicated local disks, or channeled partitions as volumes for -IOPS isolation. -* Allow pods do access local volumes without root privileges. -* Allow pods to access local volumes without needing to understand the storage -layout on every node. -* Persist local volumes and provide data gravity for pods. Any pod -using the local volume will be scheduled to the same node that the local volume -is on. -* Allow pods to release their local volume bindings and lose that volume's data -during failure conditions, such as node, storage or scheduling failures, where -the volume is not accessible for some user-configurable time. -* Allow pods to specify local storage as part of a Deployment or StatefulSet. -* Allow administrators to set up and configure local volumes with simple methods. -* Do not require administrators to manage the local volumes once provisioned -for a node. - -## Non-Goals - -* Provide data availability for a local volume beyond its local node. -* Support the use of HostPath volumes and Local PVs on the same volume. - -## Background - -In Kubernetes, there are two main types of storage: remote and local. - -Remote storage is typically used with persistent volumes where the data can -persist beyond the lifetime of the pod. - -Local storage is typically used with ephemeral volumes where the data only -persists during the lifetime of the pod. - -There is increasing demand for using local storage as persistent volumes, -especially for distributed filesystems and databases such as GlusterFS and -Cassandra. The main motivations for using persistent local storage, instead -of persistent remote storage include: - -* Performance: Local SSDs achieve higher IOPS and throughput than many -remote storage solutions. - -* Cost: Operational costs may be reduced by leveraging existing local storage, -especially in bare metal environments. Network storage can be expensive to -setup and maintain, and it may not be necessary for certain applications. - -## Use Cases - -### Distributed filesystems and databases - -Many distributed filesystem and database implementations, such as Cassandra and -GlusterFS, utilize the local storage on each node to form a storage cluster. -These systems typically have a replication feature that sends copies of the data -to other nodes in the cluster in order to provide fault tolerance in case of -node failures. Non-distributed, but replicated databases, like MySQL, can also -utilize local storage to store replicas. - -The main motivations for using local persistent storage are performance and -cost. Since the application handles data replication and fault tolerance, these -application pods do not need networked storage to provide shared access to data. -In addition, installing a high-performing NAS or SAN solution can be more -expensive, and more complex to configure and maintain than utilizing local -disks, especially if the node was already pre-installed with disks. Datacenter -infrastructure and operational costs can be reduced by increasing storage -utilization. - -These distributed systems are generally stateful, infrastructure applications -that provide data services to higher-level applications. They are expected to -run in a cluster with many other applications potentially sharing the same -nodes. Therefore, they expect to have high priority and node resource -guarantees. They typically are deployed using StatefulSets, custom -controllers, or operators. - -### Caching - -Caching is one of the recommended use cases for ephemeral local storage. The -cached data is backed by persistent storage, so local storage data durability is -not required. However, there is a use case for persistent local storage to -achieve data gravity for large caches. For large caches, if a pod restarts, -rebuilding the cache can take a long time. As an example, rebuilding a 100GB -cache from a hard disk with 150MB/s read throughput can take around 10 minutes. -If the service gets restarted and all the pods have to restart, then performance -and availability can be impacted while the pods are rebuilding. If the cache is -persisted, then cold startup latencies are reduced. - -Content-serving applications and producer/consumer workflows commonly utilize -caches for better performance. They are typically deployed using Deployments, -and could be isolated in its own cluster, or shared with other applications. - -## Environments - -### Baremetal - -In a baremetal environment, nodes may be configured with multiple local disks of -varying capacity, speeds and mediums. Mediums include spinning disks (HDDs) and -solid-state drives (SSDs), and capacities of each disk can range from hundreds -of GBs to tens of TB. Multiple disks may be arranged in JBOD or RAID configurations -to consume as persistent storage. - -Currently, the methods to use the additional disks are to: - -* Configure a distributed filesystem -* Configure a HostPath volume - -It is also possible to configure a NAS or SAN on a node as well. Speeds and -capacities will widely vary depending on the solution. - -### GCE/GKE - -GCE and GKE both have a local SSD feature that can create a VM instance with up -to 8 fixed-size 375GB local SSDs physically attached to the instance host and -appears as additional disks in the instance. The local SSDs have to be -configured at the VM creation time and cannot be dynamically attached to an -instance later. If the VM gets shutdown, terminated, pre-empted, or the host -encounters a non-recoverable error, then the SSD data will be lost. If the -guest OS reboots, or a live migration occurs, then the SSD data will be -preserved. - -### EC2 - -In EC2, the instance store feature attaches local HDDs or SSDs to a new instance -as additional disks. HDD capacities can go up to 24 2TB disks for the largest -configuration. SSD capacities can go up to 8 800GB disks or 2 2TB disks for the -largest configurations. Data on the instance store only persists across -instance reboot. - -## Limitations of current volumes - -The following is an overview of existing volume types in Kubernetes, and how -they cannot completely address the use cases for local persistent storage. - -* EmptyDir: A temporary directory for a pod that is created under the kubelet -root directory. The contents are deleted when a pod dies. Limitations: - - * Volume lifetime is bound to the pod lifetime. Pod failure is more likely -than node failure, so there can be increased network and storage activity to -recover data via replication and data backups when a replacement pod is started. - * Multiple disks are not supported unless the administrator aggregates them -into a spanned or RAID volume. In this case, all the storage is shared, and -IOPS guarantees cannot be provided. - * There is currently no method of distinguishing between HDDs and SDDs. The -“medium” field could be expanded, but it is not easily generalizable to -arbitrary types of mediums. - -* HostPath: A direct mapping to a specified directory on the node. The -directory is not managed by the cluster. Limitations: - - * Admin needs to manually setup directory permissions for the volume’s users. - * Admin has to manage the volume lifecycle manually and do cleanup of the data and -directories. - * All nodes have to have their local storage provisioned the same way in order to -use the same pod template. - * There can be path collision issues if multiple pods get scheduled to the same -node that want the same path - * If node affinity is specified, then the user has to do the pod scheduling -manually. - -* Provider’s block storage (GCE PD, AWS EBS, etc): A remote disk that can be -attached to a VM instance. The disk’s lifetime is independent of the pod’s -lifetime. Limitations: - - * Doesn’t meet performance requirements. -[Performance benchmarks on GCE](https://cloud.google.com/compute/docs/disks/performance) -show that local SSD can perform better than SSD persistent disks: - - * 16x read IOPS - * 11x write IOPS - * 6.5x read throughput - * 4.5x write throughput - -* Networked filesystems (NFS, GlusterFS, etc): A filesystem reachable over the -network that can provide shared access to data. Limitations: - - * Requires more configuration and setup, which adds operational burden and -cost. - * Requires a high performance network to achieve equivalent performance as -local disks, especially when compared to high-performance SSDs. - -Due to the current limitations in the existing volume types, a new method for -providing persistent local storage should be considered. - -## Feature Plan - -A detailed implementation plan can be found in the -[Storage SIG planning spreadsheet](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/view#gid=1566770776). -The following is a high level summary of the goals in each phase. - -### Phase 1 - -* Support Pod, Deployment, and StatefulSet requesting a single local volume -* Support pre-configured, statically partitioned, filesystem-based local volumes - -### Phase 2 - -* Block devices and raw partitions -* Smarter PV binding to consider local storage and pod scheduling constraints, -such as pod affinity/anti-affinity, and requesting multiple local volumes - -### Phase 3 - -* Support common partitioning patterns -* Volume taints and tolerations for unbinding volumes in error conditions - -### Phase 4 - -* Dynamic provisioning - -## Design - -A high level proposal with user workflows is available in the -[Local Storage Overview](local-storage-overview.md). - -This design section will focus on one phase at a time. Each new release will -extend this section. - -### Phase 1: 1.7 alpha - -#### Local Volume Plugin - -A new volume plugin will be introduced to represent logical block partitions and -filesystem mounts that are local to a node. Some examples include whole disks, -disk partitions, RAID volumes, LVM volumes, or even directories in a shared -partition. Multiple Local volumes can be created on a node, and is -accessed through a local mount point or path that is bind-mounted into the -container. It is only consumable as a PersistentVolumeSource because the PV -interface solves the pod spec portability problem and provides the following: - -* Abstracts volume implementation details for the pod and expresses volume -requirements in terms of general concepts, like capacity and class. This allows -for portable configuration, as the pod is not tied to specific volume instances. -* Allows volume management to be independent of the pod lifecycle. The volume can -survive container, pod and node restarts. -* Allows volume classification by StorageClass. -* Is uniquely identifiable within a cluster and is managed from a cluster-wide -view. - -There are major changes in PV and pod semantics when using Local volumes -compared to the typical remote storage volumes. - -* Since Local volumes are fixed to a node, a pod using that volume has to -always be scheduled on that node. -* Volume availability is tied to the node’s availability. If the node is -unavailable, then the volume is also unavailable, which impacts pod -availability. -* The volume’s data durability characteristics are determined by the underlying -storage system, and cannot be guaranteed by the plugin. A Local volume -in one environment can provide data durability, but in another environment may -only be ephemeral. As an example, in the GCE/GKE/AWS cloud environments, the -data in directly attached, physical SSDs is immediately deleted when the VM -instance terminates or becomes unavailable. - -Due to these differences in behaviors, Local volumes are not suitable for -general purpose use cases, and are only suitable for specific applications that -need storage performance and data gravity, and can tolerate data loss or -unavailability. Applications need to be aware of, and be able to handle these -differences in data durability and availability. - -Local volumes are similar to HostPath volumes in the following ways: - -* Partitions need to be configured by the storage administrator beforehand. -* Volume is referenced by the path to the partition. -* Provides the same underlying partition’s support for IOPS isolation. -* Volume is permanently attached to one node. -* Volume can be mounted by multiple pods on the same node. - -However, Local volumes will address these current issues with HostPath -volumes: - -* Security concerns allowing a pod to access any path in a node. Local -volumes cannot be consumed directly by a pod. They must be specified as a PV -source, so only users with storage provisioning privileges can determine which -paths on a node are available for consumption. -* Difficulty in permissions setup. Local volumes will support fsGroup so -that the admins do not need to setup the permissions beforehand, tying that -particular volume to a specific user/group. During the mount, the fsGroup -settings will be applied on the path. However, multiple pods -using the same volume should use the same fsGroup. -* Volume lifecycle is not clearly defined, and the volume has to be manually -cleaned up by users. For Local volumes, the PV has a clearly defined -lifecycle. Upon PVC deletion, the PV will be released (if it has the Delete -policy), and all the contents under the path will be deleted. In the future, -advanced cleanup options, like zeroing can also be specified for a more -comprehensive cleanup. - -##### API Changes - -All new changes are protected by a new feature gate, `PersistentLocalVolumes`. - -A new `LocalVolumeSource` type is added as a `PersistentVolumeSource`. For this -initial phase, the path can only be a mount point or a directory in a shared -filesystem. - -``` -type LocalVolumeSource struct { - // The full path to the volume on the node - // For alpha, this path must be a directory - // Once block as a source is supported, then this path can point to a block device - Path string -} - -type PersistentVolumeSource struct { - <snip> - // Local represents directly-attached storage with node affinity. - // +optional - Local *LocalVolumeSource -} -``` - -The relationship between a Local volume and its node will be expressed using -PersistentVolume node affinity, described in the following section. - -Users request Local volumes using PersistentVolumeClaims in the same manner as any -other volume type. The PVC will bind to a matching PV with the appropriate capacity, -AccessMode, and StorageClassName. Then the user specifies that PVC in their -Pod spec. There are no special annotations or fields that need to be set in the Pod -or PVC to distinguish between local and remote storage. It is abstracted by the -StorageClass. - -#### PersistentVolume Node Affinity - -PersistentVolume node affinity is a new concept and is similar to Pod node affinity, -except instead of specifying which nodes a Pod has to be scheduled to, it specifies which nodes -a PersistentVolume can be attached and mounted to, influencing scheduling of Pods that -use local volumes. - -For a Pod that uses a PV with node affinity, a new scheduler predicate -will evaluate that node affinity against the node's labels. For this initial phase, the -PV node affinity is only considered by the scheduler for already-bound PVs. It is not -considered during the initial PVC/PV binding, which will be addressed in a future release. - -Only the `requiredDuringSchedulingIgnoredDuringExecution` field will be supported. - -##### API Changes - -For the initial alpha phase, node affinity is expressed as an optional -annotation in the PersistentVolume object. - -``` -// AlphaStorageNodeAffinityAnnotation defines node affinity policies for a PersistentVolume. -// Value is a string of the json representation of type NodeAffinity -AlphaStorageNodeAffinityAnnotation = "volume.alpha.kubernetes.io/node-affinity" -``` - -#### Local volume initial configuration - -There are countless ways to configure local storage on a node, with different patterns to -follow depending on application requirements and use cases. Some use cases may require -dedicated disks; others may only need small partitions and are ok with sharing disks. -Instead of forcing a partitioning scheme on storage administrators, the Local volume -is represented by a path, and lets the administrators partition their storage however they -like, with a few minimum requirements: - -* The paths to the mount points are always consistent, even across reboots or when storage -is added or removed. -* The paths are backed by a filesystem (block devices or raw partitions are not supported for -the first phase) -* The directories have appropriate permissions for the provisioner to be able to set owners and -cleanup the volume. - -#### Local volume management - -Local PVs are statically created and not dynamically provisioned for the first phase. -To mitigate the amount of time an administrator has to spend managing Local volumes, -a Local static provisioner application will be provided to handle common scenarios. For -uncommon scenarios, a specialized provisioner can be written. - -The Local static provisioner will be developed in the -[kubernetes-incubator/external-storage](https://github.com/kubernetes-incubator) -repository, and will loosely follow the external provisioner design, with a few differences: - -* A provisioner instance needs to run on each node and only manage the local storage on its node. -* For phase 1, it does not handle dynamic provisioning. Instead, it performs static provisioning -by discovering available partitions mounted under configurable discovery directories. - -The basic design of the provisioner will have two separate handlers: one for PV deletion and -cleanup, and the other for static PV creation. A PersistentVolume informer will be created -and its cache will be used by both handlers. - -PV deletion will operate on the Update event. If the PV it provisioned changes to the “Released” -state, and if the reclaim policy is Delete, then it will cleanup the volume and then delete the PV, -removing it from the cache. - -PV creation does not operate on any informer events. Instead, it periodically monitors the discovery -directories, and will create a new PV for each path in the directory that is not in the PV cache. It -sets the "pv.kubernetes.io/provisioned-by" annotation so that it can distinguish which PVs it created. - -For phase 1, the allowed discovery file types are directories and mount points. The PV capacity -will be the capacity of the underlying filesystem. Therefore, PVs that are backed by shared -directories will report its capacity as the entire filesystem, potentially causing overcommittment. -Separate partitions are recommended for capacity isolation. - -The name of the PV needs to be unique across the cluster. The provisioner will hash the node name, -StorageClass name, and base file name in the volume path to generate a unique name. - -##### Packaging - -The provisioner is packaged as a container image and will run on each node in the cluster as part of -a DaemonSet. It needs to be run with a user or service account with the following permissions: - -* Create/delete/list/get PersistentVolumes - Can use the `system:persistentvolumeprovisioner` ClusterRoleBinding -* Get ConfigMaps - To access user configuration for the provisioner -* Get Nodes - To get the node's UID and labels - -These are broader permissions than necessary (a node's access to PVs should be restricted to only -those local to the node). A redesign will be considered in a future release to address this issue. - -In addition, it should run with high priority so that it can reliably handle all the local storage -partitions on each node, and with enough permissions to be able to cleanup volume contents upon -deletion. - -The provisioner DaemonSet requires the following configuration: - -* The node's name set as the MY_NODE_NAME environment variable -* ConfigMap with StorageClass -> discovery directory mappings -* Each mapping in the ConfigMap needs a hostPath volume -* User/service account with all the required permissions - -Here is an example ConfigMap: - -``` -kind: ConfigMap -metadata: - name: local-volume-config - namespace: kube-system -data: - storageClassMap: | - local-fast: - hostDir: "/mnt/ssds" - mountDir: "/local-ssds" - local-slow: - hostDir: "/mnt/hdds" - mountDir: "/local-hdds" -``` - -The `hostDir` is the discovery path on the host, and the `mountDir` is the path it is mounted to in -the provisioner container. The `hostDir` is required because the provisioner needs to create Local PVs -with the `Path` based off of `hostDir`, not `mountDir`. - -The DaemonSet for this example looks like: -``` - -apiVersion: extensions/v1beta1 -kind: DaemonSet -metadata: - name: local-storage-provisioner - namespace: kube-system -spec: - template: - metadata: - labels: - system: local-storage-provisioner - spec: - containers: - - name: provisioner - image: "k8s.gcr.io/local-storage-provisioner:v1.0" - imagePullPolicy: Always - volumeMounts: - - name: vol1 - mountPath: "/local-ssds" - - name: vol2 - mountPath: "/local-hdds" - env: - - name: MY_NODE_NAME - valueFrom: - fieldRef: - fieldPath: spec.nodeName - volumes: - - name: vol1 - hostPath: - path: "/mnt/ssds" - - name: vol2 - hostPath: - path: "/mnt/hdds" - serviceAccount: local-storage-admin -``` - -##### Provisioner Boostrapper - -Manually setting up this DaemonSet spec can be tedious and it requires duplicate specification -of the StorageClass -> directory mappings both in the ConfigMap and as hostPath volumes. To -make it simpler and less error prone, a boostrapper application will be provided to generate -and launch the provisioner DaemonSet based off of the ConfigMap. It can also create a service -account with all the required permissions. - -The boostrapper accepts the following optional arguments: - -* -image: Name of local volume provisioner image (default -"quay.io/external_storage/local-volume-provisioner:latest") -* -volume-config: Name of the local volume configuration configmap. The configmap must reside in the same -namespace as the bootstrapper. (default "local-volume-default-config") -* -serviceaccount: Name of the service account for local volume provisioner (default "local-storage-admin") - -The boostrapper requires the following permissions: - -* Get/Create/Update ConfigMap -* Create ServiceAccount -* Create ClusterRoleBindings -* Create DaemonSet - -Since the boostrapper generates the DaemonSet spec, the ConfigMap can be simplified to just specify the -host directories: - -``` -kind: ConfigMap -metadata: - name: local-volume-config - namespace: kube-system -data: - storageClassMap: | - local-fast: - hostDir: "/mnt/ssds" - local-slow: - hostDir: "/mnt/hdds" -``` - -The boostrapper will update the ConfigMap with the generated `mountDir`. It generates the `mountDir` -by stripping off the initial "/" in `hostDir`, replacing the remaining "/" with "~", and adding the -prefix path "/mnt/local-storage". - -In the above example, the generated `mountDir` is `/mnt/local-storage/mnt ~ssds` and -`/mnt/local-storage/mnt~hdds`, respectively. - -#### Use Case Deliverables - -This alpha phase for Local PV support will provide the following capabilities: - -* Local directories to be specified as Local PVs with node affinity -* Pod using a PVC that is bound to a Local PV will always be scheduled to that node -* External static provisioner DaemonSet that discovers local directories, creates, cleans up, -and deletes Local PVs - -#### Limitations - -However, some use cases will not work: - -* Specifying multiple Local PVCs in a pod. Most likely, the PVCs will be bound to Local PVs on -different nodes, -making the pod unschedulable. -* Specifying Pod affinity/anti-affinity with Local PVs. PVC binding does not look at Pod scheduling -constraints at all. -* Using Local PVs in a highly utilized cluster. PVC binding does not look at Pod resource requirements -and Node resource availability. - -These issues will be solved in a future release with advanced storage topology scheduling. - -As a workaround, PVCs can be manually prebound to Local PVs to essentially manually schedule Pods to -specific nodes. - -#### Test Cases - -##### API unit tests - -* LocalVolumeSource cannot be specified without the feature gate -* Non-empty PV node affinity is required for LocalVolumeSource -* Preferred node affinity is not allowed -* Path is required to be non-empty -* Invalid json representation of type NodeAffinity returns error - -##### PV node affinity unit tests - -* Nil or empty node affinity evaluates to true for any node -* Node affinity specifying existing node labels evaluates to true -* Node affinity specifying non-existing node label keys evaluates to false -* Node affinity specifying non-existing node label values evaluates to false - -##### Local volume plugin unit tests - -* Plugin can support PersistentVolumeSource -* Plugin cannot support VolumeSource -* Plugin supports ReadWriteOnce access mode -* Plugin does not support remaining access modes -* Plugin supports Mounter and Unmounter -* Plugin does not support Provisioner, Recycler, Deleter -* Plugin supports readonly -* Plugin GetVolumeName() returns PV name -* Plugin ConstructVolumeSpec() returns PV info -* Plugin disallows backsteps in the Path - -##### Local volume provisioner unit tests - -* Directory not in the cache and PV should be created -* Directory is in the cache and PV should not be created -* Directories created later are discovered and PV is created -* Unconfigured directories are ignored -* PVs are created with the configured StorageClass -* PV name generation hashed correctly using node name, storageclass and filename -* PV creation failure should not add directory to cache -* Non-directory type should not create a PV -* PV is released, PV should be deleted -* PV should not be deleted for any other PV phase -* PV deletion failure should not remove PV from cache -* PV cleanup failure should not delete PV or remove from cache - -##### E2E tests - -* Pod that is bound to a Local PV is scheduled to the correct node -and can mount, read, and write -* Two pods serially accessing the same Local PV can mount, read, and write -* Two pods simultaneously accessing the same Local PV can mount, read, and write -* Test both directory-based Local PV, and mount point-based Local PV -* Launch local volume provisioner, create some directories under the discovery path, -and verify that PVs are created and a Pod can mount, read, and write. -* After destroying a PVC managed by the local volume provisioner, it should cleanup -the volume and recreate a new PV. -* Pod using a Local PV with non-existent path fails to mount -* Pod that sets nodeName to a different node than the PV node affinity cannot schedule. - - -### Phase 2: 1.9 alpha - -#### Smarter PV binding - -The issue of PV binding not taking into account pod scheduling requirements affects any -type of volume that imposes topology constraints, such as local storage and zonal disks. - -Because this problem affects more than just local volumes, it will be treated as a -separate feature with a separate proposal. Once that feature is implemented, then the -limitations outlined above will be fixed. - -#### Block devices and raw partitions - -Pods accessing raw block storage is a new alpha feature in 1.9. Changes are required in -the Local volume plugin and provisioner to be able to support raw block devices. The local -volume provisioner will be enhanced to support discovery of block devices and creation of -PVs corresponding to those block devices. In addition, when a block device based PV is -released, the local volume provisioner will cleanup the block devices. The cleanup -mechanism will be configurable and also customizable as no single mechanism covers all use -cases. - - -##### Discovery - -Much like the current file based PVs, the local volume provisioner will look for block devices -under designated directories that have been mounted on the provisioner container. Currently, for -each storage class, the provisioner has a configmap entry that looks like this: - -``` -data: - storageClassMap: | - local-fast: - hostDir: "/mnt/disks" - mountDir: "/local-ssds" -``` - -With this current approach, filesystems that were meant to be exposed as PVs are supposed to be -mounted on sub-directories under hostDir and the provisioner running in a container would walk -through the corresponding "mountDir" to find all the PVs. - -For block discovery, we will extend the same approach to enable discovering block devices. The -admin can create symbolic links under hostDir for each block device that should be discovered -under that storage class. The provisioner would use the same configMap and its logic will be -enhanced to auto detect if the entry under the directory is a block device or a file system. If -it is a block device, then a block based PV is created, otherwise a file based PV is created. - -##### Cleanup after Release - -Cleanup of a block device can be a bit more involved for the following reasons: - -* With file based PVs, a quick deletion of all files (inode information) was sufficient, with -block devices one might want to wipe all current content. -* Overwriting SSDs is not guaranteed to securely cleanup all previous content as there is a -layer of indirection in SSDs called the FTL (flash translation layer) and also wear leveling -techniques in SSDs that prevent reliable overwrite of all previous content. -* SSDs can also suffer from wear if they are repeatedly subjected to zeroing out, so one would -need different tools and strategies for HDDs vs SSDs -* A cleanup process which favors overwriting every block in the disk can take several hours. - -For this reason, the cleanup process has been made configurable and extensible, so that admin -can use the most appropriate method for their environment. - -Block device cleanup logic will be encapsulated in separate scripts or binaries. There will be -several scripts that will be made available out of the box, for example: - - -| Cleanup Method | Description | Suitable for Device | -|:--------------:|-------------|:-------------------:| -|dd-zero| Used for zeroing the device repeatedly | HDD | -|blkdiscard| Discards sectors on the device. This cleanup method may not be supported by all devices.| SSD | -|fs-reset| A non-secure overwrite of any existing filesystem with mkfs, followed by wipefs to remove the signature of the file system | SSD/HDD | -|shred|Repeatedly writes random values to the block device. Less effective with wear levelling in SSDs.| HDD | -| hdparm| Issues [ATA secure erase](https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase) command to erase data on device. See ATA Secure Erase. Please note that the utility has to be supported by the device in question. | SSD/HDD | - -The fs-reset method is a quick and minimal approach as it does a reset of any file system, which -works for both SSD and HDD and will be the default choice for cleaning. For SSDs, admins could -opt for either blkdiscard which is also quite fast or hdparm. For HDDs they could opt for -dd-zeroing or shred, which can take some time to run. Finally, the user is free to create new -cleanup scripts of their own and have them specified in the configmap of the provisioner. - -The configmap from earlier section will be enhanced as follows -``` -data: - storageClassMap: | - local-fast: - hostDir: "/mnt/disks" - mountDir: "/local-ssds" - blockCleanerCommand: - - "/scripts/dd_zero.sh" - - "2" - ``` - -The block cleaner command will specify the script and any arguments that need to be passed to it. -The actual block device being cleaned will be supplied to the script as an environment variable -(LOCAL_PV_BLKDEVICE) as opposed to command line, so that the script command line has complete -freedom on its structure. The provisioner will validate that the block device path is actually -within the directory managed by the provisioner, to prevent destructive operations on arbitrary -paths. - -The provisioner logic currently does each volume’s cleanup as a synchronous serial activity. -However, with cleanup now potentially being a multi hour activity, the processes will have to -be asynchronous and capable of being executed in parallel. The provisioner will ensure that all -current asynchronous cleanup processes are tracked. Special care needs to be taken to ensure that -when a disk has only been partially cleaned. This scenario can happen if some impatient user -manually deletes a PV and the provisioner ends up re-creating pv ready for use (but only partially -cleaned). This issue will be addressed in the re-design of the provisioner (details will be provided -in the re-design section). The re-design will ensure that all disks being cleaned will be tracked -through custom resources, so no disk being cleaned will be re-created as a PV. - -The provisioner will also log events to let the user know that cleaning is in progress and it can -take some time to complete. - -##### Testing - -The unit tests in the provisioner will be enhanced to test all the new block discover, block cleaning -and asynchronous cleaning logic. The tests include -* Validating that a discovery directory containing both block and file system volumes are appropriately discovered and have PVs created. -* Validate that both success and failure of asynchronous cleanup processes are properly tracked by the provisioner -* Ensure a new PV is not created while cleaning of volume behind the PV is still in progress -* Ensure two simultaneous cleaning operations on the same PV do not occur - - In addition, end to end tests will be added to support block cleaning. The tests include: -* Validate block PV are discovered and created -* Validate cleaning of released block PV using each of the block cleaning scripts included. -* Validate that file and block volumes in the same discovery path have correct PVs created, and that they are appropriately cleaned up. -* Leverage block PV via PVC and validate that serially writes data in one pod, then reads and validates the data from a second pod. -* Restart of the provisioner during cleaning operations, and validate that the PV is not recreated by the provisioner until cleaning has occurred. - -#### Provisioner redesign for stricter K8s API access control - -In 1.7, each instance of the provisioner on each node has full permissions to create and -delete all PVs in the system. This is unnecessary and potentially a vulnerability if the -node gets compromised. - -To address this issue, the provisioner will be redesigned into two major components: - -1. A central manager pod that handles the creation and deletion of PV objects. -This central pod can run on a trusted node and be given PV create/delete permissions. -2. Worker pods on each node, run as a DaemonSet, that discovers and cleans up the local -volumes on that node. These workers do not interact with PV objects, however -they still require permissions to be able to read the `Node.Labels` on their node. - -The central manager will poll each worker for their discovered volumes and create PVs for -them. When a PV is released, then it will send the cleanup request to the worker. - -Detailed design TBD +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/mount-options.md b/contributors/design-proposals/storage/mount-options.md index 56097269..f0fbec72 100644 --- a/contributors/design-proposals/storage/mount-options.md +++ b/contributors/design-proposals/storage/mount-options.md @@ -1,113 +1,6 @@ -# Mount options for mountable volume types +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Enable Kubernetes admins to specify mount options with mountable volumes -such as - `nfs`, `glusterfs` or `aws-ebs` etc. -## Motivation - -We currently support network filesystems: NFS, Glusterfs, Ceph FS, SMB (Azure file), Quobytes, and local filesystems such as ext[3|4] and XFS. - -Mount time options that are operationally important and have no security implications should be supported. Examples are NFS's TCP mode, versions, lock mode, caching mode; Glusterfs's caching mode; SMB's version, locking, id mapping; and more. - -## Design - -### Mount option support in Persistent Volume Objects - -Mount options can be specified as a field on PVs. For example: - -``` yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: pv0003 -spec: - capacity: - storage: 5Gi - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Recycle - mountOptions: - - hard - - nolock - - nfsvers=3 - nfs: - path: /tmp - server: 172.17.0.2 -``` - - -Beta support for mount options introduced via `mount-options` annotation will be supported for near future -and deprecated in future. - - -``` yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: pv0003 - annotations: - volume.beta.kubernetes.io/mount-options: "hard,nolock,nfsvers=3" -spec: - capacity: - storage: 5Gi - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Recycle - nfs: - path: /tmp - server: 172.17.0.2 -``` - -### Mount option support in Storage Classes - -Kubernetes admin can also specify mount option as a parameter in storage class. - -```yaml -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: slow -provisioner: kubernetes.io/glusterfs -parameters: - type: gp2 -mountOptions: - - auto_mount -``` - -The mount option specified in Storage Class will be used while provisioning persistent volumes -and added as a field to PVs. - -If admin has configured mount option for a storage type that does not support mount options, -then a "provisioning failed" event will be added to PVC and PVC will stay in pending state. - -Also, if configured mount option is invalid then corresponding mount time failure error will be added to pod object. - - -## Preventing users from specifying mount options in inline volume specs of Pod - -While mount options enable more flexibility in how volumes are mounted, it can result -in user specifying options that are not supported or are known to be problematic when -using inline volume specs. - -After much deliberation it was decided that - `mountOptions` as an API parameter will not be supported -for inline volume specs. - -### Error handling and plugins that don't support mount option - -Kubernetes ships with volume plugins that don't support any kind of mount options. Such as `configmaps` or `secrets`, -in those cases to prevent user from submitting volume definitions with bogus mount options - plugins can define a interface function -such as: - -```go -func SupportsMountOption() { - return false -} -``` - -which will be used to validate the PV definition and API object will be *only* created if it passes the validation. Additionally -support for user specified mount options will be also checked when volumes are being mounted. - -In other cases where plugin supports mount options (such as - `NFS` or `GlusterFS`) but mounting fails because of invalid mount -option or otherwise - an Event API object will be created and attached to the appropriate object. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/persistent-storage.md b/contributors/design-proposals/storage/persistent-storage.md index d91ee256..f0fbec72 100644 --- a/contributors/design-proposals/storage/persistent-storage.md +++ b/contributors/design-proposals/storage/persistent-storage.md @@ -1,288 +1,6 @@ -# Persistent Storage +Design proposals have been archived. -This document proposes a model for managing persistent, cluster-scoped storage -for applications requiring long lived data. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Abstract - -Two new API kinds: - -A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. -It is analogous to a node. See [Persistent Volume Guide](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) -for how to use it. - -A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to -use in a pod. It is analogous to a pod. - -One new system component: - -`PersistentVolumeClaimBinder` is a singleton running in master that watches all -PersistentVolumeClaims in the system and binds them to the closest matching -available PersistentVolume. The volume manager watches the API for newly created -volumes to manage. - -One new volume: - -`PersistentVolumeClaimVolumeSource` references the user's PVC in the same -namespace. This volume finds the bound PV and mounts that volume for the pod. A -`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another -type of volume that is owned by someone else (the system). - -Kubernetes makes no guarantees at runtime that the underlying storage exists or -is available. High availability is left to the storage provider. - -### Goals - -* Allow administrators to describe available storage. -* Allow pod authors to discover and request persistent volumes to use with pods. -* Enforce security through access control lists and securing storage to the same -namespace as the pod volume. -* Enforce quotas through admission control. -* Enforce scheduler rules by resource counting. -* Ensure developers can rely on storage being available without being closely -bound to a particular disk, server, network, or storage device. - -#### Describe available storage - -Cluster administrators use the API to manage *PersistentVolumes*. A custom store -`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by -storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for -storage and binds them to an available volume by matching the volume's -characteristics (AccessModes and storage size) to the user's request. - -PVs are system objects and, thus, have no namespace. - -Many means of dynamic provisioning will be eventually be implemented for various -storage types. - - -##### PersistentVolume API - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume | -| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} | -| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} | -| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} | -| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume | -| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume | - - -#### Request Storage - -Kubernetes users request persistent storage for their pod by creating a -```PersistentVolumeClaim```. Their request for storage is described by their -requirements for resources and mount capabilities. - -Requests for volumes are bound to available volumes by the volume manager, if a -suitable match is found. Requests for resources can go unfulfilled. - -Users attach their claim to their pod using a new -```PersistentVolumeClaimVolumeSource``` volume source. - - -##### PersistentVolumeClaim API - - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | -| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | -| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | -| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | -| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | -| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | - - - -#### Scheduling constraints - -Scheduling constraints are to be handled similar to pod resource constraints. -Pods will need to be annotated or decorated with the number of resources it -requires on a node. Similarly, a node will need to list how many it has used or -available. - -TBD - - -#### Events - -The implementation of persistent storage will not require events to communicate -to the user the state of their claim. The CLI for bound claims contains a -reference to the backing persistent volume. This is always present in the API -and CLI, making an event to communicate the same unnecessary. - -Events that communicate the state of a mounted volume are left to the volume -plugins. - -### Example - -#### Admin provisions storage - -An administrator provisions storage by posting PVs to the API. Various ways to -automate this task can be scripted. Dynamic provisioning is a future feature -that can maintain levels of PVs. - -```yaml -POST: - -kind: PersistentVolume -apiVersion: v1 -metadata: - name: pv0001 -spec: - capacity: - storage: 10 - persistentDisk: - pdName: "abc123" - fsType: "ext4" -``` - -```console -$ kubectl get pv - -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON -pv0001 map[] 10737418240 RWO Pending -``` - -#### Users request storage - -A user requests storage by posting a PVC to the API. Their request contains the -AccessModes they wish their volume to have and the minimum size needed. - -The user must be within a namespace to create PVCs. - -```yaml -POST: - -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: myclaim-1 -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 3 -``` - -```console -$ kubectl get pvc - -NAME LABELS STATUS VOLUME -myclaim-1 map[] pending -``` - - -#### Matching and binding - -The ```PersistentVolumeClaimBinder``` attempts to find an available volume that -most closely matches the user's request. If one exists, they are bound by -putting a reference on the PV to the PVC. Requests can go unfulfilled if a -suitable match is not found. - -```console -$ kubectl get pv - -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON -pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e - - -kubectl get pvc - -NAME LABELS STATUS VOLUME -myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e -``` - -A claim must request access modes and storage capacity. This is because internally PVs are -indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity. -A claim may request one of more of the following attributes to better match a PV: volume name, selectors, -and volume class (currently implemented as an annotation). - -A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which -PVC it will match. -A PV may also define labels, annotations, and a volume class (currently implemented as an -annotation) to better target PVCs. - -As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is -matched to a PV: - -1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered. -"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines -the mode requested by the claim. - -1. The potential PVs above are considered in order of the closest access mode match, with the best case -being an exact match, and a worse case being more modes than requested by the claim. - -1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity -is not less than the storage being requested by the claim then this PV will bind to the claim. Done. - -1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is -skipped and will be handled by Dynamic Provisioning. - -1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a -placeholder, then the PV is skipped. - -1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the -PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match -since capacities may differ. - -1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder -for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped. -If the annotations for the PV and PVC are empty they are treated as being equal. - -1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs, -the PV with the smallest capacity that is also equal to or greater than the claim's requested storage -is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all -of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner. - -*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default -`StorageClass` has been defined) then a volume will be dynamically provisioned. - -#### Claim usage - -The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim -and mount its volume for a pod. - -The claim holder owns the claim and its data for as long as the claim exists. -The pod using the claim can be deleted, but the claim remains in the user's -namespace. It can be used again and again by many pods. - -```yaml -POST: - -kind: Pod -apiVersion: v1 -metadata: - name: mypod -spec: - containers: - - image: nginx - name: myfrontend - volumeMounts: - - mountPath: "/var/www/html" - name: mypd - volumes: - - name: mypd - source: - persistentVolumeClaim: - accessMode: ReadWriteOnce - claimRef: - name: myclaim-1 -``` - -#### Releasing a claim and Recycling a volume - -When a claim holder is finished with their data, they can delete their claim. - -```console -$ kubectl delete pvc myclaim-1 -``` - -The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim -reference from the PV and change the PVs status to 'Released'. - -Admins can script the recycling of released volumes. Future dynamic provisioners -will understand how a volume should be recycled. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/pod-safety.md b/contributors/design-proposals/storage/pod-safety.md index 3976e47b..f0fbec72 100644 --- a/contributors/design-proposals/storage/pod-safety.md +++ b/contributors/design-proposals/storage/pod-safety.md @@ -1,402 +1,6 @@ -# Pod Safety, Consistency Guarantees, and Storage Implications +Design proposals have been archived. -@smarterclayton @bprashanth +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -October 2016 - -## Proposal and Motivation - -A pod represents the finite execution of one or more related processes on the -cluster. In order to ensure higher level consistent controllers can safely -build on top of pods, the exact guarantees around its lifecycle on the cluster -must be clarified, and it must be possible for higher order controllers -and application authors to correctly reason about the lifetime of those -processes and their access to cluster resources in a distributed computing -environment. - -To run most clustered software on Kubernetes, it must be possible to guarantee -**at most once** execution of a particular pet pod at any time on the cluster. -This allows the controller to prevent multiple processes having access to -shared cluster resources believing they are the same entity. When a node -containing a pet is partitioned, the Pet Set must remain consistent (no new -entity will be spawned) but may become unavailable (cluster no longer has -a sufficient number of members). The Pet Set guarantee must be strong enough -for an administrator to reason about the state of the cluster by observing -the Kubernetes API. - -In order to reconcile partitions, an actor (human or automated) must decide -when the partition is unrecoverable. The actor may be informed of the failure -in an unambiguous way (e.g. the node was destroyed by a meteor) allowing for -certainty that the processes on that node are terminated, and thus may -resolve the partition by deleting the node and the pods on the node. -Alternatively, the actor may take steps to ensure the partitioned node -cannot return to the cluster or access shared resources - this is known -as **fencing** and is a well understood domain. - -This proposal covers the changes necessary to ensure: - -* Pet Sets can ensure **at most one** semantics for each individual pet -* Other system components such as the node and namespace controller can - safely perform their responsibilities without violating that guarantee -* An administrator or higher level controller can signal that a node - partition is permanent, allowing the Pet Set controller to proceed. -* A fencing controller can take corrective action automatically to heal - partitions - -We will accomplish this by: - -* Clarifying which components are allowed to force delete pods (as opposed - to merely requesting termination) -* Ensuring system components can observe partitioned pods and nodes - correctly -* Defining how a fencing controller could safely interoperate with - partitioned nodes and pods to safely heal partitions -* Describing how shared storage components without innate safety - guarantees can be safely shared on the cluster. - - -### Current Guarantees for Pod lifecycle - -The existing pod model provides the following guarantees: - -* A pod is executed on exactly one node -* A pod has the following lifecycle phases: - * Creation - * Scheduling - * Execution - * Init containers - * Application containers - * Termination - * Deletion -* A pod can only move through its phases in order, and may not return - to an earlier phase. -* A user may specify an interval on the pod called the **termination - grace period** that defines the minimum amount of time the pod will - have to complete the termination phase, and all components will honor - this interval. -* Once a pod begins termination, its termination grace period can only - be shortened, not lengthened. - -Pod termination is divided into the following steps: - -* A component requests the termination of the pod by issuing a DELETE - to the pod resource with an optional **grace period** - * If no grace period is provided, the default from the pod is leveraged -* When the kubelet observes the deletion, it starts a timer equal to the - grace period and performs the following actions: - * Executes the pre-stop hook, if specified, waiting up to **grace period** - seconds before continuing - * Sends the termination signal to the container runtime (SIGTERM or the - container image's STOPSIGNAL on Docker) - * Waits 2 seconds, or the remaining grace period, whichever is longer - * Sends the force termination signal to the container runtime (SIGKILL) -* Once the kubelet observes the container is fully terminated, it issues - a status update to the REST API for the pod indicating termination, then - issues a DELETE with grace period = 0. - -If the kubelet crashes during the termination process, it will restart the -termination process from the beginning (grace period is reset). This ensures -that a process is always given **at least** grace period to terminate cleanly. - -A user may re-issue a DELETE to the pod resource specifying a shorter grace -period, but never a longer one. - -Deleting a pod with grace period 0 is called **force deletion** and will -update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately -remove the pod from etcd. Because all communication is asynchronous, -force deleting a pod means that the pod processes may continue -to run for an arbitrary amount of time. If a higher level component like the -StatefulSet controller treats the existence of the pod API object as a strongly -consistent entity, deleting the pod in this fashion will violate the -at-most-one guarantee we wish to offer for pet sets. - - -### Guarantees provided by replica sets and replication controllers - -ReplicaSets and ReplicationControllers both attempt to **preserve availability** -of their constituent pods over ensuring at most one (of a pod) semantics. So a -replica set to scale 1 will immediately create a new pod when it observes an -old pod has begun graceful deletion, and as a result at many points in the -lifetime of a replica set there will be 2 copies of a pod's processes running -concurrently. Only access to exclusive resources like storage can prevent that -simultaneous execution. - -Deployments, being based on replica sets, can offer no stronger guarantee. - - -### Concurrent access guarantees for shared storage - -A persistent volume that references a strongly consistent storage backend -like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage -API to prevent corruption of the data due to simultaneous access by multiple -clients. However, many commonly deployed storage technologies in the -enterprise offer no such consistency guarantee, or much weaker variants, and -rely on complex systems to control which clients may access the storage. - -If a PV is assigned a iSCSI, Fibre Channel, or NFS mount point and that PV -is used by two pods on different nodes simultaneously, concurrent access may -result in corruption, even if the PV or PVC is identified as "read write once". -PVC consumers must ensure these volume types are *never* referenced from -multiple pods without some external synchronization. As described above, it -is not safe to use persistent volumes that lack RWO guarantees with a -replica set or deployment, even at scale 1. - - -## Proposed changes - -### Avoid multiple instances of pods - -To ensure that the Pet Set controller can safely use pods and ensure at most -one pod instance is running on the cluster at any time for a given pod name, -it must be possible to make pod deletion strongly consistent. - -To do that, we will: - -* Give the Kubelet sole responsibility for normal deletion of pods - - only the Kubelet in the course of normal operation should ever remove a - pod from etcd (only the Kubelet should force delete) - * The kubelet must not delete the pod until all processes are confirmed - terminated. - * The kubelet SHOULD ensure all consumed resources on the node are freed - before deleting the pod. -* Application owners must be free to force delete pods, but they *must* - understand the implications of doing so, and all client UI must be able - to communicate those implications. - * Force deleting a pod may cause data loss (two instances of the same - pod process may be running at the same time) -* All existing controllers in the system must be limited to signaling pod - termination (starting graceful deletion), and are not allowed to force - delete a pod. - * The node controller will no longer be allowed to force delete pods - - it may only signal deletion by beginning (but not completing) a - graceful deletion. - * The GC controller may not force delete pods - * The namespace controller used to force delete pods, but no longer - does so. This means a node partition can block namespace deletion - indefinitely. - * The pod GC controller may continue to force delete pods on nodes that - no longer exist if we treat node deletion as confirming permanent - partition. If we do not, the pod GC controller must not force delete - pods. -* It must be possible for an administrator to effectively resolve partitions - manually to allow namespace deletion. -* Deleting a node from etcd should be seen as a signal to the cluster that - the node is permanently partitioned. We must audit existing components - to verify this is the case. - * The PodGC controller has primary responsibility for this - it already - owns the responsibility to delete pods on nodes that do not exist, and - so is allowed to force delete pods on nodes that do not exist. - * The PodGC controller must therefore always be running and will be - changed to always be running for this responsibility in a >=1.5 - cluster. - -In the above scheme, force deleting a pod releases the lock on that pod and -allows higher level components to proceed to create a replacement. - -It has been requested that force deletion be restricted to privileged users. -That limits the application owner in resolving partitions when the consequences -of force deletion are understood, and not all application owners will be -privileged users. For example, a user may be running a 3 node etcd cluster in a -pet set. If pet 2 becomes partitioned, the user can instruct etcd to remove -pet 2 from the cluster (via direct etcd membership calls), and because a quorum -exists pets 0 and 1 can safely accept that action. The user can then force -delete pet 2 and the pet set controller will be able to recreate that pet on -another node and have it join the cluster safely (pets 0 and 1 constitute a -quorum for membership change). - -This proposal does not alter the behavior of finalizers - instead, it makes -finalizers unnecessary for common application cases (because the cluster only -deletes pods when safe). - -### Fencing - -The changes above allow Pet Sets to ensure at-most-one pod, but provide no -recourse for the automatic resolution of cluster partitions during normal -operation. For that, we propose a **fencing controller** which exists above -the current controller plane and is capable of detecting and automatically -resolving partitions. The fencing controller is an agent empowered to make -similar decisions as a human administrator would make to resolve partitions, -and to take corresponding steps to prevent a dead machine from coming back -to life automatically. - -Fencing controllers most benefit services that are not innately replicated -by reducing the amount of time it takes to detect a failure of a node or -process, isolate that node or process so it cannot initiate or receive -communication from clients, and then spawn another process. It is expected -that many StatefulSets of size 1 would prefer to be fenced, given that most -applications in the real world of size 1 have no other alternative for HA -except reducing mean-time-to-recovery. - -While the methods and algorithms may vary, the basic pattern would be: - -1. Detect a partitioned pod or node via the Kubernetes API or via external - means. -2. Decide whether the partition justifies fencing based on priority, policy, or - service availability requirements. -3. Fence the node or any connected storage using appropriate mechanisms. - -For this proposal we only describe the general shape of detection and how existing -Kubernetes components can be leveraged for policy, while the exact implementation -and mechanisms for fencing are left to a future proposal. A future fencing controller -would be able to leverage a number of systems including but not limited to: - -* Cloud control plane APIs such as machine force shutdown -* Additional agents running on each host to force kill process or trigger reboots -* Agents integrated with or communicating with hypervisors running hosts to stop VMs -* Hardware IPMI interfaces to reboot a host -* Rack level power units to power cycle a blade -* Network routers, backplane switches, software defined networks, or system firewalls -* Storage server APIs to block client access - -to appropriately limit the ability of the partitioned system to impact the cluster. -Fencing agents today use many of these mechanisms to allow the system to make -progress in the event of failure. The key contribution of Kubernetes is to define -a strongly consistent pattern whereby fencing agents can be plugged in. - -To allow users, clients, and automated systems like the fencing controllers to -observe partitions, we propose an additional responsibility to the node controller -or any future controller that attempts to detect partition. The node controller should -add an additional condition to pods that have been terminated due to a node failing -to heartbeat that indicates that the cause of the deletion was node partition. - -It may be desirable for users to be able to request fencing when they suspect a -component is malfunctioning. It is outside the scope of this proposal but would -allow administrators to take an action that is safer than force deletion, and -decide at the end whether to force delete. - -How the fencing controller decides to fence is left undefined, but it is likely -it could use a combination of pod forgiveness (as a signal of how much disruption -a pod author is likely to accept) and pod disruption budget (as a measurement of -the amount of disruption already undergone) to measure how much latency between -failure and fencing the app is willing to tolerate. Likewise, it can use its own -understanding of the latency of the various failure detectors - the node controller, -any hypothetical information it gathers from service proxies or node peers, any -heartbeat agents in the system - to describe an upper bound on reaction. - - -### Storage Consistency - -To ensure that shared storage without implicit locking be safe for RWO access, the -Kubernetes storage subsystem should leverage the strong consistency available through -the API server and prevent concurrent execution for some types of persistent volumes. -By leveraging existing concepts, we can allow the scheduler and the kubelet to enforce -a guarantee that an RWO volume can be used on at-most-one node at a time. - -In order to properly support region and zone specific storage, Kubernetes adds node -selector restrictions to pods derived from the persistent volume. Expanding this -concept to volume types that have no external metadata to read (NFS, iSCSI) may -result in adding a label selector to PVs that defines the allowed nodes the storage -can run on (this is a common requirement for iSCSI, FibreChannel, or NFS clusters). - -Because all nodes in a Kubernetes cluster possess a special node name label, it would -be possible for a controller to observe the scheduling decision of a pod using an -unsafe volume and "attach" that volume to the node, and also observe the deletion of -the pod and "detach" the volume from the node. The node would then require that these -unsafe volumes be "attached" before allowing pod execution. Attach and detach may -be recorded on the PVC or PV as a new field or materialized via the selection labels. - -Possible sequence of operations: - -1. Cluster administrator creates a RWO iSCSI persistent volume, available only to - nodes with the label selector `storagecluster=iscsi-1` -2. User requests an RWO volume and is bound to the iSCSI volume -3. The user creates a pod referencing the PVC -4. The scheduler observes the pod must schedule on nodes with `storagecluster=iscsi-1` - (alternatively this could be enforced in admission) and binds to node `A` -5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which - requires "attach" to be successful -6. The attach/detach controller observes that a pod has been bound with a PVC that - requires "attach", and attempts to execute a compare and swap update on the PVC/PV - attaching it to node `A` and pod 1 -7. The kubelet observes the attach of the PVC/PV and executes the pod -8. The user terminates the pod -9. The user creates a new pod that references the PVC -10. The scheduler binds this new pod to node `B`, which also has `storagecluster=iscsi-1` -11. The kubelet on node `B` observes the new pod, but sees that the PVC/PV is bound - to node `A` and so must wait for detach -12. The kubelet on node `A` completes the deletion of pod 1 -13. The attach/detach controller observes the first pod has been deleted and that the - previous attach of the volume to pod 1 is no longer valid - it performs a CAS - update on the PVC/PV clearing its attach state. -14. The attach/detach controller observes the second pod has been scheduled and - attaches it to node `B` and pod 2 -15. The kubelet on node `B` observes the attach and allows the pod to execute. - -If a partition occurred after step 11, the attach controller would block waiting -for the pod to be deleted, and prevent node `B` from launching the second pod. -The fencing controller, upon observing the partition, could signal the iSCSI servers -to firewall node `A`. Once that firewall is in place, the fencing controller could -break the PVC/PV attach to node `A`, allowing steps 13 onwards to continue. - - -### User interface changes - -Clients today may assume that force deletions are safe. We must appropriately -audit clients to identify this behavior and improve the messages. For instance, -`kubectl delete --grace-period=0` could print a warning and require `--confirm`: - -``` -$ kubectl delete pod foo --grace-period=0 -warning: Force deleting a pod does not wait for the pod to terminate, meaning - your containers will be stopped asynchronously. Pass --confirm to - continue -``` - -Likewise, attached volumes would require new semantics to allow the attachment -to be broken. - -Clients should communicate partitioned state more clearly - changing the status -column of a pod list to contain the condition indicating NodeDown would help -users understand what actions they could take. - - -## Backwards compatibility - -On an upgrade, pet sets would not be "safe" until the above behavior is implemented. -All other behaviors should remain as-is. - - -## Testing - -All of the above implementations propose to ensure pods can be treated as components -of a strongly consistent cluster. Since formal proofs of correctness are unlikely in -the foreseeable future, Kubernetes must empirically demonstrate the correctness of -the proposed systems. Automated testing of the mentioned components should be -designed to expose ordering and consistency flaws in the presence of - -* Master-node partitions -* Node-node partitions -* Master-etcd partitions -* Concurrent controller execution -* Kubelet failures -* Controller failures - -A test suite that can perform these tests in combination with real world pet sets -would be desirable, although possibly non-blocking for this proposal. - - -## Documentation - -We should document the lifecycle guarantees provided by the cluster in a clear -and unambiguous way to end users. - - -## Deferred issues - -* Live migration continues to be unsupported on Kubernetes for the foreseeable - future, and no additional changes will be made to this proposal to account for - that feature. - - -## Open Questions - -* Should node deletion be treated as "node was down and all processes terminated" - * Pro: it's a convenient signal that we use in other places today - * Con: the kubelet recreates its Node object, so if a node is partitioned and - the admin deletes the node, when the partition is healed the node would be - recreated, and the processes are *definitely* not terminated - * Implies we must alter the pod GC controller to only signal graceful deletion, - and only to flag pods on nodes that don't exist as partitioned, rather than - force deleting them. - * Decision: YES - captured above. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/postpone-pv-deletion.md b/contributors/design-proposals/storage/postpone-pv-deletion.md index d990ecc4..f0fbec72 100644 --- a/contributors/design-proposals/storage/postpone-pv-deletion.md +++ b/contributors/design-proposals/storage/postpone-pv-deletion.md @@ -1,71 +1,6 @@ -# Postpone deletion of a Persistent Volume if it is bound by a PVC +Design proposals have been archived. -Status: Pending +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: Beta -Implementation Owner: NickrenREN@ - -## Motivation - -Admin can delete a Persistent Volume (PV) that is being used by a PVC. It may result in data loss. - -## Proposal - -Postpone the PV deletion until the PV is not used by any PVC. - - -## User Experience -### Use Cases - -* Admin deletes a PV that is being used by a PVC and a pod referring that PVC is not aware of this. -This may result in data loss. As a user, I do not want to experience data loss. - -## Implementation - -### API Server, PV Admission Controller, PV Create: - -We can rename and reuse the PVC admission controller, let it automatically add finalizer information into newly created PV's and PVC's metadata. - -PVC protection proposal: [PVC protection](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md) - -#### PV controller: - -When we look for best matched PV for the PVC, if the PV has the deletionTimestamp set, we will not choose it even if the PV satisfies all the PVC’s requirement. - -For pre-bound PVC/PV, if the PV has the deletionTimestamp set, we will not perform the `bind` operation and keep the PVC `Pending`. - -#### PV Protection Controller: - -PV protection controller is a new internal controller. - -Since we already have PV controller which is responsible for synchronizing PVs and PVCs, here in PV protection controller, -we can just watch PV events. - -PV protection controller watches for PV events that are processed as described below: - -* PV add/update/delete events: - * If deletionTimestamp is nil and finalizer is missing, the PV is added to PV queue. - * If deletionTimestamp is non-nil and finalizer is present, the PV is added to PV queue. - -PV information is kept in a cache that is done inherently for an informer. - -The PV queue holds PVs that need to be processed according to the below rules: - -* If PV is not found in cache, the PV is skipped. -* If PV is in cache with nil deletionTimestamp and missing finalizer, finalizer is added to the PV. In case the adding finalizer operation fails, the PV is re-queued into the PV queue. -* If PV is in cache with non-nil deletionTimestamp and finalizer is present, we try to get the PV's status, if it is not `bound`(synchronized by PV controller), the finalizer removal is attempted. The PV will be re-queued if the finalizer removal operation fails. - -#### CLI: - -If a PV’s deletionTimestamp is set, the commands kubectl get pv and kubectl describe pv will display that the PV is in terminating state. - - -### Client/Server Backwards/Forwards compatibility - -N/A - -## Alternatives considered - -### Add this logic to the existing PV controller instead of creating a new admission and protection controller -When we bind PV to PVC, we add finalizer for PV and remove finalizer when PV is no longer bound to a PVC. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md b/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md index 4885bcbf..f0fbec72 100644 --- a/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md +++ b/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md @@ -1,106 +1,6 @@ -# Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod +Design proposals have been archived. -Status: Proposal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Version: GA -Implementation Owner: @pospispa - -## Motivation - -User can delete a Persistent Volume Claim (PVC) that is being used by a pod. This may have negative impact on the pod and it may result in data loss. - -For more details see issue https://github.com/kubernetes/kubernetes/issues/45143 - -## Proposal - -Postpone the PVC deletion until the PVC is not used by any pod. - -## User Experience - -### Use Cases - -1. User deletes a PVC that is being used by a pod. This may have negative impact on the pod and may result in data loss. As a user, I want that any PVC deletion does not have any negative impact on any pod. As a user, I do not want to experience data loss. - -#### Scenarios for data loss -Depending on the storage type the data loss occurs in one of the below scenarios: -- in case the dynamic provisioning is used and reclaim policy is `Delete` the PVC deletion triggers deletion of the associated storage asset and PV. -- the same as above applies for the static provisioning and `Delete` reclaim policy. - -## Implementation - -### API Server, PVC Admission Controller, PVC Create -A new plugin for PVC admission controller will be created. The plugin will automatically add finalizer information into newly created PVC's metadata. - -### Scheduler -Scheduler will check if a pod uses a PVC and if any of the PVCs has `deletionTimestamp` set. In case this is true an error will be logged: "PVC (%pvcName) is in scheduled for deletion state" and scheduler will behave as if PVC was not found. - -### Kubelet -Kubelet does currently live lookup of PVC(s) that are used by a pod. - -In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted". Kubelet will follow the same code path as if PVC(s) do not exist. - -### PVC Finalizing Controller -PVC finalizing controller is a new internal controller. - -PVC finalizing controller watches for both PVC and pod events that are processed as described below: -1. PVC add/update/delete events: - - If `deletionTimestamp` is `nil` and finalizer is missing, the PVC is added to PVC queue. - - If `deletionTimestamp` is `non-nil` and finalizer is present, the PVC is added to PVC queue. -2. Pod add events: - - If pod is terminated, all referenced PVCs are added to PVC queue. -3. Pod update events: - - If pod is changing from non-terminated to terminated state, all referenced PVCs are added to PVC queue. -4. Pod delete events: - - All referenced PVCs are added to PVC queue. - -PVC and pod information are kept in a cache that is done inherently for an informer. - -The PVC queue holds PVCs that need to be processed according to the below rules: -- If PVC is not found in cache, the PVC is skipped. -- If PVC is in cache with `nil` `deletionTimestamp` and missing finalizer, finalizer is added to the PVC. In case the adding finalizer operation fails, the PVC is re-queued into the PVC queue. -- If PVC is in cache with `non-nil` `deletionTimestamp` and finalizer is present, live pod list is done for the PVC namespace. If all pods referencing the PVC are not yet bound to a node or are terminated, the finalizer removal is attempted. In case the finalizer removal operation fails the PVC is re-queued. - -### CLI -In case a PVC has the `deletionTimestamp` set the commands `kubectl get pvc` and `kubectl describe pvc` will display that the PVC is in terminating state. - -### Client/Server Backwards/Forwards compatibility - -N/A - -## Alternatives considered - -1. Check in admission controller whether PVC can be deleted by listing all pods and checking if the PVC is used by a pod. This was discussed and rejected in PR https://github.com/kubernetes/kubernetes/pull/46573 - -There were alternatives discussed in issue https://github.com/kubernetes/kubernetes/issues/45143 - -### Scheduler Live Lookups PVC(s) Instead of Kubelet -The implementation proposes that kubelet live updates PVC(s) used by a pod before it starts the pod in order not to start a pod that uses a PVC that has the `deletionTimestamp` set. - -An alternative is that scheduler will live update PVC(s) used by a pod in order not to schedule a pod that uses a PVC that has the `deletionTimestamp` set. - -But live update represents a performance penalty. As the live update performance penalty is already present in the kubelet it's better to do the live update in kubelet. - -### Scheduler Maintains PVCUsedByPod Information in PVC -Scheduler will maintain information on both pods and PVCs from API server. - -In case a pod is being scheduled and is using PVCs that do not have condition PVCUsedByPod set it will set this condition for these PVCs. - -In case a pod is terminated and was using PVCs the scheduler will update PVCUsedByPod condition for these PVCs accordingly. - -PVC finalizing controller won't watch pods because the information whether a PVC is used by a pod or not is now maintained by the scheduler. - -In case PVC finalizing controller gets an update of a PVC and this PVC has `deletionTimestamp` set it will do live PVC update for this PVC in order to get up-to-date value of its PVCUsedByPod field. In case the PVCUsedByPod is not true it will remove the finalizer information from this PVC. - -### Scheduler In the Role of PVC Finalizing Controller -Scheduler will be responsible for removing the finalizer information from PVCs that are being deleted. - -So scheduler will watch pods and PVCs and will maintain internal cache of pods and PVCs. - -In case a PVC is deleted scheduler will do one of the below: -- In case the PVC is used by a pod it will add the PVC into its internal set of PVCs that are waiting for deletion. -- In case the PVC is not used by a pod it will remove the finalizer information from the PVC metadata. - -Note: scheduler is the source of truth of pods that are being started. The information on active pods may be a little bit outdated that causes that deletion of a PVC may be postponed (pod status in schedular is active while the pod is terminated in API server), but this does not cause any harm. - -The disadvantage is that scheduler will become responsible for PVC deletion postponing that will make scheduler bigger. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/pv-to-rbd-mapping.md b/contributors/design-proposals/storage/pv-to-rbd-mapping.md index 8071cbbe..f0fbec72 100644 --- a/contributors/design-proposals/storage/pv-to-rbd-mapping.md +++ b/contributors/design-proposals/storage/pv-to-rbd-mapping.md @@ -1,128 +1,6 @@ -# RBD Volume to PV Mapping +Design proposals have been archived. -Authors: krmayankk@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Problem -The RBD Dynamic Provisioner currently generates rbd volume names which are random. -The current implementation generates a UUID and the rbd image name becomes -image := fmt.Sprintf("kubernetes-dynamic-pvc-%s", uuid.NewUUID()). This RBD image -name is stored in the PV. The PV also has a reference to the PVC to which it binds. -The problem with this approach is that if there is a catastrophic etcd data loss -and all PV's are gone, there is no way to recover the mapping from RBD to PVC. The -RBD volumes for the customer still exist, but we have no way to tell which rbd -volumes belong to which customer. - -## Goal -We want to store some information about the PVC in RBD image name/metadata, so that -in catastrophic situations, we can derive the PVC name from rbd image name/metadata -and allow customer the following options: -- Backup RBD volume data for specific customers and hand them their copy before deleting - the RBD volume. Without knowing from rbd image name/metadata, which customers they - belong to we cannot hand those customers their data. -- Create PV with the given RBD name and pre-bind it to the desired PVC so that customer - can get its data back. - -## Non Goals -This proposal doesnt attempt to undermine the importance of etcd backups to restore -data in catastrophic situations. This is one additional line of defense in case our -backups are not working. - -## Motivation - -We recently had an etcd data loss which resulted in loss of this rbd to pv mapping -and there was no way to restore customer data. This proposal aims to store pvc name -as metadata in the RBD image so that in catastrophic scenarios, the mapping can be -restored by just looking at the RBD's. - -## Current Implementation - -```go -func (r *rbdVolumeProvisioner) Provision() (*v1.PersistentVolume, error) { -... - - // create random image name - image := fmt.Sprintf("kubernetes-dynamic-pvc-%s", uuid.NewUUID()) - r.rbdMounter.Image = image -``` -## Finalized Proposal -Use `rbd image-meta set` command to store additional metadata in the RBD image about the PVC which owns -the RBD image. - -`rbd image-meta set --pool hdd kubernetes-dynamic-pvc-fabd715f-0d24-11e8-91fa-1418774b3e9d pvcname <pvcname>` -`rbd image-meta set --pool hdd kubernetes-dynamic-pvc-fabd715f-0d24-11e8-91fa-1418774b3e9d pvcnamespace <pvcnamespace>` - -### Pros -- Simple to implement -- Does not cause regression in RBD image names, which remains same as earlier. -- The metadata information is not immediately visible to RBD admins - -### Cons -- NA - -Since this Proposal does not change the RBD image name and is able to store additional metadata about -the PVC to which it belongs, this is preferred over other two proposals. Also it does a better job -of hiding the PVC name in the metadata rather than making it more obvious in the RBD image name. The -metadata can only be seen by admins with appropriate permissions to run the rbd image-meta command. In -addition, this Proposal , doesnt impose any limitations on the length of metadata that can be stored -and hence can accommodate any pvc names and namespaces which are stored as arbitrary key value pairs. -It also leaves room for storing any other metadata about the PVC. - - -### Upgrade/Downgrade Behavior - -#### Upgrading from a K8s version without this metadata to a version with this metadata -The metadata for image is populated on CreateImage. After an upgrade, existing RBD Images will not have that -metadata set. When the next AttachDisk happens, we can check if the metadata is not set, set it. Cluster -administrators could also run a one time script to set this manually. For all newly created RBD images, -the rbd image metadata will be set properly. - -#### Downgrade from a K8s version with this metadata to a version without this metadata -After a downgrade, all existing RBD images will have the metadata set. New RBD images created after the -downgrade will not have this metadata. - -## Proposal 1 - -Make the RBD Image name as base64 encoded PVC name(namespace+name) - -```go -import b64 "encoding/base64" -... - - -func (r *rbdVolumeProvisioner) Provision() (*v1.PersistentVolume, error) { -... - - // Create a base64 encoding of the PVC Namespace and Name - rbdImageName := b64.StdEncoding.EncodeToString([]byte(r.options.PVC.Name+"/"+r.options.PVC.Namespace)) - - // Append the base64 encoding to the string `kubernetes-dynamic-pvc-` - rbdImageName = fmt.Sprintf("kubernetes-dynamic-pvc-%s", rbdImageName) - r.rbdMounter.Image = rbdImageName - -``` - -### Pros -- Simple scheme which encodes the fully qualified PVC name in the RBD image name - -### Cons -- Causes regression since RBD image names will change from one version of K8s to another. -- Some older versions of librbd/krbd start having issues with names longer than 95 characters. - - -## Proposal 2 - -Make the RBD Image name as the stringified PVC namespace plus PVC name. - -### Pros -- Simple to implement. - -### Cons -- Causes regression since RBD image names will change from one version of K8s to another. -- This exposes the customer name directly to Ceph Admins. Earlier it was hidden as base64 encoding - - -## Misc -- Document how Pre-Binding of PV to PVC works in dynamic provisioning -- Document/Test if there are other issues with restoring PVC/PV after a - etcd backup is restored +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/raw-block-pv.md b/contributors/design-proposals/storage/raw-block-pv.md index f5ced0a1..f0fbec72 100644 --- a/contributors/design-proposals/storage/raw-block-pv.md +++ b/contributors/design-proposals/storage/raw-block-pv.md @@ -1,844 +1,6 @@ -# Raw Block Consumption in Kubernetes +Design proposals have been archived. -Authors: erinboyd@, screeley44@, mtanino@ +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document presents a proposal for managing raw block storage in Kubernetes using the persistent volume source API as a consistent model of consumption. -# Terminology -* Raw Block Device - a physically attached device devoid of a filesystem -* Raw Block Volume - a logical abstraction of the raw block device as defined by a path -* Filesystem on Block - a formatted (ie xfs) filesystem on top of a raw block device - -# Goals -* Enable durable access to block storage -* Provide flexibility for users/vendors to utilize various types of storage devices -* Agree on API changes for block -* Provide a consistent security model for block devices -* Provide a means for running containerized block storage offerings as non-privileged container - -# Non Goals -* Support all storage devices natively in upstream Kubernetes. Non-standard storage devices are expected to be managed using extension - mechanisms. -* Provide a means for full integration into the scheduler based on non-storage related requests (CPU, etc.) -* Provide a means of ensuring specific topology to ensure co-location of the data - -# Value add to Kubernetes - - By extending the API for volumes to specifically request a raw block device, we provide an explicit method for volume consumption, - whereas previously any request for storage was always fulfilled with a formatted filesystem, even when the underlying storage was - block. In addition, the ability to use a raw block device without a filesystem will allow - Kubernetes better support of high performance applications that can utilize raw block devices directly for their storage. - Block volumes are critical to applications like databases (MongoDB, Cassandra) that require consistent I/O performance - and low latency. For mission critical applications, like SAP, block storage is a requirement. - - For applications that use block storage natively (like MongoDB) no additional configuration is required as the mount path passed - to the application provides the device which MongoDB then uses for the storage path in the configuration file (dbpath). Specific - tuning for each application to achieve the highest possibly performance is provided as part of its recommended configurations. - - Specific use cases around improved usage of storage consumption are included in the use cases listed below as follows: - * An admin wishes to expose a block volume to be consumed as a block volume for the user - * An admin wishes to expose a block volume to be consumed as a block volume for an administrative function such - as bootstrapping - * A user wishes to utilize block storage to fully realize the performance of an application tuned to using block devices - * A user wishes to read from a block storage device and write to a filesystem (big data analytics processing) - Also use cases include dynamically provisioning and intelligent discovery of existing devices, which this proposal sets the - foundation for more fully developing these methods. - - -# Design Overview - - The proposed design is based on the idea of leveraging well defined concepts for storage in Kubernetes. The consumption and - definitions for the block devices will be driven through the PVC and PV definitions. Along with Storage - Resource definitions, this will provide the admin with a consistent way of managing all storage. - The API changes proposed in the following section are minimal with the idea of defining a volumeMode to indicate both the definition - and consumption of the devices. Since it's possible to create a volume as a block device and then later consume it by provisioning - a filesystem on top, the design requires explicit intent for how the volume will be used. - The additional benefit of explicitly defining how the volume is to be consumed will provide a means for indicating the method - by which the device should be scrubbed when the claim is deleted, as this method will differ from a raw block device compared to a - filesystem. The ownership and responsibility of defining the retention policy shall be up to the plugin method being utilized and is - not covered in this proposal. - - Limiting use of the volumeMode to block can be executed through the use of storage resource quotas and storageClasses defined by the - administrator. - - To ensure backwards compatibility and a phased transition of this feature, the consensus from the community is to intentionally disable - the volumeMode: Block for both in-tree and external provisioners until a suitable implementation for provisioner versioning has been - accepted and implemented in the community. In addition, in-tree provisioners should be able to gracefully ignore volumeMode API objects - for plugins that haven't been updated to accept this value. - - It is important to note that when a PV is bound, it is either bound as a raw block device or formatted with a filesystem. Therefore, - the PVC drives the request and intended usage of the device by specifying the volumeMode as part of the API. This design lends itself - to support of dynamic provisioning by also letting the request initiate from the PVC defining the role for the PV. It also - allows flexibility in the implementation and storage plugins to determine their support of this feature. Acceptable values for - volumeMode are 'Block' and 'Filesystem'. Where 'Filesystem' is the default value today and not required to be set in the PV/PVC. - -# Proposed API Changes - -## Persistent Volume Claim API Changes: -In the simplest case of static provisioning, a user asks for a volumeMode of block. The binder will only bind to a PV defined -with the same volumeMode. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: myclaim -spec: - volumeMode: Block #proposed API change - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi -``` - -For dynamic provisioning and the use of the storageClass, the admin also specifically defines the intent of the volume by -indicating the volumeMode as block. The provisioner for this class will validate whether or not it supports block and return -an error if it does not. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: myclaim -spec: - storageClassName: local-fast - volumeMode: Block #proposed API change - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi -``` - -## Persistent Volume API Changes: -For static provisioning the admin creates the volume and also is intentional about how the volume should be consumed. For backwards -compatibility, the absence of volumeMode will default to filesystem which is how volumes work today, which are formatted with a filesystem depending on the plug-in chosen. Recycling will not be a supported reclaim policy as it has been deprecated. The path value in the local PV definition would be overloaded to define the path of the raw block device rather than the filesystem path. -``` -kind: PersistentVolume -apiVersion: v1 -metadata: - name: local-raw-pv - annotations: - "volume.alpha.kubernetes.io/node-affinity": '{ - "requiredDuringSchedulingIgnoredDuringExecution": { - "nodeSelectorTerms": [ - { "matchExpressions": [ - { "key": "kubernetes.io/hostname", - "operator": "In", - "values": ["ip-172-18-11-174.ec2.internal"] - } - ]} - ]} - }' -spec: - volumeMode: Block - capacity: - storage: 10Gi - local: - path: /dev/xvdf - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Retain -``` -## Pod API Changes: -This change intentionally calls out the use of a block device (volumeDevices) rather than the mount point on a filesystem. -``` -apiVersion: v1 -kind: Pod -metadata: - name: my-db -spec: - containers: - - name: mysql - image: mysql - volumeDevices: #proposed API change - - name: my-db-data - devicePath: /dev/xvda #proposed API change - volumes: - - name: my-db-data - persistentVolumeClaim: - claimName: raw-pvc -``` -## Storage Class non-API Changes: -For dynamic provisioning, it is assumed that values passed in the parameter section are opaque, thus the introduction of utilizing -fstype in the StorageClass can be used by the provisioner to indicate how to create the volume. The proposal for this value is -defined here: -https://github.com/kubernetes/kubernetes/pull/45345 -This section is provided as a general guideline, but each provisioner may implement their parameters independent of what is defined -here. It is our recommendation that the volumeMode in the PVC be the guidance for the provisioner and overrides the value given in the fstype. Therefore a provisioner should be able to ignore the fstype and provision a block device if that is what the user requested via the PVC and the provisioner can support this. - -``` -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: block-volume - provisioner: kubernetes.io/scaleio - parameters: - gateway: https://192.168.99.200:443/api - system: scaleio - protectionDomain: default - storagePool: default - storageMode: ThinProvisionned - secretRef: sio-secret - readOnly: false -``` -The provisioner (if applicable) should validate the parameters and return an error if the combination specified is not supported. -This also allows the use case for leveraging a Storage Class for utilizing pre-defined static volumes. By labeling the Persistent Volumes -with the Storage Class, volumes can be grouped and used according to how they are defined in the class. -``` -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: block-volume -provisioner: no-provisioning -parameters: -``` - -# Use Cases - -## UC1: - -DESCRIPTION: An admin wishes to pre-create a series of local raw block devices to expose as PVs for consumption. The admin wishes to specify the purpose of these devices by specifying 'block' as the volumeMode for the PVs. - -WORKFLOW: - -ADMIN: - -``` -kind: PersistentVolume -apiVersion: v1 -metadata: - name: local-raw-pv -spec: - volumeMode: Block - capacity: - storage: 100Gi - local: - path: /dev/xvdc - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Delete -``` - -## UC2: - -DESCRIPTION: -* A user uses a raw block device for database applications such as MariaDB. -* User creates a persistent volume claim with "volumeMode: Block" option to bind pre-created iSCSI PV. - -WORKFLOW: - -ADMIN: -* Admin creates a disk and exposes it to all kubelet worker nodes. (This is done by storage operation). -* Admin creates an iSCSI persistent volume using storage information such as portal IP, iqn and lun. - -``` -kind: PersistentVolume -apiVersion: v1 -metadata: - name: raw-pv -spec: - volumeMode: Block - capacity: - storage: 100Gi - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Delete - iscsi: - targetPortal: 1.2.3.4:3260 - iqn: iqn.2017-05.com.example:test - lun: 0 -``` - -USER: - -* User creates a persistent volume claim with volumeMode: Block option to bind pre-created iSCSI PV. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: raw-pvc -spec: - volumeMode: Block - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi -``` - -* User creates a Pod yaml which uses raw-pvc PVC. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: my-db -spec: - containers - - namee: mysql - image: mysql - volumeDevices: - - name: my-db-data - devicePath: /dev/xvda - volumes: - - name: my-db-data - persistentVolumeClaim: - claimName: raw-pvc -``` -* During Pod creation, iSCSI Plugin attaches iSCSI volume to the kubelet worker node using storage information. - - -## UC3: - -DESCRIPTION: - -A developer wishes to enable their application to use a local raw block device as the volume for the container. The admin has already created PVs that the user will bind to by specifying 'block' as the volume type of their PVC. - -BACKGROUND: - -For example, an admin has already created the devices locally and wishes to expose them to the user in a consistent manner through the -Persistent Volume API. - -WORKFLOW: - -USER: - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: local-raw-pvc -spec: - volumeMode: Block - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi -``` - -## UC4: - -DESCRIPTION: StorageClass with non-dynamically created volumes - -BACKGROUND: The admin wishes to create a storage class that will identify pre-provisioned block PVs based on a user's PVC request for volumeMode: Block. - -WORKFLOW: - -ADMIN: - -``` -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: block-volume -provisioner: no-provisioning -parameters: -``` -* Sample of pre-created volume definition: - -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: pv-block-volume -spec: - volumeMode: Block - storageClassName: block-volume - capacity: - storage: 35Gi - accessModes: - - ReadWriteOnce - local: - path: /dev/xvdc -``` -## UC5: - -DESCRIPTION: StorageClass with dynamically created volumes - -BACKGROUND: The admin wishes to create a storage class that will dynamically create block PVs based on a user's PVC request for volumeMode: Block. The admin desires the volumes be created dynamically and deleted when the PV definition is deleted. - -WORKFLOW: - -ADMIN: - -``` -kind: StorageClass -apiVersion: storage.k8s.io/v1 -metadata: - name: local-fast -provisioner: kubernetes.io/local-block-ssd -parameters: -``` - -## UC6: - -DESCRIPTION: The developer wishes to request a block device via a Storage Class. - -WORKFLOW: - -USER: - -``` -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: pvc-local-block -spec: - volumeMode: Block - storageClassName: local-fast - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 10Gi -``` - -**Since the PVC object is passed to the provisioner, it will be responsible for validating and handling whether or not it supports the volumeMode being passed** - -## UC7: - -DESCRIPTION: Admin creates network raw block devices - -BACKGROUND: Admin wishes to pre-create Persistent Volumes in GCE as raw block devices - -WORKFLOW: - -ADMIN: - -``` -apiVersion: "v1" -kind: "PersistentVolume" -metadata: - name: gce-disk-1 -Spec: - volumeMode: Block - capacity: - storage: "10Gi" - accessModes: - - "ReadWriteOnce" - gcePersistentDisk: - pdName: "gce-disk-1" -``` - -## UC8: - -DESCRIPTION: -* A user uses a raw block device for database applications such as mysql to read data from and write the results to a disk that - has a formatted filesystem to be displayed via nginx web server. - -ADMIN: -* Admin creates a 2 block devices and formats one with a filesystem - -``` -kind: PersistentVolume -apiVersion: v1 -metadata: - name: raw-pv -spec: - volumeMode: Block - capacity: - storage: 100Gi - accessModes: - - ReadWriteOnce - persistentVolumeReclaimPolicy: Delete - gcePersistentDisk: - pdName: "gce-disk-1" - -``` -``` -kind: PersistentVolume -apiVersion: v1 -metadata: - name: gluster-pv -spec: - volumeMode: Filesystem - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Delete - glusterfs: - endpoints: glusterfs-cluster - path: glusterVol -``` -USER: - -* User creates a persistent volume claim with volumeMode: Block option to bind pre-created block volume. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: raw-pvc -spec: - volumeMode: Block - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 80Gi -``` -* User creates a persistent volume claim with volumeMode: Filesystem to the pre-created gluster volume. - -``` -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: gluster-pvc -spec: - volumeMode: Filesystem - accessModes: - - ReadWriteMany - resources: - requests: - storage: 50Gi -``` -* User creates a Pod yaml which will utilize both block and filesystem storage by its containers. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: my-db -spec: - volumes: - - name: my-db-data - persistentVolumeClaim: - claimName: raw-pvc - - name: my-nginx-data - persistentVolumeClaim: - claimName: gluster-pvc - containers - - name: mysql - image: mysql - volumeDevices: - - name: my-db-data - devicePath: /var/lib/mysql/data - - name: nginx - image: nginx - ports: - - containerPort: 80 - volumeMounts: - - mountPath: /usr/share/nginx/html - name: my-nginx-data - readOnly: false -``` -## UC9: - -DESCRIPTION: -* A user wishes to read data from a read-only raw block device, an example might be a database for analytics processing. - -USER: -* User creates pod and specifies 'readOnly' as a parameter in the persistent volume claim to indicate they would -like to be bound to a PV with this setting enabled. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: nginx-pod-block-001 -spec: - containers: - - name: nginx-container - image: nginx:latest - ports: - - containerPort: 80 - volumeDevices: - - name: data - devicePath: /dev/xvda - volumes: - - name: data - persistentVolumeClaim: - claimName: block-pvc001 - readOnly: true #flag indicating read-only for container runtime -``` -**Note: the readOnly field already exists in the PersistentVolumeClaimVolumeSource above and will dictate the values set by the container runtime options** - - -# Container Runtime considerations -It is important the values that are passed to the container runtimes are valid and support the current implementation of these various runtimes. Listed below are a table of various runtime and the mapping of their values to what is passed from the kubelet. - -| runtime engine | runtime options | accessMode | -| -------------- |:----------------:| ----------------:| -| docker/runc/rkt | mknod / RWM | RWO | -| docker/runc/rkt | R | ROX | - -The accessModes would be passed as part of the options array and would need validate against the specific runtime engine. -Since rkt doesn't use the CRI, the config values would need to be passed in the legacy method. -Note: the container runtime doesn't require a privileged pod to enable the device as RWX (RMW), but still requires privileges to mount as is consistent with the filesystem implementation today. - -The runtime option would be placed in the DeviceInfo as such: -devices = append(devices, kubecontainer.DeviceInfo{PathOnHost: path, PathInContainer: path, Permissions: "XXX"}) - -The implementation plan would be to rename the current makeDevices to makeGPUDevices and create a separate function to add the raw block devices to the option array to be passed to the container runtime. This would iterate on the paths passed in for the pod/container. - -Since the future of this in Kubernetes for GPUs and other plug-able devices is migrating to a device plugin architecture, there are -still differentiating components of storage that are enough to not to enforce alignment to their convention. Two factors when -considering the usage of device plugins center around discoverability and topology of devices. Since neither of these are requirements -for using raw block devices, the legacy method of populating the devices and appending it to the device array is sufficient. - - -# Plugin interface changes -## New BlockVolume interface proposed design - -``` -// BlockVolume interface provides methods to generate global map path -// and pod device map path. -type BlockVolume interface { - // GetGlobalMapPath returns a global map path which contains - // symbolic links associated to a block device. - // ex. plugins/kubernetes.io/{PluginName}/{DefaultKubeletVolumeDevicesDirName}/{volumePluginDependentPath}/{pod uuid} - GetGlobalMapPath(spec *Spec) (string, error) - // GetPodDeviceMapPath returns a pod device map path - // and name of a symbolic link associated to a block device. - // ex. pods/{podUid}}/{DefaultKubeletVolumeDevicesDirName}/{escapeQualifiedPluginName}/{volumeName} - GetPodDeviceMapPath() (string, string) -} -``` - -## New BlockVolumePlugin interface proposed design - -``` -// BlockVolumePlugin is an extend interface of VolumePlugin and is used for block volumes support. -type BlockVolumePlugin interface { - VolumePlugin - // NewBlockVolumeMapper creates a new volume.BlockVolumeMapper from an API specification. - // - spec: The v1.Volume spec - // - pod: The enclosing pod - NewBlockVolumeMapper(spec *Spec, podRef *v1.Pod, opts VolumeOptions) (BlockVolumeMapper, error) - // NewBlockVolumeUnmapper creates a new volume.BlockVolumeUnmapper from recoverable state. - // - name: The volume name, as per the v1.Volume spec. - // - podUID: The UID of the enclosing pod - NewBlockVolumeUnmapper(name string, podUID types.UID) (BlockVolumeUnmapper, error) - // ConstructBlockVolumeSpec constructs a volume spec based on the given - // pod name, volume name and a pod device map path. - // The spec may have incomplete information due to limited information - // from input. This function is used by volume manager to reconstruct - // volume spec by reading the volume directories from disk. - ConstructBlockVolumeSpec(podUID types.UID, volumeName, mountPath string) (*Spec, error) -} -``` - -## New BlockVolumeMapper/BlockVolumeUnmapper interface proposed design - -``` -// BlockVolumeMapper interface provides methods to set up/map the volume. -type BlockVolumeMapper interface { - BlockVolume - // SetUpDevice prepares the volume to a self-determined directory path, - // which may or may not exist yet and returns combination of physical - // device path of a block volume and error. - // If the plugin is non-attachable, it should prepare the device - // in /dev/ (or where appropriate) and return unique device path. - // Unique device path across kubelet node reboot is required to avoid - // unexpected block volume destruction. - // If the plugin is attachable, it should not do anything here, - // just return empty string for device path. - // Instead, attachable plugin have to return unique device path - // at attacher.Attach() and attacher.WaitForAttach(). - // This may be called more than once, so implementations must be idempotent. - SetUpDevice() (string, error) -} - -// BlockVolumeUnmapper interface provides methods to cleanup/unmap the volumes. -type BlockVolumeUnmapper interface { - BlockVolume - // TearDownDevice removes traces of the SetUpDevice procedure under - // a self-determined directory. - // If the plugin is non-attachable, this method detaches the volume - // from devicePath on kubelet node. - TearDownDevice(mapPath string, devicePath string) error -} -``` - -## Changes for volume mount points - -Currently, a volume which has filesystem is mounted to the following two paths on a kubelet node when the volumes is in-use. -The purpose of those mount points are that Kubernetes manages volume attach/detach status using these mount points and number -of references to these mount points. - -``` -- Global mount path -/var/lib/kubelet/plugins/kubernetes.io/{pluginName}/{volumePluginDependentPath}/ - -- Volume mount path -/var/lib/kubelet/pods/{podUID}/volumes/{escapeQualifiedPluginName}/{volumeName}/ -``` - -Even if the volumeMode is "Block", similar scheme is needed. However, the volume which -doesn't have filesystem can't be mounted. -Therefore, instead of volume mount, we use symbolic link which is associated to raw block device. -Kubelet creates a new symbolic link under the new `global map path` and `pod device map path`. - -#### Global map path for "Block" volumeMode volume -Kubelet creates a new symbolic link under the new global map path when volume is attached to a Pod. -Number of symbolic links are equal to the number of Pods which use the same volume. Kubelet needs -to manage both creation and deletion of symbolic links under the global map path. The name of the -symbolic link is same as pod uuid. -There are two usages of Global map path. - -1. Manage number of references from multiple pods -1. Retrieve `{volumePluginDependentPath}` during `Block volume reconstruction` - -``` -/var/lib/kubelet/plugins/kubernetes.io/{pluginName}/volumeDevices/{volumePluginDependentPath}/{pod uuid1} -/var/lib/kubelet/plugins/kubernetes.io/{pluginName}/volumeDevices/{volumePluginDependentPath}/{pod uuid2} -... -``` - -- {volumePluginDependentPath} example: -``` -FC plugin: {wwn}-lun-{lun} or {wwid} -ex. /var/lib/kubelet/plugins/kubernetes.io/fc/volumeDevices/500a0982991b8dc5-lun-0/f527ca5b-6d87-11e5-aa7e-080027ff6387 -iSCSI plugin: {portal ip}-{iqn}-lun-{lun} -ex. /var/lib/kubelet/plugins/kubernetes.io/iscsi/volumeDevices/1.2.3.4:3260-iqn.2001-04.com.example:storage.kube.sys1.xyz-lun-1/f527ca5b-6d87-11e5-aa7e-080027ff6387 - ``` - -#### Pod device map path for "Block" volumeMode volume -Kubelet creates a symbolic link under the new pod device map path. The file of {volumeName} is -symbolic link and the link is associated to raw block device. If a Pod has multiple block volumes, -multiple symbolic links under the pod device map path will be created with each volume name. -The usage of pod device map path is; - -1. Retrieve raw block device path(ex. /dev/sdX) during `Container initialization` and `Block volume reconstruction` - -``` -/var/lib/kubelet/pods/{podUID}/volumeDevices/{escapeQualifiedPluginName}/{volumeName1} -/var/lib/kubelet/pods/{podUID}/volumeDevices/{escapeQualifiedPluginName}/{volumeName2} -... -``` - -# Volume binding matrix for statically provisioned volumes: - -| PV volumeMode | PVC volumeMode | Result | -| --------------|:---------------:| ----------------:| -| unspecified | unspecified | BIND | -| unspecified | Block | NO BIND | -| unspecified | Filesystem | BIND | -| Block | unspecified | NO BIND | -| Block | Block | BIND | -| Block | Filesystem | NO BIND | -| Filesystem | Filesystem | BIND | -| Filesystem | Block | NO BIND | -| Filesystem | unspecified | BIND | - - - -* unspecified defaults to 'file/ext4' today for backwards compatibility and in mount_linux.go - -# Dynamically provisioning - -Using dynamic provisioning, user is able to create block volume via provisioners. Currently, -we have two types of provisioners, internal provisioner and external provisioner. -During volume creation via dynamic provisioner, user passes persistent volume claim which -contains `volumeMode` parameter, then the persistent volume claim object is passed to -provisioners. Therefore, in order to create block volume, provisioners need to support -`volumeMode` and then create persistent volume with `volumeMode`. - -If a storage and plugin don't have an ability to create raw block type of volume, -then `both internal and external provisioner don't need any update` to support `volumeMode` -because `volumeMode` in PV and PVC are automatically set to `Filesystem` as a default when -these volume object are created. -However, there is a case that use specifies `volumeMode` as `Block` even if both plugin and -provisioner don't support. As a result, PVC will be created, PV will be provisioned -but both of them will stuck Pending status since `volumeMode` between them don't match. -For this situation, we will add error propagation into persistent volume controller to make -it more clear to the user what's wrong. - -If admin provides external provisioner to provision both filesystem and block volume, -admin have to carefully prepare Kubernetes environment for their users because both -Kubernetes itself and external provisioner have to support block volume functionality. -This means Kubernetes v1.9 or later must be used to provide block volume with external -provisioner which supports block volume. - -Regardless of the volumeMode, provisioner can set `FSType` into the plugin's volumeSource -but the value will be ignored at the volume plugin side if `volumeMode` is `Block`. - -## Internal provisioner - -If internal plugin has own provisioner, the plugin needs to support `volumeMode` to provision -block volume. This is the example implementation of `volumeMode` support for GCE PD plugin. - -``` -// Obtain volumeMode from PVC Spec VolumeMode -var volumeMode v1.PersistentVolumeMode -if options.PVC.Spec.VolumeMode != nil { - volumeMode = *options.PVC.Spec.VolumeMode -} - -// Set volumeMode into PersistentVolumeSpec -pv := &v1.PersistentVolume{ - Spec: v1.PersistentVolumeSpec{ - VolumeMode: &volumeMode, - PersistentVolumeSource: v1.PersistentVolumeSource{ - GCEPersistentDisk: &v1.GCEPersistentDiskVolumeSource{ - PDName: options.Parameters["pdName"], - FSType: options.Parameters["fsType"], - ... - }, - }, - }, -} -``` - - -## External provisioner - -We have a "protocol" to allow dynamic provisioning by external software called external provisioner. -In order to support block volume via external provisioner, external provisioner needs to support -`volumeMode` and then create persistent volume with `volumeMode`. This is the example implementation -of `volumeMode` support for external provisioner of Local volume plugin. - -``` -// Obtain volumeMode from PVC Spec VolumeMode -var volumeMode v1.PersistentVolumeMode -if options.PVC.Spec.VolumeMode != nil { - volumeMode = *options.PVC.Spec.VolumeMode -} - -// Set volumeMode into PersistentVolumeSpec -pv := &v1.PersistentVolume{ - Spec: v1.PersistentVolumeSpec{ - VolumeMode: &volumeMode, - PersistentVolumeSource: v1.PersistentVolumeSource{ - Local: &v1.LocalVolumeSource{ - Path: options.Parameters["Path"], - }, - }, - }, -} -``` - - -# Volume binding considerations for dynamically provisioned volumes: -The value used for the plugin to indicate is it provisioning block will be plugin dependent and is an opaque parameter. Binding will also be plugin dependent and must handle the parameter being passed and indicate whether or not it supports block. - -# Implementation Plan, Features & Milestones - -Phase 1: v1.9 -Feature: Pre-provisioned PVs to precreated devices - - Milestone 1: API changes - - Milestone 2: Restricted Access - - Milestone 3: Changes to the mounter interface as today it is assumed 'file' as the default. - - Milestone 4: Expose volumeMode to users via kubectl - - Milestone 5: PV controller binding changes for block devices - - Milestone 6: Container Runtime changes - - Milestone 7: Initial Plugin changes (FC & Local storage) - -Phase 2: v1.10 -Feature: Discovery of block devices - - Milestone 1: Dynamically provisioned PVs to dynamically allocated devices - - Milestone 2: Plugin changes with dynamic provisioning support (RBD, iSCSI, GCE, AWS & GlusterFS) +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/svcacct-token-volume-source.md b/contributors/design-proposals/storage/svcacct-token-volume-source.md index 3069e677..f0fbec72 100644 --- a/contributors/design-proposals/storage/svcacct-token-volume-source.md +++ b/contributors/design-proposals/storage/svcacct-token-volume-source.md @@ -1,148 +1,6 @@ -# Service Account Token Volumes +Design proposals have been archived. -Authors: - @smarterclayton - @liggitt - @mikedanese +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Summary -Kubernetes is able to provide pods with unique identity tokens that can prove -the caller is a particular pod to a Kubernetes API server. These tokens are -injected into pods as secrets. This proposal proposes a new mechanism of -distribution with support for [improved service account tokens][better-tokens] -and explores how to migrate from the existing mechanism backwards compatibly. - -## Motivation - -Many workloads running on Kubernetes need to prove to external parties who they -are in order to participate in a larger application environment. This identity -must be attested to by the orchestration system in a way that allows a third -party to trust that an arbitrary container on the cluster is who it says it is. -In addition, infrastructure running on top of Kubernetes needs a simple -mechanism to communicate with the Kubernetes APIs and to provide more complex -tooling. Finally, a significant set of security challenges are associated with -storing service account tokens as secrets in Kubernetes and limiting the methods -whereby malicious parties can get access to these tokens will reduce the risk of -platform compromise. - -As a platform, Kubernetes should evolve to allow identity management systems to -provide more powerful workload identity without breaking existing use cases, and -provide a simple out of the box workload identity that is sufficient to cover -the requirements of bootstrapping low-level infrastructure running on -Kubernetes. We expect that other systems to cover the more advanced scenarios, -and see this effort as necessary glue to allow more powerful systems to succeed. - -With this feature, we hope to provide a backwards compatible replacement for -service account tokens that strengthens the security and improves the -scalability of the platform. - -## Proposal - -Kubernetes should implement a ServiceAccountToken volume projection that -maintains a service account token requested by the node from the TokenRequest -API. - -### Token Volume Projection - -A new volume projection will be implemented with an API that closely matches the -TokenRequest API. - -```go -type ProjectedVolumeSource struct { - Sources []VolumeProjection - DefaultMode *int32 -} - -type VolumeProjection struct { - Secret *SecretProjection - DownwardAPI *DownwardAPIProjection - ConfigMap *ConfigMapProjection - ServiceAccountToken *ServiceAccountTokenProjection -} - -// ServiceAccountTokenProjection represents a projected service account token -// volume. This projection can be used to insert a service account token into -// the pods runtime filesystem for use against APIs (Kubernetes API Server or -// otherwise). -type ServiceAccountTokenProjection struct { - // Audience is the intended audience of the token. A recipient of a token - // must identify itself with an identifier specified in the audience of the - // token, and otherwise should reject the token. The audience defaults to the - // identifier of the apiserver. - Audience string - // ExpirationSeconds is the requested duration of validity of the service - // account token. As the token approaches expiration, the kubelet volume - // plugin will proactively rotate the service account token. The kubelet will - // start trying to rotate the token if the token is older than 80 percent of - // its time to live or if the token is older than 24 hours.Defaults to 1 hour - // and must be at least 10 minutes. - ExpirationSeconds int64 - // Path is the relative path of the file to project the token into. - Path string -} -``` - -A volume plugin implemented in the kubelet will project a service account token -sourced from the TokenRequest API into volumes created from -ProjectedVolumeSources. As the token approaches expiration, the kubelet volume -plugin will proactively rotate the service account token. The kubelet will start -trying to rotate the token if the token is older than 80 percent of its time to -live or if the token is older than 24 hours. - -To replace the current service account token secrets, we also need to inject the -clusters CA certificate bundle. Initially we will deploy to data in a configmap -per-namespace and reference it using a ConfigMapProjection. - -A projected volume source that is equivalent to the current service account -secret: - -```yaml -sources: -- serviceAccountToken: - expirationSeconds: 3153600000 # 100 years - path: token -- configMap: - name: kube-cacrt - items: - - key: ca.crt - path: ca.crt -- downwardAPI: - items: - - path: namespace - fieldRef: metadata.namespace -``` - - -This fixes one scalability issue with the current service account token -deployment model where secret GETs are a large portion of overall apiserver -traffic. - -A projected volume source that requests a token for vault and Istio CA: - -```yaml -sources: -- serviceAccountToken: - path: vault-token - audience: vault -- serviceAccountToken: - path: istio-token - audience: ca.istio.io -``` - -### Alternatives - -1. Instead of implementing a service account token volume projection, we could - implement all injection as a flex volume or CSI plugin. - 1. Both flex volume and CSI are alpha and are unlikely to graduate soon. - 1. Virtual kubelets (like Fargate or ACS) may not be able to run flex - volumes. - 1. Service account tokens are a fundamental part of our API. -1. Remove service accounts and service account tokens completely from core, use - an alternate mechanism that sits outside the platform. - 1. Other core features need service account integration, leading to all - users needing to install this extension. - 1. Complicates installation for the majority of users. - - -[better-tokens]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-hostpath-qualifiers.md b/contributors/design-proposals/storage/volume-hostpath-qualifiers.md index 8afcc4e0..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-hostpath-qualifiers.md +++ b/contributors/design-proposals/storage/volume-hostpath-qualifiers.md @@ -1,146 +1,6 @@ -# Support HostPath volume existence qualifiers +Design proposals have been archived. -## Introduction +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A Host volume source is probably the simplest volume type to define, needing -only a single path. However, that simplicity comes with many assumptions and -caveats. - -This proposal describes one of the issues associated with Host volumes — -their silent and implicit creation of directories on the host — and -proposes a solution. - -## Problem - -Right now, under Docker, when a bindmount references a hostPath, that path will -be created as an empty directory, owned by root, if it does not already exist. -This is rarely what the user actually wants because hostPath volumes are -typically used to express a dependency on an existing external file or -directory. -This concern was raised during the [initial -implementation](https://github.com/docker/docker/issues/1279#issuecomment-22965058) -of this behavior in Docker and it was suggested that orchestration systems -could better manage volume creation than Docker, but Docker does so as well -anyways. - -To fix this problem, I propose allowing a pod to specify whether a given -hostPath should exist prior to the pod running, whether it should be created, -and what it should exist as. -I also propose the inclusion of a default value which matches the current -behavior to ensure backwards compatibility. - -To understand exactly when this behavior will or won't be correct, it's -important to look at the use-cases of Host Volumes. -The table below broadly classifies the use-case of Host Volumes and asserts -whether this change would be of benefit to that use-case. - -### HostPath volume Use-cases - -| Use-case | Description | Examples | Benefits from this change? | Why? | -|:---------|:------------|:---------|:--------------------------:|:-----| -| Accessing an external system, data, or configuration | Data or a unix socket is created by a process on the host, and a pod within kubernetes consumes it | [fluentd-es-addon](https://github.com/kubernetes/kubernetes/blob/74b01041cc3feb2bb731cc243ab0e4515bef9a84/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml#L30), [addon-manager](https://github.com/kubernetes/kubernetes/blob/808f3ecbe673b4127627a457dc77266ede49905d/cluster/gce/coreos/kube-manifests/kube-addon-manager.yaml#L23), [kube-proxy](https://github.com/kubernetes/kubernetes/blob/010c976ce8dd92904a7609483c8e794fd8e94d4e/cluster/saltbase/salt/kube-proxy/kube-proxy.manifest#L65), etc | :white_check_mark: | Fails faster and with more useful messages, and won't run when basic assumptions are false (e.g. that docker is the runtime and the docker.sock exists) | -| Providing data to external systems | Some pods wish to publish data to the host for other systems to consume, sometimes to a generic directory and sometimes to more component-specific ones | Kubelet core components which bindmount their logs out to `/var/log/*.log` so logrotate and other tools work with them | :white_check_mark: | Sometimes, but not always. It's directory-specific whether it not existing will be a problem. | -| Communicating between instances and versions of yourself | A pod can use a hostPath directory as a sort of cache and, as opposed to an emptyDir, persist the directory between versions of itself | [etcd](https://github.com/kubernetes/kubernetes/blob/fac54c9b22eff5c5052a8e3369cf8416a7827d36/cluster/saltbase/salt/etcd/etcd.manifest#L84), caches | :x: | It's pretty much always okay to create them | - - -### Other motivating factors - -One additional motivating factor for this change is that under the rkt runtime -paths are not created when they do not exist. This change moves the management -of these volumes into the Kubelet to the benefit of the rkt container runtime. - - -## Proposed API Change - -### Host Volume - -I propose that the -[`v1.HostPathVolumeSource`](https://github.com/kubernetes/kubernetes/blob/d26b4ca2859aa667ad520fb9518e0db67b74216a/pkg/api/types.go#L447-L451) -object be changed to include the following additional field: - -`Type` - An optional string of `exists|file|device|socket|directory` - If not -set, it will default to a backwards-compatible default behavior described -below. - -| Value | Behavior | -|:------|:---------| -| *unset* | If nothing exists at the given path, an empty directory will be created there. Otherwise, behaves like `exists` | -| `exists` | If nothing exists at the given path, the pod will fail to run and provide an informative error message | -| `file` | If a file does not exist at the given path, the pod will fail to run and provide an informative error message | -| `device` | If a block or character device does not exist at the given path, the pod will fail to run and provide an informative error message | -| `socket` | If a socket does not exist at the given path, the pod will fail to run and provide an informative error message | -| `directory` | If a directory does not exist at the given path, the pod will fail to run and provide an informative error message | - -Additional possible values, which are proposed to be excluded: - -|Value | Behavior | Reason for exclusion | -|:-----|:---------|:---------------------| -| `new-directory` | Like `auto`, but the given path must be a directory if it exists | `auto` mostly fills this use-case | -| `character-device` | | Granularity beyond `device` shouldn't matter often | -| `block-device` | | Granularity beyond `device` shouldn't matter often | -| `new-file` | Like file, but if nothing exist an empty file is created instead | In general, bindmounting the parent directory of the file you intend to create addresses this usecase | -| `optional` | If a path does not exist, then do not create any container-mount at all | This would better be handled by a new field entirely if this behavior is desirable | - - -### Why not as part of any other volume types? - -This feature does not make sense for any of the other volume types simply -because all of the other types are already fully qualified. For example, NFS -volumes are known to always be in existence else they will not mount. -Similarly, EmptyDir volumes will always exist as a directory. - -Only the HostVolume and SubPath means of referencing a path have the potential -to reference arbitrary incorrect or nonexistent things without erroring out. - -### Alternatives - -One alternative is to augment Host Volumes with a `MustExist` bool and provide -no further granularity. This would allow toggling between the `auto` and -`exists` behaviors described above. This would likely cover the "90%" use-case -and would be a simpler API. It would be sufficient for all of the examples -linked above in my opinion. - -## Kubelet implementation - -It's proposed that prior to starting a pod, the Kubelet validates that the -given path meets the qualifications of its type. Namely, if the type is `auto` -the Kubelet will create an empty directory if none exists there, and for each -of the others the Kubelet will perform the given validation prior to running -the pod. This validation might be done by a volume plugin, but further -technical consideration (out of scope of this proposal) is needed. - - -## Possible concerns - -### Permissions - -This proposal does not attempt to change the state of volume permissions. Currently, a HostPath volume is created with `root` ownership and `755` permissions. This behavior will be retained. An argument for this behavior is given [here](volumes.md#shared-storage-hostpath). - -### SELinux - -This proposal should not impact SELinux relabeling. Verifying the presence and -type of a given path will be logically separate from SELinux labeling. -Similarly, creating the directory when it doesn't exist will happen before any -SELinux operations and should not impact it. - - -### Containerized Kubelet - -A containerized kubelet would have difficulty creating directories. The -implementation will likely respect the `containerized` flag, or similar, -allowing it to either break out or be "/rootfs/" aware and thus operate as -desired. - -### Racy Validation - -Ideally the validation would be done at the time the bindmounts are created, -else it's possible for a given path or directory to change in the duration from -when it's validated and the container runtime attempts to create said mount. - -The only way to solve this problem is to integrate these sorts of qualification -into container runtimes themselves. - -I don't think this problem is severe enough that we need to push to solve it; -rather I think we can simply accept this minor race, and if runtimes eventually -allow this we can begin to leverage them. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-metrics.md b/contributors/design-proposals/storage/volume-metrics.md index 4bea9645..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-metrics.md +++ b/contributors/design-proposals/storage/volume-metrics.md @@ -1,141 +1,6 @@ -# Volume operation metrics +Design proposals have been archived. -## Goal +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Capture high level metrics for various volume operations in Kubernetes. -## Motivation - -Currently we don't have high level metrics that captures time taken -and success/failures rates of various volume operations. - -This proposal aims to implement capturing of these metrics at a level -higher than individual volume plugins. - -## Implementation - -### Metric format and collection - -Volume metrics emitted will fall under category of service metrics -as defined in [Kubernetes Monitoring Architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md). - - -The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection -from `/metrics` HTTP endpoint of kubelet and controller-manager. - - -Any collector which can parse Prometheus metric format should be able to collect -metrics from these endpoints. - -A more detailed description of monitoring pipeline can be found in [Monitoring architecture](/contributors/design-proposals/instrumentation/monitoring_architecture.md#monitoring-pipeline) document. - -### Metric Types - -Since we are interested in count(or rate) and time it takes to perform certain volume operation - we will use [Histogram](https://prometheus.io/docs/practices/histograms/) type for -emitting these metrics. - -We will be using `HistogramVec` type so as we can attach dimensions at runtime. All -the volume operation metrics will be named `storage_operation_duration_seconds`. -Name of operation and volume plugin's name will be emitted as dimensions. If for some reason -volume plugin's name is not available when operation is performed - label's value can be set -to `<n/a>`. - - -We are also interested in count of volume operation failures and hence a metric of type `NewCounterVec` -will be used for keeping track of errors. The error metric will be similarly named `storage_operation_errors_total`. - -Following is a sample of metrics (not exhaustive) that will be added by this proposal: - - -``` -storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" } -storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" } -storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "volume_provision" } -storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" } -storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" } -storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" } -storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" } -storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" } -storage_operation_duration_seconds { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" } -``` - -Similarly errors will be named: - -``` -storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_attach" } -storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_detach" } -storage_operation_errors_total { volume_plugin = "glusterfs", operation_name = "volume_provision" } -storage_operation_errors_total { volume_plugin = "gce-pd", operation_name = "volume_delete" } -storage_operation_errors_total { volume_plugin = "vsphere", operation_name = "volume_mount" } -storage_operation_errors_total { volume_plugin = "iscsi" , operation_name = "volume_unmount" } -storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "unmount_device" } -storage_operation_errors_total { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" } -storage_operation_errors_total { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" } -``` - -### Implementation Detail - -We propose following changes as part of implementation details. - -1. All volume operations are executed via `goroutinemap.Run` or `nestedpendingoperations.Run`. -`Run` function interface of these two types can be changed to include a `operationComplete` callback argument. - - For example: - - ```go - // nestedpendingoperations.go - Run(v1.UniqueVolumeName, types.UniquePodName, func() error, opComplete func(error)) error - // goroutinemap - Run(string, func() error, opComplete func(error)) error - ``` - - This will enable us to know when a volume operation is complete. - -2. All `GenXXX` functions in `operation_generator.go` should return plugin name in addition to function and error. - - for example: - - ```go - GenerateMountVolumeFunc(waitForAttachTimeout time.Duration, - volumeToMount VolumeToMount, - actualStateOfWorldMounterUpdater - ActualStateOfWorldMounterUpdater, isRemount bool) (func() error, pluginName string, err error) - ``` - - Similarly `pv_controller.scheduleOperation` will take plugin name as additional parameter: - - ```go - func (ctrl *PersistentVolumeController) scheduleOperation( - operationName string, - pluginName string, - operation func() error) - ``` - -3. Above changes will enable us to gather required metrics in `operation_executor` or when scheduling a operation in -pv controller. - - For example, metrics for time it takes to attach Volume can be captured via: - - ```go - func operationExecutorHook(plugin, operationName string) func(error) { - requestTime := time.Now() - opComplete := func(err error) { - timeTaken := time.Since(requestTime).Seconds() - // Create metric with operation name and plugin name - } - return onComplete - } - attachFunc, plugin, err := - oe.operationGenerator.GenerateAttachVolumeFunc(volumeToAttach, actualStateOfWorld) - opCompleteFunc := operationExecutorHook(plugin, "volume_attach") - return oe.pendingOperations.Run( - volumeToAttach.VolumeName, "" /* podName */, attachFunc, opCompleteFunc) - ``` - - `operationExecutorHook` function is a hook that is registered in operation_executor and it will - initialize necessary metric params and will return a function. This will be called when - operation is complete and will finalize metric creation and finally emit the metrics. - -### Conclusion - -Collection of metrics at operation level ensures almost no code change to volume plugin interface and a very minimum change to controllers. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-ownership-management.md b/contributors/design-proposals/storage/volume-ownership-management.md index a9fb1cfe..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-ownership-management.md +++ b/contributors/design-proposals/storage/volume-ownership-management.md @@ -1,108 +1,6 @@ -## Volume plugins and idempotency +Design proposals have been archived. -Currently, volume plugins have a `SetUp` method which is called in the context of a higher-level -workflow within the kubelet which has externalized the problem of managing the ownership of volumes. -This design has a number of drawbacks that can be mitigated by completely internalizing all concerns -of volume setup behind the volume plugin `SetUp` method. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -### Known issues with current externalized design -1. The ownership management is currently repeatedly applied, which breaks packages that require - special permissions in order to work correctly -2. There is a gap between files being mounted/created by volume plugins and when their ownership - is set correctly; race conditions exist around this -3. Solving the correct application of ownership management in an externalized model is difficult - and makes it clear that the a transaction boundary is being broken by the externalized design - -### Additional issues with externalization - -Fully externalizing any one concern of volumes is difficult for a number of reasons: - -1. Many types of idempotence checks exist, and are used in a variety of combinations and orders -2. Workflow in the kubelet becomes much more complex to handle: - 1. composition of plugins - 2. correct timing of application of ownership management - 3. callback to volume plugins when we know the whole `SetUp` flow is complete and correct - 4. callback to touch sentinel files - 5. etc etc -3. We want to support fully external volume plugins -- would require complex orchestration / chatty - remote API - -## Proposed implementation - -Since all of the ownership information is known in advance of the call to the volume plugin `SetUp` -method, we can easily internalize these concerns into the volume plugins and pass the ownership -information to `SetUp`. - -The volume `Builder` interface's `SetUp` method changes to accept the group that should own the -volume. Plugins become responsible for ensuring that the correct group is applied. The volume -`Attributes` struct can be modified to remove the `SupportsOwnershipManagement` field. - -```go -package volume - -type Builder interface { - // other methods omitted - - // SetUp prepares and mounts/unpacks the volume to a self-determined - // directory path and returns an error. The group ID that should own the volume - // is passed as a parameter. Plugins may choose to ignore the group ID directive - // in the event that they do not support it (example: NFS). A group ID of -1 - // indicates that the group ownership of the volume should not be modified by the plugin. - // - // SetUp will be called multiple times and should be idempotent. - SetUp(gid int64) error -} -``` - -Each volume plugin will have to change to support the new `SetUp` signature. The existing -ownership management code will be refactored into a library that volume plugins can use: - -```go -package volume - -func ManageOwnership(path string, fsGroup int64) error { - // 1. recursive chown of path - // 2. make path +setgid -} -``` - -The workflow from the Kubelet's perspective for handling volume setup and refresh becomes: - -```go -// go-ish pseudocode -func mountExternalVolumes(pod) error { - podVolumes := make(kubecontainer.VolumeMap) - for i := range pod.Spec.Volumes { - volSpec := &pod.Spec.Volumes[i] - var fsGroup int64 = 0 - if pod.Spec.SecurityContext != nil && - pod.Spec.SecurityContext.FSGroup != nil { - fsGroup = *pod.Spec.SecurityContext.FSGroup - } else { - fsGroup = -1 - } - - // Try to use a plugin for this volume. - plugin := volume.NewSpecFromVolume(volSpec) - builder, err := kl.newVolumeBuilderFromPlugins(plugin, pod) - if err != nil { - return err - } - if builder == nil { - return errUnsupportedVolumeType - } - - err := builder.SetUp(fsGroup) - if err != nil { - return nil - } - } - - return nil -} -``` - -<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> -[]() -<!-- END MUNGE: GENERATED_ANALYTICS --> +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-provisioning.md b/contributors/design-proposals/storage/volume-provisioning.md index 316ec4f0..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-provisioning.md +++ b/contributors/design-proposals/storage/volume-provisioning.md @@ -1,546 +1,6 @@ -## Abstract +Design proposals have been archived. -Real Kubernetes clusters have a variety of volumes which differ widely in -size, iops performance, retention policy, and other characteristics. -Administrators need a way to dynamically provision volumes of these different -types to automatically meet user demand. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A new mechanism called 'storage classes' is proposed to provide this -capability. -## Motivation - -In Kubernetes 1.2, an alpha form of limited dynamic provisioning was added -that allows a single volume type to be provisioned in clouds that offer -special volume types. - -In Kubernetes 1.3, a label selector was added to persistent volume claims to -allow administrators to create a taxonomy of volumes based on the -characteristics important to them, and to allow users to make claims on those -volumes based on those characteristics. This allows flexibility when claiming -existing volumes; the same flexibility is needed when dynamically provisioning -volumes. - -After gaining experience with dynamic provisioning after the 1.2 release, we -want to create a more flexible feature that allows configuration of how -different storage classes are provisioned and supports provisioning multiple -types of volumes within a single cloud. - -### Out-of-tree provisioners - -One of our goals is to enable administrators to create out-of-tree -provisioners, that is, provisioners whose code does not live in the Kubernetes -project. - -## Design - -This design represents the minimally viable changes required to provision based on storage class configuration. Additional incremental features may be added as a separate effort. - -We propose that: - -1. Both for in-tree and out-of-tree storage provisioners, the PV created by the - provisioners must match the PVC that led to its creations. If a provisioner - is unable to provision such a matching PV, it reports an error to the - user. - -2. The above point applies also to PVC label selector. If user submits a PVC - with a label selector, the provisioner must provision a PV with matching - labels. This directly implies that the provisioner understands meaning - behind these labels - if user submits a claim with selector that wants - a PV with label "region" not in "[east,west]", the provisioner must - understand what label "region" means, what available regions are there and - choose e.g. "north". - - In other words, provisioners should either refuse to provision a volume for - a PVC that has a selector, or select few labels that are allowed in - selectors (such as the "region" example above), implement necessary logic - for their parsing, document them and refuse any selector that references - unknown labels. - -3. An api object will be incubated in storage.k8s.io/v1beta1 to hold the a `StorageClass` - API resource. Each StorageClass object contains parameters required by the provisioner to provision volumes of that class. These parameters are opaque to the user. - -4. `PersistentVolume.Spec.Class` attribute is added to volumes. This attribute - is optional and specifies which `StorageClass` instance represents - storage characteristics of a particular PV. - - During incubation, `Class` is an annotation and not - actual attribute. - -5. `PersistentVolume` instances do not require labels by the provisioner. - -6. `PersistentVolumeClaim.Spec.Class` attribute is added to claims. This - attribute specifies that only a volume with equal - `PersistentVolume.Spec.Class` value can satisfy a claim. - - During incubation, `Class` is just an annotation and not - actual attribute. - -7. The existing provisioner plugin implementations be modified to accept - parameters as specified via `StorageClass`. - -8. The persistent volume controller modified to invoke provisioners using `StorageClass` configuration and bind claims with `PersistentVolumeClaim.Spec.Class` to volumes with equivalent `PersistentVolume.Spec.Class` - -9. The existing alpha dynamic provisioning feature be phased out in the - next release. - -### Controller workflow for provisioning volumes - -0. Kubernetes administrator can configure name of a default StorageClass. This - StorageClass instance is then used when user requests a dynamically - provisioned volume, but does not specify a StorageClass. In other words, - `claim.Spec.Class == ""` - (or annotation `volume.beta.kubernetes.io/storage-class == ""`). - -1. When a new claim is submitted, the controller attempts to find an existing - volume that will fulfill the claim. - - 1. If the claim has non-empty `claim.Spec.Class`, only PVs with the same - `pv.Spec.Class` are considered. - - 2. If the claim has empty `claim.Spec.Class`, only PVs with an unset `pv.Spec.Class` are considered. - - All "considered" volumes are evaluated and the - smallest matching volume is bound to the claim. - -2. If no volume is found for the claim and `claim.Spec.Class` is not set or is - empty string dynamic provisioning is disabled. - -3. If `claim.Spec.Class` is set the controller tries to find instance of StorageClass with this name. If no - such StorageClass is found, the controller goes back to step 1. and - periodically retries finding a matching volume or storage class again until - a match is found. The claim is `Pending` during this period. - -4. With StorageClass instance, the controller updates the claim: - * `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner` - -* **In-tree provisioning** - - The controller tries to find an internal volume plugin referenced by - `storageClass.Provisioner`. If it is found: - - 5. The internal provisioner implements interface`ProvisionableVolumePlugin`, - which has a method called `NewProvisioner` that returns a new provisioner. - - 6. The controller calls volume plugin `Provision` with Parameters - from the `StorageClass` configuration object. - - 7. If `Provision` returns an error, the controller generates an event on the - claim and goes back to step 1., i.e. it will retry provisioning - periodically. - - 8. If `Provision` returns no error, the controller creates the returned - `api.PersistentVolume`, fills its `Class` attribute with `claim.Spec.Class` - and makes it already bound to the claim - - 1. If the create operation for the `api.PersistentVolume` fails, it is - retried - - 2. If the create operation does not succeed in reasonable time, the - controller attempts to delete the provisioned volume and creates an event - on the claim - -Existing behavior is unchanged for claims that do not specify -`claim.Spec.Class`. - -* **Out of tree provisioning** - - Following step 4. above, the controller tries to find internal plugin for the - `StorageClass`. If it is not found, it does not do anything, it just - periodically goes to step 1., i.e. tries to find available matching PV. - - The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", - "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be - interpreted as described in RFC 2119. - - External provisioner must have these features: - - * It MUST have a distinct name, following Kubernetes plugin naming scheme - `<vendor name>/<provisioner name>`, e.g. `gluster.org/gluster-volume`. - - * The provisioner SHOULD send events on a claim to report any errors - related to provisioning a volume for the claim. This way, users get the same - experience as with internal provisioners. - - * The provisioner MUST implement also a deleter. It must be able to delete - storage assets it created. It MUST NOT assume that any other internal or - external plugin is present. - - The external provisioner runs in a separate process which watches claims, be - it an external storage appliance, a daemon or a Kubernetes pod. For every - claim creation or update, it implements these steps: - - 1. The provisioner inspects if - `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] == <provisioner name>`. - All other claims MUST be ignored. - - 2. The provisioner MUST check that the claim is unbound, i.e. its - `claim.Spec.VolumeName` is empty. Bound volumes MUST be ignored. - - *Race condition when the provisioner provisions a new PV for a claim and - at the same time Kubernetes binds the same claim to another PV that was - just created by admin is discussed below.* - - 3. It tries to find a StorageClass instance referenced by annotation - `claim.Annotations["volume.beta.kubernetes.io/storage-class"]`. If not - found, it SHOULD report an error (by sending an event to the claim) and it - SHOULD retry periodically with step i. - - 4. The provisioner MUST parse arguments in the `StorageClass` and - `claim.Spec.Selector` and provisions appropriate storage asset that matches - both the parameters and the selector. - When it encounters unknown parameters in `storageClass.Parameters` or - `claim.Spec.Selector` or the combination of these parameters is impossible - to achieve, it SHOULD report an error and it MUST NOT provision a volume. - All errors found during parsing or provisioning SHOULD be send as events - on the claim and the provisioner SHOULD retry periodically with step i. - - As parsing (and understanding) claim selectors is hard, the sentence - "MUST parse ... `claim.Spec.Selector`" will in typical case lead to simple - refusal of claims that have any selector: - - ```go - if pvc.Spec.Selector != nil { - return Error("can't parse PVC selector!") - } - ``` - - 5. When the volume is provisioned, the provisioner MUST create a new PV - representing the storage asset and save it in Kubernetes. When this fails, - it SHOULD retry creating the PV again few times. If all attempts fail, it - MUST delete the storage asset. All errors SHOULD be sent as events to the - claim. - - The created PV MUST have these properties: - - * `pv.Spec.ClaimRef` MUST point to the claim that led to its creation - (including the claim UID). - - *This way, the PV will be bound to the claim.* - - * `pv.Annotations["pv.kubernetes.io/provisioned-by"]` MUST be set to name - of the external provisioner. This provisioner will be used to delete the - volume. - - *The provisioner/delete should not assume there is any other - provisioner/deleter available that would delete the volume.* - - * `pv.Annotations["volume.beta.kubernetes.io/storage-class"]` MUST be set - to name of the storage class requested by the claim. - - *So the created PV matches the claim.* - - * The provisioner MAY store any other information to the created PV as - annotations. It SHOULD save any information that is needed to delete the - storage asset there, as appropriate StorageClass instance may not exist - when the volume will be deleted. However, references to Secret instance - or direct username/password to a remote storage appliance MUST NOT be - stored there, see issue #34822. - - * `pv.Labels` MUST be set to match `claim.spec.selector`. The provisioner - MAY add additional labels. - - *So the created PV matches the claim.* - - * `pv.Spec` MUST be set to match requirements in `claim.Spec`, especially - access mode and PV size. The provisioned volume size MUST NOT be smaller - than size requested in the claim, however it MAY be larger. - - * Kubernetes v1.9 or later have functionality to deploy raw block volume - instead of filesystem volume as a new feature. To support the feature, - we added `volumeMode` parameter which takes values `Filesystem` and - `Block` to `pv.Spec` and `pvc.Spec`. In order to deploy block volume - via external provisioner, following conditions are REQUIRED. - * A storage has ability to create raw block type of volume - * Block volume feature has been supported by the volume plugin - * External-provisioner MUST set `volumeMode` which matches requirements - in `claim.Spec` into `pv.Spec`. - - *So the created PV matches the claim.* - - * `pv.Spec.PersistentVolumeSource` MUST be set to point to the created - storage asset. - - * `pv.Spec.PersistentVolumeReclaimPolicy` SHOULD be set to `Delete` unless - user manually configures other reclaim policy. - - * `pv.Name` MUST be unique. Internal provisioners use name based on - `claim.UID` to produce conflicts when two provisioners accidentally - provision a PV for the same claim, however external provisioners can use - any mechanism to generate an unique PV name. - - Example 1) a claim that is to be provisioned by an external provisioner for - `foo.org/foo-volume`: - - ```yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - annotations: - volume.beta.kubernetes.io/storage-class: myClass - volume.beta.kubernetes.io/storage-provisioner: foo.org/foo-volume - name: fooclaim - namespace: default - resourceVersion: "53" - uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 - spec: - accessModes: - - ReadWriteOnce - volumeMode: Filesystem - resources: - requests: - storage: 4Gi - # volumeName: must be empty! - ``` - - Example 1) the created PV: - - ```yaml - apiVersion: v1 - kind: PersistentVolume - metadata: - annotations: - pv.kubernetes.io/provisioned-by: foo.org/foo-volume - volume.beta.kubernetes.io/storage-class: myClass - foo.org/provisioner: "any other annotations as needed" - labels: - foo.org/my-label: "any labels as needed" - generateName: "foo-volume-" - spec: - accessModes: - - ReadWriteOnce - volumeMode: Filesystem - awsElasticBlockStore: - fsType: ext4 - volumeID: aws://us-east-1d/vol-de401a79 - capacity: - storage: 4Gi - claimRef: - apiVersion: v1 - kind: PersistentVolumeClaim - name: fooclaim - namespace: default - resourceVersion: "53" - uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 - persistentVolumeReclaimPolicy: Delete - ``` - - Example 2) a claim that provisions `volumeMode: Block` volume: - - ```yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - ... - spec: - accessModes: - - ReadWriteOnce - volumeMode: Block - resources: - requests: - storage: 4Gi - # volumeName: must be empty! - ``` - - Example 2) the created PV: - - ```yaml - apiVersion: v1 - kind: PersistentVolume - metadata: - ... - spec: - accessModes: - - ReadWriteOnce - volumeMode: Block - awsElasticBlockStore: - volumeID: aws://us-east-1d/vol-de401a79 - capacity: - storage: 4Gi - claimRef: - ... - persistentVolumeReclaimPolicy: Delete - ``` - - - As result, Kubernetes has a PV that represents the storage asset and is bound - to the claim. When everything went well, Kubernetes completed binding of the - claim to the PV. - - Kubernetes was not blocked in any way during the provisioning and could - either bound the claim to another PV that was created by user or even the - claim may have been deleted by the user. In both cases, Kubernetes will mark - the PV to be delete using the protocol below. - - The external provisioner MAY save any annotations to the claim that is - provisioned, however the claim may be modified or even deleted by the user at - any time. - - -### Controller workflow for deleting volumes - -When the controller decides that a volume should be deleted it performs these -steps: - -1. The controller changes `pv.Status.Phase` to `Released`. - -2. The controller looks for `pv.Annotations["pv.kubernetes.io/provisioned-by"]`. - If found, it uses this provisioner/deleter to delete the volume. - -3. If the volume is not annotated by `pv.kubernetes.io/provisioned-by`, the - controller inspects `pv.Spec` and finds in-tree deleter for the volume. - -4. If the deleter found by steps 2. or 3. is internal, it calls it and deletes - the storage asset together with the PV that represents it. - -5. If the deleter is not known to Kubernetes, it does not do anything. - -6. External deleters MUST watch for PV changes. When - `pv.Status.Phase == Released && pv.Annotations['pv.kubernetes.io/provisioned-by'] == <deleter name>`, - the deleter: - - * It MUST check reclaim policy of the PV and ignore all PVs whose - `Spec.PersistentVolumeReclaimPolicy` is not `Delete`. - - * It MUST delete the storage asset. - - * Only after the storage asset was successfully deleted, it MUST delete the - PV object in Kubernetes. - - * Any error SHOULD be sent as an event on the PV being deleted and the - deleter SHOULD retry to delete the volume periodically. - - * The deleter SHOULD NOT use any information from StorageClass instance - referenced by the PV. This is different to internal deleters, which - need to be StorageClass instance present at the time of deletion to read - Secret instances (see Gluster provisioner for example), however we would - like to phase out this behavior. - - Note that watching `pv.Status` has been frowned upon in the past, however in - this particular case we could use it quite reliably to trigger deletion. - It's not trivial to find out if a PV is not needed and should be deleted. - *Alternatively, an annotation could be used.* - -### Security considerations - -Both internal and external provisioners and deleters may need access to -credentials (e.g. username+password) of an external storage appliance to -provision and delete volumes. - -* For internal provisioners, a Secret instance in a well secured namespace -should be used. Pointer to the Secret instance shall be parameter of the -StorageClass and it MUST NOT be copied around the system e.g. in annotations -of PVs. See issue #34822. - -* External provisioners running in pod should have appropriate credentials -mounted as Secret inside pods that run the provisioner. Namespace with the pods -and Secret instance should be well secured. - -### `StorageClass` API - -A new API group should hold the API for storage classes, following the pattern -of autoscaling, metrics, etc. To allow for future storage-related APIs, we -should call this new API group `storage.k8s.io` and incubate in storage.k8s.io/v1beta1. - -Storage classes will be represented by an API object called `StorageClass`: - -```go -package storage - -// StorageClass describes the parameters for a class of storage for -// which PersistentVolumes can be dynamically provisioned. -// -// StorageClasses are non-namespaced; the name of the storage class -// according to etcd is in ObjectMeta.Name. -type StorageClass struct { - unversioned.TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - // Provisioner indicates the type of the provisioner. - Provisioner string `json:"provisioner,omitempty"` - - // Parameters for dynamic volume provisioner. - Parameters map[string]string `json:"parameters,omitempty"` -} - -``` - -`PersistentVolumeClaimSpec` and `PersistentVolumeSpec` both get Class attribute -(the existing annotation is used during incubation): - -```go -type PersistentVolumeClaimSpec struct { - // Name of requested storage class. If non-empty, only PVs with this - // pv.Spec.Class will be considered for binding and if no such PV is - // available, StorageClass with this name will be used to dynamically - // provision the volume. - Class string -... -} - -type PersistentVolumeSpec struct { - // Name of StorageClass instance that this volume belongs to. - Class string -... -} -``` - -Storage classes are natural to think of as a global resource, since they: - -1. Align with PersistentVolumes, which are a global resource -2. Are administrator controlled - -### Provisioning configuration - -With the scheme outlined above the provisioner creates PVs using parameters specified in the `StorageClass` object. - -### Provisioner interface changes - -`struct volume.VolumeOptions` (containing parameters for a provisioner plugin) -will be extended to contain StorageClass.Parameters. - -The existing provisioner implementations will be modified to accept the StorageClass configuration object. - -### PV Controller Changes - -The persistent volume controller will be modified to implement the new -workflow described in this proposal. The changes will be limited to the -`provisionClaimOperation` method, which is responsible for invoking the -provisioner and to favor existing volumes before provisioning a new one. - -## Examples - -### AWS provisioners with distinct QoS - -This example shows two storage classes, "aws-fast" and "aws-slow". - -```yaml -apiVersion: v1 -kind: StorageClass -metadata: - name: aws-fast -provisioner: kubernetes.io/aws-ebs -parameters: - zone: us-east-1b - type: ssd - - -apiVersion: v1 -kind: StorageClass -metadata: - name: aws-slow -provisioner: kubernetes.io/aws-ebs -parameters: - zone: us-east-1b - type: spinning -``` - -# Additional Implementation Details - -0. Annotation `volume.alpha.kubernetes.io/storage-class` is used instead of `claim.Spec.Class` and `volume.Spec.Class` during incubation. - -1. `claim.Spec.Selector` and `claim.Spec.Class` are mutually exclusive for now (1.4). User can either match existing volumes with `Selector` XOR match existing volumes with `Class` and get dynamic provisioning by using `Class`. This simplifies initial PR and also provisioners. This limitation may be lifted in future releases. - -# Cloud Providers - -Since the `volume.alpha.kubernetes.io/storage-class` is in use a `StorageClass` must be defined to support provisioning. No default is assumed as before. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-selectors.md b/contributors/design-proposals/storage/volume-selectors.md index 5af92d0c..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-selectors.md +++ b/contributors/design-proposals/storage/volume-selectors.md @@ -1,262 +1,6 @@ -## Abstract +Design proposals have been archived. -Real Kubernetes clusters have a variety of volumes which differ widely in -size, iops performance, retention policy, and other characteristics. A -mechanism is needed to enable administrators to describe the taxonomy of these -volumes, and for users to make claims on these volumes based on their -attributes within this taxonomy. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -A label selector mechanism is proposed to enable flexible selection of volumes -by persistent volume claims. -## Motivation - -Currently, users of persistent volumes have the ability to make claims on -those volumes based on some criteria such as the access modes the volume -supports and minimum resources offered by a volume. In an organization, there -are often more complex requirements for the storage volumes needed by -different groups of users. A mechanism is needed to model these different -types of volumes and to allow users to select those different types without -being intimately familiar with their underlying characteristics. - -As an example, many cloud providers offer a range of performance -characteristics for storage, with higher performing storage being more -expensive. Cluster administrators want the ability to: - -1. Invent a taxonomy of logical storage classes using the attributes - important to them -2. Allow users to make claims on volumes using these attributes - -## Constraints and Assumptions - -The proposed design should: - -1. Deal with manually-created volumes -2. Not necessarily require users to know or understand the differences between - volumes (ie, Kubernetes should not dictate any particular set of - characteristics to administrators to think in terms of) - -We will focus **only** on the barest mechanisms to describe and implement -label selectors in this proposal. We will address the following topics in -future proposals: - -1. An extension resource or third party resource for storage classes -1. Dynamically provisioning new volumes for based on storage class - -## Use Cases - -1. As a user, I want to be able to make a claim on a persistent volume by - specifying a label selector as well as the currently available attributes - -### Use Case: Taxonomy of Persistent Volumes - -Kubernetes offers volume types for a variety of storage systems. Within each -of those storage systems, there are numerous ways in which volume instances -may differ from one another: iops performance, retention policy, etc. -Administrators of real clusters typically need to manage a variety of -different volumes with different characteristics for different groups of -users. - -Kubernetes should make it possible for administrators to flexibly model the -taxonomy of volumes in their clusters and to label volumes with their storage -class. This capability must be optional and fully backward-compatible with -the existing API. - -Let's look at an example. This example is *purely fictitious* and the -taxonomies presented here are not a suggestion of any sort. In the case of -AWS EBS there are four different types of volume (in ascending order of cost): - -1. Cold HDD -2. Throughput optimized HDD -3. General purpose SSD -4. Provisioned IOPS SSD - -Currently, there is no way to distinguish between a group of 4 PVs where each -volume is of one of these different types. Administrators need the ability to -distinguish between instances of these types. An administrator might decide -to think of these volumes as follows: - -1. Cold HDD - `tin` -2. Throughput optimized HDD - `bronze` -3. General purpose SSD - `silver` -4. Provisioned IOPS SSD - `gold` - -This is not the only dimension that EBS volumes can differ in. Let's simplify -things and imagine that AWS has two availability zones, `east` and `west`. Our -administrators want to differentiate between volumes of the same type in these -two zones, so they create a taxonomy of volumes like so: - -1. `tin-west` -2. `tin-east` -3. `bronze-west` -4. `bronze-east` -5. `silver-west` -6. `silver-east` -7. `gold-west` -8. `gold-east` - -Another administrator of the same cluster might label things differently, -choosing to focus on the business role of volumes. Say that the data -warehouse department is the sole consumer of the cold HDD type, and the DB as -a service offering is the sole consumer of provisioned IOPS volumes. The -administrator might decide on the following taxonomy of volumes: - -1. `warehouse-east` -2. `warehouse-west` -3. `dbaas-east` -4. `dbaas-west` - -There are any number of ways an administrator may choose to distinguish -between volumes. Labels are used in Kubernetes to express the user-defined -properties of API objects and are a good fit to express this information for -volumes. In the examples above, administrators might differentiate between -the classes of volumes using the labels `business-unit`, `volume-type`, or -`region`. - -Label selectors are used through the Kubernetes API to describe relationships -between API objects using flexible, user-defined criteria. It makes sense to -use the same mechanism with persistent volumes and storage claims to provide -the same functionality for these API objects. - -## Proposed Design - -We propose that: - -1. A new field called `Selector` be added to the `PersistentVolumeClaimSpec` - type -2. The persistent volume controller be modified to account for this selector - when determining the volume to bind to a claim - -### Persistent Volume Selector - -Label selectors are used throughout the API to allow users to express -relationships in a flexible manner. The problem of selecting a volume to -match a claim fits perfectly within this metaphor. Adding a label selector to -`PersistentVolumeClaimSpec` will allow users to label their volumes with -criteria important to them and select volumes based on these criteria. - -```go -// PersistentVolumeClaimSpec describes the common attributes of storage devices -// and allows a Source for provider-specific attributes -type PersistentVolumeClaimSpec struct { - // Contains the types of access modes required - AccessModes []PersistentVolumeAccessMode `json:"accessModes,omitempty"` - // Selector is a selector which must be true for the claim to bind to a volume - Selector *unversioned.Selector `json:"selector,omitempty"` - // Resources represents the minimum resources required - Resources ResourceRequirements `json:"resources,omitempty"` - // VolumeName is the binding reference to the PersistentVolume backing this claim - VolumeName string `json:"volumeName,omitempty"` -} -``` - -### Labeling volumes - -Volumes can already be labeled: - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-1 - labels: - ebs-volume-type: iops - aws-availability-zone: us-east-1 -spec: - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-12345 - fsType: xfs -``` - -### Controller Changes - -At the time of this writing, the various controllers for persistent volumes -are in the process of being refactored into a single controller (see -[kubernetes/24331](https://github.com/kubernetes/kubernetes/pull/24331)). - -The resulting controller should be modified to use the new -`selector` field to match a claim to a volume. In order to -match to a volume, all criteria must be satisfied; ie, if a label selector is -specified on a claim, a volume must match both the label selector and any -specified access modes and resource requirements to be considered a match. - -## Examples - -Let's take a look at a few examples, revisiting the taxonomy of EBS volumes and regions: - -Volumes of the different types might be labeled as follows: - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-west - labels: - ebs-volume-type: iops-ssd - aws-availability-zone: us-west-1 -spec: - capacity: - storage: 150Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-23456 - fsType: xfs - -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-east - labels: - ebs-volume-type: gp-ssd - aws-availability-zone: us-east-1 -spec: - capacity: - storage: 150Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-34567 - fsType: xfs -``` - -...claims on these volumes would look like: - -```yaml -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: ebs-claim-west -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 1Gi - selector: - matchLabels: - ebs-volume-type: iops-ssd - aws-availability-zone: us-west-1 - -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: ebs-claim-east -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 1Gi - selector: - matchLabels: - ebs-volume-type: gp-ssd - aws-availability-zone: us-east-1 -``` +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-snapshotting.md b/contributors/design-proposals/storage/volume-snapshotting.md index df0aa1a7..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-snapshotting.md +++ b/contributors/design-proposals/storage/volume-snapshotting.md @@ -1,518 +1,6 @@ -Kubernetes Snapshotting Proposal -================================ +Design proposals have been archived. -**Authors:** [Cindy Wang](https://github.com/ciwang) +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Background - -Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). - -Typical existing backup solutions offer on demand or scheduled snapshots. - -An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume. - -Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves. - -## Objectives - -For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals. - -* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers. - - * Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods. - -* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API. - - * Nongoal: Support Delete and Restore snapshot operations in API. - -* Goal 3: Implement snapshotting interface for GCE PDs. - - * Nongoal: Implement snapshotting interface for non GCE PD volumes. - -### Feature Roadmap - -Major features, in order of priority (bold features are priorities for v1): - -* **On demand snapshots** - - * **API to create new snapshots and list existing snapshots** - - * API to restore a disk from a snapshot and delete old snapshots - -* Scheduled snapshots - -* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node) - -## Requirements - -### Performance - -* Time SLA from issuing a snapshot to completion: - -* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location. - -* This should be on the order of a few minutes. - -### Reliability - -* Data corruption - - * Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons: - - * GCE and Amazon can create snapshots while the application is running. - - * Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code. - - * Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown. - -* Snapshot failure - - * Case: Failure during external process, such as during API call or upload - - * Log error, retry until success (indefinitely) - - * Case: Failure within Kubernetes, such as controller restarts - - * If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state. - -## Solution Overview - -Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects. - -* **Create:** - - * Key: create.snapshot.volume.alpha.kubernetes.io - - * Value: [snapshot name] - -* **List:** - - * Key: snapshot.volume.alpha.kubernetes.io/[snapshot name] - - * Value: [snapshot timestamp] - -A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes. - -The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage). - -## Detailed Design - -### API - -* Create snapshot - - * Usage: - - * Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter - - * When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list. - - * API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually. - - * PVC control loop in master - - * If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV. - - * If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV. - -* List existing snapshots - - * Only displayed as annotations on PVC object. - - * Only lists unique names and timestamps of snapshots taken using the Kubernetes API. - - * Usage: - - * Get the PVC object - - * Snapshots are listed as key-value pairs within the PVC annotations - -### SnapshotController - - - -**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests. - -**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully. - -**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful. - -The controller will have a loop that does the following: - -* Fetch State - - * Fetch all PVC objects from the API server. - -* Act - - * Trigger snapshot: - - * Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation. - -* Persist State - - * Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server. - -Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor. - -The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor. - -### Create Snapshot Logic - -To create a snapshot: - -* Acquire operation lock for volume so that no other attach or detach operations can be started for volume. - - * Abort if there is already a pending operation for the specified volume (main loop will retry, if needed). - -* Spawn a new thread: - - * Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC. - - * For any errors, log the error, and terminate the thread (the main controller will retry as needed). - - * Once a snapshot is created successfully: - - * Make a call to the API server to delete the create snapshot annotation in the PVC object. - - * Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations. - -*Brainstorming notes below, read at your own risk!* - -* * * - - -Open questions: - -* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API? - - * It seems that the API route is a bit more feasible in implementation and can also be fully utilized. - - * Can the API call methods on VolumePlugins? Yeah via controller - - * The scheduler gives users functionality that doesn't already exist, but required adding an entirely new controller - -* Should the list and restore operations be part of v1? - -* Do we call them snapshots or backups? - - * From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is necessary, but not sufficient, when conducting a backup of a stateful application." - -* At what minimum granularity should snapshots be allowed? - -* How do we store information about the most recent snapshot in case the controller restarts? - -* In case of error, do we err on the side of fewer or more snapshots? - -Snapshot Scheduler - -1. PVC API Object - -A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression. - -* CRUD operations on snapshot schedules - - * Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/) - - * The cron expression provides flexibility to decrease the interval between snapshots in future versions - - * Read: Display snapshot schedule to user via kubectl get pvc - - * Update: Do not support changing the snapshot schedule for an existing PVC - - * Delete: Do not support deleting the snapshot schedule for an existing PVC - - * In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name - -* Validation - - * Cron expressions must have a 0 in the minutes place and use exact, not interval syntax - - * [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals). - - * If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user - - Objective - -Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins. - -Goal: Implement snapshotting interface for GCE PDs. - -Goal: Protect against data loss by allowing users to restore snapshots of their disks. - -Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes. - -Nongoal: Use snapshotting to provide additional features such as migration. - - Background - -Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). - -Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage. - -Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules. - - what is the value case? Who wants this? What do we enable by implementing this? - -I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious. - -Requirements - -Functionality - -Should this support PVs, direct volumes, or both? - -Should we support deletion? - -Should we support restores? - -Automated schedule -- times or intervals? Before major event? - -Performance - -Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing? - -* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards. - - * Pending = uploading to GCE - -* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting - -Reliability - -How do we ascertain that deletions happen when we want them to? - -For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes. - -We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this. - -What happens when the snapshot fails as these are async operations? - -Retry (for some time period? indefinitely?) and log the error - -Other - -What is the UI for seeing the list of snapshots? - -In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots. - -Overview - -There are several design options for the design of each layer of implementation as follows. - -1. **Public API:** - -Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive: - - 1. New Volume API object - - 1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object. - - 2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot. - - 3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object. - - 4. Pros - - 1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule. - - 5. Cons - - 2. Heavyweight codewise; involves changing and touching a lot of existing code. - - 3. Potentially bad UX: How is the Volume API object created? - - 1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands. - - 2. When a unique volume is specified in a pod or PV spec. - - 2. Directly in volume definition in the pod/PV object - - 6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. - - 7. Pros - - 4. Easy for users to implement and understand - - 8. Cons - - 5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used. - - 6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set. - - 3. Only in PV object - - 9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. - - 10. Pros - - 7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property. - - 11. Cons - - 8. No support for direct volumes - - 9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job - - 10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules. - - 4. Annotations: key value pairs on API object - - 12. User experience is the same as (b) - - 13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours. - - 14. Pros - - 11. Less invasive to the codebase than (a-c) - - 15. Cons - - 12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation. - -2. **Business logic:** - - 5. Does this go on the master, node, or both? - - 16. Where the snapshot is stored - - 13. GCE, Amazon: cloud storage - - 14. Others stored on volume itself (gluster) or external drive (iSCSI) - - 17. Requirements for snapshot operation - - 15. Application flush, sync, and fsfreeze before creating snapshot - - 6. Suggestion: - - 18. New SnapshotController on master - - 16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot - - 17. If controller restarts and we miss a snapshot in the process, just skip it - - 3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5) - - 18. If snapshotting call fails, retry for an amount of time specified in retryPeriod - - 19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period - - 19. Logic to prepare the disk for snapshotting on node - - 20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD) - - 7. Alternatives: login entirely on node - - 20. Problems: - - 21. If pod moves from one node to another - - 4. A different node is in now in charge of snapshotting - - 5. If the volume plugin requires external memory for snapshots, we need to move the existing data - - 22. If the same pod exists on two different nodes, which node is in charge - -3. **Volume plugin interface/internal API:** - - 8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin) - - 9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot - - 10. Similar to volume.attach call - -4. **Other questions:** - - 11. Snapshot period - - 12. Time or period - - 13. What is our SLO around time accuracy? - - 21. Best effort, but no guarantees (depends on time or period) -- if going with time. - - 14. What if we miss a snapshot? - - 22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity - - 15. Will we know when an operation has failed? How do we report that? - - 23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases - - 16. Will we be responsible for GCing old snapshots? - - 24. Maybe this can be explicit non-goal, in the future can automate garbage collection - - 17. If the pod dies do we continue creating snapshots? - - 18. How to communicate errors (PD doesn't support snapshotting, time period unsupported) - - 19. Off schedule snapshotting like before an application upgrade - - 20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this? - -Options, pros, cons, suggestion/recommendation - -Example 1b - -During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod's associated volume. - -For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2): - -apiVersion: v1 -kind: Pod -metadata: - name: test-pd -spec: - containers: - - image: k8s.gcr.io/test-webserver - name: test-container - volumeMounts: - - mountPath: /test-pd - name: test-volume - volumes: - - name: test-volume - # This GCE PD must already exist. - gcePersistentDisk: - pdName: my-data-disk - fsType: ext4 - -Introduce a new field into the volume spec: - -apiVersion: v1 -kind: Pod -metadata: - name: test-pd -spec: - containers: - - image: k8s.gcr.io/test-webserver - name: test-container - volumeMounts: - - mountPath: /test-pd - name: test-volume - volumes: - - name: test-volume - # This GCE PD must already exist. - gcePersistentDisk: - pdName: my-data-disk - fsType: ext4 - -** ssTimes: ****[1, 5]** - - Caveats - -* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because - - * this does not provide value to the user and only adds an extra layer of indirection/complexity. - - * ? - - Dependencies - -* Kubernetes - -* Persistent volume snapshot support through API - - * POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-snapshotting.png b/contributors/design-proposals/storage/volume-snapshotting.png Binary files differdeleted file mode 100644 index 1b1ea748..00000000 --- a/contributors/design-proposals/storage/volume-snapshotting.png +++ /dev/null diff --git a/contributors/design-proposals/storage/volume-topology-scheduling.md b/contributors/design-proposals/storage/volume-topology-scheduling.md index 230398c7..f0fbec72 100644 --- a/contributors/design-proposals/storage/volume-topology-scheduling.md +++ b/contributors/design-proposals/storage/volume-topology-scheduling.md @@ -1,1369 +1,6 @@ -# Volume Topology-aware Scheduling +Design proposals have been archived. -Authors: @msau42, @lichuqiang +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -This document presents a detailed design for making the default Kubernetes -scheduler aware of volume topology constraints, and making the -PersistentVolumeClaim (PVC) binding aware of scheduling decisions. -## Definitions -* Topology: Rules to describe accessibility of an object with respect to - location in a cluster. -* Domain: A grouping of locations within a cluster. For example, 'node1', - 'rack10', 'zone5'. -* Topology Key: A description of a general class of domains. For example, - 'node', 'rack', 'zone'. -* Hierarchical domain: Domain that can be fully encompassed in a larger domain. - For example, the 'zone1' domain can be fully encompassed in the 'region1' - domain. -* Failover domain: A domain that a workload intends to run in at a later time. - -## Goals -* Allow topology to be specified for both pre-provisioned and dynamic - provisioned PersistentVolumes so that the Kubernetes scheduler can correctly - place a Pod using such a volume to an appropriate node. -* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar) - without encoding each as first class objects in the Kubernetes API. -* Allow the Kubernetes scheduler to influence where a volume is provisioned or - which pre-provisioned volume to bind to based on scheduling constraints on the - Pod requesting a volume, such as Pod resource requirements and - affinity/anti-affinity policies. -* No scheduling latency performance regression for Pods that do not use - PVs with topology. -* Allow administrators to restrict allowed topologies per StorageClass. - -## Non Goals -* Fitting a pod after the initial PVC binding has been completed. - * The more constraints you add to your pod, the less flexible it becomes -in terms of placement. Because of this, tightly constrained storage, such as -local storage, is only recommended for specific use cases, and the pods should -have higher priority in order to preempt lower priority pods from the node. -* Binding decision considering scheduling constraints from two or more pods -sharing the same PVC. - * The scheduler itself only handles one pod at a time. It’s possible the -two pods may not run at the same time either, so there’s no guarantee that you -will know both pod’s requirements at once. - * For two+ pods simultaneously sharing a PVC, this scenario may require an -operator to schedule them together. Another alternative is to merge the two -pods into one. - * For two+ pods non-simultaneously sharing a PVC, this scenario could be -handled by pod priorities and preemption. -* Provisioning multi-domain volumes where all the domains will be able to run - the workload. For example, provisioning a multi-zonal volume and making sure - the pod can run in all zones. - * Scheduler cannot make decisions based off of future resource requirements, - especially if those resources can fluctuate over time. For applications that - use such multi-domain storage, the best practice is to either: - * Configure cluster autoscaling with enough resources to accommodate - failing over the workload to any of the other failover domains. - * Manually configure and overprovision the failover domains to - accommodate the resource requirements of the workload. -* Scheduler supporting volume topologies that are independent of the node's - topologies. - * The Kubernetes scheduler only handles topologies with respect to the - workload and the nodes it runs on. If a storage system is deployed on an - independent topology, it will be up to provisioner to correctly spread the - volumes for a workload. This could be facilitated as a separate feature - by: - * Passing the Pod's OwnerRef to the provisioner, and the provisioner - spreading volumes for Pods with the same OwnerRef - * Adding Volume Anti-Affinity policies, and passing those to the - provisioner. - - -## Problem -Volumes can have topology constraints that restrict the set of nodes that the -volume can be accessed on. For example, a GCE PD can only be accessed from a -single zone, and a local disk can only be accessed from a single node. In the -future, there could be other topology domains, such as rack or region. - -A pod that uses such a volume must be scheduled to a node that fits within the -volume’s topology constraints. In addition, a pod can have further constraints -and limitations, such as the pod’s resource requests (cpu, memory, etc), and -pod/node affinity and anti-affinity policies. - -Currently, the process of binding and provisioning volumes are done before a pod -is scheduled. Therefore, it cannot take into account any of the pod’s other -scheduling constraints. This makes it possible for the PV controller to bind a -PVC to a PV or provision a PV with constraints that can make a pod unschedulable. - -### Examples -* In multizone clusters, the PV controller has a hardcoded heuristic to provision -PVCs for StatefulSets spread across zones. If that zone does not have enough -cpu/memory capacity to fit the pod, then the pod is stuck in pending state because -its volume is bound to that zone. -* Local storage exasperates this issue. The chance of a node not having enough -cpu/memory is higher than the chance of a zone not having enough cpu/memory. -* Local storage PVC binding does not have any node spreading logic. So local PV -binding will very likely conflict with any pod anti-affinity policies if there is -more than one local PV on a node. -* A pod may need multiple PVCs. As an example, one PVC can point to a local SSD for -fast data access, and another PVC can point to a local HDD for logging. Since PVC -binding happens without considering if multiple PVCs are related, it is very likely -for the two PVCs to be bound to local disks on different nodes, making the pod -unschedulable. -* For multizone clusters and deployments requesting multiple dynamically provisioned -zonal PVs, each PVC is provisioned independently, and is likely to provision each PV -in different zones, making the pod unschedulable. - -To solve the issue of initial volume binding and provisioning causing an impossible -pod placement, volume binding and provisioning should be more tightly coupled with -pod scheduling. - - -## Volume Topology Specification -First, volumes need a way to express topology constraints against nodes. Today, it -is done for zonal volumes by having explicit logic to process zone labels on the -PersistentVolume. However, this is not easily extendable for volumes with other -topology keys. - -Instead, to support a generic specification, the PersistentVolume -object will be extended with a new NodeAffinity field that specifies the -constraints. It will closely mirror the existing NodeAffinity type used by -Pods, but we will use a new type so that we will not be bound by existing and -future Pod NodeAffinity semantics. - -``` -type PersistentVolumeSpec struct { - ... - - NodeAffinity *VolumeNodeAffinity -} - -type VolumeNodeAffinity struct { - // The PersistentVolume can only be accessed by Nodes that meet - // these required constraints - Required *NodeSelector -} -``` - -The `Required` field is a hard constraint and indicates that the PersistentVolume -can only be accessed from Nodes that satisfy the NodeSelector. - -In the future, a `Preferred` field can be added to handle soft node constraints with -weights, but will not be included in the initial implementation. - -The advantages of this NodeAffinity field vs the existing method of using zone labels -on the PV are: -* We don't need to expose first-class labels for every topology key. -* Implementation does not need to be updated every time a new topology key - is added to the cluster. -* NodeSelector is able to express more complex topology with ANDs and ORs. -* NodeAffinity aligns with how topology is represented with other Kubernetes - resources. - -Some downsides include: -* You can have a proliferation of Node labels if you are running many different - kinds of volume plugins, each with their own topology labeling scheme. -* The NodeSelector is more expressive than what most storage providers will - need. Most storage providers only need a single topology key with - one or more domains. Non-hierarchical domains may present implementation - challenges, and it will be difficult to express all the functionality - of a NodeSelector in a non-Kubernetes specification like CSI. - - -### Example PVs with NodeAffinity -#### Local Volume -In this example, the volume can only be accessed from nodes that have the -label key `kubernetes.io/hostname` and label value `node-1`. -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - Name: local-volume-1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - local: - path: /mnt/disks/ssd1 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: kubernetes.io/hostname - operator: In - values: - - node-1 -``` - -#### Zonal Volume -In this example, the volume can only be accessed from nodes that have the -label key `failure-domain.beta.kubernetes.io/zone` and label value -`us-central1-a`. -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - Name: zonal-volume-1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a -``` - -#### Multi-Zonal Volume -In this example, the volume can only be accessed from nodes that have the -label key `failure-domain.beta.kubernetes.io/zone` and label value -`us-central1-a` OR `us-central1-b`. -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - Name: multi-zonal-volume-1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a - - us-central1-b -``` - -#### Multi Label Volume -In this example, the volume needs two labels to uniquely identify the topology. -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - Name: rack-volume-1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - csi: - driver: my-rack-storage-driver - volumeHandle: my-vol - volumeAttributes: - foo: bar - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a - - key: foo.io/rack - operator: In - values: - - rack1 -``` - -### Zonal PV Upgrade and Downgrade -Upgrading of zonal PVs to use the new PV.NodeAffinity API can be phased in as -follows: - -1. Update PV label admission controllers to specify the new PV.NodeAffinity. New - PVs created will automatically use the new PV.NodeAffinity. Existing PVs are - not updated yet, so on a downgrade, existing PVs are unaffected. New PVCs - should be deleted and recreated if there were problems with this feature. -2. Once PV.NodeAffinity is GA, deprecate the VolumeZoneChecker scheduler - predicate. Add a zonal PV upgrade controller to convert existing PVs. At this - point, if there are issues with this feature, then on a downgrade, the - VolumeScheduling feature would also need to be disabled. -3. After deprecation period, remove VolumeZoneChecker predicate and PV upgrade - controller. - -The zonal PV upgrade controller will convert existing PVs leveraging the -existing zonal scheduling logic using labels to PV.NodeAffinity. It will keep -the existing labels for backwards compatibility. - -For example, this zonal volume: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: zonal-volume-1 - labels: - failure-domain.beta.kubernetes.io/zone: us-central1-a - failure-domain.beta.kubernetes.io/region: us-central1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 -``` - -will be converted to: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: zonal-volume-1 - labels: - failure-domain.beta.kubernetes.io/zone: us-central1-a - failure-domain.beta.kubernetes.io/region: us-central1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a - - key: failure-domain.beta.kubernetes.io/region - operator: In - values: - - us-central1 -``` - -### Multi-Zonal PV Upgrade -The zone label for multi-zonal volumes need to be specially parsed. - -For example, this multi-zonal volume: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: multi-zonal-volume-1 - labels: - failure-domain.beta.kubernetes.io/zone: us-central1-a__us-central1-b - failure-domain.beta.kubernetes.io/region: us-central1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 -``` - -will be converted to: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: zonal-volume-1 - labels: - failure-domain.beta.kubernetes.io/zone: us-central1-a__us-central1-b - failure-domain.beta.kubernetes.io/region: us-central1 -spec: - capacity: - storage: 100Gi - storageClassName: my-class - gcePersistentDisk: - diskName: my-disk - fsType: ext4 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a - - us-central1-b - - key: failure-domain.beta.kubernetes.io/region - operator: In - values: - - us-central1 -``` - -### Bound PVC Enforcement -For PVCs that are already bound to a PV with NodeAffinity, enforcement is -simple and will be done at two places: -* Scheduler predicate: if a Pod references a PVC that is bound to a PV with -NodeAffinity, the predicate will evaluate the `Required` NodeSelector against -the Node's labels to filter the nodes that the Pod can be schedule to. The -existing VolumeZone scheduling predicate will coexist with this new predicate -for several releases until PV NodeAffinity becomes GA and we can deprecate the -old predicate. -* Kubelet: PV NodeAffinity is verified against the Node when mounting PVs. - -### Unbound PVC Binding -As mentioned in the problem statement, volume binding occurs without any input -about a Pod's scheduling constraints. To fix this, we will delay volume binding -and provisioning until a Pod is created. This behavior change will be opt-in as a -new StorageClass parameter. - -Both binding decisions of: -* Selecting a precreated PV with NodeAffinity -* Dynamically provisioning a PV with NodeAffinity - -will be considered by the scheduler, so that all of a Pod's scheduling -constraints can be evaluated at once. - -The detailed design for implementing this new volume binding behavior will be -described later in the scheduler integration section. - -## Delayed Volume Binding -Today, volume binding occurs immediately once a PersistentVolumeClaim is -created. In order for volume binding to take into account all of a pod's other scheduling -constraints, volume binding must be delayed until a Pod is being scheduled. - -A new StorageClass field `BindingMode` will be added to control the volume -binding behavior. - -``` -type StorageClass struct { - ... - - BindingMode *BindingMode -} - -type BindingMode string - -const ( - BindingImmediate BindingMode = "Immediate" - BindingWaitForFirstConsumer BindingMode = "WaitForFirstConsumer" -) -``` - -`BindingImmediate` is the default and current binding method. - -This approach allows us to: -* Introduce the new binding behavior gradually. -* Maintain backwards compatibility without deprecation of previous - behavior. Any automation that waits for PVCs to be bound before scheduling Pods - will not break. -* Support scenarios where volume provisioning for globally-accessible volume - types could take a long time, where volume provisioning is a planned - event well in advance of workload deployment. - -However, it has a few downsides: -* StorageClass will be required to get the new binding behavior, even if dynamic - provisioning is not used (in the case of local storage). -* We have to maintain two different code paths for volume binding. -* We will be depending on the storage admin to correctly configure the - StorageClasses for the volume types that need the new binding behavior. -* User experience can be confusing because PVCs could have different binding - behavior depending on the StorageClass configuration. We will mitigate this by - adding a new PVC event to indicate if binding will follow the new behavior. - - -## Dynamic Provisioning with Topology -To make dynamic provisioning aware of pod scheduling decisions, delayed volume -binding must also be enabled. The scheduler will pass its selected node to the -dynamic provisioner, and the provisioner will create a volume in the topology -domain that the selected node is part of. The domain depends on the volume -plugin. Zonal volume plugins will create the volume in the zone where the -selected node is in. The local volume plugin will create the volume on the -selected node. - -### End to End Zonal Example -This is an example of the most common use case for provisioning zonal volumes. -For this use case, the user's specs are unchanged. Only one change -to the StorageClass is needed to enable delayed volume binding. - -1. Admin sets up StorageClass, setting up delayed volume binding. -``` -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: standard -provisioner: kubernetes.io/gce-pd -bindingMode: WaitForFirstConsumer -parameters: - type: pd-standard -``` -2. Admin launches provisioner. For in-tree plugins, nothing needs to be done. -3. User creates PVC. Nothing changes in the spec, although now the PVC won't be - immediately bound. -``` -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: my-pvc -spec: - storageClassName: standard - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 100Gi -``` -4. User creates Pod. Nothing changes in the spec. -``` -apiVersion: v1 -kind: Pod -metadata: - name: my-pod -spec: - containers: - ... - volumes: - - name: my-vol - persistentVolumeClaim: - claimName: my-pvc -``` -5. Scheduler picks a node that can satisfy the Pod and - [passes it](#pv-controller-changes) to the provisioner. -6. Provisioner dynamically provisions a PV that can be accessed from - that node. -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - Name: volume-1 -spec: - capacity: - storage: 100Gi - storageClassName: standard - gcePersistentDisk: - diskName: my-disk - fsType: ext4 - nodeAffinity: - required: - nodeSelectorTerms: - - matchExpressions: - - key: failure-domain.beta.kubernetes.io/zone - operator: In - values: - - us-central1-a -``` -7. Pod gets scheduled to the node. - - -### Restricting Topology -For the common use case, volumes will be provisioned in whatever topology domain -the scheduler has decided is best to run the workload. Users may impose further -restrictions by setting label/node selectors, and pod affinity/anti-affinity -policies on their Pods. All those policies will be taken into account when -dynamically provisioning a volume. - -While less common, administrators may want to further restrict what topology -domains are available to a StorageClass. To support these administrator -policies, an AllowedTopologies field can also be specified in the -StorageClass to restrict the topology domains for dynamic provisioning. -This is not expected to be a common use case, and there are some caveats, -described below. - -``` -type StorageClass struct { - ... - - // Restrict the node topologies where volumes can be dynamically provisioned. - // Each volume plugin defines its own supported topology specifications. - // Each entry in AllowedTopologies is ORed. - AllowedTopologies []TopologySelector -} - -type TopologySelector struct { - // Topology must meet all of the TopologySelectorLabelRequirements - // These requirements are ANDed. - MatchLabelExpressions []TopologySelectorLabelRequirement -} - -// Topology requirement expressed as Node labels. -type TopologySelectorLabelRequirement struct{ - // Topology label key - Key string - // Topology must match at least one of the label Values for the given label Key. - // Each entry in Values is ORed. - Values []string -} -``` - -A nil value means there are no topology restrictions. A scheduler predicate -will evaluate a non-nil value when considering dynamic provisioning for a node. - -The AllowedTopologies will also be provided to provisioners as a new field, detailed in -the provisioner section. Provisioners can use the allowed topology information -in the following scenarios: -* StorageClass is using the default immediate binding mode. This is the - legacy topology-unaware behavior. In this scenario, the volume could be - provisioned in a domain that cannot run the Pod since it doesn't take any - scheduler input. -* For volumes that span multiple domains, the AllowedTopologies can restrict those - additional domains. However, special care must be taken to avoid specifying - conflicting topology constraints in the Pod. For example, the administrator could - restrict a multi-zonal volume to zones 'zone1' and 'zone2', but the Pod could have - constraints that restrict it to 'zone1' and 'zone3'. If 'zone1' - fails, the Pod cannot be scheduled to the intended failover zone. - -Note that if delayed binding is enabled and the volume spans only a single domain, -then the AllowedTopologies can be ignored by the provisioner because the -scheduler would have already taken it into account when it selects the node. - -Kubernetes will leave validation and enforcement of the AllowedTopologies content up -to the provisioner. - -Support in the GCE PD and AWS EBS provisioners for the existing `zone` and `zones` -parameters will be deprecated. CSI in-tree migration will handle translation of -`zone` and `zones` parameters to CSI topology. - -Admins must already create a new StorageClass with delayed volume binding to use -this feature, so the documentation can encourage use of the AllowedTopologies -instead of existing zone parameters. A plugin-specific admission controller -can also validate that both zone and AllowedTopologies are not specified, -although the CSI plugin should still be robust to handle this configuration -error. - -##### Alternatives -A new restricted TopologySelector is used here instead of reusing -VolumeNodeAffinity because the provisioning operation requires -allowed topologies to be explicitly enumerated, while NodeAffinity and -NodeSelectors allow for non-explicit expressions of topology values (i.e., -operators NotIn, Exists, DoesNotExist, Gt, Lt). It would be difficult for -provisioners to evaluate all the expressions without having to enumerate all the -Nodes in the cluster. - -Another alternative is to have a list of allowed PV topologies, where each PV -topology is exactly the same as a single PV topology. This expression can become -very verbose for volume types that have multi-dimensional topologies or multiple -selections. As an example, for a multi-zonal volume that needs to select -two zones, if an administrator wants to restrict the selection to 4 zones, then -all 6 combinations need to be explicitly enumerated. - -Another alternative is to expand ResourceQuota to support topology constraints. -However, ResourceQuota is currently only evaluated during admission, and not -scheduling. - -#### Zonal Example -This example restricts the volumes provisioned to zones us-central1-a and -us-central1-b. -``` -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: zonal-class -provisioner: kubernetes.io/gce-pd -parameters: - type: pd-standard -allowedTopologies: -- matchLabelExpressions: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-central1-a - - us-central1-b -``` - -#### Multi-Zonal Example -This example restricts the volume's primary and failover zones -to us-central1-a, us-central1-b and us-central1-c. The regional PD -provisioner will pick two out of the three zones to provision in. -``` -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: multi-zonal-class -provisioner: kubernetes.io/gce-pd -parameters: - type: pd-standard - replication-type: regional-pd -allowedTopologies: -- matchLabelExpressions: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-central1-a - - us-central1-b - - us-central1-c -``` - -Topologies that are incompatible with the storage provider parameters -will be enforced by the provisioner. For example, dynamic provisioning -of regional PDs will fail if provisioning is restricted to fewer than -two zones in all regions. This configuration will cause provisioning to fail: -``` -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: multi-zonal-class -provisioner: kubernetes.io/gce-pd -parameters: - type: pd-standard - replication-type: regional-pd -allowedTopologies: -- matchLabelExpressions: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-central1-a -``` - -#### Multi Label Example -This example restricts the volume's topology to nodes that -have the following labels: - -* "zone: us-central1-a" and "rack: rack1" or, -* "zone: us-central1-b" and "rack: rack1" or, -* "zone: us-central1-b" and "rack: rack2" - -``` -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: something-fancy -provisioner: rack-based-provisioner -parameters: -allowedTopologies: -- matchLabelExpressions: - - key: zone - values: - - us-central1-a - - key: rack - values: - - rack1 -- matchLabelExpressions: - - key: zone - values: - - us-central1-b - - key: rack - values: - - rack1 - - rack2 -``` - - -## Feature Gates -All functionality is controlled by the VolumeScheduling feature gate, -and must be configured in the -kube-scheduler, kube-controller-manager, and all kubelets. - -## Integrating volume binding with pod scheduling -For the new volume binding mode, the proposed new workflow is: -1. Admin pre-provisions PVs and/or StorageClasses. -2. User creates unbound PVC and there are no prebound PVs for it. -3. **NEW:** PVC binding and provisioning is delayed until a pod is created that -references it. -4. User creates a pod that uses the PVC. -5. Pod starts to get processed by the scheduler. -6. Scheduler processes predicates. -7. **NEW:** A new predicate function, called CheckVolumeBinding, will process -both bound and unbound PVCs of the Pod. It will validate the VolumeNodeAffinity -for bound PVCs. For unbound PVCs, it will try to find matching PVs for that node -based on the PV NodeAffinity. If there are no matching PVs, then it checks if -dynamic provisioning is possible for that node based on StorageClass -AllowedTopologies. -8. The scheduler continues to evaluate priority functions -9. **NEW:** A new priority -function, called PrioritizeVolumes, will get the PV matches per PVC per -node, and compute a priority score based on various factors. -10. After evaluating all the predicates and priorities, the -scheduler will pick a node. -11. **NEW:** A new assume function, AssumePodVolumes, is called by the scheduler. -The assume function will check if any binding or -provisioning operations need to be done. If so, it will update the PV cache to -mark the PVs with the chosen PVCs and queue the Pod for volume binding. -12. AssumePod is done by the scheduler. -13. **NEW:** If PVC binding or provisioning is required, a new bind function, -BindPodVolumes, will be called asynchronously, passing -in the selected node. The bind function will prebind the PV to the PVC, or -trigger dynamic provisioning. Then, it waits for the binding or provisioning -operation to complete. -14. In the same async thread, scheduler binds the Pod to a Node. -15. Kubelet starts the Pod. - -This diagram depicts the new additions to the default scheduler: - - - -This new workflow will have the scheduler handle unbound PVCs by choosing PVs -and prebinding them to the PVCs. The PV controller completes the binding -transaction, handling it as a prebound PV scenario. - -Prebound PVCs and PVs will still immediately be bound by the PV controller. - -Manual recovery by the user will be required in following error conditions: -* A Pod has multiple PVCs, and only a subset of them successfully bind. - -The primary cause for these errors is if a user or external entity -binds a PV between the time that the scheduler chose the PV and when the -scheduler actually made the API update. Some workarounds to -avoid these error conditions are to: -* Prebind the PV instead. -* Separate out volumes that the user prebinds from the volumes that are -available for the system to choose from by StorageClass. - -### PV Controller Changes -When the feature gate is enabled, the PV controller needs to skip binding -unbound PVCs with VolumBindingWaitForFirstConsumer and no prebound PVs -to let it come through the scheduler path. - -Dynamic provisioning will also be skipped if -VolumBindingWaitForFirstConsumer is set. The scheduler will signal to -the PV controller to start dynamic provisioning by setting the -`annSelectedNode` annotation in the PVC. If provisioning fails, the PV -controller can signal back to the scheduler to retry dynamic provisioning by -removing the `annSelectedNode` annotation. For external provisioners, the -external provisioner needs to remove the annotation. - -No other state machine changes are required. The PV controller continues to -handle the remaining scenarios without any change. - -The methods to find matching PVs for a claim and prebind PVs need to be -refactored for use by the new scheduler functions. - -### Dynamic Provisioning interface changes -The dynamic provisioning interfaces will be updated to pass in: -* selectedNode, when late binding is enabled on the StorageClass -* allowedTopologies, when it is set in the StorageClass - -If selectedNode is set, the provisioner should get its appropriate topology -labels from the Node object, and provision a volume based on those topology -values. In the common use case for a volume supporting a single topology domain, -if nodeName is set, then allowedTopologies can be ignored by the provisioner. -However, multi-domain volume provisioners may still need to look at -allowedTopologies to restrict the remaining domains. - -In-tree provisioners: -``` -Provision(selectedNode *v1.Node, allowedTopologies *storagev1.VolumeProvisioningTopology) (*v1.PersistentVolume, error) -``` - -External provisioners: -* selectedNode will be represented by the PVC annotation "volume.alpha.kubernetes.io/selectedNode". - Value is the name of the node. -* allowedTopologies must be obtained by looking at the StorageClass for the PVC. - -#### New Permissions -Provisioners will need to be able to get Node and StorageClass objects. - -### Scheduler Changes - -#### Predicate -A new predicate function checks all of a Pod's unbound PVCs can be satisfied -by existing PVs or dynamically provisioned PVs that are -topologically-constrained to the Node. -``` -CheckVolumeBinding(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error) -``` -1. If all the Pod’s PVCs are bound, return true. -2. Otherwise try to find matching PVs for all of the unbound PVCs in order of -decreasing requested capacity. -3. Walk through all the PVs. -4. Find best matching PV for the PVC where PV topology is satisfied by the Node. -5. Temporarily cache this PV choice for the PVC per Node, for fast -processing later in the priority and bind functions. -6. Return true if all PVCs are matched. -7. If there are still unmatched PVCs, check if dynamic provisioning is possible, - by evaluating StorageClass.AllowedTopologies. If so, - temporarily cache this decision in the PVC per Node. -8. Otherwise return false. - -Note that we should consider all the cases which may affect predicate cached -results of CheckVolumeBinding and other scheduler predicates, this will be -explained later. - -#### Priority -After all the predicates run, there is a reduced set of Nodes that can fit a -Pod. A new priority function will rank the remaining nodes based on the -unbound PVCs and their matching PVs. -``` -PrioritizeVolumes(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error) -``` -1. For each Node, get the cached PV matches for the Pod’s PVCs. -2. Compute a priority score for the Node using the following factors: - 1. How close the PVC’s requested capacity and PV’s capacity are. - 2. Matching pre-provisioned PVs is preferred over dynamic provisioning because we - assume that the administrator has specifically created these PVs for - the Pod. - -TODO (beta): figure out weights and exact calculation - -#### Assume -Once all the predicates and priorities have run, then the scheduler picks a -Node. Then we can bind or provision PVCs for that Node. For better scheduler -performance, we’ll assume that the binding will likely succeed, and update the -PV and PVC caches first. Then the actual binding API update will be made -asynchronously, and the scheduler can continue processing other Pods. - -For the alpha phase, the AssumePodVolumes function will be directly called by the -scheduler. We’ll consider creating a generic scheduler interface in a -subsequent phase. - -``` -AssumePodVolumes(pod *v1.pod, node *v1.node) (pvcbindingrequired bool, err error) -``` -1. If all the Pod’s PVCs are bound, return false. -2. For pre-provisioned PV binding: - 1. Get the cached matching PVs for the PVCs on that Node. - 2. Validate the actual PV state. - 3. Mark PV.ClaimRef in the PV cache. - 4. Cache the PVs that need binding in the Pod object. -3. For in-tree and external dynamic provisioning: - 1. Mark the PVC annSelectedNode in the PVC cache. - 2. Cache the PVCs that need provisioning in the Pod object. -4. Return true - -#### Bind -A separate go routine performs the binding operation for the Pod. - -If AssumePodVolumes returns pvcBindingRequired, then BindPodVolumes is called -first in this go routine. It will handle binding and provisioning of PVCs that -were assumed, and wait for the operations to complete. - -Once complete, or if no volumes need to be bound, then the scheduler continues -binding the Pod to the Node. - -For the alpha phase, the BindPodVolumes function will be directly called by the -scheduler. We’ll consider creating a generic scheduler interface in a subsequent -phase. - -``` -BindPodVolumes(pod *v1.Pod, node *v1.Node) (err error) -``` -1. For pre-provisioned PV binding: - 1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field. - 2. If the prebind fails, revert the cache updates. -2. For in-tree and external dynamic provisioning: - 1. Set `annSelectedNode` on the PVC. -3. Wait for binding and provisioning to complete. - 1. In the case of failure, error is returned and the Pod will retry - scheduling. Failure scenarios include: - * PV or PVC got deleted - * PV.ClaimRef got cleared - * PVC selectedNode annotation got cleared or is set to the wrong node - -TODO: pv controller has a high resync frequency, do we need something similar -for the scheduler too - -#### Access Control -Scheduler will need PV update permissions for prebinding pre-provisioned PVs, and PVC -update permissions for triggering dynamic provisioning. - -#### Pod preemption considerations -The CheckVolumeBinding predicate does not need to be re-evaluated for pod -preemption. Preempting a pod that uses a PV will not free up capacity on that -node because the PV lifecycle is independent of the Pod’s lifecycle. - -#### Other scheduler predicates -Currently, there are a few existing scheduler predicates that require the PVC -to be bound. The bound assumption needs to be changed in order to work with -this new workflow. - -TODO: how to handle race condition of PVCs becoming bound in the middle of -running predicates? One possible way is to mark at the beginning of scheduling -a Pod if all PVCs were bound. Then we can check if a second scheduler pass is -needed. - -##### Max PD Volume Count Predicate -This predicate checks the maximum number of PDs per node is not exceeded. It -needs to be integrated into the binding decision so that we don’t bind or -provision a PV if it’s going to cause the node to exceed the max PD limit. But -until it is integrated, we need to make one more pass in the scheduler after all -the PVCs are bound. The current copy of the predicate in the default scheduler -has to remain to account for the already-bound volumes. - -##### Volume Zone Predicate -This predicate makes sure that the zone label on a PV matches the zone label of -the node. If the volume is not bound, this predicate can be ignored, as the -binding logic will take into account zone constraints on the PV. - -However, this assumes that zonal PVs like GCE PDs and AWS EBS have been updated -to use the new PV topology specification, which is not the case as of 1.8. So -until those plugins are updated, the binding and provisioning decisions will be -topology-unaware, and we need to make one more pass in the scheduler after all -the PVCs are bound. - -This predicate needs to remain in the default scheduler to handle the -already-bound volumes using the old zonal labeling, but must be updated to skip -unbound PVC if StorageClass binding mode is WaitForFirstConsumer. It can be -removed once that mechanism is deprecated and unsupported. - -##### Volume Node Predicate -This is a new predicate added in 1.7 to handle the new PV node affinity. It -evaluates the node affinity against the node’s labels to determine if the pod -can be scheduled on that node. If the volume is not bound, this predicate can -be ignored, as the binding logic will take into account the PV node affinity. - -#### Caching -There are two new caches needed in the scheduler. - -The first cache is for handling the PV/PVC API binding updates occurring -asynchronously with the main scheduler loop. `AssumePodVolumes` needs to store -the updated API objects before `BindPodVolumes` makes the API update, so -that future binding decisions will not choose any assumed PVs. In addition, -if the API update fails, the cached updates need to be reverted and restored -with the actual API object. The cache will return either the cached-only -object, or the informer object, whichever one is latest. Informer updates -will always override the cached-only object. The new predicate and priority -functions must get the objects from this cache instead of from the informer cache. -This cache only stores pointers to objects and most of the time will only -point to the informer object, so the memory footprint per object is small. - -The second cache is for storing temporary state as the Pod goes from -predicates to priorities and then assume. This all happens serially, so -the cache can be cleared at the beginning of each pod scheduling loop. This -cache is used for: -* Indicating if all the PVCs are already bound at the beginning of the pod -scheduling loop. This is to handle situations where volumes may have become -bound in the middle of processing the predicates. We need to ensure that -all the volume predicates are fully run once all PVCs are bound. -* Caching PV matches per node decisions that the predicate had made. This is -an optimization to avoid walking through all the PVs again in priority and -assume functions. -* Caching PVC dynamic provisioning decisions per node that the predicate had - made. - -#### Event handling - -##### Move pods into active queue -When a pod is tried and determined to be unschedulable, it will be placed in -the unschedulable queue by scheduler. It will not be scheduled until being -moved to active queue. For volume topology scheduling, we need to move -pods to active queue in following scenarios: - -- on PVC add - - Pod which references nonexistent PVCs is unschedulable for now, we need to - move pods to active queue when a PVC is added. - -- on PVC update - - The proposed design has the scheduler initiating the binding transaction by - prebinding the PV and waiting for PV controller to finish binding and put it - back in the schedule queue. To achieve this, we need to move pods to active - queue on PVC update. - -- on PV add - - Pods created when there are no PVs available will be stuck in unschedulable - queue. But unbound PVs created for static provisioning and delay binding - storage class are skipped in PV controller dynamic provisioning and binding - process, will not trigger events to schedule pod again. So we need to move - pods to active queue on PV add for this scenario. - -- on PV update - - In scheduler assume process, if volume binding is required, scheduler will - put pod to unschedulable queue and wait for asynchronous volume binding - updates are made. But binding volumes worker may fail to update assumed pod - volume bindings due to conflicts if PVs are updated by PV controller or other - entities. So we need to move pods to active queue on PV update for this - scenario. - -- on Storage Class add - - CheckVolumeBindingPred will fail if pod has unbound immediate PVCs. If these - PVCs have specified StorageClass name, creating StorageClass objects with - late binding for these PVCs will cause predicates to pass, so we need to move - pods to active queue when a StorageClass with WaitForFirstConsumer is added. - -##### Invalidate predicate equivalence cache -Scheduler now have an optional [equivalence -cache](../scheduling/scheduler-equivalence-class.md#goals) to improve -scheduler's scalability. We need to invalidate -CheckVolumeBinding/NoVolumeZoneConflict predicate cached results in following -scenarios to keep equivalence class cache up to date: - -- on PVC add/delete - - When PVCs are created or deleted, available PVs to choose from for volume - scheduling may change, we need to invalidate CheckVolumeBinding predicate. - -- on PVC update - - PVC volume binding may change on PVC update, we need to invalidate - CheckVolumeBinding predicate. - -- on PV add/delete - - When PVs are created or deleted, available PVs to choose from for volume - scheduling will change, we need to invalidate CheckVolumeBinding - predicate. - -- on PV update - - CheckVolumeBinding predicate may cache PVs in pod binding cache. When PV got - updated, we should invalidate cache, otherwise assume process will fail - with out of sync error. - -- on StorageClass delete - - When a StorageClass with WaitForFirstConsumer is deleted, PVCs which references - this storage class will be in immediate binding mode. We need to invalidate - CheckVolumeBinding and NoVolumeZoneConflict. - -#### Performance and Optimizations -Let: -* N = number of nodes -* V = number of all PVs -* C = number of claims in a pod - -C is expected to be very small (< 5) so shouldn’t factor in. - -The current PV binding mechanism just walks through all the PVs once, so its -running time O(V). - -Without any optimizations, the new PV binding mechanism has to run through all -PVs for every node, so its running time is O(NV). - -A few optimizations can be made to improve the performance: - -1. PVs that don’t use node affinity should not be using delayed binding. -2. Optimizing for PVs that have node affinity: - 1. When a static PV is created, if node affinity is present, evaluate it -against all the nodes. For each node, keep an in-memory map of all its PVs -keyed by StorageClass. When finding matching PVs for a particular node, try to -match against the PVs in the node’s PV map instead of the cluster-wide PV list. - -For the alpha phase, the optimizations are not required. However, they should -be required for beta and GA. - -### Packaging -The new bind logic that is invoked by the scheduler can be packaged in a few -ways: -* As a library to be directly called in the default scheduler -* As a scheduler extender - -We propose taking the library approach, as this method is simplest to release -and deploy. Some downsides are: -* The binding logic will be executed using two different caches, one in the -scheduler process, and one in the PV controller process. There is the potential -for more race conditions due to the caches being out of sync. -* Refactoring the binding logic into a common library is more challenging -because the scheduler’s cache and PV controller’s cache have different interfaces -and private methods. - -#### Extender cons -However, the cons of the extender approach outweighs the cons of the library -approach. - -With an extender approach, the PV controller could implement the scheduler -extender HTTP endpoint, and the advantage is the binding logic triggered by the -scheduler can share the same caches and state as the PV controller. - -However, deployment of this scheduler extender in a master HA configuration is -extremely complex. The scheduler has to be configured with the hostname or IP of -the PV controller. In a HA setup, the active scheduler and active PV controller -could run on the same, or different node, and the node can change at any time. -Exporting a network endpoint in the controller manager process is unprecedented -and there would be many additional features required, such as adding a mechanism -to get a stable network name, adding authorization and access control, and -dealing with DDOS attacks and other potential security issues. Adding to those -challenges is the fact that there are countless ways for users to deploy -Kubernetes. - -With all this complexity, the library approach is the most feasible in a single -release time frame, and aligns better with the current Kubernetes architecture. - -### Downsides - -#### Unsupported Use Cases -The following use cases will not be supported for PVCs with a StorageClass with -BindingWaitForFirstConsumer: -* Directly setting Pod.Spec.NodeName -* DaemonSets - -These two use cases will bypass the default scheduler and thus will not -trigger PV binding. - -#### Custom Schedulers -Custom schedulers, controllers and operators that handle pod scheduling and want -to support this new volume binding mode will also need to handle the volume -binding decision. - -There are a few ways to take advantage of this feature: -* Custom schedulers could be implemented through the scheduler extender -interface. This allows the default scheduler to be run in addition to the -custom scheduling logic. -* The new code for this implementation will be packaged as a library to make it -easier for custom schedulers to include in their own implementation. - -In general, many advanced scheduling features have been added into the default -scheduler, such that it is becoming more difficult to run without it. - -#### HA Master Upgrades -HA masters adds a bit of complexity to this design because the active scheduler -process and active controller-manager (PV controller) process can be on different -nodes. That means during an HA master upgrade, the scheduler and controller-manager -can be on different versions. - -The scenario where the scheduler is newer than the PV controller is fine. PV -binding will not be delayed and in successful scenarios, all PVCs will be bound -before coming to the scheduler. - -However, if the PV controller is newer than the scheduler, then PV binding will -be delayed, and the scheduler does not have the logic to choose and prebind PVs. -That will cause PVCs to remain unbound and the Pod will remain unschedulable. - -TODO: One way to solve this is to have some new mechanism to feature gate system -components based on versions. That way, the new feature is not turned on until -all dependencies are at the required versions. - -For alpha, this is not concerning, but it needs to be solved by GA. - -### Other Alternatives Considered - -#### One scheduler function -An alternative design considered was to do the predicate, priority and bind -functions all in one function at the end right before Pod binding, in order to -reduce the number of passes we have to make over all the PVs. However, this -design does not work well with pod preemption. Pod preemption needs to be able -to evaluate if evicting a lower priority Pod will make a higher priority Pod -schedulable, and it does this by re-evaluating predicates without the lower -priority Pod. - -If we had put the MatchUnboundPVCs predicate at the end, then pod preemption -wouldn’t have an accurate filtered nodes list, and could end up preempting pods -on a Node that the higher priority pod still cannot run on due to PVC -requirements. For that reason, the PVC binding decision needs to be have its -predicate function separated out and evaluated with the rest of the predicates. - -#### Pull entire PVC binding into the scheduler -The proposed design only has the scheduler initiating the binding transaction -by prebinding the PV. An alternative is to pull the whole two-way binding -transaction into the scheduler, but there are some complex scenarios that -scheduler’s Pod sync loop cannot handle: -* PVC and PV getting unexpectedly unbound or lost -* PVC and PV state getting partially updated -* PVC and PV deletion and cleanup - -Handling these scenarios in the scheduler’s Pod sync loop is not possible, so -they have to remain in the PV controller. - -#### Keep all PVC binding in the PV controller -Instead of initiating PV binding in the scheduler, have the PV controller wait -until the Pod has been scheduled to a Node, and then try to bind based on the -chosen Node. A new scheduling predicate is still needed to filter and match -the PVs (but not actually bind). - -The advantages are: -* Existing scenarios where scheduler is bypassed will work. -* Custom schedulers will continue to work without any changes. -* Most of the PV logic is still contained in the PV controller, simplifying HA -upgrades. - -Major downsides of this approach include: -* Requires PV controller to watch Pods and potentially change its sync loop -to operate on pods, in order to handle the multiple PVCs in a pod scenario. -This is a potentially big change that would be hard to keep separate and -feature-gated from the current PV logic. -* Both scheduler and PV controller processes have to make the binding decision, -but because they are done asynchronously, it is possible for them to choose -different PVs. The scheduler has to cache its decision so that it won't choose -the same PV for another PVC. But by the time PV controller handles that PVC, -it could choose a different PV than the scheduler. - * Recovering from this inconsistent decision and syncing the two caches is -very difficult. The scheduler could have made a cascading sequence of decisions -based on the first inconsistent decision, and they would all have to somehow be -fixed based on the real PVC/PV state. -* If the scheduler process restarts, it loses all its in-memory PV decisions and -can make a lot of wrong decisions after the restart. -* All the volume scheduler predicates that require PVC to be bound will not get -evaluated. To solve this, all the volume predicates need to also be built into -the PV controller when matching possible PVs. - -#### Move PVC binding to kubelet -Looking into the future, with the potential for NUMA-aware scheduling, you could -have a sub-scheduler on each node to handle the pod scheduling within a node. It -could make sense to have the volume binding as part of this sub-scheduler, to make -sure that the volume selected will have NUMA affinity with the rest of the -resources that the pod requested. - -However, there are potential security concerns because kubelet would need to see -unbound PVs in order to bind them. For local storage, the PVs could be restricted -to just that node, but for zonal storage, it could see all the PVs in that zone. - -In addition, the sub-scheduler is just a thought at this point, and there are no -concrete proposals in this area yet. - -## Binding multiple PVCs in one transaction -There are no plans to handle this, but a possible solution is presented here if the -need arises in the future. Since the scheduler is serialized, a partial binding -failure should be a rare occurrence and would only be caused if there is a user or -other external entity also trying to bind the same volumes. - -One possible approach to handle this is to rollback previously bound PVCs on -error. However, volume binding cannot be blindly rolled back because there could -be user's data on the volumes. - -For rollback, PersistentVolumeClaims will have a new status to indicate if it's -clean or dirty. For backwards compatibility, a nil value is defaulted to dirty. -The PV controller will set the status to clean if the PV is Available and unbound. -Kubelet will set the PV status to dirty during Pod admission, before adding the -volume to the desired state. - -If scheduling fails, update all bound PVCs with an annotation, -"pv.kubernetes.io/rollback". The PV controller will only unbind PVCs that -are clean. Scheduler and kubelet needs to reject pods with PVCs that are -undergoing rollback. - -## Recovering from kubelet rejection of pod -We can use the same rollback mechanism as above to handle this case. -If kubelet rejects a pod, it will go back to scheduling. If the scheduler -cannot find a node for the pod, then it will encounter scheduling failure and -initiate the rollback. - - -## Testing - -### E2E tests -* StatefulSet, replicas=3, specifying pod anti-affinity - * Positive: Local PVs on each of the nodes - * Negative: Local PVs only on 2 out of the 3 nodes -* StatefulSet specifying pod affinity - * Positive: Multiple local PVs on a node - * Negative: Only one local PV available per node -* Multiple PVCs specified in a pod - * Positive: Enough local PVs available on a single node - * Negative: Not enough local PVs available on a single node -* Fallback to dynamic provisioning if unsuitable pre-provisioned PVs - -### Unit tests -* All PVCs found a match on first node. Verify match is best suited based on -capacity. -* All PVCs found a match on second node. Verify match is best suited based on -capacity. -* Only 2 out of 3 PVCs have a match. -* Priority scoring doesn’t change the given priorityList order. -* Priority scoring changes the priorityList order. -* Don’t match PVs that are prebound - - -## Implementation Plan - -### Alpha -* New feature gate for volume topology scheduling -* StorageClass API change -* Refactor PV controller methods into a common library -* PV controller: Delay binding and provisioning unbound PVCs -* Predicate: Filter nodes and find matching PVs -* Predicate: Check if provisioner exists for dynamic provisioning -* Update existing predicates to skip unbound PVC -* Bind: Trigger PV binding -* Bind: Trigger dynamic provisioning -a Pod (only if alpha is enabled) - -### Beta -* Scheduler cache: Optimizations for no PV node affinity -* Priority: capacity match score -* Plugins: Convert all zonal volume plugins to use new PV node affinity (GCE PD, -AWS EBS, what else?) -* Make dynamic provisioning topology aware - -### GA -* Predicate: Handle max PD per node limit -* Scheduler cache: Optimizations for PV node affinity - - -## Open Issues -* Can generic device resource API be leveraged at all? Probably not, because: - * It will only work for local storage (node specific devices), and not zonal -storage. - * Storage already has its own first class resources in K8s (PVC/PV) with an -independent lifecycle. The current resource API proposal does not have an a way to -specify identity/persistence for devices. -* Will this be able to work with the node sub-scheduler design for NUMA-aware -scheduling? - * It’s still in a very early discussion phase. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/storage/volume-topology-scheduling.png b/contributors/design-proposals/storage/volume-topology-scheduling.png Binary files differdeleted file mode 100644 index 0e724aef..00000000 --- a/contributors/design-proposals/storage/volume-topology-scheduling.png +++ /dev/null diff --git a/contributors/design-proposals/storage/volumes.md b/contributors/design-proposals/storage/volumes.md index a963b279..f0fbec72 100644 --- a/contributors/design-proposals/storage/volumes.md +++ b/contributors/design-proposals/storage/volumes.md @@ -1,478 +1,6 @@ -## Abstract +Design proposals have been archived. -A proposal for sharing volumes between containers in a pod using a special supplemental group. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Motivation -Kubernetes volumes should be usable regardless of the UID a container runs as. This concern cuts -across all volume types, so the system should be able to handle them in a generalized way to provide -uniform functionality across all volume types and lower the barrier to new plugins. - -Goals of this design: - -1. Enumerate the different use-cases for volume usage in pods -2. Define the desired goal state for ownership and permission management in Kubernetes -3. Describe the changes necessary to achieve desired state - -## Constraints and Assumptions - -1. When writing permissions in this proposal, `D` represents a don't-care value; example: `07D0` - represents permissions where the owner has `7` permissions, all has `0` permissions, and group - has a don't-care value -2. Read-write usability of a volume from a container is defined as one of: - 1. The volume is owned by the container's effective UID and has permissions `07D0` - 2. The volume is owned by the container's effective GID or one of its supplemental groups and - has permissions `0D70` -3. Volume plugins should not have to handle setting permissions on volumes -5. Preventing two containers within a pod from reading and writing to the same volume (by choosing - different container UIDs) is not something we intend to support today -6. We will not design to support multiple processes running in a single container as different - UIDs; use cases that require work by different UIDs should be divided into different pods for - each UID - -## Current State Overview - -### Kubernetes - -Kubernetes volumes can be divided into two broad categories: - -1. Unshared storage: - 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, - downward api. All volumes in this category delegate to `EmptyDir` for their underlying - storage. These volumes are created with ownership `root:root`. - 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively - by a single pod*. -2. Shared storage: - 1. `hostPath` is shared storage because it is necessarily used by a container and the host - 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. For these volumes, the ownership - is determined by the configuration of the shared storage system. - 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because - they may be used simultaneously by multiple pods. - -The `EmptyDir` volume was recently modified to create the volume directory with `0777` permissions -from `0750` to support basic usability of that volume as a non-root UID. - -### Docker - -Docker recently added supplemental group support. This adds the ability to specify additional -groups that a container should be part of, and will be released with Docker 1.8. - -There is a [proposal](https://github.com/docker/docker/pull/14632) to add a bind-mount flag to tell -Docker to change the ownership of a volume to the effective UID and GID of a container, but this has -not yet been accepted. - -### rkt - -rkt -[image manifests](https://github.com/appc/spec/blob/master/spec/aci.md#image-manifest-schema) can -specify users and groups, similarly to how a Docker image can. A rkt -[pod manifest](https://github.com/appc/spec/blob/master/spec/pods.md#pod-manifest-schema) can also -override the default user and group specified by the image manifest. - -rkt does not currently support supplemental groups or changing the owning UID or -group of a volume, but it has been [requested](https://github.com/coreos/rkt/issues/1309). - -## Use Cases - -1. As a user, I want the system to set ownership and permissions on volumes correctly to enable - reads and writes with the following scenarios: - 1. All containers running as root - 2. All containers running as the same non-root user - 3. Multiple containers running as a mix of root and non-root users - -### All containers running as root - -For volumes that only need to be used by root, no action needs to be taken to change ownership or -permissions, but setting the ownership based on the supplemental group shared by all containers in a -pod will also work. For situations where read-only access to a shared volume is required from one -or more containers, the `VolumeMount`s in those containers should have the `readOnly` field set. - -### All containers running as a single non-root user - -In use cases whether a volume is used by a single non-root UID the volume ownership and permissions -should be set to enable read/write access. - -Currently, a non-root UID will not have permissions to write to any but an `EmptyDir` volume. -Today, users that need this case to work can: - -1. Grant the container the necessary capabilities to `chown` and `chmod` the volume: - - `CAP_FOWNER` - - `CAP_CHOWN` - - `CAP_DAC_OVERRIDE` -2. Run a wrapper script that runs `chown` and `chmod` commands to set the desired ownership and - permissions on the volume before starting their main process - -This workaround has significant drawbacks: - -1. It grants powerful kernel capabilities to the code in the image and thus is not securing, - defeating the reason containers are run as non-root users -2. The user experience is poor; it requires changing Dockerfile, adding a layer, or modifying the - container's command - -Some cluster operators manage the ownership of shared storage volumes on the server side. -In this scenario, the UID of the container using the volume is known in advance. The ownership of -the volume is set to match the container's UID on the server side. - -### Containers running as a mix of root and non-root users - -If the list of UIDs that need to use a volume includes both root and non-root users, supplemental -groups can be applied to enable sharing volumes between containers. The ownership and permissions -`root:<supplemental group> 2770` will make a volume usable from both containers running as root and -running as a non-root UID and the supplemental group. The setgid bit is used to ensure that files -created in the volume will inherit the owning GID of the volume. - -## Community Design Discussion - -- [kubernetes/2630](https://github.com/kubernetes/kubernetes/issues/2630) -- [kubernetes/11319](https://github.com/kubernetes/kubernetes/issues/11319) -- [kubernetes/9384](https://github.com/kubernetes/kubernetes/pull/9384) - -## Analysis - -The system needs to be able to: - -1. Model correctly which volumes require ownership management -1. Determine the correct ownership of each volume in a pod if required -1. Set the ownership and permissions on volumes when required - -### Modeling whether a volume requires ownership management - -#### Unshared storage: volumes derived from `EmptyDir` - -Since Kubernetes creates `EmptyDir` volumes, it should ensure the ownership is set to enable the -volumes to be usable for all of the above scenarios. - -#### Unshared storage: network block devices - -Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way -as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` -volumes, permissions and ownership can be managed on the client side by the Kubelet when used -exclusively by one pod. When the volumes are used outside of a persistent volume, or with the -`ReadWriteOnce` mode, they are effectively unshared storage. - -When used by multiple pods, there are many additional use-cases to analyze before we can be -confident that we can support ownership management robustly with these file systems. The right -design is one that makes it easy to experiment and develop support for ownership management with -volume plugins to enable developers and cluster operators to continue exploring these issues. - -#### Shared storage: hostPath - -The `hostPath` volume should only be used by effective-root users, and the permissions of paths -exposed into containers via hostPath volumes should always be managed by the cluster operator. If -the Kubelet managed the ownership for `hostPath` volumes, a user who could create a `hostPath` -volume could affect changes in the state of arbitrary paths within the host's filesystem. This -would be a severe security risk, so we will consider hostPath a corner case that the kubelet should -never perform ownership management for. - -#### Shared storage - -Ownership management of shared storage is a complex topic. Ownership for existing shared storage -will be managed externally from Kubernetes. For this case, our API should make it simple to express -whether a particular volume should have these concerns managed by Kubernetes. - -We will not attempt to address the ownership and permissions concerns of new shared storage -in this proposal. - -When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` -modes, it is shared storage, and thus outside the scope of this proposal. - -#### Plugin API requirements - -From the above, we know that some volume plugins will 'want' ownership management from the Kubelet -and others will not. Plugins should be able to opt in to ownership management from the Kubelet. To -facilitate this, there should be a method added to the `volume.Plugin` interface that the Kubelet -uses to determine whether to perform ownership management for a volume. - -### Determining correct ownership of a volume - -Using the approach of a pod-level supplemental group to own volumes solves the problem in any of the -cases of UID/GID combinations within a pod. Since this is the simplest approach that handles all -use-cases, our solution will be made in terms of it. - -Eventually, Kubernetes should allocate a unique group for each pod so that a pod's volumes are -usable by that pod's containers, but not by containers of another pod. The supplemental group used -to share volumes must be unique in a multitenant cluster. If uniqueness is enforced at the host -level, pods from one host may be able to use shared filesystems meant for pods on another host. - -Eventually, Kubernetes should integrate with external identity management systems to populate pod -specs with the right supplemental groups necessary to use shared volumes. In the interim until the -identity management story is far enough along to implement this type of integration, we will rely -on being able to set arbitrary groups. (Note: as of this writing, a PR is being prepared for -setting arbitrary supplemental groups). - -An admission controller could handle allocating groups for each pod and setting the group in the -pod's security context. - -#### A note on the root group - -Today, by default, all docker containers are run in the root group (GID 0). This is relied on by -image authors that make images to run with a range of UIDs: they set the group ownership for -important paths to be the root group, so that containers running as GID 0 *and* an arbitrary UID -can read and write to those paths normally. - -It is important to note that the changes proposed here will not affect the primary GID of -containers in pods. Setting the `pod.Spec.SecurityContext.FSGroup` field will not -override the primary GID and should be safe to use in images that expect GID 0. - -### Setting ownership and permissions on volumes - -For `EmptyDir`-based volumes and unshared storage, `chown` and `chmod` on the node are sufficient to -set ownership and permissions. Shared storage is different because: - -1. Shared storage may not live on the node a pod that uses it runs on -2. Shared storage may be externally managed - -## Proposed design: - -Our design should minimize code for handling ownership required in the Kubelet and volume plugins. - -### API changes - -We should not interfere with images that need to run as a particular UID or primary GID. A pod -level supplemental group allows us to express a group that all containers in a pod run as in a way -that is orthogonal to the primary UID and GID of each container process. - -```go -package api - -type PodSecurityContext struct { - // FSGroup is a supplemental group that all containers in a pod run under. This group will own - // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will - // not set the group ownership of any volumes. - FSGroup *int64 `json:"fsGroup,omitempty"` -} -``` - -The V1 API will be extended with the same field: - -```go -package v1 - -type PodSecurityContext struct { - // FSGroup is a supplemental group that all containers in a pod run under. This group will own - // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will - // not set the group ownership of any volumes. - FSGroup *int64 `json:"fsGroup,omitempty"` -} -``` - -The values that can be specified for the `pod.Spec.SecurityContext.FSGroup` field are governed by -[pod security policy](https://github.com/kubernetes/kubernetes/pull/7893). - -#### API backward compatibility - -Pods created by old clients will have the `pod.Spec.SecurityContext.FSGroup` field unset; -these pods will not have their volumes managed by the Kubelet. Old clients will not be able to set -or read the `pod.Spec.SecurityContext.FSGroup` field. - -### Volume changes - -The `volume.Mounter` interface should have a new method added that indicates whether the plugin -supports ownership management: - -```go -package volume - -type Mounter interface { - // other methods omitted - - // SupportsOwnershipManagement indicates that this volume supports having ownership - // and permissions managed by the Kubelet; if true, the caller may manipulate UID - // or GID of this volume. - SupportsOwnershipManagement() bool -} -``` - -In the first round of work, only `hostPath` and `emptyDir` and its derivations will be tested with -ownership management support: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | false | -| `awsElasticBlockStore` | false | -| `nfs` | false | -| `iscsi` | false | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | false | -| `cinder` | false | -| `cephfs` | false | - -Ultimately, the matrix will theoretically look like: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | true | -| `awsElasticBlockStore` | true | -| `nfs` | false | -| `iscsi` | true | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | true | -| `cinder` | false | -| `cephfs` | false | - -### Kubelet changes - -The Kubelet should be modified to perform ownership and label management when required for a volume. - -For ownership management the criteria are: - -1. The `pod.Spec.SecurityContext.FSGroup` field is populated -2. The volume builder returns `true` from `SupportsOwnershipManagement` - -Logic should be added to the `mountExternalVolumes` method that runs a local `chgrp` and `chmod` if -the pod-level supplemental group is set and the volume supports ownership management: - -```go -package kubelet - -type ChgrpRunner interface { - Chgrp(path string, gid int) error -} - -type ChmodRunner interface { - Chmod(path string, mode os.FileMode) error -} - -type Kubelet struct { - chgrpRunner ChgrpRunner - chmodRunner ChmodRunner -} - -func (kl *Kubelet) mountExternalVolumes(pod *api.Pod) (kubecontainer.VolumeMap, error) { - podFSGroup = pod.Spec.PodSecurityContext.FSGroup - podFSGroupSet := false - if podFSGroup != 0 { - podFSGroupSet = true - } - - podVolumes := make(kubecontainer.VolumeMap) - - for i := range pod.Spec.Volumes { - volSpec := &pod.Spec.Volumes[i] - - rootContext, err := kl.getRootDirContext() - if err != nil { - return nil, err - } - - // Try to use a plugin for this volume. - internal := volume.NewSpecFromVolume(volSpec) - builder, err := kl.newVolumeMounterFromPlugins(internal, pod, volume.VolumeOptions{RootContext: rootContext}, kl.mounter) - if err != nil { - glog.Errorf("Could not create volume builder for pod %s: %v", pod.UID, err) - return nil, err - } - if builder == nil { - return nil, errUnsupportedVolumeType - } - err = builder.SetUp() - if err != nil { - return nil, err - } - - if builder.SupportsOwnershipManagement() && - podFSGroupSet { - err = kl.chgrpRunner.Chgrp(builder.GetPath(), podFSGroup) - if err != nil { - return nil, err - } - - err = kl.chmodRunner.Chmod(builder.GetPath(), os.FileMode(1770)) - if err != nil { - return nil, err - } - } - - podVolumes[volSpec.Name] = builder - } - - return podVolumes, nil -} -``` - -This allows the volume plugins to determine when they do and don't want this type of support from -the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. - -The docker runtime will be modified to set the supplemental group of each container based on the -`pod.Spec.SecurityContext.FSGroup` field. Theoretically, the `rkt` runtime could support this -feature in a similar way. - -### Examples - -#### EmptyDir - -For a pod that has two containers sharing an `EmptyDir` volume: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: test-pod -spec: - securityContext: - fsGroup: 1001 - containers: - - name: a - securityContext: - runAsUser: 1009 - volumeMounts: - - mountPath: "/example/hostpath/a" - name: empty-vol - - name: b - securityContext: - runAsUser: 1010 - volumeMounts: - - mountPath: "/example/hostpath/b" - name: empty-vol - volumes: - - name: empty-vol -``` - -When the Kubelet runs this pod, the `empty-vol` volume will have ownership root:1001 and permissions -`0770`. It will be usable from both containers a and b. - -#### HostPath - -For a volume that uses a `hostPath` volume with containers running as different UIDs: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: test-pod -spec: - securityContext: - fsGroup: 1001 - containers: - - name: a - securityContext: - runAsUser: 1009 - volumeMounts: - - mountPath: "/example/hostpath/a" - name: host-vol - - name: b - securityContext: - runAsUser: 1010 - volumeMounts: - - mountPath: "/example/hostpath/b" - name: host-vol - volumes: - - name: host-vol - hostPath: - path: "/tmp/example-pod" -``` - -The cluster operator would need to manually `chgrp` and `chmod` the `/tmp/example-pod` on the host -in order for the volume to be usable from the pod. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/testing/OWNERS b/contributors/design-proposals/testing/OWNERS deleted file mode 100644 index 541bac08..00000000 --- a/contributors/design-proposals/testing/OWNERS +++ /dev/null @@ -1,8 +0,0 @@ -# See the OWNERS docs at https://go.k8s.io/owners - -reviewers: - - sig-testing-leads -approvers: - - sig-testing-leads -labels: - - sig/testing diff --git a/contributors/design-proposals/testing/flakiness-sla.md b/contributors/design-proposals/testing/flakiness-sla.md index efed5e96..f0fbec72 100644 --- a/contributors/design-proposals/testing/flakiness-sla.md +++ b/contributors/design-proposals/testing/flakiness-sla.md @@ -1,109 +1,6 @@ -# Kubernetes Testing Flakiness SLA +Design proposals have been archived. -This document captures the expectations of the community about flakiness in -our tests and our test infrastructure. It sets out an SLA (Service Level -Agreement) for flakiness in our tests, as well as actions that we will -take when we are out of SLA. +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -## Definition of "We" -Throughout the document the term _we_ is used. This is intended to refer -to the Kubernetes project as a whole, and any governance structures the -project puts in place. It is not intended to refer to any specific group -of individuals. - -## Definition of a "Flake" - -We'll start by the definition of a _flake_. _Flakiness_ is defined for a -complete run of one required job in the pre-submit testing infrastructure -(e.g. pull-kubernetes-e2e-gke-gci) for a given pull request (PR). We will -not measure flakiness SLA for e2e jobs that are not required for code merge. - -A pre-submit job's test result is considered to -be a flake according to two criteria: - -1) it both fails and passes without any changes in the code -or environment being tested. - -2) the PR in question doesn't cause the flake itself. - -### Measuring flakiness -There are a number of challenges in monitoring flakiness. We expect that -the metric will be heuristic in nature and we will iterate on it over time. -Identifying all of the potential problems and ways to measure the metric -are out of the scope of the document, but some currently known challenges -are listed at the end of the document. - -## Flakiness SLA -We will measure flakiness based on pull requests that are run through pre-submit -PR testing. The metric that we will monitor is: Pre-submit flakes per PR. This metric -will be calculated on a daily and weekly basis using: - -Sum(Flakes not caused by the PR) / Total(PRs) - -Our current SLA is that this metric will be less than 0.01 -(1% of all PRs have flakes that are not caused by the PR itself). - -## Activities on SLA violation -When the project is out of SLA for flakiness on either the daily or weekly metric -we will determine an appropriate actions to take to bring the project back -within SLA. For now these specific actions are to be determined. - -## Non-goals: Flakiness at HEAD -We will consider flakiness for PRs only, we will _not_ currently -measure flakiness for continuous e2e that are run at HEAD independent of PRs. - -There a few reasons for this: - * The volume of testing on PRs is significantly higher than at HEAD, so flakes are more readily apparent. - * The flakiness of these tests is already being measured in other places. - * The goal of this proposal is to improve contributor experience, which is - governed by PR behavior, rather than the comprehensive e2e suite which is - intended to ensure project stability. - -Going forward, if the e2e suite at HEAD is showing increased instability, we -may chose to update our flakiness SLA. - -## Infrastructure, enforcement and success metrics - -### Monitoring and enforcement infrastructure -SIG-Testing are currently responsible for the submit-queue infrastructure -and will be responsible for designing, implementing and deploying the -relevant monitoring and enforcement mechanisms for this proposal. - -### Assessing success -Ultimately, the goal of this effort is to decrease flaky tests and -improve the contributor experience. To that end, SIG-contributor-experience -will assess and evaluate if this proposal is successful, and will refine -or eliminate this proposal as new evidence is obtained. - -Success for this proposal will be measured in terms of the overall flakiness -of tests. The number of times the SLA is violated in a given time period, -and the number of PRs to fix test flakes. If this proposal is successful -all of these metrics should decrease over time. - -## Approaches to measuring the flakiness metric - -Currently, measuring this metric is somewhat tricky, since determining flakes -caused by PRs vs. existing flakes is somewhat challenging. We will use a variety -of heuristics to determine this, including looking at PRs that contain changes -that could not cause flakes (e.g. typos, docs). As well as looking at the -past history of test failures (e.g. if a test fails across many different PRs -it's likely a flake) - -### Detecting changes in the environment (e.g. GCE, Azure, AWS, etc) -Changes in the environment are notably hard to measure or control for, but we'll do our best. - -### Detecting changes in code -Changes in code are obviously quite easy to control. The github API can tell us if the commit SHA -has changed between different runs of the test result. - -### Detecting if it is caused by PR or pre-existing? -Measuring whether a flake is caused by a PR or existing problems is a challenge, but will use -observation across multiple PRs to judge the probability that it is an existing problem or caused by the PR. -If the test flakes across multiple PRs it's likely the test. If it is only in a single PR it is likely -that PR. - -### Using retest requests as a signal -When a user requests a re-test of a PR, it is a signal that the user believes the test to be flaky. -We will use this as a strong indication that the test suite result is flaky. If a user requests a retest of the suite, -and that test suite passes, that is a strong indication that there is a flaky test suite involved. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file diff --git a/contributors/design-proposals/vault-based-kms-class-diagram.png b/contributors/design-proposals/vault-based-kms-class-diagram.png Binary files differdeleted file mode 100644 index 13d57884..00000000 --- a/contributors/design-proposals/vault-based-kms-class-diagram.png +++ /dev/null diff --git a/contributors/design-proposals/vault-based-kms-provider.md b/contributors/design-proposals/vault-based-kms-provider.md index 4f97ef5a..f0fbec72 100644 --- a/contributors/design-proposals/vault-based-kms-provider.md +++ b/contributors/design-proposals/vault-based-kms-provider.md @@ -1,275 +1,6 @@ -# Vault based KMS provider for envelope encryption of secrets in etcd3 +Design proposals have been archived. -## Abstract +To view the last version of this document, see the [Design Proposals Archive Repo](https://github.com/kubernetes/design-proposals-archive/). -Kubernetes, starting with the release 1.7, adds Alpha support ( via PRs -[41939](https://github.com/kubernetes/kubernetes/pull/41939) and -[46460](https://github.com/kubernetes/kubernetes/pull/46460)) to encrypt secrets -and resources in etcd3 via a configured Provider. This release supports three -providers viz. aesgcm, aescbc, secretbox. These providers store the encryption -key(s) locally in a server configuration file. The provider encrypts and -decrypts secrets in-process. Building upon these, a KMS provider framework with -an option to support different KMS providers like google cloud KMS is being -added via PRs [48574](https://github.com/kubernetes/kubernetes/pull/48575) and -[49350](https://github.com/kubernetes/kubernetes/pull/49350). The new KMS -provider framework uses an envelope encryption scheme. - -This proposal adopts the KMS provider framework and adds a new KMS provider that -uses Hashicorp Vault with a transit backend, to encrypt and decrypt the DEK -stored in encrypted form in etcd3 along with encrypted secrets. - - -Vault is widely used for Data encryption and securely storing secrets. -Externalizing encryption/decryption of kubernetes secrets to vault provides -various benefits - -* Choice of industry standard encryption algorithms and strengths without having -to implement specific providers for each (in K8S). -* Reduced risk of encryption key compromise. - * encryption key is stored and managed in Vault. - * encryption key does not need to leave the Vault. -* Vault provides ability to define access control suitable for a wide range of deployment scenarios and security needs. -* Vault provides In-built auditing of vault API calls. -* Ability for a customer already using Vault to leverage the instance to also -secure keys used to encrypt secrets managed within a Kubernetes cluster -* Separation of Kubernetes cluster management responsibilities from encryption key -management and administration allowing an organization to better leverage -competencies and skills within the DevOps teams. - -Note, that the Vault Provider in this proposal - -1. **requires** Vault transit backend. -2. supports a wide range of authentication backends supported by vault (see below -for exact list). -3. does not depend on specific storage backend or any other specific configuration. - -This proposal assumes familiarity with Vault and the transit back-end. - -## High level design -As with existing providers, the Vault based provider will implement the -interface ``envelope.Service``. Based on value of *name* in the KMS provider -configuration, the ``EnvelopeTransformer`` module will use an instance of the -Vault provider for decryption and encryption of DEK before storing and after -reading from the storage. - -The KEK will be stored and managed in Vault backend. The Vault based provider -configured in KMS Transformer configuration will make REST requests to encrypt -and decrypt DEKs over a secure channel (must enable TLS). KMS Transformer will -store the DEKs in etcd in encrypted form along with encrypted secrets. As with -existing providers, encrypted DEKs will be stored with metadata used to identify -the provider and KEK to be used for decryption. - -The provider will support following authentication back-ends - -* Vault token based, -* TLS cert based, -* Vault AppRole based. - -Deployers can choose an authentication mechanism best suited to their -requirements. -The provider will work with vault REST APIs and will not require Vault to be -configured or deployed in any specific way other than requiring a Transit -Backend. - -### Diagram illustrating interfaces and implementations - - - -### Pseudocode -#### Prefix Metadata -Every encrypted secret will have the following metadata prefixed. -```json -k8s:enc:kms:<api-version>:vault:len(<KEK-key-name>:<KEK-key-version>:<DEK -encrypted with KEK>):<KEK-key-name>:<KEK-key-version>:<DEK encrypted with KEK> -``` - -* ``<api-version>`` represents api version in the providers configuration file. -* ``vault`` represents the KMS service *kind* value. It is a fixed value for Vault -based provider. -* ``KEK-key-name`` is determined from the vault service configuration in providers -configuration file -* ``KEK-key-version`` is an internal identifier used by vault to identify specific -key version used to encrypt and decrypt. Vault sends ``kek-key-version`` -prefixed with encrypted data in the response to an encrypt request. The -``kek-key-version`` will be stored as part of prefix and returned back to Vault -during a decrypt request. - -Of the above metadata, - -* ``EnvelopeTransformer`` will add -``k8s:enc:kms:<api-version>:vault:len(<KEK-key-name>:<KEK-key-version>:<DEK -encrypted with KEK>)`` -* while the ``vaultEnvelopeService`` will add -``<KEK-key-name>:<KEK-key-version>:<DEK encrypted with KEK>``. - - -#### For each write of DEK -``EnvelopeTransformer`` will write encrypted DEK along with encrypted secret in -etcd. - -Here's the pseudocode for ``vaultEnvelopeService.encrypt()``, invoked on each -write of DEK. - - KEY_NAME = <first key-name from vault provider config> - PLAIN_DEK = <value of DEK> - ENCRYPTED_DEK_WITH_KEY_VERSION = encrypt(base64(PLAIN_DEK), KEY_NAME) - - // output from vault will have an extra prefix "vault" (other than key version) which will be stripped. - - STORED_DEK = KEY_NAME:<ENCRYPTED_DEK_WITH_KEY_VERSION> - -#### For each read of DEK -``EnvelopeTransformer`` will read encrypted DEK along with encrypted secret from -etcd - -Here's the pseudocode ``vaultEnvelopeService.decrypt()`` invoked on each read of -DEK. - - // parse the provider kind, key name and encrypted DEK prefixed with key version - KEY_NAME = //key-name from the prefix - ENCRYPTED_DEK_WITH_KEY_VERSION = //<key version>:<encrypted DEK> from the stored value - - // add "vault" prefix to ENCRYPTED_DEK_WITH_KEY_VERSION as required by vault decrypt API - - base64Encoded = decrypt(vault:ENCRYPTED_DEK_WITH_KEY_VERSION, KEY_NAME) - - PLAIN_DEK = base64.Decode(base64Encoded) - -#### Example - - DEK = "the quick brown fox" - provider kind = "vault" - api version version = "v1" - Key name = "kube-secret-enc-key" - key version = v1 - ciphertext returned from vault = vault:v1:aNOTZn0aUDMDbWAQL1E31tH/7zr7oslRjkSpRW0+BPdMfSJntyXZNCAwIbkTtn0= - prefixed DEK used to tag secrets = vault:kube-secret-enc-key:v1:aNOTZn0aUDMDbWAQL1E31tH/7zr7oslRjkSpRW0+BPdMfSJntyXZNCAwIbkTtn0= - -### Configuration - -No new configuration file or startup parameter will be introduced. - -The vault provider will be specified in the existing configuration file used to -configure any of the encryption providers. The location of this configuration -file is identified by the existing startup parameter: -`--experimental-encryption-provider-config` . - -Vault provider configuration will be identified by value "**vault**" for the -``name`` attribute in ``kms`` provider. - -The actual configuration of the vault provider will be in a separate -configuration identified by the ``configfile`` attribute in the KMS provider. - -Here is a sample configuration file with the vault provider configured: - - kind: EncryptionConfig - apiVersion: v1 - resources: - - resources: - - secrets - providers: - - kms: - name: vault - cachesize: 10 - configfile: /home/myvault/vault-config.yaml - -#### Minimal required Configuration -The Vault based Provider needs the following configuration elements, at a -minimum: - -1. ``addr`` Vault service base endpoint eg. https://example.com:8200 -2. ``key-names`` list of names of the keys in Vault to be used. eg: key-name: -kube-secret-enc-key. - -Note : key name does not need to be changed if the key is rotated in Vault, the -rotated key is identified by key version which is prefix to ciphertext. - -A new key can be added in the list. Encryption will be done using the first key -in the list. Decryption can happen using any of the keys in the list based on -the prefix to the encrypted DEK stored in etcd - -#### Authentication Configuration -##### Vault Server Authentication - -For the Kubernetes cluster to authenticate the vault server, TLS must be enabled : -1. ``ca-cert`` location of x509 certificate to authenticate the vault server eg: -``/var/run/kubernetes/ssl/vault.crt`` - -##### Client Authentication Choices - -For client authentication, one of following **must** be used: (provider will -reject the configuration if parameters for more than one authentication backends -are specified ) - -###### X509 based authentication -1. ``client-cert``: location of x509 certificate to authenticate kubernetes API -server to vault server eg. ``/var/run/kubernetes/ssl/valut-client-cert.pem`` -2. ``client-key`` : location of x509 private key to authenticate kubernetes API -server to vault server eg. ``/var/run/kubernetes/ssl/vault-client-key.pem`` - -Here's a sample ``vault-config.yaml`` configuration with ``client-cert``: -``` - key-names: - - kube-secret-enc-key - addr: https://example.com:8200 - ca-cert:/var/run/kubernetes/ssl/vault.crt - client-cert:/var/run/kubernetes/ssl/vault-client-cert.pem - client-key:/var/run/kubernetes/ssl/vault-client-key.pem -``` - -###### Vault token based authentication -1. ``token`` : limited access vault token required by kubernetes API server to -authenticate itself while making requests to vault eg: -8dad1053-4a4e-f359-2eab-d57968eb277f - -Here's a sample ``vault-config.yaml`` configuration using a Vault Token for authentication. -the Kubernetes cluster as a client to Vault: -``` - key-names: - -kube-secret-enc-key - addr: https://example.com:8200 - ca-cert:/var/run/kubernetes/ssl/vault.crt - token: 8dad1053-4a4e-f359-2eab-d57968eb277f -``` - -###### Vault AppRole based authentication -1. ``role-id`` : RoleID of the AppRole -2. ``secret-id`` : secret Id only if associated with the appRole. - -Here's a sample configuration file using a Vault AppRole for authentication. -``` - key-names: - - kube-secret-enc-key - addr: https://localhost:8200 - ca-cert: /var/run/kubernetes/ssl/vault.crt - role-id: db02de05-fa39-4855-059b-67221c5c2f63 -``` - -## Key Generation and rotation -The KEK is generated in Vault and rotated using direct API call or CLI to Vault -itself. The Key never leaves the vault. - -Note that when a key is rotated, Vault does not allow choosing a different -encryption algorithm or key size. If a key for different encryption algorithm or -a different key size is desired, new key needs to be generated in Vault and the -corresponding key name be added in the configuration. Subsequent encryption will -be done using the first key in the list. Decryption can happen using any of the -keys in the list based on the prefix to the encrypted DEK. - -## Backward compatibility -1. Unencrypted secrets and secrets encrypted using other non-KMS providers will -continue to be readable upon adding vault as a new KMS provider. -2. If a Vault KMS is added as first provider, the secrets created or modified -thereafter will be encrypted by vault provider. - -## Performance -1. KMS provider framework uses LRU cache to minimize the requests to KMS for -encryption and decryption of DEKs. -2. Note that there will be a request to KMS for every cache miss causing a -performance impact. Hence, depending on the cache size, there will be a -performance impact. -3. Response time. - 4. will depend on choice of encryption algorithm and strength. - 5. will depend on specific vault configurations like storage backend, -authentication mechanism, token polices etc. +Please remove after 2022-04-01 or the release of Kubernetes 1.24, whichever comes first.
\ No newline at end of file |
