diff options
| author | Stephen Augustus <foo@agst.us> | 2018-12-01 02:40:42 -0500 |
|---|---|---|
| committer | Stephen Augustus <foo@agst.us> | 2018-12-01 02:40:42 -0500 |
| commit | 1004e56177eb12d85b6e0f6cf1ccd00431f7336b (patch) | |
| tree | e2a87f95b32e046ed32a2eea6cde661704e61fbd | |
| parent | 973b19523840d207ae206175ac2093d3b564668c (diff) | |
Add KEP tombstones
Signed-off-by: Stephen Augustus <foo@agst.us>
66 files changed, 264 insertions, 13615 deletions
diff --git a/keps/0000-kep-template.md b/keps/0000-kep-template.md index d8a65189..cfd1f5fa 100644 --- a/keps/0000-kep-template.md +++ b/keps/0000-kep-template.md @@ -1,174 +1,4 @@ ---- -kep-number: 0 -title: My First KEP -authors: - - "@janedoe" -owning-sig: sig-xxx -participating-sigs: - - sig-aaa - - sig-bbb -reviewers: - - TBD - - "@alicedoe" -approvers: - - TBD - - "@oscardoe" -editor: TBD -creation-date: yyyy-mm-dd -last-updated: yyyy-mm-dd -status: provisional -see-also: - - KEP-1 - - KEP-2 -replaces: - - KEP-3 -superseded-by: - - KEP-100 ---- - -# Title - -This is the title of the KEP. -Keep it simple and descriptive. -A good title can help communicate what the KEP is and should be considered as part of any review. - -The *filename* for the KEP should include the KEP number along with the title. -The title should be lowercased and spaces/punctuation should be replaced with `-`. -As the KEP is approved and an official KEP number is allocated, the file should be renamed. - -To get started with this template: -1. **Pick a hosting SIG.** - Make sure that the problem space is something the SIG is interested in taking up. - KEPs should not be checked in without a sponsoring SIG. -1. **Allocate a KEP number.** - Do this by (a) taking the next number in the `NEXT_KEP_NUMBER` file and (b) incrementing that number. - Include the updated `NEXT_KEP_NUMBER` file in your PR. -1. **Make a copy of this template.** - Name it `NNNN-YYYYMMDD-my-title.md` where `NNNN` is the KEP number that was allocated. -1. **Fill out the "overview" sections.** - This includes the Summary and Motivation sections. - These should be easy if you've preflighted the idea of the KEP with the appropriate SIG. -1. **Create a PR.** - Assign it to folks in the SIG that are sponsoring this process. -1. **Merge early.** - Avoid getting hung up on specific details and instead aim to get the goal of the KEP merged quickly. - The best way to do this is to just start with the "Overview" sections and fill out details incrementally in follow on PRs. - View anything marked as a `provisional` as a working document and subject to change. - Aim for single topic PRs to keep discussions focused. - If you disagree with what is already in a document, open a new PR with suggested changes. - -The canonical place for the latest set of instructions (and the likely source of this file) is [here](/keps/0000-kep-template.md). - -The `Metadata` section above is intended to support the creation of tooling around the KEP process. -This will be a YAML section that is fenced as a code block. -See the KEP process for details on each of these items. - -## Table of Contents - -A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. -[Tools for generating][] a table of contents from markdown are available. - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories-optional) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -The `Summary` section is incredibly important for producing high quality user focused documentation such as release notes or a development road map. -It should be possible to collect this information before implementation begins in order to avoid requiring implementors to split their attention between writing release notes and implementing the feature itself. -KEP editors, SIG Docs, and SIG PM should help to ensure that the tone and content of the `Summary` section is useful for a wide audience. - -A good summary is probably at least a paragraph in length. - -## Motivation - -This section is for explicitly listing the motivation, goals and non-goals of this KEP. -Describe why the change is important and the benefits to users. -The motivation section can optionally provide links to [experience reports][] to demonstrate the interest in a KEP within the wider Kubernetes community. - -[experience reports]: https://github.com/golang/go/wiki/ExperienceReports - -### Goals - -List the specific goals of the KEP. -How will we know that this has succeeded? - -### Non-Goals - -What is out of scope for his KEP? -Listing non-goals helps to focus discussion and make progress. - -## Proposal - -This is where we get down to the nitty gritty of what the proposal actually is. - -### User Stories [optional] - -Detail the things that people will be able to do if this KEP is implemented. -Include as much detail as possible so that people can understand the "how" of the system. -The goal here is to make this feel real for users without getting bogged down. - -#### Story 1 - -#### Story 2 - -### Implementation Details/Notes/Constraints [optional] - -What are the caveats to the implementation? -What are some important details that didn't come across above. -Go in to as much detail as necessary here. -This might be a good place to talk about core concepts and how they releate. - -### Risks and Mitigations - -What are the risks of this proposal and how do we mitigate. -Think broadly. -For example, consider both security and how this will impact the larger kubernetes ecosystem. - -## Graduation Criteria - -How will we know that this has succeeded? -Gathering user feedback is crucial for building high quality experiences and SIGs have the important responsibility of setting milestones for stability and completeness. -Hopefully the content previously contained in [umbrella issues][] will be tracked in the `Graduation Criteria` section. - -[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. -Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance -- the `Proposal` section being merged signaling agreement on a proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was available -- the version of Kubernetes where the KEP graduated to general availability -- when the KEP was retired or superseded - -## Drawbacks [optional] - -Why should this KEP _not_ be implemented. - -## Alternatives [optional] - -Similar to the `Drawbacks` section the `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a KEP. - -## Infrastructure Needed [optional] - -Use this section if you need things from the project/SIG. -Examples include a new subproject, repos requested, github details. -Listing these here allows a SIG to get the process for these resources started right away.
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/0001-kubernetes-enhancement-proposal-process.md b/keps/0001-kubernetes-enhancement-proposal-process.md index 9dc65553..cfd1f5fa 100644 --- a/keps/0001-kubernetes-enhancement-proposal-process.md +++ b/keps/0001-kubernetes-enhancement-proposal-process.md @@ -1,362 +1,4 @@ ---- -kep-number: 1 -title: Kubernetes Enhancement Proposal Process -authors: - - "@calebamiles" - - "@jbeda" -owning-sig: sig-architecture -participating-sigs: - - kubernetes-wide -reviewers: - - name: "@timothysc" -approvers: - - name: "@bgrant0607" -editor: - name: "@jbeda" -creation-date: 2017-08-22 -status: implementable ---- - -# Kubernetes Enhancement Proposal Process - -## Table of Contents - -* [Kubernetes Enhancement Proposal Process](#kubernetes-enhancement-proposal-process) - * [Metadata](#metadata) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Reference-level explanation](#reference-level-explanation) - * [What type of work should be tracked by a KEP](#what-type-of-work-should-be-tracked-by-a-kep) - * [KEP Template](#kep-template) - * [KEP Metadata](#kep-metadata) - * [KEP Workflow](#kep-workflow) - * [Git and GitHub Implementation](#git-and-github-implementation) - * [KEP Editor Role](#kep-editor-role) - * [Important Metrics](#important-metrics) - * [Prior Art](#prior-art) - * [Graduation Criteria](#graduation-criteria) - * [Drawbacks](#drawbacks) - * [Alternatives](#alternatives) - * [Unresolved Questions](#unresolved-questions) - * [Mentors](#mentors) - -## Summary - -A standardized development process for Kubernetes is proposed in order to - -- provide a common structure for proposing changes to Kubernetes -- ensure that the motivation for a change is clear -- allow for the enumeration stability milestones and stability graduation - criteria -- persist project information in a Version Control System (VCS) for future - Kubernauts -- support the creation of _high value user facing_ information such as: - - an overall project development roadmap - - motivation for impactful user facing changes -- reserve GitHub issues for tracking work in flight rather than creating "umbrella" - issues -- ensure community participants are successfully able to drive changes to - completion across one or more releases while stakeholders are adequately - represented throughout the process - -This process is supported by a unit of work called a Kubernetes Enhancement Proposal or KEP. -A KEP attempts to combine aspects of a - -- feature, and effort tracking document -- a product requirements document -- design document - -into one file which is created incrementally in collaboration with one or more -Special Interest Groups (SIGs). - -## Motivation - -For cross project SIGs such as SIG PM and SIG Release an abstraction beyond a -single GitHub Issue or Pull request seems to be required in order to understand -and communicate upcoming changes to Kubernetes. In a blog post describing the -[road to Go 2][], Russ Cox explains - -> that it is difficult but essential to describe the significance of a problem -> in a way that someone working in a different environment can understand - -as a project it is vital to be able to track the chain of custody for a proposed -enhancement from conception through implementation. - -Without a standardized mechanism for describing important enhancements our -talented technical writers and product managers struggle to weave a coherent -narrative explaining why a particular release is important. Additionally for -critical infrastructure such as Kubernetes adopters need a forward looking road -map in order to plan their adoption strategy. - -The purpose of the KEP process is to reduce the amount of "tribal knowledge" in -our community. By moving decisions from a smattering of mailing lists, video -calls and hallway conversations into a well tracked artifact this process aims -to enhance communication and discoverability. - -A KEP is broken into sections which can be merged into source control -incrementally in order to support an iterative development process. An important -goal of the KEP process is ensuring that the process for submitting the content -contained in [design proposals][] is both clear and efficient. The KEP process -is intended to create high quality uniform design and implementation documents -for SIGs to deliberate. - -[road to Go 2]: https://blog.golang.org/toward-go2 -[design proposals]: /contributors/design-proposals - - -## Reference-level explanation - -### What type of work should be tracked by a KEP - -The definition of what constitutes an "enhancement" is a foundational concern -for the Kubernetes project. Roughly any Kubernetes user or operator facing -enhancement should follow the KEP process: if an enhancement would be described -in either written or verbal communication to anyone besides the KEP author or -developer then consider creating a KEP. - -Similarly, any technical effort (refactoring, major architectural change) that -will impact a large section of the development community should also be -communicated widely. The KEP process is suited for this even if it will have -zero impact on the typical user or operator. - -As the local bodies of governance, SIGs should have broad latitude in describing -what constitutes an enhancement which should be tracked through the KEP process. -SIGs may find that helpful to enumerate what _does not_ require a KEP rather -than what does. SIGs also have the freedom to customize the KEP template -according to their SIG specific concerns. For example the KEP template used to -track API changes will likely have different subsections than the template for -proposing governance changes. However, as changes start impacting other SIGs or -the larger developer community outside of a SIG, the KEP process should be used -to coordinate and communicate. - -Enhancements that have major impacts on multiple SIGs should use the KEP process. -A single SIG will own the KEP but it is expected that the set of approvers will span the impacted SIGs. -The KEP process is the way that SIGs can negotiate and communicate changes that cross boundaries. - -KEPs will also be used to drive large changes that will cut across all parts of the project. -These KEPs will be owned by SIG-architecture and should be seen as a way to communicate the most fundamental aspects of what Kubernetes is. - -### KEP Template - -The template for a KEP is precisely defined [here](0000-kep-template.md) - -### KEP Metadata - -There is a place in each KEP for a YAML document that has standard metadata. -This will be used to support tooling around filtering and display. It is also -critical to clearly communicate the status of a KEP. - -Metadata items: -* **kep-number** Required - * Each proposal has a number. This is to make all references to proposals as - clear as possible. This is especially important as we create a network - cross references between proposals. - * Before having the `Approved` status, the number for the KEP will be in the - form of `draft-YYYYMMDD`. The `YYYYMMDD` is replaced with the current date - when first creating the KEP. The goal is to enable fast parallel merges of - pre-acceptance KEPs. - * On acceptance a sequential dense number will be assigned. This will be done - by the editor and will be done in such a way as to minimize the chances of - conflicts. The final number for a KEP will have no prefix. -* **title** Required - * The title of the KEP in plain language. The title will also be used in the - KEP filename. See the template for instructions and details. -* **status** Required - * The current state of the KEP. - * Must be one of `provisional`, `implementable`, `implemented`, `deferred`, `rejected`, `withdrawn`, or `replaced`. -* **authors** Required - * A list of authors for the KEP. - This is simply the github ID. - In the future we may enhance this to include other types of identification. -* **owning-sig** Required - * The SIG that is most closely associated with this KEP. If there is code or - other artifacts that will result from this KEP, then it is expected that - this SIG will take responsibility for the bulk of those artifacts. - * Sigs are listed as `sig-abc-def` where the name matches up with the - directory in the `kubernetes/community` repo. -* **participating-sigs** Optional - * A list of SIGs that are involved or impacted by this KEP. - * A special value of `kubernetes-wide` will indicate that this KEP has impact - across the entire project. -* **reviewers** Required - * Reviewer(s) chosen after triage according to proposal process - * If not yet chosen replace with `TBD` - * Same name/contact scheme as `authors` - * Reviewers should be a distinct set from authors. -* **approvers** Required - * Approver(s) chosen after triage according to proposal process - * Approver(s) are drawn from the impacted SIGs. - It is up to the individual SIGs to determine how they pick approvers for KEPs impacting them. - The approvers are speaking for the SIG in the process of approving this KEP. - The SIGs in question can modify this list as necessary. - * The approvers are the individuals that make the call to move this KEP to the `approved` state. - * Approvers should be a distinct set from authors. - * If not yet chosen replace with `TBD` - * Same name/contact scheme as `authors` -* **editor** Required - * Someone to keep things moving forward. - * If not yet chosen replace with `TBD` - * Same name/contact scheme as `authors` -* **creation-date** Required - * The date that the KEP was first submitted in a PR. - * In the form `yyyy-mm-dd` - * While this info will also be in source control, it is helpful to have the set of KEP files stand on their own. -* **last-updated** Optional - * The date that the KEP was last changed significantly. - * In the form `yyyy-mm-dd` -* **see-also** Optional - * A list of other KEPs that are relevant to this KEP. - * In the form `KEP-123` -* **replaces** Optional - * A list of KEPs that this KEP replaces. Those KEPs should list this KEP in - their `superseded-by`. - * In the form `KEP-123` -* **superseded-by** - * A list of KEPs that supersede this KEP. Use of this should be paired with - this KEP moving into the `Replaced` status. - * In the form `KEP-123` - - -### KEP Workflow - -A KEP has the following states - -- `provisional`: The KEP has been proposed and is actively being defined. - This is the starting state while the KEP is being fleshed out and actively defined and discussed. - The owning SIG has accepted that this is work that needs to be done. -- `implementable`: The approvers have approved this KEP for implementation. -- `implemented`: The KEP has been implemented and is no longer actively changed. -- `deferred`: The KEP is proposed but not actively being worked on. -- `rejected`: The approvers and authors have decided that this KEP is not moving forward. - The KEP is kept around as a historical document. -- `withdrawn`: The KEP has been withdrawn by the authors. -- `replaced`: The KEP has been replaced by a new KEP. - The `superseded-by` metadata value should point to the new KEP. - -### Git and GitHub Implementation - -KEPs are checked into the community repo under the `/kep` directory. -In the future, as needed we can add SIG specific subdirectories. -KEPs in SIG specific subdirectories have limited impact outside of the SIG and can leverage SIG specific OWNERS files. - -New KEPs can be checked in with a file name in the form of `draft-YYYYMMDD-my-title.md`. -As significant work is done on the KEP the authors can assign a KEP number. -This is done by taking the next number in the NEXT_KEP_NUMBER file, incrementing that number, and renaming the KEP. -No other changes should be put in that PR so that it can be approved quickly and minimize merge conflicts. -The KEP number can also be done as part of the initial submission if the PR is likely to be uncontested and merged quickly. - -### KEP Editor Role - -Taking a cue from the [Python PEP process][], we define the role of a KEP editor. -The job of an KEP editor is likely very similar to the [PEP editor responsibilities][] and will hopefully provide another opportunity for people who do not write code daily to contribute to Kubernetes. - -In keeping with the PEP editors which - -> Read the PEP to check if it is ready: sound and complete. The ideas must make -> technical sense, even if they don't seem likely to be accepted. -> The title should accurately describe the content. -> Edit the PEP for language (spelling, grammar, sentence structure, etc.), markup -> (for reST PEPs), code style (examples should match PEP 8 & 7). - -KEP editors should generally not pass judgement on a KEP beyond editorial corrections. -KEP editors can also help inform authors about the process and otherwise help things move smoothly. - -[Python PEP process]: https://www.python.org/dev/peps/pep-0001/ -[PEP editor responsibilities]: https://www.python.org/dev/peps/pep-0001/#pep-editor-responsibilities-workflow - -### Important Metrics - -It is proposed that the primary metrics which would signal the success or -failure of the KEP process are - -- how many "enhancements" are tracked with a KEP -- distribution of time a KEP spends in each state -- KEP rejection rate -- PRs referencing a KEP merged per week -- number of issues open which reference a KEP -- number of contributors who authored a KEP -- number of contributors who authored a KEP for the first time -- number of orphaned KEPs -- number of retired KEPs -- number of superseded KEPs - -### Prior Art - -The KEP process as proposed was essentially stolen from the [Rust RFC process][] which -itself seems to be very similar to the [Python PEP process][] - -[Rust RFC process]: https://github.com/rust-lang/rfcs - -## Drawbacks - -Any additional process has the potential to engender resentment within the -community. There is also a risk that the KEP process as designed will not -sufficiently address the scaling challenges we face today. PR review bandwidth is -already at a premium and we may find that the KEP process introduces an unreasonable -bottleneck on our development velocity. - -It certainly can be argued that the lack of a dedicated issue/defect tracker -beyond GitHub issues contributes to our challenges in managing a project as large -as Kubernetes, however, given that other large organizations, including GitHub -itself, make effective use of GitHub issues perhaps the argument is overblown. - -The centrality of Git and GitHub within the KEP process also may place too high -a barrier to potential contributors, however, given that both Git and GitHub are -required to contribute code changes to Kubernetes today perhaps it would be reasonable -to invest in providing support to those unfamiliar with this tooling. - -Expanding the proposal template beyond the single sentence description currently -required in the [features issue template][] may be a heavy burden for non native -English speakers and here the role of the KEP editor combined with kindness and -empathy will be crucial to making the process successful. - -[features issue template]: https://git.k8s.io/features/ISSUE_TEMPLATE.md - -## Alternatives - -This KEP process is related to -- the generation of a [architectural roadmap][] -- the fact that the [what constitutes a feature][] is still undefined -- [issue management][] -- the difference between an [accepted design and a proposal][] -- [the organization of design proposals][] - -this proposal attempts to place these concerns within a general framework. - -[architectural roadmap]: https://github.com/kubernetes/community/issues/952 -[what constitutes a feature]: https://github.com/kubernetes/community/issues/531 -[issue management]: https://github.com/kubernetes/community/issues/580 -[accepted design and a proposal]: https://github.com/kubernetes/community/issues/914 -[the organization of design proposals]: https://github.com/kubernetes/community/issues/918 - -### GitHub issues vs. KEPs - -The use of GitHub issues when proposing changes does not provide SIGs good -facilities for signaling approval or rejection of a proposed change to Kubernetes -since anyone can open a GitHub issue at any time. Additionally managing a proposed -change across multiple releases is somewhat cumbersome as labels and milestones -need to be updated for every release that a change spans. These long lived GitHub -issues lead to an ever increasing number of issues open against -`kubernetes/features` which itself has become a management problem. - -In addition to the challenge of managing issues over time, searching for text -within an issue can be challenging. The flat hierarchy of issues can also make -navigation and categorization tricky. While not all community members might -not be comfortable using Git directly, it is imperative that as a community we -work to educate people on a standard set of tools so they can take their -experience to other projects they may decide to work on in the future. While -git is a fantastic version control system (VCS), it is not a project management -tool nor a cogent way of managing an architectural catalog or backlog; this -proposal is limited to motivating the creation of a standardized definition of -work in order to facilitate project management. This primitive for describing -a unit of work may also allow contributors to create their own personalized -view of the state of the project while relying on Git and GitHub for consistency -and durable storage. - -## Unresolved Questions - -- How reviewers and approvers are assigned to a KEP -- Example schedule, deadline, and time frame for each stage of a KEP -- Communication/notification mechanisms -- Review meetings and escalation procedure +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/0001a-meta-kep-implementation.md b/keps/0001a-meta-kep-implementation.md index 2a12db54..cfd1f5fa 100644 --- a/keps/0001a-meta-kep-implementation.md +++ b/keps/0001a-meta-kep-implementation.md @@ -1,221 +1,4 @@ ---- -kep-number: 1a -title: Meta KEP Implementation -authors: - - "@justaugustus" - - "@calebamiles" - - "@jdumars" -owning-sig: sig-pm -participating-sigs: - - sig-architecture -reviewers: - - "@erictune" - - "@idvoretskyi" - - "@spiffxp" - - "@timothysc" -approvers: - - "@bgrant0607" - - "@jbeda" -editor: -creation-date: 2018-09-08 -last-updated: 2018-09-29 -status: implementable -see-also: - - KEP-1 -replaces: -superseded-by: ---- - -# Meta KEP Implementation - -## Table of Contents - -- [Table of Contents](#table-of-contents) -- [Summary](#summary) -- [Motivation](#motivation) - - [Why another KEP?](#why-another-kep) - - [Non-Goals](#non-goals) -- [Proposal](#proposal) - - [Implementation Details / Notes / Constraints](#implementation-details--notes--constraints) - - [Define](#define) - - [Organize](#organize) - - [Visibility and Automation](#visibility-and-automation) - - [Constraints](#constraints) -- [Graduation Criteria](#graduation-criteria) -- [Implementation History](#implementation-history) - -## Summary - -Drive KEP adoption through improved process, documentation, visibility, and automation. - -## Motivation - -The KEP process is the standardized structure for proposing changes to the Kubernetes project. - -In order to graduate KEPs to GA, we must iterate over the implementation details. - -This KEP seeks to define actionable / delegable items to move the process forward. - -Finally, by submitting a KEP, we gain an opportunity to dogfood the process and further identify areas for improvement. - -### Why another KEP? - -As [KEP-1] is currently de facto for the project, we must be careful to make changes to it in an iterative and atomic fashion. - -When proposing a KEP, we action over some unit of work, usually some area of code. - -In this instance, we treat [KEP-1] as the unit of work. That said, this would be considered a meta-KEP of the meta-KEP. - -### Non-Goals - -- API Review process -- Feature request triage -- Developer guide - -## Proposal - -### Implementation Details / Notes / Constraints - -#### Define - -- Refine existing KEP documentation -- Define KEP [DACI] -- Glossary of terms (enhancement, KEP, feature, etc.) -- KEP Workflow - - KEP states - - Entry / exit criteria - -#### Organize - -- Move KEPs from flat-files to a directory structure: -``` -├── keps # top-level directory -│ ├── sig-beard # SIG directory -| | ├── 9000-beard-implementation-api # KEP directory -| | | ├── kep.md (required) # KEP (multi-release work) -| | | ├── experience_reports (required) # user feedback -| | | │ ├── alpha-feedback.md -| | | │ └── beta-feedback.md -| | | ├── features # units of work that span approximately one release cycle -| | | │ ├── feature-01.md -| | | │ ├── ... -| | | │ └── feature-n.md -| | | ├── guides -| | | | ├── guide-for-developers.md (required) -| | | | ├── guide-for-teachers.md (required) -| | | | ├── guide-for-operators.md (required) -| | | | └── guide-for-project-maintainers.md -| | | ├── index.json (required) # used for site generation e.g., Hugo -| | | ├── metadata.yaml (required) # used for automation / project tracking -| | └── └── OWNERS -│ ├── sig-foo -| ├── ... -| └── sig-n -``` - -metadata.yaml would contain information that was previously in a KEP's YAML front-matter: - -``` ---- -authors: # required - - "calebamiles" # just a GitHub handle for now - - "jbeda" -title: "Kubernetes Enhancement Proposal process" -number: 42 # required -owning-sig: "sig-pm" # required -participating-sigs: - - "sig-architecture" - - "sig-contributor-experience" -approvers: # required - - "bgrant0607" # just a GitHub handle for now -reviewers: - - "justaugustus" # just a GitHub handle for now - - "jdumars" -editors: - - null # generally omit empty/null fields -status: "active" # required -github: - issues: - - null # GitHub url - pull_requests: - - null # GitHub url - projects: - - project_id: null - card_id: null -releases: # required - - k8s_version: v1.9 - kep_status: "active" - k8s_status: "alpha" # one of alpha|beta|GA - - k8s_version: v1.10 - kep_status: "active" - k8s_status: "alpha" -replaces: - - kep_location: null -superseded-by: - - kep_location: null -created: 2018-01-22 # in YYYY-MM-DD -updated: 2018-09-04 -``` - -- Move existing KEPs into [k/features] -- Create a `kind/kep` label for [k/community] and [k/features] - - For `k/community`: - - Label incoming KEPs as `kind/kep` - - Enable searches of `org:kubernetes label:kind/kep`, so we can identify active PRs to `k/community` and reroute the PR authors to `k/enhancements` (depending on the state) - - For `k/enhancements` (fka `k/features`): - - Label incoming KEPs as `kind/kep` - - Classify KEP submissions / tracking issues as `kind/kep`, differentiating them from `kind/feature` -- Move existing design proposals into [k/features] -- Move existing architectural documents into [k/features] (process TBD) -- Deprecate design proposals -- Rename [k/features] to [k/enhancements] -- Create tombstones / redirects to [k/enhancements] -- Prevent new KEPs and design proposals from landing in [k/community] -- Remove `kind/kep` from [k/community] once KEP migration is complete -- Correlate existing Feature tracking issues with links to KEPs -- Fix [KEP numbering races] by using the GitHub issue number of the KEP tracking issue -- Coordination of existing KEPs to use new directory structure (with SIG PM guidance per SIG) - -#### Visibility and Automation - -- Create tooling to: - - Generate KEP directories and associated metadata - - Present KEPs, through some easy to use mechanism e.g., https://enhancements.k8s.io. This would be a redesigned version of https://contributor.kubernetes.io/keps/. We envision this site / repo having at least three directories: - - `keps/` (KEPs) - - `design-proposals/` (historical design proposals from https://git.k8s.io/community/contributors/design-proposals) - - `arch[itecture]|design/` (design principles of Kubernetes, derived from reorganizing https://git.k8s.io/community/contributors/devel, mentioned [here](https://github.com/kubernetes/community/issues/2565#issuecomment-419185591)) - - Enable project tracking across SIGs - -#### Constraints - -- Preserve git history -- Preserve issues -- Preserve PRs - -## Graduation Criteria - -Throughout implementation, we will be reaching out across the project to SIG leadership, approvers, and reviewers to capture feedback. - -While graduation criteria has not strictly been defined at this stage, we will define it in future updates to this KEP. - -## Implementation History - -- 2018-08-20: (@timothysc) Issue filed about repo separation: https://github.com/kubernetes/community/issues/2565 -- 2018-08-30: SIG Architecture meeting mentioning the need for a clearer KEP process - https://youtu.be/MMJ-zAR_GbI -- 2018-09-06: SIG Architecture meeting agreeing to move forward with a KEP process improvement effort to be co-led with SIG PM (@justaugustus / @jdumars) - https://youtu.be/fmlXkN4DJy0 -- 2018-09-10: KEP-1a submitted for review -- 2018-09-25: Rationale discussion in SIG PM meeting -- 2018-09-28: Merged as `provisional` -- 2018-09-29: KEP implementation started -- 2018-09-29: [KEP Implementation Tracking issue] created -- 2018-09-29: [KEP Implementation Tracking board] created -- 2018-09-29: Submitted as `implementable` - -[DACI]: https://www.atlassian.com/team-playbook/plays/daci -[KEP-1]: 0001-kubernetes-enhancement-proposal-process.md -[KEP Implementation Tracking board]: https://github.com/orgs/kubernetes/projects/5 -[KEP Implementation Tracking issue]: https://github.com/kubernetes/features/issues/617 -[KEP numbering races]: https://github.com/kubernetes/community/issues/2245 -[k/community]: http://git.k8s.io/community -[k/enhancements]: http://git.k8s.io/enhancements -[k/features]: http://git.k8s.io/features +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/README.md b/keps/README.md index c514dc17..cfd1f5fa 100644 --- a/keps/README.md +++ b/keps/README.md @@ -1,59 +1,4 @@ -# Kubernetes Enhancement Proposals (KEPs) - -A Kubernetes Enhancement Proposal (KEP) is a way to propose, communicate and coordinate on new efforts for the Kubernetes project. -You can read the full details of the project in [KEP-1](0001-kubernetes-enhancement-proposal-process.md). - -This process is still in a _beta_ state and is opt-in for those that want to provide feedback for the process. - -## Quick start for the KEP process - -1. Socialize an idea with a sponsoring SIG. - Make sure that others think the work is worth taking up and will help review the KEP and any code changes required. -2. Follow the process outlined in the [KEP template](0000-kep-template.md) - -## FAQs - -### Do I have to use the KEP process? - -No... but we hope that you will. -Over time having a rich set of KEPs in one place will make it easier for people to track what is going in the community and find a structured historic record. - -KEPs are only required when the changes are wide ranging and impact most of the project. -These changes are usually coordinated through SIG-Architecture. -It is up to any specific SIG if they want to use the KEP process and when. -The process is available to SIGs to use but not required. - -### Why would I want to use the KEP process? - -Our aim with KEPs is to clearly communicate new efforts to the Kubernetes contributor community. -As such, we want to build a well curated set of clear proposals in a common format with useful metadata. - -Benefits to KEP users (in the limit): -* Exposure on a kubernetes blessed web site that is findable via web search engines. -* Cross indexing of KEPs so that users can find connections and the current status of any KEP. -* A clear process with approvers and reviewers for making decisions. - This will lead to more structured decisions that stick as there is a discoverable record around the decisions. - -We are inspired by IETF RFCs, Pyton PEPs and Rust RFCs. -See [KEP-1](0001-kubernetes-enhancement-proposal-process.md) for more details. - -### Do I put my KEP in the root KEP directory or a SIG subdirectory? - -If the KEP is mainly restricted to one SIG's purview then it should be in a KEP directory for that SIG. -If the KEP is widely impacting much of Kubernetes, it should be put at the root of this directory. -If in doubt ask [SIG-Architecture](../sig-architecture/README.md) and they can advise. - -### What will it take for KEPs to "graduate" out of "beta"? - -Things we'd like to see happen to consider KEPs well on their way: -* A set of KEPs that show healthy process around describing an effort and recording decisions in a reasonable amount of time. -* KEPs exposed on a searchable and indexable web site. -* Presubmit checks for KEPs around metadata format and markdown validity. - -Even so, the process can evolve. As we find new techniques we can improve our processes. - -### My FAQ isn't answered here! - -The KEP process is still evolving! -If something is missing or not answered here feel free to reach out to [SIG-Architecture](../sig-architecture/README.md). -If you want to propose a change to the KEP process you can open a PR on [KEP-1](0001-kubernetes-enhancement-proposal-process.md) with your proposal. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-api-machinery/0006-apply.md b/keps/sig-api-machinery/0006-apply.md index 980f89e8..cfd1f5fa 100644 --- a/keps/sig-api-machinery/0006-apply.md +++ b/keps/sig-api-machinery/0006-apply.md @@ -1,171 +1,4 @@ ---- -kep-number: 6 -title: Apply -authors: - - "@lavalamp" -owning-sig: sig-api-machinery -participating-sigs: - - sig-api-machinery - - sig-cli -reviewers: - - "@pwittrock" - - "@erictune" -approvers: - - "@bgrant0607" -editor: TBD -creation-date: 2018-03-28 -last-updated: 2018-03-28 -status: provisional -see-also: - - n/a -replaces: - - n/a -superseded-by: - - n/a ---- - -# Apply - -## Table of Contents - -- [Apply](#apply) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - -## Summary - -`kubectl apply` is a core part of the Kubernetes config workflow, but it is -buggy and hard to fix. This functionality will be regularized and moved to the -control plane. - -## Motivation - -Example problems today: - -* User does POST, then changes something and applies: surprise! -* User does an apply, then `kubectl edit`, then applies again: surprise! -* User does GET, edits locally, then apply: surprise! -* User tweaks some annotations, then applies: surprise! -* Alice applies something, then Bob applies something: surprise! - -Why can't a smaller change fix the problems? Why hasn't it already been fixed? - -* Too many components need to change to deliver a fix -* Organic evolution and lack of systematic approach - * It is hard to make fixes that cohere instead of interfere without a clear model of the feature -* Lack of API support meant client-side implementation - * The client sends a PATCH to the server, which necessitated strategic merge patch--as no patch format conveniently captures the data type that is actually needed. - * Tactical errors: SMP was not easy to version, fixing anything required client and server changes and a 2 release deprecation period. -* The implications of our schema were not understood, leading to bugs. - * e.g., non-positional lists, sets, undiscriminated unions, implicit context - * Complex and confusing defaulting behavior (e.g., Always pull policy from :latest) - * Non-declarative-friendly API behavior (e.g., selector updates) - -### Goals - -"Apply" is intended to allow users and systems to cooperatively determine the -desired state of an object. The resulting system should: - -* Be robust to changes made by other users, systems, defaulters (including mutating admission control webhooks), and object schema evolution. -* Be agnostic about prior steps in a CI/CD system (and not require such a system). -* Have low cognitive burden: - * For integrators: a single API concept supports all object types; integrators - have to learn one thing total, not one thing per operation per api object. - Client side logic should be kept to a minimum; CURL should be sufficient to - use the apply feature. - * For users: looking at a config change, it should be intuitive what the - system will do. The “magic” is easy to understand and invoke. - * Error messages should--to the extent possible--tell users why they had a - conflict, not just what the conflict was. - * Error messages should be delivered at the earliest possible point of - intervention. - -Goal: The control plane delivers a comprehensive solution. - -Goal: Apply can be called by non-go languages and non-kubectl clients. (e.g., -via CURL.) - -### Non-Goals - -* Multi-object apply will not be changed: it remains client side for now -* Providing an API for just performing merges (without affecting state in the - cluster) is left as future work. -* Some sources of user confusion will not be addressed: - * Changing the name field makes a new object rather than renaming an existing object - * Changing fields that can’t really be changed (e.g., Service type). - -## Proposal - -Some highlights of things we intend to change: - -* Apply will be moved to the control plane: [overall design](https://goo.gl/UbCRuf). - * It will be invoked by sending a certain Content-Type with the verb PATCH. -* The last-applied annotation will be promoted to a first-class citizen under - metadata. Multiple appliers will be allowed. -* Apply will have user-targeted and controller-targeted variants. -* The Go IDL will be fixed: [design](https://goo.gl/EBGu2V). OpenAPI data models will be fixed. Result: 2-way and - 3-way merges can be implemented correctly. -* 2-way and 3-way merges will be implemented correctly: [design](https://goo.gl/nRZVWL). -* Dry-run will be implemented on control plane verbs (POST and PUT). - * Admission webhooks will have their API appended accordingly. -* The defaulting and conversion stack will be solidified to allow converting - partially specified objects. -* An upgrade path will be implemented so that version skew between kubectl and - the control plane will not have disastrous results. -* Strategic Merge Patch and the existing merge key annotations will be - deprecated. Development on these will stop, but they will not be removed until - the v1 API goes away (i.e., likely 3+ years). - -The linked documents should be read for a more complete picture. - -### Implementation Details/Notes/Constraints [optional] - -What are the caveats to the implementation? -What are some important details that didn't come across above. -Go in to as much detail as necessary here. -This might be a good place to talk about core concepts and how they releate. - -### Risks and Mitigations - -There are many things that will need to change. We are considering using a -feature branch. - -## Graduation Criteria - -This can be promoted to beta when it is a drop-in replacement for the existing -kubectl apply, which has no regressions (which aren't bug fixes). - -This will be promoted to GA once it's gone a sufficient amount of time as beta -with no changes. - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. -Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance -- the `Proposal` section being merged signaling agreement on a proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was available -- the version of Kubernetes where the KEP graduated to general availability -- when the KEP was retired or superseded - -## Drawbacks - -Why should this KEP _not_ be implemented: many bugs in kubectl apply will go -away. Users might be depending on the bugs. - -## Alternatives - -It's our belief that all routes to fixing the user pain involve -centralizing this functionality in the control plane. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-api-machinery/0015-dry-run.md b/keps/sig-api-machinery/0015-dry-run.md index 22efc5ad..cfd1f5fa 100644 --- a/keps/sig-api-machinery/0015-dry-run.md +++ b/keps/sig-api-machinery/0015-dry-run.md @@ -1,143 +1,4 @@ ---- -kep-number: 15 -title: Dry-run -authors: - - "@apelisse" -owning-sig: sig-api-machinery -participating-sigs: - - sig-api-machinery - - sig-cli -reviewers: - - "@lavalamp" - - "@deads2k" -approvers: - - "@erictune" -editor: apelisse -creation-date: 2018-06-21 -last-updated: 2018-06-21 -status: implementable ---- -# Kubernetes Dry-run - -Dry-run is a new feature that we intend to implement in the api-server. The goal -is to be able to send requests to modifying endpoints, and see if the request -would have succeeded (admission chain, validation, merge conflicts, ...) and/or -what would have happened without having it actually happen. The response body -for the request should be as close as possible to a non dry-run response. - -## Specifying dry-run - -Dry-run is triggered by setting the “dryRun” query parameter on modifying -verbs: POST, PUT, PATCH and DELETE. - -This parameter is a string, working as an enum: -- All: Everything will run as normal, except for the storage that won’t be - modified. Everything else should work as expected: admission controllers will - be run to check that the request is valid, mutating controllers will change - the object, merge will be performed on PATCH. The storage layer will be - informed not to save, and the final object will be returned to the user with - normal status code. -- Leave the value empty, or don't specify the parameter at all to keep the - default modifying behavior. - -No other values are supported yet, but this gives us an opportunity to create a -finer-grained mechanism later, if necessary. - -## Admission controllers - -Admission controllers need to be modified to understand that the request is a -“dry-run” request. Admission controllers are allowed to have side-effects -when triggered, as long as there is a reconciliation system, because it is not -guaranteed that subsequent validating will permit the request to finish. -Quotas for example uses the current request values to change the available quotas. -The ```admission.Attributes``` interface will be edited like this, to inform the -built-in admission controllers if a request is a dry-run: -```golang -type Attributes interface { - ... - // IsDryRun indicates that modifications will definitely not be persisted for this request. This is to prevent - // admission controllers with side effects and a method of reconciliation from being overwhelmed. - // However, a value of false for this does not mean that the modification will be persisted, because it - // could still be rejected by a subsequent validation step. - IsDryRun() bool - ... -} -``` - -All built-in admission controllers will then have to be checked, and the ones with side -effects will have to be changed to handle the dry-run case correctly. Some examples of -built-in admission controllers with the possibility for side-effects are: -- ResourceQuota -- EventRateLimit -- NamespaceAutoProvision -- (Valid|Mut)atingAdmissionWebhook - -To address the possibility of webhook authors [relying on side effects](https://github.com/kubernetes/website/blame/836629cb118e0f74545cc7d6d97aa6b9edfa1a16/content/en/docs/reference/access-authn-authz/admission-controllers.md#L582-L584), a new field -will be added to ```admissionregistration.k8s.io/v1beta1.ValidatingWebhookConfiguration``` and -```admissionregistration.k8s.io/v1beta1.MutatingWebhookConfiguration``` so that webhooks -can explicitly register as having dry-run support. -If dry-run is requested on a non-supported webhook, the request will be completely rejected, -as a 400: Bad Request. This field will be defaulted to true and deprecated in v1, and completely removed in v2. -All webhooks registered with v2 will be assumed to support dry run. The [api conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md) advise -against bool fields because "many ideas start as boolean but eventually trend towards a small set -of mutually exclusive options" but in this case, we plan to remove the field in a future version. -```golang -// admissionregistration.k8s.io/v1beta1 -... -type Webhook struct { - ... - // DryRunnable defines whether this webhook will correctly handle dryRun requests. - // If false, any dryRun requests to resources/subresources this webhook applies to - // will be completely rejected and the webhook will not be called. - // Defaults to false. - // +optional - DryRunnable *bool `json:"dryRunnable,omitempty" protobuf:"varint,6,number,opt,name=dryRunnable"` -} -``` - -Additionally, a new field will be added to ```admission.k8s.io/v1beta1.AdmissionReview``` -API object to reflect the changes to the ```admission.Attributes``` interface, indicating -whether or not the request being reviewed is for a dry-run: -```golang -// admission.k8s.io/v1beta1 -... -type AdmissionRequest struct { - ... - // DryRun indicates that modifications will definitely not be persisted for this request. - // Defaults to false. - // +optional - DryRun *bool `json:"dryRun,omitempty" protobuf:"varint,11,number,opt,name=dryRun"` -} -``` - -## Generated values - -Some values of the object are typically generated before the object is persisted: -- generateName can be used to assign a unique random name to the object, -- creationTimestamp/deletionTimestamp records the time of creation/deletion, -- UID uniquely identifies the object and is randomly generated (non-deterministic), -- resourceVersion tracks the persisted version of the object. - -Most of these values are not useful in the context of dry-run, and could create -some confusion. The UID and the generated name would have a different value in a -dry-run and non-dry-run creation. These values will be left empty when -performing a dry-run. - -CreationTimestamp and DeletionTimestamp are also generated on creation/deletion, -but there are less ways to abuse them so they will be generated as they for a -regular request. - -ResourceVersion will also be left empty on creation. On updates, the value will -stay unchanged. - -## Storage - -The storage layer will be modified, so that it can know if request is dry-run, -most likely by looking for the field in the “Options” structure (missing for -some handlers, to be added). If it is, it will NOT store the object, but return -success. That success can be forwarded back to the user. - -A dry-run request should behave as close as possible to a regular -request. Attempting to dry-run create an existing object will result in an -`AlreadyExists` error to be returned. Similarly, if a dry-run update is -performed on a non-existing object, a `NotFound` error will be returned. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-api-machinery/0030-storage-migration.md b/keps/sig-api-machinery/0030-storage-migration.md index 88d57852..cfd1f5fa 100644 --- a/keps/sig-api-machinery/0030-storage-migration.md +++ b/keps/sig-api-machinery/0030-storage-migration.md @@ -1,289 +1,4 @@ ---- -kep-number: 30 -title: Migrating API objects to latest storage version -authors: - - "@xuchao" -owning-sig: sig-api-machinery -reviewers: - - "@deads2k" - - "@yliaog" - - "@lavalamp" -approvers: - - "@deads2k" - - "@lavalamp" -creation-date: 2018-08-06 -last-updated: 2018-10-11 -status: provisional ---- - -# Migrating API objects to latest storage version - -## Table of Contents - - * [Migrating API objects to latest storage version](#migrating-api-objects-to-latest-storage-version) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Proposal](#proposal) - * [Alpha workflow](#alpha-workflow) - * [Alpha API](#alpha-api) - * [Failure recovery](#failure-recovery) - * [Beta workflow - Automation](#beta-workflow---automation) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Alternatives](#alternatives) - * [update-storage-objects.sh](#update-storage-objectssh) - -## Summary - -We propose a solution to migrate the stored API objects in Kubernetes clusters. -In 2018 Q4, we will deliver a tool of alpha quality. The tool extends and -improves based on the [oc adm migrate storage][] command. We will integrate the -storage migration into the Kubernetes upgrade process in 2019 Q1. We will make -the migration automatically triggered in 2019. - -[oc adm migrate storage]:https://www.mankier.com/1/oc-adm-migrate-storage - -## Motivation - -"Today it is possible to create API objects (e.g., HPAs) in one version of -Kubernetes, go through multiple upgrade cycles without touching those objects, -and eventually arrive at a version of Kubernetes that can’t interpret the stored -resource and crashes. See k8s.io/pr/52185."[1][]. We propose a solution to the -problem. - -[1]:https://docs.google.com/document/d/1eoS1K40HLMl4zUyw5pnC05dEF3mzFLp5TPEEt4PFvsM - -### Goals - -A successful storage version migration tool must: -* work for Kubernetes built-in APIs, custom resources (CR), and aggregated APIs. -* do not add burden to cluster administrators or Kubernetes distributions. -* only cause insignificant load to apiservers. For example, if the master has - 10GB memory, the migration tool should generate less than 10 qps of single - object operations(TODO: measure the memory consumption of PUT operations; - study how well the default 10 Mbps bandwidth limit in the oc command work). -* work for big clusters that have ~10^6 instances of some resource types. -* make progress in flaky environment, e.g., flaky apiservers, or the migration - process get preempted. -* allow system administrators to track the migration progress. - -As to the deliverables, -* in the short term, providing system administrators with a tool to migrate - the Kubernetes built-in API objects to the proper storage versions. -* in the long term, automating the migration of Kubernetes built-in APIs, CR, - aggregated APIs without further burdening system administrators or Kubernetes - distributions. - -## Proposal - -### Alpha workflow - -At the alpha stage, the migrator needs to be manually launched, and does not -handle custom resources or aggregated resources. - -After all the kube-apiservers are at the desired version, the cluster -administrator runs `kubectl apply -f migrator-initializer-<k8s-versio>.yaml`. -The apply command -* creates a *kube-storage-migration* namespace -* creates a *storage-migrator* service account -* creates a *system:storage-migrator* cluster role that can *get*, *list*, and - *update* all resources, and in addition, *create* and *delete* CRDs. -* creates a cluster role binding to bind the created service account with the - cluster role -* creates a **migrator-initializer** job running with the - *storage-migrator* service account. - -The **migrator-initializer** job -* deletes any existing deployment of **kube-migrator controller** -* creates a **kube-migrator controller** deployment running with the - *storage-migrator* service account. -* generates a comprehensive list of resource types via the discovery API -* discovers all custom resources via listing CRDs -* discovers all aggregated resources via listing all `apiservices` that have - `.spec.service != null` -* removes the custom resources and aggregated resources from the comprehensive - resource list. The list now only contains Kubernetes built-in resources. -* removes resources that share the same storage. At the alpha stage, the - information is hard-coded, like in this [list][]. -* creates `migration` CRD (see the [API section][] for the schema) if it does - not exist. -* creates `migration` CRs for all remaining resources in the list. The - `ownerReferences` of the `migration` objects are set to the **kube-migrator - controller** deployment. Thus, the old `migration`s are deleted with the old - deployment in the first step. - -The control loop of **kube-migrator controller** does the following: -* runs a reflector to watch for the instances of the `migration` CR. The list - function used to construct the reflector sorts the `migration`s so that the - *Running* `migration` will be processed first. -* syncs one `migration` at a time to avoid overloading the apiserver, - * if `migration.status` is nil, or `migration.status.conditions` shows - *Running*, it creates a **migration worker** goroutine to migrate the - resource type. - * adds the *Running* condition to `migration.status.conditions`. - * waits until the **migration worker** goroutine finishes, adds either the - *Succeeded* or *Failed* condition to `migration.status.conditions` and sets - the *Running* condition to false. - -The **migration worker** runs the equivalence of `oc adm migrate storage ---include=<resource type>` to migrate a resource type. The **migration worker** -uses API chunking to retrieve partial lists of a resource type and thus can -migrate a small chunk at a time. It stores the [continue token] in the owner -`migration.spec.continueToken`. With the inconsistent continue token -introduced in [#67284][], the **migration worker** does not need to worry about -expired continue token. - -[list]:https://github.com/openshift/origin/blob/2a8633598ef0dcfa4589d1e9e944447373ac00d7/pkg/oc/cli/admin/migrate/storage/storage.go#L120-L184 -[#67284]:https://github.com/kubernetes/kubernetes/pull/67284 -[API section]:#alpha-api - -The cluster admin can run the `kubectl wait --for=condition=Succeeded -migrations` to wait for all migrations to succeed. - -Users can run `kubectl create` to create `migration`s to request migrating -custom resources and aggregated resources. - -### Alpha API - -We introduce the `storageVersionMigration` API to record the intention and the -progress of a migration. Throughout this doc, we abbreviated it as `migration` -for simplicity. The API will be a CRD defined in the `migration.k8s.io` group. - -Read the [workflow section][] to understand how the API is used. - -```golang -type StorageVersionMigration struct { - metav1.TypeMeta - // For readers of this KEP, metadata.generateName will be "<resource>.<group>" - // of the resource being migrated. - metav1.ObjectMeta - Spec StorageVersionMigrationSpec - Status StorageVersionMigrationStatus -} - -// Note that the spec only contains an immutable field in the alpha version. To -// request another round of migration for the resource, clients need to create -// another `migration` CR. -type StorageVersionMigrationSpec { - // Resource is the resource that is being migrated. The migrator sends - // requests to the endpoint tied to the Resource. - // Immutable. - Resource GroupVersionResource - // ContinueToken is the token to use in the list options to get the next chunk - // of objects to migrate. When the .status.conditions indicates the - // migration is "Running", users can use this token to check the progress of - // the migration. - // +optional - ContinueToken string -} - -type MigrationConditionType string - -const ( - // MigrationRunning indicates that a migrator job is running. - MigrationRunning MigrationConditionType = "Running" - // MigrationSucceed indicates that the migration has completed successfully. - MigrationSucceeded MigrationConditionType = "Succeeded" - // MigrationFailed indicates that the migration has failed. - MigrationFailed MigrationConditionType = "Failed" -) - -type MigrationCondition struct { - // Type of the condition - Type MigrationConditionType - // Status of the condition, one of True, False, Unknown. - Status corev1.ConditionStatus - // The last time this condition was updated. - LastUpdateTime metav1.Time - // The reason for the condition's last transition. - Reason string - // A human readable message indicating details about the transition. - Message string -} - -type StorageVersionMigrationStatus { - // Conditions represents the latest available observations of the migration's - // current state. - Conditions []MigrationCondition -} -``` - -[continue token]:https://github.com/kubernetes/kubernetes/blob/972e1549776955456d9808b619d136ee95ebb388/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L82 -[workflow section]:#alpha-workflow - -### Failure recovery - -As stated in the goals section, the migration has to make progress even if the -environment is flaky. This section describes how the migrator recovers from -failure. - -Kubernetes **replicaset controller** restarts the **migration controller** `pod` -if it fails. Because the migration states, including the continue token, are - stored in the `migration` object, the **migration controller** can resume from - where it left off. - -[workflow section]:#alpha-workflow - -### Beta workflow - Automation - -It is a beta goal to automate the migration workflow. That is, migration does -not need to be triggered manually by cluster admins, or by custom control loops -of Kubernetes distributions. - -The automated migration should work for Kubernetes built-in resource types, -custom resources, and aggregated resources. - -The trigger can be implemented as a separate control loop. It watches for the -triggering signal, and creates `migration` to notify the **kube-migrator -controller** to migrate a resource. - -We haven't reached consensus on what signal would trigger storage migration. We -will revisit this section during beta design. - -### Risks and Mitigations - -The migration process does not change the objects, so it will not pollute -existing data. - -If the rate limiting is not tuned well, the migration can overload the -apiserver. Users can delete the migration controller and the migration -jobs to mitigate. - -Before upgrading or downgrading the cluster, the cluster administrator must run -`kubectl wait --for=condition=Succeeded migrations` to make sure all -migrations have completed. Otherwise the apiserver can crash, because it cannot -interpret the serialized data in etcd. To mitigate, the cluster administrator -can rollback the apiserver to the old version, and wait for the migration to -complete. Even if the apiserver does not crash after upgrading or downgrading, -the `migration` objects are not accurate anymore, because the default storage -versions might have changed after upgrading or downgrading, but no one -increments the `migration.spec.generation`. Administrator needs to re-run the -`kubectl run migrate --image=migrator-initializer --restart=OnFailure` command -to recover. - -TODO: it is safe to rollback an apiserver to the previous configuration without -waiting for the migration to complete. It is only unsafe to roll-forward or -rollback twice. We need to design how to record the previous configuration. - -## Graduation Criteria - -* alpha: delivering a tool that implements the "alpha workflow" and "failure - recovery" sections. ETA is 2018 Q4. - -* beta: implementing the "beta workflow" and integrating the storage migration - into Kubernetes upgrade tests. - -* GA: TBD. - -We will revisit this section in 2018 Q4. - -## Alternatives - -### update-storage-objects.sh - -The Kubernetes repo has an update-storage-objects.sh script. It is not -production ready: no rate limiting, hard-coded resource types, no persisted -migration states. We will delete it, leaving a breadcrumb for any users to -follow to the new tool. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-apps/0026-ttl-after-finish.md b/keps/sig-apps/0026-ttl-after-finish.md index 7bde7a9b..cfd1f5fa 100644 --- a/keps/sig-apps/0026-ttl-after-finish.md +++ b/keps/sig-apps/0026-ttl-after-finish.md @@ -1,296 +1,4 @@ ---- -kep-number: 26 -title: TTL After Finished -authors: - - "@janetkuo" -owning-sig: sig-apps -participating-sigs: - - sig-api-machinery -reviewers: - - "@enisoc" - - "@tnozicka" -approvers: - - "@kow3ns" -editor: TBD -creation-date: 2018-08-16 -last-updated: 2018-08-16 -status: provisional -see-also: - - n/a -replaces: - - n/a -superseded-by: - - n/a ---- - -# TTL After Finished Controller - -## Table of Contents - -A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. -[Tools for generating][] a table of contents from markdown are available. - - * [TTL After Finished Controller](#ttl-after-finished-controller) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Proposal](#proposal) - * [Concrete Use Cases](#concrete-use-cases) - * [Detailed Design](#detailed-design) - * [Feature Gate](#feature-gate) - * [API Object](#api-object) - * [Validation](#validation) - * [User Stories](#user-stories) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [TTL Controller](#ttl-controller) - * [Finished Jobs](#finished-jobs) - * [Finished Pods](#finished-pods) - * [Owner References](#owner-references) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -We propose a TTL mechanism to limit the lifetime of finished resource objects, -including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods -after they finish. The TTL timer starts when the Job/Pod finishes, and the -finished Job/Pod will be cleaned up after the TTL expires. - -## Motivation - -In Kubernetes, finishable resources, such as Jobs and Pods, are often -frequently-created and short-lived. If a Job or Pod isn't controlled by a -higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some -other resources, it's difficult for the users to clean them up automatically, -and those Jobs and Pods can accumulate and overload a Kubernetes cluster very -easily. Even if we can avoid the overload issue by implementing a cluster-wide -(global) resource quota, users won't be able to create new resources without -cleaning up old ones first. See [#64470][]. - -The design of this proposal can be later generalized to other finishable -frequently-created, short-lived resources, such as completed Pods or finished -custom resources. - -[#64470]: https://github.com/kubernetes/kubernetes/issues/64470 - -### Goals - -Make it easy to for the users to specify a time-based clean up mechanism for -finished resource objects. -* It's configurable at resource creation time and after the resource is created. - -## Proposal - -[K8s Proposal: TTL controller for finished Jobs and Pods][] - -[K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit - -### Concrete Use Cases - -* [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo - workflows, etc.), see [#718][]. - -* [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker. - -* [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods. - -* Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods. - -[Kubeflow]: https://github.com/kubeflow -[#718]: https://github.com/kubeflow/tf-operator/issues/718 -[Prow]: https://github.com/kubernetes/test-infra/tree/master/prow -[Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html - -### Detailed Design - -#### Feature Gate - -This will be launched as an alpha feature first, with feature gate -`TTLAfterFinished`. - -#### API Object - -We will add the following API fields to `JobSpec` (`Job`'s `.spec`). - -```go -type JobSpec struct { - // ttlSecondsAfterFinished limits the lifetime of a Job that has finished - // execution (either Complete or Failed). If this field is set, once the Job - // finishes, it will be deleted after ttlSecondsAfterFinished expires. When - // the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will - // be honored. If this field is unset, ttlSecondsAfterFinished will not - // expire. If this field is set to zero, ttlSecondsAfterFinished expires - // immediately after the Job finishes. - // This field is alpha-level and is only honored by servers that enable the - // TTLAfterFinished feature. - // +optional - TTLSecondsAfterFinished *int32 -} -``` - -This allows Jobs to be cleaned up after they finish and provides time for -asynchronous clients to observe Jobs' final states before they are deleted. - - -Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`). - -```go -type PodSpec struct { - // ttlSecondsAfterFinished limits the lifetime of a Pod that has finished - // execution (either Succeeded or Failed). If this field is set, once the Pod - // finishes, it will be deleted after ttlSecondsAfterFinished expires. When - // the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will - // be honored. If this field is unset, ttlSecondsAfterFinished will not - // expire. If this field is set to zero, ttlSecondsAfterFinished expires - // immediately after the Pod finishes. - // This field is alpha-level and is only honored by servers that enable the - // TTLAfterFinished feature. - // +optional - TTLSecondsAfterFinished *int32 -} -``` - -##### Validation - -Because Job controller depends on Pods to exist to work correctly. In Job -validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to -prevent users from breaking their Jobs. Users should set TTL seconds on a Job, -instead of Pods owned by a Job. - -It is common for higher level resources to call generic PodSpec validation; -therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to -be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never` -(i.e. not `Always`). - -### User Stories - -The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. -The Jobs accumulates over time, and 1 year later, the cluster ended up with more -than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, -and eventually made the cluster unavailable. - -The problem could have been avoided easily with TTL controller for Jobs. - -The steps are as easy as: - -1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to - 3600 (i.e. 1 hour). -1. The user deploys Jobs as usual. -1. After a Job finishes, the result is observed asynchronously within an hour - and stored elsewhere. -1. The TTL collector cleans up Jobs 1 hour after they complete. - -### Implementation Details/Notes/Constraints - -#### TTL Controller -We will add a TTL controller for finished Jobs and finished Pods. We considered -adding it in Job controller, but decided not to, for the following reasons: - -1. Job controller should focus on managing Pods based on the Job's spec and pod - template, but not cleaning up Jobs. -1. We also need the TTL controller to clean up finished Pods, and we consider - generalizing TTL controller later for custom resources. - -The TTL controller utilizes informer framework, watches all Jobs and Pods, and -read Jobs and Pods from a local cache. - -#### Finished Jobs - -When a Job is created or updated: - -1. Check its `.status.conditions` to see if it has finished (`Complete` or - `Failed`). If it hasn't finished, do nothing. -1. Otherwise, if the Job has finished, check if Job's - `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is - not set. -1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. - `.spec.ttlSecondsAfterFinished` + the time when the Job finishes - (`.status.conditions.lastTransitionTime`) > now. -1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount - of time when it will expire. The computed time period is: - (`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` - - now). -1. If the TTL has expired, `GET` the Job from API server to do final sanity - checks before deleting it. -1. Check if the freshly got Job's TTL has expired. This field may be updated - before TTL controller observes the new value in its local cache. - * If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue - the Job after a computed amount of time when it will expire. -1. Delete the Job if passing the sanity checks. - -#### Finished Pods - -When a Pod is created or updated: -1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`). - If it hasn't finished, do nothing. -1. Otherwise, if the Pod has finished, check if Pod's - `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is - not set. -1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. - `.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all - of its containers termination time - `.containerStatuses.state.terminated.finishedAt`) > now. -1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount - of time when it will expire. The computed time period is: - (`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now). -1. If the TTL has expired, `GET` the Pod from API server to do final sanity - checks before deleting it. -1. Check if the freshly got Pod's TTL has expired. This field may be updated - before TTL controller observes the new value in its local cache. - * If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue - the Pod after a computed amount of time when it will expire. -1. Delete the Pod if passing the sanity checks. - -#### Owner References - -We have considered making TTL controller leave a Job/Pod around even after its -TTL expires, if the Job/Pod has any owner specified in its -`.metadata.ownerReferences`. - -We decided not to block deletion on owners, because the purpose of -`.metadata.ownerReferences` is for cascading deletion, but not for keeping an -owner's dependents alive. If the Job is owned by a CronJob, the Job can be -cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs -to keep), or CronJob can choose not to set history limit but set the TTL of its -Job template to clean up Jobs after TTL expires instead of based on the history -limit capacity. - -Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has -owners. - -Similarly, the TTL won't block deletion from generic garbage collector. This -means that when a Job's or Pod's owners are gone, generic garbage collector will -delete it, even if it hasn't finished or its TTL hasn't expired. - -### Risks and Mitigations - -Risks: -* Time skew may cause TTL controller to clean up resource objects at the wrong - time. - -Mitigations: -* In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time - skew. We will also document this risk. - -[#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058 - -## Graduation Criteria - -We want to implement this feature for Pods/Jobs first to gather feedback, and -decide whether to generalize it to custom resources. This feature can be -promoted to beta after we finalize the decision for whether to generalize it or -not, and when it satisfies users' need for cleaning up finished resource -objects, without regressions. - -This will be promoted to GA once it's gone a sufficient amount of time as beta -with no changes. - -[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 - -## Implementation History - -TBD +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-apps/0028-20180925-optional-service-environment-variables.md b/keps/sig-apps/0028-20180925-optional-service-environment-variables.md index f9ec352e..cfd1f5fa 100644 --- a/keps/sig-apps/0028-20180925-optional-service-environment-variables.md +++ b/keps/sig-apps/0028-20180925-optional-service-environment-variables.md @@ -1,93 +1,4 @@ ---- -kep-number: 28 -title: Optional Service Environment Variables -authors: - - "@bradhoekstra" - - "@kongslund" -owning-sig: sig-apps -participating-sigs: -reviewers: - - TBD -approvers: - - TBD -editor: TBD -creation-date: 2018-09-25 -last-updated: 2018-09-25 -status: provisional -see-also: - - https://github.com/kubernetes/community/pull/1249 -replaces: -superseded-by: ---- - -# Optional Service Environment Variables - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -This enhancement allows application developers to choose whether their Pods will receive [environment variables](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables) from services in their namespace. They can choose to disable them via the new `enableServiceLinks` field in `PodSpec`. The current behaviour will continue to be the default behaviour, but the developer may choose to disable these environment variables for certain workloads for reasons such as incompatibilities with other expected environment variables or scalability issues. - -## Motivation - -Today, a list of all services that were running when a pod's containers are created is automatically injected to those containers as environment variables matching the syntax of Docker links. There is no way to disable this. - -Docker links have long been considered as a [deprecated legacy feature](https://docs.docker.com/engine/userguide/networking/default_network/dockerlinks/) of Docker since the introduction of networks and DNS. Likewise, in Kubernetes, DNS is to be preferred over service links. - -Possible issues with injected service links are: - -* Accidental coupling. -* Incompatibilities with container images that no longer utilize service links and explicitly fail at startup time if certain service links are defined. -* Performance penalty in starting up pods [for namespaces with many services](https://github.com/kubernetes/kubernetes/issues/1768#issuecomment-330778184) - -### Goals - -* Allow users to choose whether to inject service environment variables in their Pods. -* Do this in a backwards-compatible, non-breaking way. Default to the current behaviour. - -### Non-Goals - -N/A - -## Proposal - -### User Stories - -* As an application developer, I want to be able to disable service link injection since the injected environment variables interfere with a Docker image that I am trying to run on Kubernetes. -* As an application developer, I want to be able to disable service link injection since I don't need it and it takes increasingly longer time to start pods as services are added to the namespace. -* As an application developer, I want to be able to disable service link injection since pods can fail to start if the environment variable list becomes too long. This can happen when there are >5,000 services in the same namespace. - -### Implementation Details/Notes/Constraints - -`PodSpec` is extended with an additional field, `enableServiceLinks`. The field should be a pointer to a boolean and default to true if nil. - -In `kubelet_pods.go`, the value of that field is passed along to the function `getServiceEnvVarMap` where it is used to decide which services will be propogated into environment variables. In case `enableServiceLinks` is false then only the `kubernetes` service in the `kl.masterServiceNamespace` should be injected. The latter is needed in order to preserve Kubernetes variables such as `KUBERNETES_SERVICE_HOST` since a lot of code depends on it. - -### Risks and Mitigations - -The current behaviour is being kept as the default as much existing code and documentation depends on these environment variables. - -## Graduation Criteria - -N/A - -## Implementation History - -- 2017-10-21: First draft of original proposal [PR](https://github.com/kubernetes/community/pull/1249) -- 2018-02-22: First draft of implementation [PR](https://github.com/kubernetes/kubernetes/pull/60206) -- 2018-08-31: General consensus of implementation plan -- 2018-09-17: First draft of new implementation [PR](https://github.com/kubernetes/kubernetes/pull/68754) -- 2018-09-24: Implementation merged into master -- 2018-09-25: Converting proposal into this KEP +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-apps/0032-portable-service-definitions.md b/keps/sig-apps/0032-portable-service-definitions.md index e80d0c50..cfd1f5fa 100644 --- a/keps/sig-apps/0032-portable-service-definitions.md +++ b/keps/sig-apps/0032-portable-service-definitions.md @@ -1,152 +1,4 @@ -# KEP: Portable Service Definitions - ---- -kep-number: 31 -title: Portable Service Definitions -authors: - - "@mattfarina" -owning-sig: sig-apps -participating-sigs: - - sig-service-catalog -reviewers: - - "@carolynvs" - - "@kibbles-n-bytes" - - "@duglin" - - "@jboyd01" - - "@prydonius" - - "@kow3ns" -approvers: - - "@mattfarina" - - "@prydonius" - - "@kow3ns" -editor: TBD -creation-date: 2018-11-13 -last-updated: 2018-11-19 -status: provisional -see-also: -replaces: -superseded-by: - ---- - -# Portable Service Definitions - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories-optional) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) - -## Summary - -The goal of this feature is to enable an application to be deployed into multiple environments while relying on external services that are not part of the application and using the same objects in all environments. That includes service instances that may or may not be managed by Kubernetes. For example, take a WordPress application that relies on MySQL. If the application is running in GKE or AKS it may leverage a Google or Azure MySQL as a Service. If it is running in Minikube or in a bare metal cluster it may use MySQL managed by the Kubernetes cluster. In all of these cases, the same resource is declared asking for the service (e.g., MySQL) and the credentials to use the service are returned in a common manner (e.g., a secret with the same type and schema). This enables portability from one environment to another because working with services in all environments leverages the same Kubernetes objects with the same schemas. - -## Motivation - -Workload portability is commonly cited as a goal for those deploying workloads. The Kubernetes API can provide a common API for deploying workloads across varying environments enabling some level of portability. For example, a deployment can be run in a cluster running on GKE, AKS, EKS, or clusters running elsewhere. - -But, many applications rely on software as a service (SaaS). The reason for this is to push the operational details on to someone else who specializes in that particular service so the application developers and operators can focus on their application and business logic. - -The problem is that one cannot deploy the same application in two different environments by two different providers, if the applications leverages services, with the same set of resources. This includes cases where the service being leveraged is common (e.g., MySQL as a Service). This problem limits application portability and sharing (e.g., in open source). - -This KEP is looking to solve this problem by providing Kubernetes compatible objects, via CRDs and Secrets, that can be used in many environments by many providers to make working with common services easier. This can be used for services like database (e.g., MySQL, PostgreSQL), DNS, SMTP, and many others. - -### Goals - -* Provide a common way to request common services (e.g., MySQL) -* Provide a common means to obtain credentials to use the service -* Provide a common method to detect which services are available -* Provide a system that can be implemented for the major public clouds, on-premise clusters, and local cluster (e.g., Docker for Mac) - -### Non-Goals - -* Provide an out of the box solution for every bespoke service provided by everyone -* Replace Kubernetes Service Catalog - -## Proposal - -### User Stories - -#### Story 1 - -As a user of Kubernetes, I can query the services I can declaratively request using a Kubernetes native API. For example, using the command `kubectl get crds` where I can see services alongside other resources that can be created. - -#### Story 2 - -As a user of Kubernetes, I can declaratively request an instance of a service using a custom resource. When the service is provided the means to use that service (e.g., credentials in a secret) are provided in a common and consistent manner. The same resource and secret can be used in clusters running in different locations and the way the service is provided may be different. - -#### Story 3 - -As a cluster operator or application operator, I can discover controllers implementing the CRDs and secrets to support the application portability in my cluster. - -#### Story 4 - -As a cluster operator or application operator, I can set default values and provider custom settings for a service. - -### Implementation Details/Notes/Constraints - -To solve the two user stories there are two types of Kubernetes resources that can be leveraged. - -1. Custom resource definitions (CRDs) can be used to describe a service. The CRDs can be implemented by controllers for different environments and the list of installed CRDs can be queried to see what is supported in a cluster -2. Secrets with a specific type and schema can be used to handle credentials and other relevant information for services that have them (e.g., a database). Not all services will require a secret (e.g., DNS) - -This subproject will list and document the resources and how controllers can implement them. This provides for interoperability including that for controllers and other tools, like validators, and a canonical listing held by a vendor neutral party. - -In addition to the resources, this subproject will also provide a controller implementing the defined services to support testing, providing an example implementation, and to support other Kubernetes subprojects (e.g., Minikube). Controllers produced by this project are _not_ meant to be used in production. - -3rd party controllers implementing the CRDs and secrets can use a variety of methods to implement the service handling. This is where the Kubernetes Service Catalog can be an option. This subproject will not host or support 3rd party controllers but will list them to aide in users discovering them. This is in support of the 3rd user story. - -To support custom settings for services by a service provider and to add the ability to add default settings (user story 4) we are considering a pattern of using CRDs for a controller with configuration on a cluster wide and namespace based level. An example of this in existence today is the cert manager issuer and cluster issuer resources. How to support this pattern will be worked out as part of the process in the next step of building a working system. This pattern is tentative until worked out in practice. - -The next step is to work out the details on a common service and an initial process by which future services can be added. To work out the details we will start with MySQL and go through the process to make it work as managed by Kubernetes and each of the top three public clouds as defined by adoption. Public clouds and other cloud platforms following the top three are welcome to be involved in the process but are not required in the process. - -Before this KEP can move to being implementable at least 2 services need to go through the process of being implemented to prove out the process elements of this system. - -### Risks and Mitigations - -Two known risks include: - -1. The access controls and relationships between accounts and services. How will proper user and tenancy information be passed to clouds that require this form of information? -2. Details required, when requesting a service, can vary between cloud providers that will implement this as a SaaS. How can that information be commonized or otherwise handled? - -## Graduation Criteria - -The following are the graduation criteria: - -- 5 organizations have adopted using the portable service definitions -- Service definitions for at least 3 services have been created and are in use -- Documentation exists explaining how a controller implementer can use the CRDs and secrets to create a controller of their own -- Service consumer documentation exists explaining how to use portable service definitions within their applications -- A documented process for bringing a new service from suggestion to an implementable solution - -## Implementation History - -- the _Summary_, _Motivation_, _Proposal_ sections developed - -## Drawbacks [optional] - -Why should this KEP _not_ be implemented. - -## Alternatives [optional] - -An alternative is to modify the service catalog to leverage CRDs and return common secret credentials. Drawbacks to this are that it would be a complicated re-write to the service catalog, according to the service catalog team, and the solution would still require the open service broker (OSB) to be implemented in all environments (e.g., Minikube) even where simpler models (e.g., a controller) could be used instead. The solution proposed here could work with the service catalog in environments it makes sense and use other models in other environments. The focus here is more on the application operator experience working with services than all of the implementations required to power it. - -## Infrastructure Needed [optional] - -The following infrastructure elements are needed: - -- A new subproject under SIG Apps for organizational purposes -- A git repository, in the `kubernetes-sigs` organization, to host the CRD and Secrets schemas along with the Kubernetes provided controller -- Testing infrastructure to continuously test the codebase +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-apps/README.md b/keps/sig-apps/README.md index f7f6b320..cfd1f5fa 100644 --- a/keps/sig-apps/README.md +++ b/keps/sig-apps/README.md @@ -1,3 +1,4 @@ -# SIG Apps KEPs - -This directory contains KEPs related to SIG Apps. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-apps/sidecarcontainers.md b/keps/sig-apps/sidecarcontainers.md index f3cab7e2..cfd1f5fa 100644 --- a/keps/sig-apps/sidecarcontainers.md +++ b/keps/sig-apps/sidecarcontainers.md @@ -1,150 +1,4 @@ ---- -title: Sidecar Containers -authors: - - "@joseph-irving" -owning-sig: sig-apps -participating-sigs: - - sig-apps - - sig-node -reviewers: - - "@fejta" -approvers: - - "@enisoc" - - "@kow3ns" -editor: TBD -creation-date: 2018-05-14 -last-updated: 2018-11-20 -status: provisional ---- - -# Sidecar Containers - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Alternatives](#alternatives) - -## Summary - -To solve the problem of container lifecycle dependency we can create a new class of container: a "sidecar container" that behaves primarily like a normal container but is handled differently during termination and startup. - -## Motivation - -SideCar containers have always been used in some ways but just not formally identified as such, they are becoming more common in a lot of applications and as more people have used them, more issues have cropped up. - -Here are some examples of the main problems: - -### Jobs - If you have a Job with two containers one of which is actually doing the main processing of the job and the other is just facilitating it, you encounter a problem when the main process finishes; your sidecar container will carry on running so the job will never finish. - -The only way around this problem is to manage the sidecar container's lifecycle manually and arrange for it to exit when the main container exits. This is typically achieved by building an ad-hoc signalling mechanism to communicate completion status between containers. Common implementations use a shared scratch volume mounted into all pods, where lifecycle status can be communicated by creating and watching for the presence of files. This pattern has several disadvantages: - -* Repetitive lifecycle logic must be rewritten in each instance a sidecar is deployed. -* Third-party containers typically require a wrapper to add this behaviour, normally provided via an entrypoint wrapper script implemented in the k8s container spec. This adds undesirable overhead and introduces repetition between the k8s and upstream container image specs. -* The wrapping typically requires the presence of a shell in the container image, so this pattern does not work for minimal containers which ship without a toolchain. - -### Startup -An application that has a proxy container acting as a sidecar may fail when it starts up as it's unable to communicate until its proxy has started up successfully. Readiness probes don't help if the application is trying to talk outbound. - -### Shutdown -Applications that rely on sidecars may experience a high amount of errors when shutting down as the sidecar may terminate before the application has finished what it's doing. - - -## Goals - -Solve issues so that they don't require application modification: -* [25908](https://github.com/kubernetes/kubernetes/issues/25908) - Job completion -* [65502](https://github.com/kubernetes/kubernetes/issues/65502) - Container startup dependencies - -## Non-Goals - -Allowing multiple containers to run at once during the init phase. //TODO See if we can solve this problem with this proposal - -## Proposal - -Create a way to define containers as sidecars, this will be an additional field to the Container Spec: `sidecar: true`. //TODO Decide on the API (see [Alternatives](#alternatives)) - -e.g: -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: myapp-pod - labels: - app: myapp -spec: - containers: - - name: myapp - image: myapp - command: ['do something'] - - name: sidecar - image: sidecar-image - sidecar: true - command: ["do something to help my app"] - -``` -Sidecars will be started before normal containers but after init, so that they are ready before your main processes start. - -This will change the Pod startup to look like this: -* Init containers start -* Init containers finish -* Sidecars start -* Containers start - -During pod termination sidecars will be terminated last: -* Containers sent SIGTERM -* Once all Containers have exited: Sidecars sent SIGTERM - -If Containers don't exit before the end of the TerminationGracePeriod then they will be sent a SIGKIll as normal, Sidecars will then be sent a SIGTERM with a short grace period of 5/10 Seconds (up for debate) to give them a chance to cleanly exit. - -PreStop Hooks will be sent to sidecars and containers at the same time. -This will be useful in scenarios such as when your sidecar is a proxy so that it knows to no longer accept inbound requests but can continue to allow outbound ones until the the primary containers have shut down. //TODO Discuss whether this is a valid use case (dropping inbound requests can cause problems with load balancers) - -To solve the problem of Jobs that don't complete: When RestartPolicy!=Always if all normal containers have reached a terminal state (Succeeded for restartPolicy=OnFailure, or Succeeded/Failed for restartPolicy=Never), then all sidecar containers will be sent a SIGTERM. - -### Implementation Details/Notes/Constraints - -As this is a fairly large change I think it make sense to break this proposal down and phase in more functionality as we go, potential roadmap could look like: - -* Add sidecar field, use it for the shutdown triggering when RestartPolicy!=Always -* Pre-stop hooks sent to sidecars before non sidecar containers -* Sidecars are terminated after normal containers -* Sidecars start before normal containers - - -As this is a change to the Container spec we will be using feature gating, you will be required to explicitly enable this feature on the api server as recommended [here](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#adding-unstable-features-to-stable-versions). - -### Risks and Mitigations - -You could set all containers to be `sidecar: true`, this seems wrong, so maybe the api should do a validation check that at least one container is not a sidecar. - -Init containers would be able to have `sidecar: true` applied to them as it's an additional field to the container spec, this doesn't currently make sense as init containers are ran sequentially. We could get around this by having the api throw a validation error if you try to use this field on an init container or just ignore the field. - -Older Kubelets that don't implement the sidecar logic could have a pod scheduled on them that has the sidecar field. As this field is just an addition to the Container Spec the Kubelet would still be able to schedule the pod, treating the sidecars as if they were just a normal container. This could potentially cause confusion to a user as their pod would not behave in the way they expect, but would avoid pods being unable to schedule. - - -## Graduation Criteria - -//TODO - -## Implementation History - -- 14th May 2018: Proposal Submitted - - -## Alternatives - -One alternative would be to have a new field in the Pod Spec of `sidecarContainers:` where you could define a list of sidecar containers, however this would require more work in terms of updating tooling to support this. - -Another alternative would be to change the Job Spec to have a `primaryContainer` field to tell it which containers are important. However I feel this is perhaps too specific to job when this Sidecar concept could be useful in other scenarios. - -Having it as a boolean could cause problems later down the line if more lifecycle related flags were added, perhaps it makes more sense to have something like `lifecycle: Sidecar` to make it more future proof. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-auth/0000-20170814-bounding-self-labeling-kubelets.md b/keps/sig-auth/0000-20170814-bounding-self-labeling-kubelets.md index 73b3344a..cfd1f5fa 100644 --- a/keps/sig-auth/0000-20170814-bounding-self-labeling-kubelets.md +++ b/keps/sig-auth/0000-20170814-bounding-self-labeling-kubelets.md @@ -1,141 +1,4 @@ ---- -kep-number: 0 -title: Bounding Self-Labeling Kubelets -authors: - - "@mikedanese" - - "@liggitt" -owning-sig: sig-auth -participating-sigs: - - sig-node - - sig-storage -reviewers: - - "@saad-ali" - - "@tallclair" -approvers: - - "@thockin" - - "@smarterclayton" -creation-date: 2017-08-14 -last-updated: 2018-10-31 -status: implementable ---- - -# Bounding Self-Labeling Kubelets - -## Motivation - -Today the node client has total authority over its own Node labels. -This ability is incredibly useful for the node auto-registration flow. -The kubelet reports a set of well-known labels, as well as additional -labels specified on the command line with `--node-labels`. - -While this distributed method of registration is convenient and expedient, it -has two problems that a centralized approach would not have. Minorly, it makes -management difficult. Instead of configuring labels in a centralized -place, we must configure `N` kubelet command lines. More significantly, the -approach greatly compromises security. Below are two straightforward escalations -on an initially compromised node that exhibit the attack vector. - -### Capturing Dedicated Workloads - -Suppose company `foo` needs to run an application that deals with PII on -dedicated nodes to comply with government regulation. A common mechanism for -implementing dedicated nodes in Kubernetes today is to set a label or taint -(e.g. `foo/dedicated=customer-info-app`) on the node and to select these -dedicated nodes in the workload controller running `customer-info-app`. - -Since the nodes self reports labels upon registration, an intruder can easily -register a compromised node with label `foo/dedicated=customer-info-app`. The -scheduler will then bind `customer-info-app` to the compromised node potentially -giving the intruder easy access to the PII. - -This attack also extends to secrets. Suppose company `foo` runs their outward -facing nginx on dedicated nodes to reduce exposure to the company's publicly -trusted server certificates. They use the secret mechanism to distribute the -serving certificate key. An intruder captures the dedicated nginx workload in -the same way and can now use the node certificate to read the company's serving -certificate key. - -## Proposal - -1. Modify the `NodeRestriction` admission plugin to prevent Kubelets from self-setting labels -within the `k8s.io` and `kubernetes.io` namespaces *except for these specifically allowed labels/prefixes*: - - ``` - kubernetes.io/hostname - kubernetes.io/instance-type - kubernetes.io/os - kubernetes.io/arch - - beta.kubernetes.io/instance-type - beta.kubernetes.io/os - beta.kubernetes.io/arch - - failure-domain.beta.kubernetes.io/zone - failure-domain.beta.kubernetes.io/region - - failure-domain.kubernetes.io/zone - failure-domain.kubernetes.io/region - - [*.]kubelet.kubernetes.io/* - [*.]node.kubernetes.io/* - ``` - -2. Reserve and document the `node-restriction.kubernetes.io/*` label prefix for cluster administrators -that want to label their `Node` objects centrally for isolation purposes. - - > The `node-restriction.kubernetes.io/*` label prefix is reserved for cluster administrators - > to isolate nodes. These labels cannot be self-set by kubelets when the `NodeRestriction` - > admission plugin is enabled. - -This accomplishes the following goals: - -- continues allowing people to use arbitrary labels under their own namespaces any way they wish -- supports legacy labels kubelets are already adding -- provides a place under the `kubernetes.io` label namespace for node isolation labeling -- provide a place under the `kubernetes.io` label namespace for kubelets to self-label with kubelet and node-specific labels - -## Implementation Timeline - -v1.13: - -* Kubelet deprecates setting `kubernetes.io` or `k8s.io` labels via `--node-labels`, -other than the specifically allowed labels/prefixes described above, -and warns when invoked with `kubernetes.io` or `k8s.io` labels outside that set. -* NodeRestriction admission prevents kubelets from adding/removing/modifying `[*.]node-restriction.kubernetes.io/*` labels on Node *create* and *update* -* NodeRestriction admission prevents kubelets from adding/removing/modifying `kubernetes.io` or `k8s.io` -labels other than the specifically allowed labels/prefixes described above on Node *update* only - -v1.15: - -* Kubelet removes the ability to set `kubernetes.io` or `k8s.io` labels via `--node-labels` -other than the specifically allowed labels/prefixes described above (deprecation period -of 6 months for CLI elements of admin-facing components is complete) - -v1.17: - -* NodeRestriction admission prevents kubelets from adding/removing/modifying `kubernetes.io` or `k8s.io` -labels other than the specifically allowed labels/prefixes described above on Node *update* and *create* -(oldest supported kubelet running against a v1.17 apiserver is v1.15) - -## Alternatives Considered - -### File or flag-based configuration of the apiserver to allow specifying allowed labels - -* A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently -* File-based config isn't easily inspectable to be able to verify enforced labels -* File-based config isn't easily kept in sync in HA apiserver setups - -### API-based configuration of the apiserver to allow specifying allowed labels - -* A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently -* An API object that controls the allowed labels is a potential escalation path for a compromised node - -### Allow kubelets to add any labels they wish, and add NoSchedule taints if disallowed labels are added - -* To be robust, this approach would also likely involve a controller to automatically inspect labels and remove the NoSchedule taint. This seemed overly complex. Additionally, it was difficult to come up with a tainting scheme that preserved information about which labels were the cause. - -### Forbid all labels regardless of namespace except for a specifically allowed set - -* This was much more disruptive to existing usage of `--node-labels`. -* This was much more difficult to integrate with other systems allowing arbitrary topology labels like CSI. -* This placed restrictions on how labels outside the `kubernetes.io` and `k8s.io` label namespaces could be used, which didn't seem proper. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-auth/0014-dynamic-audit-configuration.md b/keps/sig-auth/0014-dynamic-audit-configuration.md index 8a7df026..cfd1f5fa 100644 --- a/keps/sig-auth/0014-dynamic-audit-configuration.md +++ b/keps/sig-auth/0014-dynamic-audit-configuration.md @@ -1,280 +1,4 @@ ---- -kep-number: 14 -title: Dynamic Audit Configuration -authors: - - "@pbarker" -owning-sig: sig-auth -participating-sigs: - - sig-api-machinery -reviewers: - - "@tallclair" - - "@yliaog" - - "@caesarxuchao" - - "@liggitt" -approvers: - - "@tallclair" - - "@liggitt" - - "@yliaog" -editor: TBD -creation-date: 2018-05-18 -last-updated: 2018-07-31 -status: implementable ---- - -# Dynamic Audit Control - -## Table of Contents - -* [Dynamic Audit Control](#dynamic-audit-control) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Proposal](#proposal) - * [Dynamic Configuration](#dynamic-configuration) - * [Cluster Scoped Configuration](#cluster-scoped-configuration) - * [User Stories](#user-stories) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Story 3](#story-3) - * [Story 4](#story-4) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Feature Gating](#feature-gating) - * [Policy Enforcement](#policy-enforcement) - * [Aggregated Servers](#aggregated-servers) - * [Risks and Mitigations](#risks-and-mitigations) - * [Privilege Escalation](#privilege-escalation) - * [Leaked Resources](#leaked-resources) - * [Webhook Authentication](#webhook-authentication) - * [Performance](#performance) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Alternatives](#alternatives) - * [Generalized Dynamic Configuration](#generalized-dynamic-configuration) - * [Policy Override](#policy-override) - -## Summary - -We want to allow the advanced auditing features to be dynamically configured. Following in the same vein as -[Dynamic Admission Control](https://kubernetes.io/docs/admin/extensible-admission-controllers/) we would like to provide -a means of configuring the auditing features post cluster provisioning. - -## Motivation - -The advanced auditing features are a powerful tool, yet difficult to configure. The configuration requires deep insight -into the deployment mechanism of choice and often takes many iterations to configure properly requiring a restart of -the apiserver each time. Moreover, the ability to install addon tools that configure and enhance auditing is hindered -by the overhead in configuration. Such tools frequently run on the cluster requiring future knowledge of how to reach -them when the cluster is live. These tools could enhance the security and conformance of the cluster and its applications. - -### Goals -- Provide an api and set of objects to configure the advanced auditing kube-apiserver configuration dynamically - -### Non-Goals -- Provide a generic interface to configure all kube-apiserver flags -- configuring non-webhook backends -- configuring audit output (format or per-field filtering) -- authorization of audit output - -## Proposal - -### Dynamic Configuration -A new dynamic audit backend will be introduced that follows suit with the existing [union backend](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/audit/union.go). It will hold a map of configuration objects that it syncs with an informer. - -#### Cluster Scoped Configuration -A cluster scoped configuration object will be provided that applies to all events in the cluster. - -```golang -// AuditConfiguration represents a dynamic audit configuration -type AuditConfiguration struct { - metav1.TypeMeta - - v1.ObjectMeta - - // Policy is the current audit v1beta1 Policy object - // if undefined it will default to the statically configured cluster policy if available - // if neither exist the backend will fail - Policy *Policy - - // Backend to send events - Backend *Backend -} - -// Backend holds the configuration for the backend -type Backend struct { - // Webhook holds the webhook backend - Webhook *WebhookBackend -} - -// WebhookBackend holds the configuration of the webhooks -type WebhookBackend struct { - // InitialBackoff is amount of time to wait before retrying the first failed request in seconds - InitialBackoff *int - - // ThrottleBurst is the maximum number of events sent at the same moment - ThrottleBurst *int - - // ThrottleEnabled determines whether throttling is enabled - ThrottleEnabled *bool - - // ThrottleQPS maximum number of batches per second - ThrottleQPS *float32 - - // ClientConfig holds the connection parameters for the webhook - ClientConfig WebhookClientConfig -} - -// WebhookClientConfig contains the information to make a TLS -// connection with the webhook; this follows: -// https://github.com/kubernetes/api/blob/master/admissionregistration/v1beta1/types.go#L222 -// but may require some additive auth parameters -type WebhookClientConfig struct { - // URL of the server - URL *string - - // Service name to send to - Service *ServiceReference - - // `caBundle` is a PEM encoded CA bundle which will be used to validate - // the webhook's server certificate. - CABundle []byte -} -``` - -Multiple definitions can exist as independent solutions. These updates will require the audit API to be registered with the apiserver. The dynamic configurations will be wrapped by truncate and batch options, which are set statically through existing flags. Dynamic configuration will be enabled by a feature gate for pre-stable releases. If existing flags are provided to configure the audit backend they will be taken as a separate backend configuration. - -Example configuration yaml config: -```yaml -apiVersion: audit.k8s.io/v1beta1 -kind: AuditConfiguration -metadata: - name: <name> -policy: - rules: - - level: <level> - omitStages: - - stage: <stage> -backend: - webhook: - - initialBackoff: <10s> - throttleBurst: <15> - throttleEnabled: <true> - throttleQPS: <10> - clientConfig: - url: <backend url> - service: <optional service name> - caBundle: <ca bundle> -``` -A configuration flag will be added that enables dynamic auditing `--audit-dynamic-configuration`, which will default to false. - -### User Stories - -#### Story 1 -As a cluster admin, I will easily be able to enable the internal auditing features of an existing cluster, and tweak the configurations as necessary. I want to prevent privilege escalation from being able to tamper with a root audit configuration. - -#### Story 2 -As a Kubernetes extension developer, I will be able to provide drop in extensions that utilize audit data. - -#### Story 3 -As a cluster admin, I will be able configure multiple audit-policies and webhook endpoints to provide independent auditing facilities. - -#### Story 4 -As a kubernetes developer, I will be able to quickly turn up the audit level on a certain area to debug my application. - -### Implementation Details/Notes/Constraints - -#### Feature Gating -Introduction of dynamic policy requires changes to the current audit pipeline. Care must be taken that these changes are -properly gated and do not affect the stability or performance of the current features as they progress to GA. A new decorated -handler will be provisioned similar to the [existing handlers](https://github.com/kubernetes/apiserver/blob/master/pkg/endpoints/filters/audit.go#L41) -called `withDynamicAudit`. Another conditional clause will be added where the handlers are -[provisioned](https://github.com/kubernetes/apiserver/blob/master/pkg/server/config.go#L536) allowing for the proper feature gating. - -#### Policy Enforcement -This addition will move policy enforcement from the main handler to the backends. From the `withDynamicAudit` handler, -the full event will be generated and then passed to the backends. Each backend will copy the event and then be required to -drop any pieces that do not conform to its policy. A new sink interface will be required for these changes called `EnforcedSink`, -this will largely follow suite with the existing sink but take a fully formed event and the authorizer attributes as its -parameters. It will then utilize the `LevelAndStages` method in the policy -[checker](https://github.com/kubernetes/apiserver/blob/master/pkg/audit/policy/checker.go) to enforce its policy on the event, -and drop any unneeded sections. The new dynamic backend will implement the `EnforcedSink` interface, and update its state -based on a shared informer. For the existing backends to comply, a method will be added that implements the `EnforcedSink` interface. - -Implementing the [attribute interface](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/apiserver/pkg/authorization/authorizer/interfaces.go) -based on the Event struct was also explored. This would allow us to keep the existing `Sink` interfaces, however it would -require parsing the request URI twice in the pipeline due to how that field is represented in the Event. This was determined -to not be worth the cost. - -#### Aggregated Servers -Inherently apiserver aggregates and HA apiserver setups will work off the same dynamic configuration object. If separate -audit configuration objects are needed they should be configured as static objects on the node and set through the runtime flags. Aggregated servers will implement the same audit handling mechanisms. A conformance test should be provided as assurance. Metadata level -logging will happen by default at the main api server as it proxies the traffic. The aggregated server will then watch the same -configuration objects and only log on resource types that it handles. This will duplicate the events sent to the receiving servers -so they should not expect to key off `{ Audit-ID x Stage }`. - -### Risks and Mitigations - -#### Privilege Escalation -This does open up the attack surface of the audit mechanisms. Having them strictly configured through the api server has the advantage of limiting the access of those configurations to those that have access to the master node. This opens a number of potential attack vectors: - -* privileged user changes audit policy to hide (not audit) malicious actions -* privileged user changes audit policy to DoS audit endpoint (with malintent, or ignorance) -* privileged user changes webhook configuration to hide malicious actions - -As a mitigation strategy policy configured through a static file on the api server will not be accessible through the api. This file ensures that an escalation attack cannot tamper with a root configuration, but works independently of any dynamically configured objects. - -#### Leaked Resources -A user with permissions to create audit policies effectively has read access to the entire cluster (including all secrets data). - -A mitigation strategy will be to document the exposure space granted with this resource. Advice will be provided to only allow access to cluster admin level roles. - -#### Webhook Authentication -With Dynamic Admission control today any authentication mechanism must be provided through a static kubeconfig file on the node. This hinders a lot of the advances in this proposal. All webhooks would require authentication as an unauthenticated endpoint would allow a bad actor to push phony events. Lack of dynamic credential provisioning is problematic to the drop-in extension use case, and difficult to configure. - -The reason for static configuration today is that a single configured credential would have no way of differentiating apiserver replicas or their aggregates. There is a possible mitigation by providing a bound service account token and using the calling server's dns name as the audience. - -It may also be reasonable to provide a dynamic auth configuration from secrets, with the understanding that it is shared by the api servers. - -This needs further discussion. - -#### Performance - -These changes will likely have an O(n) performance impact on the api server per policy. A `DeepCopy` of the event will be -required for each backend. Also, the request/response object would now be serialized on every [request](https://github.com/kubernetes/kubernetes/blob/cef2d325ee1be894e883d63013f75cfac5cb1246/staging/src/k8s.io/apiserver/pkg/audit/request.go#L150-L152). -Benchmark testing will be required to understand the scope of the impact and what optimizations may be required. This impact -is gated by opt-in feature flags, which allows it to move to alpha but these concerns must be tested and reconciled before it -progresses to beta. - -## Graduation Criteria - -Success will be determined by stability of the provided mechanisms and ease of understanding for the end user. - -* alpha: Api server flags can be dynamically configured, known issues are tested and resolved. -* beta: Mechanisms have been hardened against any known bugs and the process is validated by the community - -## Implementation History - -- 05/18/2018: initial design -- 06/13/2018: updated design -- 07/31/2018: dynamic policy addition - -## Alternatives - -### Generalized Dynamic Configuration - -We could strive for all kube-apiserver flags to be able to be dynamically provisioned in a common way. This is likely a large -task and out of the scope of the intentions of this feature. - -### Policy Override - -There has been discussion over whether the policy configured by api server flags should limit the policies configured dynamically. -This would allow a cluster admin to narrowly define what is allowed to be logged by the dynamic configurations. While this has upsides -it was ruled out for the following reasons: - -* It would limit user story #4 in the ability to quickly turn up logging when needed -* It could prove difficult to understand as the policies themselves are fairly complex -* The use of CRDs would be difficult to bound - -The dynamic policy feature is gated by runtime flags. This still provides the cluster provisioner a means to limit audit logging to the -single runtime object if needed.
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-autoscaling/0032-enhance-hpa-metrics-specificity.md b/keps/sig-autoscaling/0032-enhance-hpa-metrics-specificity.md index a5fbe05b..cfd1f5fa 100644 --- a/keps/sig-autoscaling/0032-enhance-hpa-metrics-specificity.md +++ b/keps/sig-autoscaling/0032-enhance-hpa-metrics-specificity.md @@ -1,277 +1,4 @@ ---- -kep-number: 0032 -title: Enhance HPA Metrics Specificity -authors: - - "@directxman12" -owning-sig: sig-autoscaling -participating-sigs: - - sig-instrumentation -reviewers: - - "@brancz" - - "@maciekpytel" -approvers: - - "@brancz" - - "@maciekpytel" - - "@directxman12" -editor: "@directxman12" -creation-date: 2018-04-19 -status: implemented ---- - -# Enhance HPA Metrics Specificity - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Alternatives](#alternatives) - -## Summary - -The `External` metric source type in the HPA currently supports passing -a metric label selectors, which is passed to the custom metrics API -(custom.metrics.k8s.io) to select a more specific metrics series. This -allows users to more easily make use of existing metrics structure, -without need to manipulate their metrics labeling and ingestion -externally. - -Additionally, it supports the `targetAverageValue` field, which allows -artificially dividing an external metric by the number of replicas in the -target scalable. - -This proposal brings both of those fields to the `Object` metric source -type, and further brings the selector field to the `Pods` metric source -type, making both types more flexible and bringing them in line with the -`External` metrics types. - -## Motivation - -With custom-metrics-based autoscaling, users frequently ask how to select -more specific metrics in their metric storage. For instance, a user might -have message queue statefulset with several queues producing metrics: - -``` -queue_length{statefulset="foo",pod="foo-1",queue="some-jobs"} -queue_length{statefulset="foo",pod="foo-1",queue="other-jobs"} -``` - -Suppose they have a pool of works that they wish to scale for each queue. -In the current-day HPA, it's non-trivial to allow selecting the metric for -a specific queue. Current suggestions are metric-backend-specific (for -instance, you could create a Prometheus recording rule to relabel or -rename the metric), and often involve making external changes to the -metrics pipeline. - -With the addition of the metrics label selector, users could simply select -the queue using the label selector: - -```yaml -- type: Object - object: - describedObject: - kind: StatefulSet - apiVersion: apps/v1 - name: foo - target: - type: Value - value: 2 - metric: - name: queue_length - selector: {matchLabels: {queue: some-jobs}} -``` - -Similarly, in discussions of scaling on queues, being able to divide -a target backlog length by the number of available pods is often useful -- -for instance, a backlog length of 3 might be acceptable if there are three -pods processing items, but not if there is only one. - -### Goals - -- The autoscaling/v2 API is updated with the additional fields - described below. -- A corresponding change is made to the custom metrics API to support the - additional label selector. -- The testing adapter is updated to support these changes (for e2e - purposes). - -### Non-Goals - -It is outside of the purview of the KEP to ensure that current custom -metrics adapters support the new changes -- this is up to those adapters -maintainers. - -## Proposal - -The autoscaling/v2 API will be updated the following way: - -```go -type ObjectMetricSource struct { - DescribedObject CrossVersionObjectReference - Target MetricTarget - Metric MetricIdentifier -} - -type PodsMetricSource struct { - Target MetricTarget - Metric MetricIdentifier -} - -type ExternalMetricSouce struct { - Metric MetricIdentifier - Target MetricTarget -} - -type ResourceMetricSource struct { - Name v1.ResourceName - Target MetricTarget -} - -type MetricIdentifier struct { - // name is the name of the given metric - Name string - // selector is the selector for the given metric - // +optional - Selector *metav1.LabelSelector -} - -type MetricTarget struct { - Type MetricTargetType // Utilization, Value, AverageValue - // value is the raw value of the single metric (valid for object metrics) - Value *resource.Quantity - - // averageValue is the raw value or values averaged across the number - // of pods targeted by the HPA (valid for all metric types). - AverageValue *resource.Quantity - - // averageUtilization is the average value (as defined above) as - // a percentage of the corresponding average pod request (valid - // for resource metrics). - AverageUtilization *int32 -} - -// and similarly for the statuses: - -type MetricValueStatus struct { - // value is the current value of the metric (as a quantity). - // +optional - Value *resource.Quantity - // averageValue is the current value of the average of the - // metric across all relevant pods (as a quantity) - // (always reported for resource metrics) - // +optional - AverageValue *resource.Quantity - // currentAverageUtilization is the current value of the average of the - // resource metric across all relevant pods, represented as a percentage of - // the requested value of the resource for the pods. - // +optional - AverageUtilization *int32 -} - -``` - -Notice that the `metricName` field is replaced with a new `metric` field, -which encapsulates both the metric name, and an optional label selector, -which takes the form of a standard kubernetes label selector. - -The `targetXXX` fields are replaced by a unified `Target` field that -contains the different target types. The `target` field in the Object -metric source type is renamed to `describedObject`, since the `target` -field is now taken, and to more accurately describe its purpose. - -The `External` source is updated slightly to match the new form of the -`Pods` and `Object` sources. - -These changes necessitate a second beta of `autoscaling/v2`: -`autoscaling/v2beta2`. - -Similarly, corresponding changes need to be made to the custom metrics -API: - -```go -type MetricValue struct { - metav1.TypeMeta - DescribedObject ObjectReference - - Metric MetricIdentifier - - Timestamp metav1.Time - WindowSeconds *int64 - Value resource.Quantity -} - -type MetricIdentifier struct { - // name is the name of the given metric - Name string - // selector represents the label selector that could be used to select - // this metric, and will generally just be the selector passed in to - // the query used to fetch this metric. - // +optional - Selector *metav1.LabelSelector -} -``` - -This will also require bumping the custom metrics API to -`custom.metrics.k8s.io/v1beta2`. - -**Note that if a metrics pipeline works in such a way that multiple series -are matched by a label selector, it's the metrics adapter's job to deal -with it, similarly to the way things current work with the custom metrics -API.** - -### Risks and Mitigations - -The main risk around this proposal revolves around metric backend support. -When crafting the initial API, there were two constraints: a) limit -ourselves to an API surface that could be limited in adapters without any -additional processing of metrics, and b) avoid creating a new query -language. - -There are currently three adapter implementations (known to SIG -Autoscaling): Prometheus, Stackdriver, and Sysdig. Of those three, both -Prometheus and Stackdriver map nicely to the `name+labels` abstraction, -while Sysdig does not seem to natively have a concept of labels. However, -this simply means that users of sysdig metrics will not make use of labels --- there should be no need for the sysdig adapter to do anything special -with the labels besides ignore them. The "name+label" paradigm also seems -to match nicely with other metric solutions (InfluxDB, DataDog, etc) used -with Kubernetes. - -As for moving closer to a query language, this change is still very -structured and very limitted. It requires no additional parsing logic -(since it uses standard kubernetes label selectors), and translation to -underlying APIs and query languages should be relatively simple. - -## Graduation Criteria - -In general, we'll want to graduate the autoscaling/v2 and -custom.metrics.k8s.io APIs to GA once we have a release with at least one -adapter up to date, and positive user feedback that does not suggest -urgent need for further changes. - -## Implementation History - -- (2018/4/19) Proposal proposed -- (2018/8/27) Implementation (kubernetes/kubernetes#64097) merged for Kubernetes 1.12 - -## Alternatives - -- Continuing to require out-of-band changes to support more complex metric - environments: this induces a lot of friction with traditional - Prometheus-style monitoring setups, which favor selecting on labels. - Furthermore, the changes required often involve admin intervention, - which is not always simple or scalable in larger environments. - -- Allow passing full queries instead of metric names: this would make the - custom metrics API significantly more scalable, at the cost of adapter - complexity, security issues, and lesser portability. Effectively, - adapters would have to implement query rewriting to inject extra labels - in to scope metrics down to their target objects, which could in turn - cause security issues. Additionally, it makes it a lot hard to port the - HPAs between different metrics solutions. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-autoscaling/README.md b/keps/sig-autoscaling/README.md index d29c197f..cfd1f5fa 100644 --- a/keps/sig-autoscaling/README.md +++ b/keps/sig-autoscaling/README.md @@ -1,3 +1,4 @@ -# SIG Autoscaling KEPs - -This directory contains KEPs related to SIG Autoscaling. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-aws/20181126-aws-k8s-tester.md b/keps/sig-aws/20181126-aws-k8s-tester.md index b9cc17d1..cfd1f5fa 100644 --- a/keps/sig-aws/20181126-aws-k8s-tester.md +++ b/keps/sig-aws/20181126-aws-k8s-tester.md @@ -1,101 +1,4 @@ ---- -title: aws-k8s-tester -authors: - - "@gyuho" -owning-sig: sig-aws -reviewers: - - "@d-nishi" - - "@shyamjvs" -approvers: - - "@d-nishi" - - "@shyamjvs" -editor: TBD -creation-date: 2018-11-26 -last-updated: 2018-11-29 -status: provisional ---- - -# aws-k8s-tester - kubetest plugin for AWS and EKS - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Kubernetes E2E test workflow: upstream, Prod EKS builds](#kubernetes-e2e-test-workflow-upstream-prod-eks-builds) - * [Sub-project E2E test workflow: upstream, ALB Ingress Controller](#sub-project-e2e-test-workflow-upstream-alb-ingress-controller) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -All e2e tests maintained by AWS uses `aws-k8s-tester` as a plugin to kubetest. `aws-k8s-tester` runs various Kubernetes testing operations (e.g. create a temporary EKS cluster) mainly to implement [`kubernetes/test-infra/kubetest.deployer`](https://github.com/kubernetes/test-infra/blob/40b4010f8e38582a5786adedd4e04cf4e1fc5a36/kubetest/main.go#L222-L229) interface. - -## Motivation - -Many AWS tests are run by community projects such as [kops](https://github.com/kubernetes/kops), which does not test EKS. Does EKS Service check out-of-range NodePort? Does EKS Service prevent NodePort collisions? Can EKS support 5,000 nodes? There were thousands more to cover. Incomplete test coverage feeds production issues: one component failure may evolve into cluster-wide outage, Kubernetes CSI driver might be incompatible with Amazon EBS, customers may experience scalability problems from untested features, etc. Complete test coverage will unearth such issues beforehand, which leads to better customer experience. This work alone will make a huge impact on improving EKS reliability and its release cadence. - -### Goals - -The following key features are in scope: - -* EKS cluster creation/teardown -* Test open-source Kubernetes distribution (e.g. wrap kubeadm to run Kubernetes e2e tests) -* Create AWS resources to test AWS sub-projects (e.g. create EC2 instances to test CSI driver) - -These are the key design principles that guides EKS testing development: - -- *Platform Uniformity*: EKS provides a native Kubernetes experience. To keep two platforms in sync, EKS must be tested with upstream Kubernetes. Whenever a new feature is added to Kubernetes, customer should assume it will be tested against EKS. For example, encryption provider feature was added to Kubernetes 1.10 as an alpha feature, but customers have to wait until it becomes a stable feature. Rigorous test coverage will enable more new features to customers at earliest. -- *Maximize Productivity*: Test automation is essential to EKS development productivity at scale. EKS team should do the minimum amount of work possible to upgrade Kubernetes (including etcd). To this end, pre-prod and prod EKS builds will be continuously tested for every PR created in upstream Kubernetes. For example, etcd client upgrade in API server must be tested against EKS control plane components. If the upgrade test fails EKS, we should block the change. -- *Transparency for Community*: We want to contribute EKS tests to upstream and make test results visible to the whole communities. Users should be able to see how EKS performs with 5,000 worker nodes and compare it with other providers, just by looking at upstream performance dashboard. - -### Non-Goals - -* The project does not replace kops or kubeadm. -* This project is only meant for testing. - -## Proposal - -### User Stories - -#### Kubernetes E2E test workflow: upstream, Prod EKS builds - -EKS uses `aws-k8s-tester` as a plugin to kubetest. `aws-k8s-tester` is a broker that creates and deletes AWS resources on behalf of kubetest, connects to pre-prod EKS clusters, reports test results back to dashboards, etc. Every upstream change will be tested against EKS cluster. - -Figure 1 shows how AWS would run Kubernetes e2e tests inside EKS (e.g. ci-kubernetes-e2e-aws-eks). - - - -#### Sub-project E2E test workflow: upstream, ALB Ingress Controller - -Let's take ALB Ingress Controller for example. Since Kubernetes cluster is a prerequisite to ALB Ingress Controller, `aws-k8s-tester` first creates EKS cluster. Then ALB Ingress Controller plug-in deploys and creates Ingress objects, with sample web server and client. awstester is configured through YAML rather than POSIX flags. This makes it easier to implement sub-project add-ons (e.g. add “alb-ingress-controller” field to set up ingress add-on). Cluster status, ingress controller states, and testing results are persisted to disk for status report and debugging purposes. - -Figure 2 shows how `aws-k8s-tester` plugin creates and tests ALB Ingress Controller. - - - -### Implementation Details/Notes/Constraints - -We implement kubetest plugin, out-of-tree and provided as a single binary file. Separate code base speeds up development and makes dependency management easier. For example, kops in kubetest uses AWS SDK [v1.12.53](https://github.com/aws/aws-sdk-go/releases/tag/v1.12.53), which was released at December 2017. Upgrading SDK to latest would break existing kops. Packaging everything in a separate binary gives us freedom to choose whatever SDK version we need. - -### Risks and Mitigations - -* *“aws-k8s-tester” creates a key-pair using EC2 API. Is the private key safely managed?* Each test run creates a temporary key pair and stores the private key on disk. The private key is used to SSH access into Kubernetes worker nodes and read service logs from the EC2 instance. “aws-k8s-tester” safely [deletes the private key on disk](https://github.com/aws/aws-k8s-tester/blob/cde0484f0ae167d8831442a48b4b5e447481af45/internal/ec2/key_pair.go#L65) and [destroys all associated AWS resources](https://github.com/aws/aws-k8s-tester/blob/cde0484f0ae167d8831442a48b4b5e447481af45/internal/ec2/key_pair.go#L71-L73), whether the test completes or get interrupted. For instance, when it deletes the key pair object from EC2, the public key is also deleted, which means the local private key has no use for any threat. -* *Does “aws-k8s-tester” store any sensitive information?* “aws-k8s-tester” maintains a test cluster state in [`ClusterState`](https://godoc.org/github.com/aws/awstester/eksconfig#ClusterState), which is periodically synced to local disk and S3. It does not contain any sensitive data such as private key blobs. -* *Upstream Kubernetes test-infra team mounts our AWS test credential to their Prow cluster. Can anyone access the credential?* Upstream Prow cluster schedules all open-source Kubernetes test runs. In order to test EKS from upstream Kubernetes, AWS credential must be accessible from each test job. Currently, it is mounted as a Secret object (https://kubernetes.io/docs/concepts/configuration/secret/) in upstream Prow cluster. Which means our AWS credential is still stored as base64-encoded plaintext in etcd. Then, there are two ways to access this data. One is to read from Prow testing pod (see [test-infra/PR#9940](https://github.com/kubernetes/test-infra/pull/9940/files)). In theory, any test job has access to “eks-aws-credentials” secret object, thus can maliciously mount it to steal the credential. In practice, every single job needs an approval before it runs any tests. So, if anybody tries to exploit the credential, the change should be rejected beforehand. Two, read the non-encrypted credential data from etcd. This is unlikely as well. We can safely assume that Google GKE deploys etcd in a trusted environment, where the access is restricted to Google test-infra team. See https://kubernetes.io/docs/concepts/configuration/secret/#risks for more. - -## Graduation Criteria - -`aws-k8s-tester` will be considered successful when it is used by the majority of AWS Kubernetes e2e tests. - -## Implementation History - -* Initial integration with upstream has been tracked -* Initial proposal to SIG 2018-11-26 -* Initial KEP draft 2018-11-26 +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-aws/20181127-aws-ebs-csi-driver.md b/keps/sig-aws/20181127-aws-ebs-csi-driver.md index 81cb1e7b..cfd1f5fa 100644 --- a/keps/sig-aws/20181127-aws-ebs-csi-driver.md +++ b/keps/sig-aws/20181127-aws-ebs-csi-driver.md @@ -1,73 +1,4 @@ ---- -title: aws-ebs-csi-driver -authors: - - "@leakingtapan" -owning-sig: sig-aws -reviewers: - - "@d-nishi" - - "@jsafrane" -approvers: - - "@d-nishi" - - "@jsafrane" -editor: TBD -creation-date: 2018-11-27 -last-updated: 2018-11-27 -status: provisional ---- - -# AWS Elastic Block Store (EBS) CSI Driver - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Static Provisioning](#static-provisioning) - * [Volume Schduling](#volume-scheduling) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary -AWS EBS CSI Driver implements [Container Storage Interface](https://github.com/container-storage-interface/spec/tree/master) which is the standard of storage interface for container. It provides the same in-tree AWS EBS plugin features including volume creation, volume attachment, volume mounting and volume scheduling. It is also configurable on what is the EBS volume type to create, what is the file system file should be formatted, which KMS key to use to create encrypted volume, etc. - -## Motivation -Similar to CNI plugins, AWS EBS CSI driver will be a stand alone plugin that lives out-of-tree of kuberenetes. Being out-of-tree, it will be benefit from being modularized, maintained and optimized without affecting kubernetes core code base. Aside from those benefits, it could also be consumed by other container orchestrators such as ECS. - -### Goals -AWS EBS CSI driver will provide similar user experience as in-tree EBS plugin: -* As an application developer, he will not even notice any difference between EBS CSI driver and in-tree plugin. His workflow will stay the same as current. -* As an infrastructure operator, he just need to create/update storage class to use CSI driver to manage underlying storage backend. - -List of driver features include volume creation/deletion, volume attach/detach, volume mount/unmount, volume scheduling, create volume configurations, volume snapshotting, mount options, raw block volume, etc. - -### Non-Goals -* Supporting non AWS block storage -* Supporting other AWS storage serivces such as Dynamodb, S3, etc. - -## Proposal - -### User Stories - -#### Static Provisioning -Operator creates a pre-created EBS volume on AWS and a PV that refer the EBS volume on cluster. Developer creates PVC and a Pod that uses the PVC. Then developer deploys the Pod during which time the PV will be attached to container inside Pod after PVC bonds to PV successfully. - -#### Volume Scheduling -Operation creates StorageClass with volumeBindingMode = WaitForFirstConsumer. When developer deploys a Pod that has PVC that is trying to claim for a PV, a new PV will be created, attached, formatted and mounted inside Pod's container by the EBS CSI driver. Topology information provided by EBS CSI driver will be used during Pod scheduling to guarantee that both Pod and volume are collocated in the same availability zone. - -### Risks and Mitigations -* *Information disclosure* - AWS EBS CSI driver requires permission to perform AWS operation on users' behave. EBS CSI driver will make sure non of credentials are logged. And we will instruct user to grant only required permission to driver as best securtiy practise. -* *Escalation of Privileges* - Since EBS CSI driver is formatting and mounting volumes, it requires root privilege to permform the operations. So that driver will have higher privilege than other containers in the cluster. The driver will not execute random command provided by untrusted user. All of its interfaces are only provided for kuberenetes system components to interact with. The driver will also validate requests to make sure it aligns with its assumption. - -## Graduation Criteria -AWS EBS CSI driver provides the same features as in-tree plugin. - -## Implementation History -* 2018-11-26 Initial proposal to SIG -* 2018-11-26 Initial KEP draft -* 2018-12-03 Alpha release with kuberentes 1.13 - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-aws/README.md b/keps/sig-aws/README.md index 525cb618..cfd1f5fa 100644 --- a/keps/sig-aws/README.md +++ b/keps/sig-aws/README.md @@ -1,3 +1,4 @@ -# SIG AWS KEPs - -This directory contains KEPs related to [SIG AWS](/sig-aws) +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-aws/aws-lb-prefix-annotation.md b/keps/sig-aws/aws-lb-prefix-annotation.md index 433e2a4b..cfd1f5fa 100644 --- a/keps/sig-aws/aws-lb-prefix-annotation.md +++ b/keps/sig-aws/aws-lb-prefix-annotation.md @@ -1,65 +1,4 @@ ---- -kep-number: TBD -title: AWS LoadBalancer Prefix -authors: - - "@minherz" -owning-sig: sig-aws -participating-sigs: -reviewers: - - TBD -approvers: - - TBD -editor: TBD -creation-date: 2018-11-02 -last-updated: 2018-11-02 -status: provisional -see-also: -replaces: -superseded-by: ---- - -# AWS LoadBalancer Prefix Annotation Proposal - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) - -## Summary -AWS load balancer prefix annotation adds a control over the naming of the AWS ELB resources that are being generated when provisioning a Kubernetes service of type `LoadBalancer`. The current implementation provisions AWS ELB with a unique name based on the resource UID. The resulted unpredicted name makes it impossible to integrate the provisioning with existing IAM policies in situations when these two operations are controlled by two different groups. For example, IAM policies are defined and controlled by InfoSec team while provisioning of resources is under CloudOps team. The AWS IAM policies allow definition when only a prefix of the resource identifier is known. Using Kubernetes service with this annotation when it is provisioned in AWS, will allow an integration with existing IAM policies. - -## Motivation -Current way of provisioning load balancer (for a Kubernetes service of the type `LoadBalancer`) is to use the service's UID and to follow Cloud naming conventions for load balancers (for AWS it is a 32 character sequence of alphanumeric characters or hyphens that cannot begin or end with hypen [link1](https://docs.aws.amazon.com/elasticloadbalancing/2012-06-01/APIReference/API_CreateLoadBalancer.html), [link2](https://docs.aws.amazon.com/cli/latest/reference/elbv2/create-load-balancer.html)). When it is provisioned on AWS account with predefined IAM policies that limit access to ELB resources using wildcarded paths (IAM identifiers), the Kubernetes service cannot be provisioned. Providing a way to define a short known prefix to ELB resource makes it possible to match IAM policies conditions regarding the resource identifiers. - -### Goals -* Support provisioning of AWS ELB resources for Kubernetes services of the type `LoadBalancer` that match AWS IAM policies -### Non-Goals -* Provide meaningful names for AWS ELB resources generated for Kubernetes services of the type `LoadBalancer` - -## Proposal - -### User Stories [optional] - -### Implementation Details/Notes/Constraints [optional] - -### Risks and Mitigations - -## Graduation Criteria - -## Implementation History - -## Drawbacks [optional] - -## Alternatives [optional] - -## Infrastructure Needed [optional]
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-aws/draft-20181127-aws-alb-ingress-controller.md b/keps/sig-aws/draft-20181127-aws-alb-ingress-controller.md index a98b00ca..cfd1f5fa 100644 --- a/keps/sig-aws/draft-20181127-aws-alb-ingress-controller.md +++ b/keps/sig-aws/draft-20181127-aws-alb-ingress-controller.md @@ -1,80 +1,4 @@ ---- -kep-number: draft-20181127 -title: AWS ALB Ingress Controller -authors: - - "@M00nF1sh" -owning-sig: sig-aws -reviewers: - - TBD - - "@d-nishi" -approvers: - - TBD - - "@d-nishi" -editor: TBD -creation-date: 2018-11-27 -last-updated: 2018-11-27 -status: provisional ---- - -# AWS ALB Ingress Controller - -## Table of Contents -- [Table of Contents](#table-of-contents) -- [Summary](#summary) -- [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) -- [Proposal](#proposal) - - [User Stories](#user-stories) - - [Expose HTTP[s] API backed by kubernetes services](#expose-https-api-backed-by-kubernetes-services) - - [Adjust ALB settings via annotation](#adjust-alb-settings-via-annotation) - - [Leverage WAF & Cognito](#leverage-waf--cognito) - - [Sharing single ALB among Ingresses across namespace](#sharing-single-alb-among-ingresses-across-namespace) -- [Graduation Criteria](#graduation-criteria) -- [Implementation History](#implementation-history) - -## Summary - -This proposal introduces [AWS ALB Ingress Controller](https://github.com/kubernetes-sigs/aws-alb-ingress-controller/) as Ingress controller for kubernetes cluster on AWS. Which use [Amazon Elastic Load Balancing Application Load Balancer](https://aws.amazon.com/elasticloadbalancing/features/#Details_for_Elastic_Load_Balancing_Products)(ALB) to fulfill [Ingress resources](https://kubernetes.io/docs/concepts/services-networking/ingress/), and provides integration with various AWS services. - -## Motivation - -In order for the Ingress resource to work, the cluster must have an Ingress controller runnings. However, existing Ingress controllers like [nginx](https://github.com/kubernetes/ingress-nginx/blob/master/README.md) didn't take advantage of native AWS features. -AWS ALB Ingress Controller aims to enhance Ingress resource on AWS by leveraging rich feature set of ALB, such as host/path based routing, TLS termination, WebSockets, HTTP/2. Also, it will provide close integration with other AWS services such as WAF(web application firewall) and Cognito. - -### Goals - -* Support running multiple Ingress controllers in cluster -* Support portable Ingress resource(no annotations) -* Support leverage feature set of ALB via custom annotations -* Support integration with WAF -* Support integration with Cognito - -### Non-Goals - -* This project does not replacing nginx ingress controller - -## Proposal - -### User Stories - -#### Expose HTTP[s] API backed by kubernetes services -Developers create an Ingress resources to specify rules for how to routing HTTP[s] traffic to different services. -AWS ALB Ingress Controller will monitor such Ingress resources and create ALB and other necessary supporting AWS resources to match the Ingress resource specification. - -#### Adjust ALB settings via annotation -Developers specifies custom annotations on their Ingress resource to adjust ALB settings, such as enable deletion protection, enable access logs to specific S3 bucket. - -#### Leverage WAF & Cognito -Developers specifies custom annotations on their Ingress resource to denote WAF and Cognito integrations. Which provides web application firewall and authentication support for their exposed API. - -#### Sharing single ALB among Ingresses across namespace -Developers from different teams create Ingress resources in different namespaces which route traffic to services within their own namespace. However, an single ALB is shared from these Ingresses to expose a single DNS name for customers. - -## Graduation Criteria - -* AWS ALB Ingress Controller is widely used as Ingress controller for kubernetes clusters on AWS - -## Implementation History -- [community#2841](https://github.com/kubernetes/community/pull/2841) Design proposal -- [aws-alb-ingress-controller#738](https://github.com/kubernetes-sigs/aws-alb-ingress-controller/pull/738) First stable release: v1.0.0
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-azure/0018-20180711-azure-availability-zones.md b/keps/sig-azure/0018-20180711-azure-availability-zones.md index 46b17daa..cfd1f5fa 100644 --- a/keps/sig-azure/0018-20180711-azure-availability-zones.md +++ b/keps/sig-azure/0018-20180711-azure-availability-zones.md @@ -1,292 +1,4 @@ ---- -kep-number: 18 -title: Azure Availability Zones -authors: - - "@feiskyer" -owning-sig: sig-azure -participating-sigs: - - sig-azure - - sig-storage -reviewers: - - name: "@khenidak" - - name: "@colemickens" -approvers: - - name: "@brendanburns" -editor: - - "@feiskyer" -creation-date: 2018-07-11 -last-updated: 2018-09-29 -status: implementable ---- - -# Azure Availability Zones - -## Table of Contents - -- [Azure Availability Zones](#azure-availability-zones) - - [Summary](#summary) - - [Scopes and Non-scopes](#scopes-and-non-scopes) - - [Scopes](#scopes) - - [Non-scopes](#non-scopes) - - [AZ label format](#az-label-format) - - [Cloud provider options](#cloud-provider-options) - - [Node registration](#node-registration) - - [Get by instance metadata](#get-by-instance-metadata) - - [Get by Go SDK](#get-by-go-sdk) - - [LoadBalancer and PublicIP](#loadbalancer-and-publicip) - - [AzureDisk](#azuredisk) - - [PVLabeler](#pvlabeler) - - [PersistentVolumeLabel](#persistentvolumelabel) - - [StorageClass](#storageclass) - - [Appendix](#appendix) - -## Summary - -This proposal aims to add [Azure Availability Zones (AZ)](https://azure.microsoft.com/en-us/global-infrastructure/availability-zones/) support to Kubernetes. - -## Scopes and Non-scopes - -### Scopes - -The proposal includes required changes to support availability zones for various functions in Azure cloud provider and AzureDisk volumes: - -- Detect availability zones automatically when registering new nodes (by kubelet or node controller) and node's label `failure-domain.beta.kubernetes.io/zone` will be replaced with AZ instead of fault domain -- LoadBalancer and PublicIP will be provisioned with zone redundant -- `GetLabelsForVolume` interface will be implemented for Azure managed disks so that PV label controller in cloud-controller-manager can appropriately add `Labels` and `NodeAffinity` to the Azure managed disk PVs. Additionally, `PersistentVolumeLabel` admission controller will be enhanced to achieve the same for Azure managed disks. -- Azure Disk's `Provision()` function will be enhanced to take into account the zone of the node as well as `allowedTopologies` when determining the zone to create a disk in. - -> Note that unlike most cases, fault domain and availability zones mean different on Azure: -> -> - A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc. -> - Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking. -> -> An Availability Zone in an Azure region is a combination of a fault domain and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains. - -### Non-scopes - -Provisioning Kubernetes masters and nodes with availability zone support is not included in this proposal. It should be done in the provisioning tools (e.g. acs-engine). Azure cloud provider will auto-detect the node's availability zone if `availabilityZones` option is configured for the Azure cloud provider. - -## AZ label format - -Currently, Azure nodes are registered with label `failure-domain.beta.kubernetes.io/zone=faultDomain`. - -The format of fault domain is numbers (e.g. `1` or `2`), which is in same format with AZ (e.g. `1` or `3`). If AZ is using same format with faultDomain, then there'll be scheduler issues for clusters with both AZ and non-AZ nodes. So AZ will use a different format in kubernetes: `<region>-<AZ>`, e.g. `centralus-1`. - -The AZ label will be applied in multiple Kubernetes resources, e.g. - -- Nodes -- AzureDisk PersistentVolumes -- AzureDisk StorageClass - -## Cloud provider options - -Because only standard load balancer is supported with AZ, it is a prerequisite to enable AZ for the cluster. - -Standard load balancer has been added in Kubernetes v1.11, related options include: - -| Option | Default | **AZ Value** | Releases | Notes | -| --------------------------- | ------- | ------------- | -------- | ------------------------------------- | -| loadBalancerSku | basic | **standard** | v1.11 | Enable standard LB | -| excludeMasterFromStandardLB | true | true or false | v1.11 | Exclude master nodes from LB backends | - -These options should be configured in Azure cloud provider configure file (e.g. `/etc/kubernetes/azure.json`): - -```json -{ - ..., - "loadBalancerSku": "standard", - "excludeMasterFromStandardLB": true -} -``` - -Note that with standard SKU LoadBalancer, `primaryAvailabitySetName` and `primaryScaleSetName` is not required because all available nodes (with configurable masters via `excludeMasterFromStandardLB`) are added to LoadBalancer backend pools. - -## Node registration - -When registering new nodes, kubelet (with build in cloud provider) or node controller (with external cloud provider) automatically adds labels to them with region and zone information: - -- Region: `failure-domain.beta.kubernetes.io/region=centralus` -- Zone: `failure-domain.beta.kubernetes.io/zone=centralus-1` - -```sh -$ kubectl get nodes --show-labels -NAME STATUS AGE VERSION LABELS -kubernetes-node12 Ready 6m v1.11 failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1,... -``` - -Azure cloud providers sets fault domain for label `failure-domain.beta.kubernetes.io/zone` today. With AZ enabled, we should set the node's availability zone instead. To keep backward compatibility and distinguishing from fault domain, `<region>-<AZ>` is used here. - -The node's zone could get by ARM API or instance metadata. This will be added in `GetZoneByProviderID()` and `GetZoneByNodeName()`. - -### Get by instance metadata - -This method is used in kube-controller-manager. - -```sh -# Instance metadata API should be upgraded to 2017-12-01. -$ curl -H Metadata:true "http://169.254.169.254/metadata/instance/compute/zone?api-version=2017-12-01&format=text" -2 -``` - -### Get by Go SDK - -This method is used in cloud-controller-manager. - -No `zones` property is included in `VirtualMachineScaleSetVM` yet in Azure Go SDK (including latest 2018-04-01 compute API). - -We need to ask Azure Go SDK to add `zones` for `VirtualMachineScaleSetVM`. Opened the issue https://github.com/Azure/azure-sdk-for-go/issues/2183 for tracking it. - -> Note: there's already `zones` property in `VirtualMachineScaleSet`, `VirtualMachine` and `Disk`. - -## LoadBalancer and PublicIP - -LoadBalancer with standard SKU will be created and all available nodes (including VirtualMachines and VirtualMachineScaleSetVms, together with optional masters configured via excludeMasterFromStandardLB) are added to LoadBalancer backend pools. - -PublicIPs will also be created with standard SKU, and they are zone redundant by default. - -Note that zonal PublicIPs are not supported. We may add this easily if there’re clear use-cases in the future. - -## AzureDisk - -When Azure managed disks are created, the `PersistentVolumeLabel` admission controller or PV label controller automatically adds zone labels and node affinity to them. The scheduler (via `VolumeZonePredicate` or `PV.NodeAffinity`) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones. In addition, admission controller - -Note that - -- Only managed disks are supported. Blob disks don't support availability zones on Azure. -- Node affinity is enabled by feature gate `VolumeScheduling`. - -### PVLabeler interface - -To setup AzureDisk's zone label correctly (required by cloud-controller-manager's PersistentVolumeLabelController), Azure cloud provider's [PVLabeler](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L212) interface should be implemented: - -```go -// PVLabeler is an abstract, pluggable interface for fetching labels for volumes -type PVLabeler interface { - GetLabelsForVolume(ctx context.Context, pv *v1.PersistentVolume) (map[string]string, error) -} -``` - -It should return the region and zone for the AzureDisk, e.g. - -- `failure-domain.beta.kubernetes.io/region=centralus` -- `failure-domain.beta.kubernetes.io/zone=centralus-1` - -so that the PV will be created with labels: - -```sh -$ kubectl get pv --show-labels -NAME CAPACITY ACCESSMODES STATUS CLAIM REASON AGE LABELS -pv-managed-abc 5Gi RWO Bound default/claim1 46s failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1 -``` - -### PersistentVolumeLabel admission controller - -Cloud provider's `PVLabeler` interface is only applied when cloud-controller-manager is used. For build in Azure cloud provider, [PersistentVolumeLabel](https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission.go) admission controller should also updated with AzureDisk support, so that new PVs could also be applied with above labels. - -```go -func (l *persistentVolumeLabel) Admit(a admission.Attributes) (err error) { - ... - if volume.Spec.AzureDisk != nil { - labels, err := l.findAzureDiskLabels(volume) - if err != nil { - return admission.NewForbidden(a, fmt.Errorf("error querying AzureDisk volume %s: %v", volume.Spec.AzureDisk.DiskName, err)) - } - volumeLabels = labels - } - ... -} -``` - -> Note: the PersistentVolumeLabel admission controller will be deprecated, and cloud-controller-manager is preferred after its GA (probably v1.13 or v1.14). - -### StorageClass - -Note that the above interfaces are only applied to AzureDisk persistent volumes, not StorageClass. For AzureDisk StorageClass, we should add a few new options for zone-aware and [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. The following three new options will be added in AzureDisk StorageClass: - -- `zoned`: indicates whether new disks are provisioned with AZ. Default is `true`. -- `zone` and `zones`: indicates which zones should be used to provision new disks (zone-aware provisioning). Only can be set if `zoned` is not false and `allowedTopologies` is not set. -- `allowedTopologies`: indicates which topologies are allowed for [topology-aware](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md) provisioning. Only can be set if `zoned` is not false and `zone`/`zones` are not set. - -An example of zone-aware provisioning storage class is: - -```yaml -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - annotations: - labels: - kubernetes.io/cluster-service: "true" - name: managed-premium -parameters: - kind: Managed - storageaccounttype: Premium_LRS - # only one of zone and zones are allowed - zone: "centralus-1" - # zones: "centralus-1,centralus-2,centralus-3" -provisioner: kubernetes.io/azure-disk -``` - -Another example of topology-aware provisioning storage class is: - -```yaml -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - annotations: - labels: - kubernetes.io/cluster-service: "true" - name: managed-premium -parameters: - kind: Managed - storageaccounttype: Premium_LRS -provisioner: kubernetes.io/azure-disk -allowedTopologies: -- matchLabelExpressions: - - key: failure-domain.beta.kubernetes.io/zone - values: - - centralus-1 - - centralus-2 -``` - -AzureDisk can only be created with one specific zone, so if multiple zones are specified in the storage class, then new disks will be provisioned with zone chosen by following rules: - -- If `DynamicProvisioningScheduling` is enabled and `VolumeBindingMode: WaitForFirstConsumer` is specified in the storage class, zone of the disk should be set to the zone of the node passed to `Provision()`. Specifying zone/zones in storage class should be considered an error in this scenario. -- If `DynamicProvisioningScheduling` is enabled and `VolumeBindingMode: WaitForFirstConsumer` is not specified in StorageClass, zone of disk should be chosen from `allowedTopologies` or zones depending on which is specified. Specifying both `allowedTopologies` and `zones` should lead to error. -- If `DynamicProvisioningScheduling` is disabled and `zones` are specified, then the zone maybe arbitrarily chosen as specified by arbitrarily choosing from the zones specified in the storage class. -- If `DynamicProvisioningScheduling` is disabled and no zones are specified and `zoned` is `true`, then new disks will be provisioned with zone chosen by round-robin across all active zones, which means - - If there are no zoned nodes, then an `no zoned nodes` error will be reported - - Zoned AzureDisk will only be provisioned when there are zoned nodes - - If there are multiple zones, then those zones are chosen by round-robin - -Note that - -- active zones means there're nodes in that zone. -- there are risks if the cluster is running with both zoned and non-zoned nodes. In such case, zoned AzureDisk can't be attached to non-zoned nodes. So - - new pods with zoned AzureDisks are always scheduled to zoned nodes - - old pods using non-zoned AzureDisks can't be scheduled to zoned nodes - -So if users are planning to migrate workloads to zoned nodes, old AzureDisks should be recreated (probably backup first and restore to the new one). - -## Implementation History - -- [kubernetes#66242](https://github.com/kubernetes/kubernetes/pull/66242): Adds initial availability zones support for Azure nodes. -- [kubernetes#66553](https://github.com/kubernetes/kubernetes/pull/66553): Adds avaialability zones support for Azure managed disks. -- [kubernetes#67121](https://github.com/kubernetes/kubernetes/pull/67121): Adds DynamicProvisioningScheduling and VolumeScheduling support for Azure managed disks. -- [cloud-provider-azure#57](https://github.com/kubernetes/cloud-provider-azure/pull/57): Adds documentation for Azure availability zones. - -## Appendix - -Kubernetes will automatically spread the pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). - -With multiple-zone clusters, this spreading behavior is extended across zones (to reduce the impact of zone failures.) (This is achieved via `SelectorSpreadPriority`). This is a best-effort placement, and so if the zones in your cluster are heterogeneous (e.g. different numbers of nodes, different types of nodes, or different pod resource requirements), this might prevent perfectly even spreading of your pods across zones. If desired, you can use homogeneous zones (same number and types of nodes) to reduce the probability of unequal spreading. - -There's also some [limitations of availability zones of various Kubernetes functions](https://kubernetes.io/docs/setup/multiple-zones/#limitations), e.g. - -- No zone-aware network routing -- Volume zone-affinity will only work with a `PersistentVolume`, and will not work if you directly specify an AzureDisk volume in the pod spec. -- Clusters cannot span clouds or regions (this functionality will require full federation support). -- StatefulSet volume zone spreading when using dynamic provisioning is currently not compatible with pod affinity or anti-affinity policies. -- If the name of the StatefulSet contains dashes (“-”), volume zone spreading may not provide a uniform distribution of storage across zones. -- When specifying multiple PVCs in a Deployment or Pod spec, the StorageClass needs to be configured for a specific, single zone, or the PVs need to be statically provisioned in a specific zone. Another workaround is to use a StatefulSet, which will ensure that all the volumes for a replica are provisioned in the same zone. - -See more at [running Kubernetes in multiple zones](https://kubernetes.io/docs/setup/multiple-zones/). +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-azure/0025-20180809-cross-resource-group-nodes.md b/keps/sig-azure/0025-20180809-cross-resource-group-nodes.md index 9f2f1337..cfd1f5fa 100644 --- a/keps/sig-azure/0025-20180809-cross-resource-group-nodes.md +++ b/keps/sig-azure/0025-20180809-cross-resource-group-nodes.md @@ -1,181 +1,4 @@ ---- -kep-number: 25 -title: Cross resource group nodes -authors: - - "@feiskyer" -owning-sig: sig-azure -participating-sigs: - - sig-azure -reviewers: - - name: "@khenidak" - - name: "@justaugustus" -approvers: - - name: "@brendanburns" -editor: - - "@feiskyer" -creation-date: 2018-08-09 -last-updated: 2018-09-29 -status: implementable ---- - -# Cross resource group nodes - -## Table of Contents - -<!-- TOC --> - -- [Cross resource group nodes](#cross-resource-group-nodes) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Assumptions](#assumptions) - - [Non-Goals](#non-goals) - - [Design](#design) - - [Implementation](#implementation) - - [Cross-RG nodes](#cross-rg-nodes) - - [On-prem nodes](#on-prem-nodes) - - [Alternatives](#alternatives) - -<!-- /TOC --> - -## Summary - -This KEP aims to add support for cross resource group (RG) and on-prem nodes to the Azure cloud provider. - -## Motivation - -Today, the Azure cloud provider only supports nodes from a specified RG (which is set in the cloud provider configuration file). For nodes in a different RG, Azure cloud provider reports `InstanceNotFound` error and thus they would be removed by controller manager. The same holds true for on-prem nodes. - -With managed clusters, like [AKS](https://docs.microsoft.com/en-us/azure/aks/), there is limited access to configure nodes. There are instances where users may need to customize nodes in ways that are not possible in a managed service. This document proposes support for joining arbitrary nodes to a cluster and the required changes to make in both the Azure cloud provider and provisioned setups, which include: - -- Provisioning tools should setup kubelet with required labels (e.g. via `--node-labels`) -- Azure cloud provider would fetch RG from those labels and then get node information based on that - -### Assumptions - -While new nodes (either from different RGs or on-prem) would be supported in this proposal, not all features would be supported for them. For example, AzureDisk will not work for on-prem nodes. - -This proposal makes following assumptions for those new nodes: - -- Nodes are in same region and set with required labels (as clarified in the following design part) -- Nodes will not be part of the load balancer managed by cloud provider -- Both node and container networking are properly configured -- AzureDisk is supported for Azure cross-RG nodes, but not for on-prem nodes - -In addition, feature gate [ServiceNodeExclusion](https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go#L174) must also be enabled for Kubernetes cluster. - -### Non-Goals - -Note that provisioning the Kubernetes cluster, setting up networking and provisioning new nodes are out of this proposal scope. Those could be done by external provisioning tools (e.g. acs-engine). - -## Design - -Instance metadata is a general way to fetch node information for Azure, but it doesn't work if cloud-controller-manager is used (`kubelet --cloud-provider=external`). So it won't be used in this proposal. Instead, the following labels are proposed for providing required information: - -- `alpha.service-controller.kubernetes.io/exclude-balancer=true`, which is used to exclude the node from load balancer. Required. -- `kubernetes.azure.com/resource-group=<rg-name>`, which provides external RG and is used to get node information. Required for cross-RG nodes. -- `kubernetes.azure.com/managed=true|false`, which indicates whether a node is on-prem or not. Required for on-prem nodes with `false` value. - -When initializing nodes, these two labels should be set for kubelet by provisioning tools, e.g. - -```sh -# For cross-RG nodes -kubelet --node-labels=alpha.service-controller.kubernetes.io/exclude-balancer=true,kubernetes.azure.com/resource-group=<rg-name> ... - -# For on-prem nodes -kubelet --node-labels=alpha.service-controller.kubernetes.io/exclude-balancer=true,kubernetes.azure.com/managed=false ... -``` - -Node label `alpha.service-controller.kubernetes.io/exclude-balancer=true` has already been supported in Kubernetes, and it is controlled by feature gate `ServiceNodeExclusion`. Cluster admins should ensure the feature gate `ServiceNodeExclusion` opened when provisioning the cluster. - -Note that - -- Azure resource group name supports a [wider range of valid characters](https://docs.microsoft.com/en-us/azure/architecture/best-practices/naming-conventions#naming-rules-and-restrictions) than [Kubernetes labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set). **Only kubernetes labels compatible resource group names** are supported, which must be 63 characters or less and must be empty or begin and end with an alphanumeric character (`[a-z0-9A-Z]`) with dashes (`-`), underscores (`_`), dots (`.`), and alphanumerics between. -- If the label `kubernetes.azure.com/managed` is not provided, then Azure cloud provider will assume the node to be managed. - -## Implementation - -### Cross-RG nodes - -Cross-RG nodes should register themselves with required labels together with cloud provider: - -- `--cloud-provider=azure` when using kube-controller-manager -- `--cloud-provider=external` when using cloud-controller-manager - -For example, - -```sh -kubelet ... \ - --cloud-provider=azure \ - --cloud-config=/etc/kubernetes/azure.json \ - --node-labels=alpha.service-controller.kubernetes.io/exclude-balancer=true,kubernetes.azure.com/resource-group=<rg-name> -``` - -[LoadBalancer](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L92) is not required for cross-RG nodes, hence only following features will be implemented for them: - -- [Instances](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L121) -- [Zones](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L194) -- [Routes](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L169) -- [Azure managed disks](https://github.com/kubernetes/kubernetes/tree/master/pkg/volume/azure_dd) - -Most operations of those features are similar with existing nodes, except the RG name. The existing nodes are using RG from cloud provider configure, while cross-RG nodes will get RG from node label `kubernetes.azure.com/resource-group=<rg-name>`. - -To achieve this, [Informers](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L52-L55) will be used to get node labels and then their RGs will be cached in `nodeResourceGroups map[string]string`. - -```go -type Cloud struct { - ... - // nodeResourceGroups is a mapping from Node's name to resource group name. - // It will be updated by the nodeInformer. - nodeResourceGroups map[string]string -} -``` - -### On-prem nodes - -On-prem nodes are different from Azure nodes, all Azure coupled features (including Instances, LoadBalancer, Zones, Routes and Azure managed disks) are not supported for them. To prevent the node being deleted, Azure cloud provider will always assumes the node existing and use providerID in format `azure://<node-name>`. - -On-prem nodes should register themselves with labels `alpha.service-controller.kubernetes.io/exclude-balancer=true` and `kubernetes.azure.com/managed=false`, e.g. - -```sh -kubelet --node-labels=alpha.service-controller.kubernetes.io/exclude-balancer=true,kubernetes.azure.com/managed=false ... -``` - -Because AzureDisk is also not supported, and we don't expect Pods using AzureDisk being scheduled to on-prem nodes, a new taint `kubernetes.azure.com/managed:NoSchedule` will be added for those nodes. - -To run workloads on them, nodeSelector and tolerations should be provided. For example, - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: nginx -spec: - containers: - - image: nginx - name: nginx - ports: - - containerPort: 80 - name: http - protocol: TCP - dnsPolicy: ClusterFirst - nodeSelector: - kubernetes.azure.com/resource-group: on-prem - tolerations: - - key: kubernetes.azure.com/managed - effect: NoSchedule -``` - -## Implementation History - -- [kubernetes#67604](https://github.com/kubernetes/kubernetes/pull/67604): Adds initial support for Azure cross resource group nodes. -- [kubernetes#67984](https://github.com/kubernetes/kubernetes/pull/67984): Adds unmanaged nodes support for Azure cloud provider. -- [cloud-provider-azure#58](https://github.com/kubernetes/cloud-provider-azure/pull/58): Adds documentation for Azure cross resource group nodes. - -## Alternatives - -Annotations, additional cloud provider options and querying directly from Azure API are three alternatives ways to provide resource group information. They are not preferred because - -- Kubelet doesn't support registering itself with annotations, so it requires admin to annotate the node afterward. The extra steps add complexity for cluster operations. -- Cloud provider options are not flexible compared to labels and annotations. It needs configure file updates and controller manager restarts if unknown resource groups are used for new nodes. -- Querying node information directly from Azure API is also not feasible because that would need list all resource groups, all virtual machine scale sets and all virtual machines. The operation is time consuming and easy to hit rate limits. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cli/0008-kustomize.md b/keps/sig-cli/0008-kustomize.md index cb47f8d7..cfd1f5fa 100644 --- a/keps/sig-cli/0008-kustomize.md +++ b/keps/sig-cli/0008-kustomize.md @@ -1,222 +1,4 @@ ---- -kep-number: 8 -title: Kustomize -authors: - - "@pwittrock" - - "@monopole" -owning-sig: sig-cli -participating-sigs: - - sig-cli -reviewers: - - "@droot" -approvers: - - "@soltysh" -editor: "@droot" -creation-date: 2018-05-05 -last-updated: 2018-05-23 -status: implemented -see-also: - - n/a -replaces: - - kinflate # Old name for kustomize -superseded-by: - - n/a ---- - -# Kustomize - -## Table of Contents - -- [Kustomize](#kustomize) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - - [Risks and Mitigations](#risks-and-mitigations) - - [Risks of Not Having a Solution](#risks-of-not-having-a-solution) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - - [FAQ](#faq) - -## Summary - -Declarative specification of Kubernetes objects is the recommended way to manage Kubernetes -production workloads, however gaps in the kubectl tooling force users to write their own scripting and -tooling to augment the declarative tools with preprocessing transformations. -While most of these transformations already exist as imperative kubectl commands, they are not natively accessible -from a declarative workflow. - -This KEP describes how `kustomize` addresses this problem by providing a declarative format for users to access -the imperative kubectl commands they are already familiar natively from declarative workflows. - -## Motivation - -The kubectl command provides a cli for: - -- accessing the Kubernetes apis through json or yaml configuration -- porcelain commands for generating and transforming configuration off of command line flags. - -Examples: - -- Generate a configmap or secret from a text or binary file - - `kubectl create configmap`, `kubectl create secret` - - Users can manage their configmaps and secrets text and binary files - -- Create or update fields that cut across other fields and objects - - `kubectl label`, `kubectl annotate` - - Users can add and update labels for all objects composing an application - -- Transform an existing declarative configuration without forking it - - `kubectl patch` - - Users may generate multiple variations of the same workload - -- Transform live resources arbitrarily without auditing - - `kubectl edit` - -To create a Secret from a binary file, users must first base64 encode the binary file and then create a Secret yaml -config from the resulting data. Because the source of truth is actually the binary file, not the config, -users must write scripting and tooling to keep the 2 sources consistent. - -Instead, users should be able to access the simple, but necessary, functionality available in the imperative -kubectl commands from their declarative workflow. - -#### Long standing issues - -Kustomize addresses a number of long standing issues in kubectl. - -- Declarative enumeration of multiple files [kubernetes/kubernetes#24649](https://github.com/kubernetes/kubernetes/issues/24649) -- Declarative configmap and secret creation: [kubernetes/kubernetes#24744](https://github.com/kubernetes/kubernetes/issues/24744), [kubernetes/kubernetes#30337](https://github.com/kubernetes/kubernetes/issues/30337) -- Configmap rollouts: [kubernetes/kubernetes#22368](https://github.com/kubernetes/kubernetes/issues/22368) - - [Example in kustomize](https://github.com/kubernetes-sigs/kustomize/tree/master/examples/helloWorld#how-this-works-with-kustomize) -- Name/label scoping and safer pruning: [kubernetes/kubernetes#1698](https://github.com/kubernetes/kubernetes/issues/1698) - - [Example in kustomize](https://github.com/kubernetes-sigs/kustomize/blob/master/examples/breakfast.md#demo-configure-breakfast) -- Template-free add-on customization: [kubernetes/kubernetes#23233](https://github.com/kubernetes/kubernetes/issues/23233) - - [Example in kustomize](https://github.com/kubernetes-sigs/kustomize/tree/master/examples/helloWorld#staging-kustomization) - -### Goals - -- Declarative support for defining ConfigMaps and Secrets generated from binary and text files -- Declarative support for adding or updating cross-cutting fields - - labels & selectors - - annotations - - names (as transformation of the original name) -- Declarative support for applying patches to transform arbitrary fields - - use strategic-merge-patch format -- Ease of integration with CICD systems that maintain configuration in a version control repository - as a single source of truth, and take action (build, test, deploy, etc.) when that truth changes (gitops). - -### Non-Goals - -#### Exposing every imperative kubectl command in a declarative fashion - -The scope of kustomize is limited only to functionality gaps that would otherwise prevent users from -defining their workloads in a purely declarative manner (e.g. without writing scripts to perform pre-processing -or linting). Commands such as `kubectl run`, `kubectl create deployment` and `kubectl edit` are unnecessary -in a declarative workflow because a Deployment can easily be managed as declarative config. - -#### Providing a simpler facade on top of the Kubernetes APIs - -The community has developed a number of facades in front of the Kubernetes APIs using -templates or DSLs. Attempting to provide an alternative interface to the Kubernetes API is -a non-goal. Instead the focus is on: - -- Facilitating simple cross-cutting transformations on the raw config that would otherwise require other tooling such - as *sed* -- Generating configuration when the source of truth resides elsewhere -- Patching existing configuration with transformations - -## Proposal - -### Capabilities - -**Note:** This proposal has already been implemented in `github.com/kubernetes/kubectl`. - -Define a new meta config format called *kustomization.yaml*. - -#### *kustomization.yaml* will allow users to reference config files - -- Path to config yaml file (similar to `kubectl apply -f <file>`) -- Urls to config yaml file (similar to `kubectl apply -f <url>`) -- Path to *kustomization.yaml* file (takes the output of running kustomize) - -#### *kustomization.yaml* will allow users to generate configs from files - -- ConfigMap (`kubectl create configmap`) -- Secret (`kubectl create secret`) - -#### *kustomization.yaml* will allow users to apply transformations to configs - -- Label (`kubectl label`) -- Annotate (`kubectl annotate`) -- Strategic-Merge-Patch (`kubectl patch`) -- Name-Prefix - -### UX - -Kustomize will also contain subcommands to facilitate authoring *kustomization.yaml*. - -#### Edit - -The edit subcommands will allow users to modify the *kustomization.yaml* through cli commands containing -helpful messaging and documentation. - -- Add ConfigMap - like `kubectl create configmap` but declarative in *kustomization.yaml* -- Add Secret - like `kubectl create secret` but declarative in *kustomization.yaml* -- Add Resource - adds a file reference to *kustomization.yaml* -- Set NamePrefix - adds NamePrefix declaration to *kustomization.yaml* - -#### Diff - -The diff subcommand will allow users to see a diff of the original and transformed configuration files - -- Generated config (configmap) will show the files as created -- Transformations (name prefix) will show the files as modified - -### Implementation Details/Notes/Constraints [optional] - -Kustomize has already been implemented in the `github.com/kubernetes/kubectl` repo, and should be moved to a -separate repo for the subproject. - -Kustomize was initially developed as its own cli, however once it has matured, it should be published -as a subcommand of kubectl or as a statically linked plugin. It should also be more tightly integrated with apply. - -- Create the *kustomize* sig-cli subproject and update sigs.yaml -- Move the existing kustomize code from `github.com/kubernetes/kubectl` to `github.com/kubernetes-sigs/kustomize` - -### Risks and Mitigations - - -### Risks of Not Having a Solution - -By not providing a viable option for working directly with Kubernetes APIs as json or -yaml config, we risk the ecosystem becoming fragmented with various bespoke API facades. -By ensuring the raw Kubernetes API json or yaml is a usable approach for declaratively -managing applications, even tools that do not use the Kubernetes API as their native format can -better work with one another through transformation to a common format. - -## Graduation Criteria - -- Dogfood kustomize by either: - - moving one or more of our own (OSS Kubernetes) services to it. - - getting user feedback from one or more mid or large application deployments using kustomize. -- Publish kustomize as a subcommand of kubectl. - -## Implementation History - -kustomize was implemented in the kubectl repo before subprojects became a first class thing in Kubernetes. -The code has been fully implemented, but it must be moved to a proper location. - -## Drawbacks - - -## Alternatives - -1. Users write their own bespoke scripts to generate and transform the config before it is applied. -2. Users don't work with the API directly, and use or develop DSLs for interacting with Kubernetes. - -## FAQs +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cli/0024-kubectl-plugins.md b/keps/sig-cli/0024-kubectl-plugins.md index a79fcc4e..cfd1f5fa 100644 --- a/keps/sig-cli/0024-kubectl-plugins.md +++ b/keps/sig-cli/0024-kubectl-plugins.md @@ -1,234 +1,4 @@ ---- -kep-number: 24 -title: Kubectl Plugins -authors: - - "@juanvallejo" -owning-sig: sig-cli -participating-sigs: - - sig-cli -reviewers: - - "@pwittrock" - - "@deads2k" - - "@liggitt" - - "@soltysh" -approvers: - - "@pwittrock" - - "@soltysh" -editor: juanvallejo -creation-date: 2018-07-24 -last-updated: 2018-08-09 -status: provisional -see-also: - - n/a -replaces: - - "https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cli/kubectl-extension.md" - - "https://github.com/kubernetes/community/pull/481" -superseded-by: - - n/a ---- - -# Kubectl Plugins - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Limitations of the Existing Design](#limitations-of-the-existing-design) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Scenarios](#scenarios) - * [Implementation Details/Design/Constraints](#implementation-detailsdesign) - * [Naming Conventions](#naming-conventions) - * [Implementation Notes/Constraints](#implementation-notesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks](#drawbacks) -* [Future Improvements/Considerations](#future-improvementsconsiderations) - -## Summary - -This proposal introduces the main design for a plugin mechanism in `kubectl`. -The mechanism is a git-style system, that looks for executables on a user's `$PATH` whose name begins with `kubectl-`. -This allows plugin binaries to override existing command paths and add custom commands and subcommands to `kubectl`. - -## Motivation - -The main motivation behind a plugin system for `kubectl` stems from being able to provide users with a way to extend -the functionality of `kubectl`, beyond what is offered by its core commands. - -By picturing the core commands provided by `kubectl` as essential building blocks for interacting with a Kubernetes -cluster, we can begin to think of plugins as a means of using these building blocks to provide more complex functionality. -A new command, `kubectl set-ns`, for example, could take advantage of the rudimentary functionality already provided by -the `kubectl config` command, and build on top of it to provide users with a powerful, yet easy-to-use way of switching -to a new namespace. - -For example, the user experience for switching namespaces could go from: - -```bash -kubectl config set-context $(kubectl config current-context) --namespace=mynewnamespace -``` - -to: - -``` -kubectl set-ns mynewnamespace -``` - -where `set-ns` would be a user-provided plugin which would call the initial `kubectl config set-context ...` command -and set the namespace flag according to the value provided as the plugin's first parameter. - -The `set-ns` command above could have multiple variations, or be expanded to support subcommands with relative ease. -Since plugins would be distributed by their authors, independent from the core Kubernetes repository, plugins could -release updates and changes at their own pace. - -### Limitations of the Existing Design - -The existing alpha plugin system in `kubectl` presents a few limitations with its current design. -It forces plugin scripts and executables to exist in a pre-determined location, requires a per-plugin metadata file for -interpretation, and does not provide a clear way to override existing command paths or provide additional subcommands -without having to override a top-level command. - -The proposed git-style re-design of the plugin system allows us to implement extensibility requests from users that the -current system is unable to address. -See https://github.com/kubernetes/kubernetes/issues/53640 and https://github.com/kubernetes/kubernetes/issues/55708. - -### Goals - -* Avoid any kind of installation process (no additional config, users drop an executable in their `PATH`, for example, - and they are then able to use that plugin with `kubectl`). - No additional configuration is needed, only the plugin executable. - A plugin's filename determines the plugin's intention, such as which path in the command tree it applies to: - `/usr/bin/kubectl-educate-dolphins` would, for example be invoked under the command `kubectl educate dolphins --flag1 --flag2`. - It is up to a plugin to parse any arguments and flags given to it. A plugin decides when an argument is a - subcommand, as well as any limitations or constraints that its flags should have. -* Relay all information given to `kubectl` (via command line args) to plugins as-is. - Plugins receive all arguments and flags provided by users and are responsible for adjusting their behavior - accordingly. -* Provide a way to limit which command paths can and cannot be overridden by plugins in the command tree. - -### Non-Goals - -* The new plugin mechanism will not be a "plugin installer" or wizard. It will not have specific or baked-in knowledge - regarding a plugin's location or composition, nor will it provide a way to download or unpack plugins in a correct - location. -* Plugin discovery is not a main focus of this mechanism. As such, it will not attempt to collect data about every - plugin that exists in an environment. -* Plugin management is out of the scope of this design. A mechanism for updating and managing lifecycle of existing - plugins should be covered as a separate design (See https://github.com/kubernetes/community/pull/2340). -* Provide a standard package of common cli utilities that is consumed by `kubectl` and plugins alike. - This should be done as an independent effort of this plugin mechanism. - -## Proposal - -### Scenarios - -* Developer wants to create and expose a plugin to `kubectl`. - They use a programming language of their choice and create an executable file. - The executable's filename consists of the command path to implement, and is prefixed with `kubectl-`. - The executable file is placed on the user's `PATH`. - -### Implementation Details/Design - -The proposed design passes through all environment variables, flags, input, and output streams exactly as they are given -to the parent `kubectl` process. This has the effect of letting plugins run without the need for any special parsing -or case-handling in `kubectl`. - -In essence, a plugin binary must be able to run as a standalone process, completely independent of `kubectl`. - -* When `kubectl` is executed with a subcommand _foo_ that does not exist in the command tree, it will attempt to look -for a filename `kubectl-foo` (`kubectl-foo.exe` on Windows) in the user's `PATH` and execute it, relaying all arguments given -as well as all environment variables to the plugin child-process. - -A brief example (not an actual prototype) is provided below to clarify the core logic of the proposed design: - -```go -// treat all args given by the user as pieces of a plugin binary's filename -// and short-circuit once we find an arg that appears to be a flag. -remainingArgs := []string{} // all "non-flag" arguments - -for idx := range cmdArgs { - if strings.HasPrefix(cmdArgs[idx], "-") { - break - } - remainingArgs = append(remainingArgs, strings.Replace(cmdArgs[idx], "-", "_", -1)) -} - -foundBinaryPath := "" - -// find binary in the user's PATH, starting with the longest possible filename -// based on the given non-flag arguments by the user -for len(remainingArgs) > 0 { - path, err := exec.LookPath(fmt.Sprintf("kubectl-%s", strings.Join(remainingArgs, "-"))) - if err != nil || len(path) == 0 { - remainingArgs = remainingArgs[:len(remainingArgs)-1] - continue - } - - foundBinaryPath = path - break -} - -// if we are able to find a suitable plugin executable, perform a syscall.Exec call -// and relay all remaining arguments (in order given), as well as environment vars. -syscall.Exec(foundBinaryPath, append([]string{foundBinaryPath}, cmdArgs[len(remainingArgs):]...), os.Environ()) -``` - -#### Naming Conventions - -Under this proposal, `kubectl` would identify plugins by looking for filenames beginning with the `kubectl-` prefix. -A search for these names would occur on a user's `PATH`. Only files that are executable and begin with this prefix -would be identified. - -### Implementation Notes/Constraints - -The current implementation details for the proposed design rely on using a plugin executable's name to determine what -command the plugin is adding. -For a given command `kubectl foo --bar baz`, an executable `kubectl-foo` will be matched on a user's `PATH`, -and the arguments `--bar baz` will be passed to it in that order. - -A potential limitation of this could present itself in the order of arguments provided by a user. -A user could intend to run a plugin `kubectl-foo-bar` with the flag `--baz` with the following command -`kubectl foo --baz bar`, but instead end up matching `kubectl-foo` with the flag `--baz` and the argument `bar` based -on the placement of the flag `--baz`. - -A notable constraint of this design is that it excludes any form of plugin lifecycle management, or version compatibility. -A plugin may depend on other plugins based on the decision of a plugin author, however the proposed design does nothing -to facilitate such dependencies. It is up to the plugin's author (or a separate / independent plugin management system) to -provide documentation or instructions on how to meet any dependencies required by a plugin. - -Further, with the proposed design, plugins that rely on multiple "helper" files to properly function, should provide an -"entrypoint" executable (which is placed on a user's `PATH`), with any additional files located elsewhere (e.g. ~/.kubeplugins/myplugin/helper1.py). - -### Risks and Mitigations - -Unlike the existing alpha plugin mechanism, the proposed design does not constrain commands added by plugins to exist as subcommands of the -`kubectl plugin` design. Commands provided by plugins under the new mechanism can be invoked as first-class commands (`/usr/bin/kubectl-foo` provides the `kubectl foo` parent command). - -A potential risk associated with this could present in the form of a "land-rush" by plugin providers. -Multiple plugin authors would be incentivized to provide their own version of plugin `foo`. -Users would be at the mercy of whichever variation of `kubectl-foo` is discovered in their `PATH` first when executing that command. - -A way to mitigate the above scenario would be to have users take advantage of the proposed plugin mechanism's design by renaming multiple variations of `kubectl-foo` -to include the provider's name, for example: `kubectl-acme-foo`, or `kubectl-companyB-foo`. - -Conflicts such as this one could further be mitigated by a plugin manager, which could perform conflict resolution among similarly named plugins on behalf of a user. - -## Graduation Criteria - -* Make this mechanism a part of `kubectl`'s command-lookup logic. - -## Implementation History - -This plugin design closely follows major aspects of the plugin system design for `git`. - -## Drawbacks - -Implementing this design could potentially conflict with any ongoing work that depends on the current alpha plugin system. - -## Future Improvements/Considerations - -The proposed design is flexible enough to accommodate future updates that could allow certain command paths to be overwritten -or extended (with the addition of subcommands) via plugins. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cli/0031-datadrivencommands.md b/keps/sig-cli/0031-datadrivencommands.md index 4f386c5f..cfd1f5fa 100644 --- a/keps/sig-cli/0031-datadrivencommands.md +++ b/keps/sig-cli/0031-datadrivencommands.md @@ -1,445 +1,4 @@ ---- -kep-number: 32 -title: Data Driven Commands for Kubectl -authors: - - "@pwittrock" -owning-sig: sig-cli -participating-sigs: -reviewers: - - "@soltysh" - - "@juanvallejo" - - "@seans3 " -approvers: - - "@soltysh" -editor: TBD -creation-date: 2018-11-13 -last-updated: 2018-11-13 -status: provisional -see-also: -replaces: -superseded-by: ---- - -# data driven commands - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Implementation Details](#implementation-details) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Alternatives](#alternatives) - -## Summary - -Many Kubectl commands make requests to specific Resource endpoints. The request bodies are populated by flags -provided by the user. - -Examples: - -- `create <resource>` -- `set <field> <resource>` -- `logs` - -Although these commands are compiled into the kubectl binary, their workflow is similar to a form on a webpage and -could be complete driven by the server providing the client with the request (endpoint + body) and a set of flags -to populate the request body. - -Publishing commands as data from the server addresses cli integration with API extensions as well as client-server -version skew. - -**Note:** No server-side changes are required for this, all Request and Response template expansion is performed on -the client side. - -## Motivation - -Kubectl provides a number of commands to simplify working with Kubernetes by making requests to -Resources and SubResources. These requests are mostly static, with fields filled in by user -supplied flags. Today the commands are compiled into the client, which as the following challenges: - -- Extension APIs cannot be compiled into the client -- Version-Skewed clients (old client) may be missing commands for new APIs or send outdated requests -- Version-Skewed clients (new client) may have commands for APIs that are not present in the server or expose - fields not present in older API versions - -### Goals - -Allow client commands that make a single request to a specific resource and output the result to be data driven -from the server. - -- Address cli support for extension APIs -- Address user experience for version skewed clients - -### Non-Goals - -Allow client commands that have complex client-side logic to be data driven. - -- Require a TTY -- Are Agnostic to Specific Resources - -## Proposal - -Define a format for publishing simple cli commands as data. CLI commands would be limited to: - -- Sending one or more requests to Resource or SubResource Endpoints -- Populating requests from command line flags and arguments -- Writing output populated from the Responses - -**Proof of Concept:** [cnctl](https://github.com/pwittrock/kubectl/tree/cnctl/cmd/cnctl) - -Instructions to run PoC: - -- `go run ./main.go` (no commands show up) -- `kubectl apply` the `cli_v1alpha1_clitestresource.yaml` (add the CRD with the commands) -- `go run ./main.go` (create command shows up) -- `go run ./main create deployment -h` (view create command help) -- `go run ./main create deploy --image nginx --name nginx` (create a deployment) -- `kubectl get deployments` - -### Implementation Details - -**Publishing Data:** - -Alpha: No apimachinery changes required - -- Alpha: publish extension Resource Commands as an annotation on CRDs. -- Alpha: publish core Resource Commands as openapi extension. - -Beta: apimachinery changes required - -- Beta: publish extension Resource Commands a part of the CRD Spec. -- Beta: publish core Resource Commands from new endpoint (like *swagger.json*) - -**Data Command Structure:** - -```go -/* -Copyright 2018 The Kubernetes Authors. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -*/ - -package v1alpha1 - -type OutputType string - -const ( - // Use the outputTemplate field to format the response on the client-side - OUTPUT_TEMPLATE OutputType = "TEMPLATE" - - // Use Server-Side Printing and output the response table in a columnar format - OUTPUT_TABLE = "TABLE" -) - -// ResourceCommand defines a command that is dynamically defined as an annotation on a CRD -type ResourceCommand struct { - // Command is the cli Command - Command Command `json:"command"` - - // Requests are the requests the command will send to the apiserver. - // +optional - Requests []ResourceRequest `json:"requests,omitempty"` - - // OutputType is used to determine what output type to print - // +optional - OutputType OutputType `json:"outputTemplate,omitempty"` - - // OutputTemplate is a go-template used by the kubectl client to format the server responses as command output - // (STDOUT). - // - // The template may reference values specified as flags using - // {{index .Flags.Strings "flag-name"}}, {{index .Flags.Ints "flag-name"}}, {{index .Flags.Bools "flag-name"}}, - // {{index .Flags.Floats "flag-name"}}. - // - // The template may also reference values from the responses that were saved using saveResponseValues - // {{index .Responses.Strings "response-value-name"}}. - // - // Example: - // deployment.apps/{{index .Responses.Strings "responsename"}} created - // - // +optional - OutputTemplate string `json:"outputTemplate,omitempty"` -} - -type ResourceOperation string - -const ( - CREATE_RESOURCE ResourceOperation = "CREATE" - UPDATE_RESOURCE = "UPDATE" - DELETE_RESOURCE = "DELETE" - GET_RESOURCE = "GET" - PATCH_RESOURCE = "PATCH" -) - -type ResourceRequest struct { - // Group is the API group of the request endpoint - // - // Example: apps - Group string `json:"group"` - - // Version is the API version of the request endpoint - // - // Example: v1 - Version string `json:"version"` - - // Resource is the API resource of the request endpoint - // - // Example: deployments - Resource string `json:"resource"` - - // Operation is the type of operation to perform for the request. One of: Create, Update, Delete, Get, Patch - Operation ResourceOperation `json:"operation"` - - // BodyTemplate is a go-template for the request Body. It may reference values specified as flags using - // {{index .Flags.Strings "flag-name"}}, {{index .Flags.Ints "flag-name"}}, {{index .Flags.Bools "flag-name"}}, - // {{index .Flags.Floats "flag-name"}} - // - // Example: - // apiVersion: apps/v1 - // kind: Deployment - // metadata: - // name: {{index .Flags.Strings "name"}} - // namespace: {{index .Flags.Strings "namespace"}} - // labels: - // app: nginx - // spec: - // replicas: {{index .Flags.Ints "replicas"}} - // selector: - // matchLabels: - // app: {{index .Flags.Strings "name"}} - // template: - // metadata: - // labels: - // app: {{index .Flags.Strings "name"}} - // spec: - // containers: - // - name: {{index .Flags.Strings "name"}} - // image: {{index .Flags.Strings "image"}} - // - // +optional - BodyTemplate string `json:"bodyTemplate,omitempty"` - - // SaveResponseValues are values read from the response and saved in {{index .Responses.Strings "flag-name"}}. - // They may be used in the ResourceCommand.Output go-template. - // - // Example: - // - name: responsename - // jsonPath: "{.metadata.name}" - // - // +optional - SaveResponseValues []ResponseValue `json:"saveResponseValues,omitempty"` -} - -// Flag defines a cli flag that should be registered and available in request / output templates. -// -// Flag is used only by the client to expand Request and Response templates with user defined values provided -// as command line flags. -type Flag struct { - Type FlagType `json:"type"` - - Name string `json:"name"` - - Description string `json:"description"` - - // +optional - StringValue string `json:"stringValue,omitempty"` - - // +optional - StringSliceValue []string `json:"stringSliceValue,omitempty"` - - // +optional - BoolValue bool `json:"boolValue,omitempty"` - - // +optional - IntValue int32 `json:"intValue,omitempty"` - - // +optional - FloatValue float64 `json:"floatValue,omitempty"` -} - -// ResponseValue defines a value that should be parsed from a response and available in output templates -type ResponseValue struct { - Name string `json:"name"` - JsonPath string `json:"jsonPath"` -} - -type FlagType string - -const ( - STRING FlagType = "String" - BOOL = "Bool" - FLOAT = "Float" - INT = "Int" - STRING_SLICE = "StringSlice" -) - -type Command struct { - // Use is the one-line usage message. - Use string `json:"use"` - - // Path is the path to the sub-command. Omit if the command is directly under the root command. - // +optional - Path []string `json:"path,omitempty"` - - // Short is the short description shown in the 'help' output. - // +optional - Short string `json:"short,omitempty"` - - // Long is the long message shown in the 'help <this-command>' output. - // +optional - Long string `json:"long,omitempty"` - - // Example is examples of how to use the command. - // +optional - Example string `json:"example,omitempty"` - - // Deprecated defines, if this command is deprecated and should print this string when used. - // +optional - Deprecated string `json:"deprecated,omitempty"` - - // Flags are the command line flags. - // - // Flags are used by the client to expose command line flags to users and populate the Request go-templates - // with the user provided values. - // - // Example: - // - name: namespace - // type: String - // stringValue: "default" - // description: "deployment namespace" - // - // +optional - Flags []Flag `json:"flags,omitempty"` - - // SuggestFor is an array of command names for which this command will be suggested - - // similar to aliases but only suggests. - SuggestFor []string `json:"suggestFor,omitempty"` - - // Aliases is an array of aliases that can be used instead of the first word in Use. - Aliases []string `json:"aliases,omitempty"` - - // Version defines the version for this command. If this value is non-empty and the command does not - // define a "version" flag, a "version" boolean flag will be added to the command and, if specified, - // will print content of the "Version" variable. - // +optional - Version string `json:"version,omitempty"` -} - -// ResourceCommandList contains a list of Commands -type ResourceCommandList struct { - Items []ResourceCommand `json:"items"` -} -``` - -**Example Command:** - -```go -# Set Label: "cli.sigs.k8s.io/cli.v1alpha1.CommandList": "" -# Set Annotation: "cli.sigs.k8s.io/cli.v1alpha1.CommandList": '<json>' ---- -items: -- command: - path: - - "create" # Command is a subcommand of this path - use: "deployment" # Command use - aliases: # Command alias' - - "deploy" - - "deployments" - short: Create a deployment with the specified name. - long: Create a deployment with the specified name. - example: | - # Create a new deployment named my-dep that runs the busybox image. - kubectl create deployment --name my-dep --image=busybox - flags: - - name: name - type: String - stringValue: "" - description: deployment name - - name: image - type: String - stringValue: "" - description: Image name to run. - - name: replicas - type: Int - intValue: 1 - description: Image name to run. - - name: namespace - type: String - stringValue: "default" - description: deployment namespace - requests: - - group: apps - version: v1 - resource: deployments - operation: Create - bodyTemplate: | - apiVersion: apps/v1 - kind: Deployment - metadata: - name: {{index .Flags.Strings "name"}} - namespace: {{index .Flags.Strings "namespace"}} - labels: - app: nginx - spec: - replicas: {{index .Flags.Ints "replicas"}} - selector: - matchLabels: - app: {{index .Flags.Strings "name"}} - template: - metadata: - labels: - app: {{index .Flags.Strings "name"}} - spec: - containers: - - name: {{index .Flags.Strings "name"}} - image: {{index .Flags.Strings "image"}} - saveResponseValues: - - name: responsename - jsonPath: "{.metadata.name}" - outputTemplate: | - deployment.apps/{{index .Responses.Strings "responsename"}} created -``` - -### Risks and Mitigations - -- Command name collisions: CRD publishes command that conflicts with another command - - Initially require the resource name to be the command name (e.g. `create foo`, `set image foo`) - - Mitigation: Use the discovery service to manage preference (as it does for the K8S APIs) -- Command makes requests on behalf of the user that may be undesirable - - Mitigation: Automatically output the Resource APIs that command uses as part of the command description - - Mitigation: Support dry-run to emit the requests made to the server without actually making them - - Migration: Possibly restrict the APIs commands can use (e.g. CRD published commands can only use the APIs for that - Resource). -- Approach is hard to maintain, complex, etc - - Initially restrict to only `create` commands, get feedback -- Doesn't work well with auto-complete - - TODO: Investigate if this is true and how much it matters. - -## Graduation Criteria - -- Simple commands for Core Resources have been migrated to be data driven -- In use by high profile extension APIs - e.g. Istio -- Published as first class item for Extension and Core Resources - -## Alternatives - -- Use plugins for these cases - - Still suffer from version skew - - Require users to download and install binaries - - Hard to keep in sync with set of Resources for each cluster -- Don't support cli commands for Extension Resources
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cli/0031-kustomize-integration.md b/keps/sig-cli/0031-kustomize-integration.md index 1c2af9c7..cfd1f5fa 100644 --- a/keps/sig-cli/0031-kustomize-integration.md +++ b/keps/sig-cli/0031-kustomize-integration.md @@ -1,233 +1,4 @@ ---- -kep-number: 31 -title: Enable kustomize in kubectl -authors: - - "@Liujingfang1" -owning-sig: sig-cli -participating-sigs: - - sig-cli -reviewers: - - "@pwittrock" - - "@seans3" - - "@soltysh" -approvers: - - "@pwittrock" - - "@seans3" - - "@soltysh" -editor: TBD -creation-date: 2018-11-07 -last-updated: yyyy-mm-dd -status: pending -see-also: - - [KEP-0008](https://github.com/kubernetes/community/blob/master/keps/sig-cli/0008-kustomize.md) -replaces: - - n/a -superseded-by: - - n/a ---- - -# Enable kustomize in kubectl - -## Table of Contents -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Kustomize Introduction](#kustomize-introduction) -* [Proposal](#proposal) - * [UX](#UX) - * [apply](#apply) - * [get](#get) - * [delete](#delete) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Alternatives](#alternatives) - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary -[Kustomize](https://github.com/kubernetes-sigs/kustomize) is a tool developed to provide declarative support for kubernetes objects. It has been adopted by many projects and users. Having kustomize enabled in kubectl will address a list of long standing issues. This KEP describes how `kustomize` is integrated into kubectl subcommands with consistent UX. - -## Motivation - -Declarative specification of Kubernetes objects is the recommended way to manage Kubernetes applications or workloads. There is some gap in kubectl on declarative support. To eliminate the gap, a [KEP](https://github.com/kubernetes/community/blob/master/keps/sig-cli/0008-kustomize.md#faq) was proposed months ago and Kustomize was developed. After more than 10 iterations, Kustomize has a complete set of features and reached a good state to be integrated into kubectl. - -### Goals - -Integrate kustomize with kubectl so that kubectl can recognize kustomization directories and expand resources from kustomization.yaml before running kubectl subcommands. This integration should be transparent. It doesn't change kubectl UX. This integration should also be backward compatible. For non kustomization directories, kubectl behaves the same as current. The integration shouldn't have any impact on those parts. - - -### Non-Goals -- provide an editing functionality of kustomization.yaml from kubectl -- further integration with other kubectl flags - -## Kustomize Introduction - -Kustomize has following subcommands: -- build -- edit, edit also has subcommands - - Set - - imagetag - - namespace - - nameprefix - - Add - - base - - resource - - patch - - label - - annotation - - configmap -- config -- version -- help - -`edit` and `build` are most commonly used subcommands. - -`edit` is to modify the fields in `kustomization.yaml`. A `kustomization.yaml` includes configurations that are consumed by Kustomize. Here is an example of `kustomization.yaml` file. - -`build` is to perform a set of pre-processing transformations on the resources inside one kustomization. Those transformations include: -- Get objects from the base -- Apply patches -- Add name prefix to all resources -- Add common label and annotation to all resources -- Replace imageTag is specified -- Update objects’ names where they are referenced -- Resolve variables and substitute them - -``` -# TODO: currently kustomization.yaml is not versioned -# Need to version this with apiVersion and Kind -# https://github.com/kubernetes-sigs/kustomize/issues/588 -namePrefix: alices- - -commonAnnotations: - oncallPager: 800-555-1212 - -configMapGenerator: -- name: myJavaServerEnvVars - literals: - - JAVA_HOME=/opt/java/jdk - - JAVA_TOOL_OPTIONS=-agentlib:hprof - -secretGenerator: -- name: app-sec - commands: - username: "echo admin" - password: "echo secret" -``` -The build output of this sample kustomizaiton.yaml file is -``` -apiVersion: v1 -data: - JAVA_HOME: /opt/java/jdk - JAVA_TOOL_OPTIONS: -agentlib:hprof -kind: ConfigMap -metadata: - annotations: - oncallPager: 800-555-1212 - name: alices-myJavaServerEnvVars-7bc9c27cmf ---- -apiVersion: v1 -data: - password: c2VjcmV0Cg== - username: YWRtaW4K -kind: Secret -metadata: - annotations: - oncallPager: 800-555-1212 - name: alices-app-sec-c7c5tbh526 -type: Opaque -``` - -## Proposal - -### UX - -When apply, get or delete is run on a directory, check if it contains a kustomization.yaml file. If there is, apply, get or delete the output of kustomize build. Kubectl behaves the same as current for directories without kustomization.yaml. - -#### apply -The command visible to users is -``` -kubectl apply -f <dir> -``` -To view the objects in a kustomization without applying them to the cluster -``` -kubectl apply -f <dir> --dry-run -o yaml|json -``` - -#### get -The command visible to users is -``` -kubectl get -f <dir> -``` -To get the detailed objects in a kustomization -``` -kubectl get -f <dir> --dry-run -o yaml|json -``` - -#### delete -The command visible to users is -``` -kubectl delete -f <dir> -``` - -### Implementation Details/Notes/Constraints - -To enable kustomize in kubectl, the function `FilenameParam` inside Builder type will be updated to recognize kustomization directories. The Builder will expand the sources in a kustomization directory and pass them to a subcommand. - - Since kustomization directories themselves have a recursive structure, `-R` will be ignored on those directories. Allowing recursive visit to the same files will lead to duplicate resources. - -The examples and descriptions for apply, get and delete will be updated to include the support of kustomization directories. - -### Risks and Mitigations - -This KEP doesn't provide a editing command for kustomization.yaml file. Users will either manually edit this file or use `Kusotmize edit`. - -## Graduation Criteria - -There are two signals that can indicate the success of this integration. -- Kustomize users drop the piped commands `kustomize build <dir> | kubectl apply -f - ` and start to use apply directly. -- Kubectl users put their configuration files in kustomization directories. - - -## Implementation History - -Most implementation will be in cli-runtime - -- vendor `kustomize/pkg` into kubernetes -- copy `kustomize/k8sdeps` into cli-runtime -- Implement a Visitor for kustomization directory which - - execute kustomize build to get a list of resources - - write the output to a StreamVisitor -- When parsing filename parameters in FilenameParam, look for kustomization directories -- update the examples in kubectl commands -- Improve help messages or documentations to list kubectl subcommands that can work with kustomization directories - -## Alternatives - -The approaches in this section are considered, but rejected. -### Copy kustomize into staging -Copy kustomize code into kubernetes/staging and have the staging kustomize as source of truth. The public kustomize repo will be synced automatically with the staging kustomize. -- Pros - - Issues can be managed in the public repo - - The public repo can provide a kustomize binary - - The public repo can be used as kustomize libraries - - Empower further integration of kubectl with kustomize -- Cons - - The staging repo is designed for libraries that will be separated out. Moving kustomize into staging sounds controversial - - Kustomize will be in staging, the development will be done in k/k repository - - Development velocity will be reduced of every release - -### Add kustomize as a subcommand in kubectl -Add kustomize as a subcommand into kubectl was the first way we tried to enable kustomize in kubectl. The PR was [add kustomize as a subcommand of kubectl](https://github.com/kubernetes/kubernetes/pull/70213). -- Pros - - kustomize command is visible to users - - the code change is straightforward - - easy to test -- Cons - - UX is not consistent with other kubectl subcommands - - Apply command will include two parts - `kubectl kustomize build dir | kubectl apply -f -` +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/0002-cloud-controller-manager.md b/keps/sig-cloud-provider/0002-cloud-controller-manager.md index cb5a4073..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/0002-cloud-controller-manager.md +++ b/keps/sig-cloud-provider/0002-cloud-controller-manager.md @@ -1,473 +1,4 @@ ---- -kep-number: 2 -title: Cloud Provider Controller Manager -authors: - - "@cheftako" - - "@calebamiles" - - "@hogepodge" -owning-sig: sig-apimachinery -participating-sigs: - - sig-apps - - sig-aws - - sig-azure - - sig-cloud-provider - - sig-gcp - - sig-network - - sig-openstack - - sig-storage -reviewers: - - "@andrewsykim" - - "@calebamiles" - - "@hogepodge" - - "@jagosan" -approvers: - - "@thockin" -editor: TBD -status: provisional -replaces: - - contributors/design-proposals/cloud-provider/cloud-provider-refactoring.md ---- - -# Remove Cloud Provider Code From Kubernetes Core - -## Table of Contents - -- [Remove Cloud Provider Code From Kubernetes Core](#remove-cloud-provider-code-from-kubernetes-core) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Intermediary Goals](#intermediary-goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [Controller Manager Changes](#controller-manager-changes) - - [Kubelet Changes](#kubelet-changes) - - [API Server Changes](#api-server-changes) - - [Volume Management Changes](#volume-management-changes) - - [Deployment Changes](#deployment-changes) - - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - - [Repository Requirements](#repository-requirements) - - [Notes for Repository Requirements](#notes-for-repository-requirements) - - [Repository Timeline](#repository-timeline) - - [Security Considerations](#security-considerations) - - [Graduation Criteria](#graduation-criteria) - - [Graduation to Beta](#graduation-to-beta) - - [Process Goals](#process-goals) - - [Implementation History](#implementation-history) - - [Alternatives](#alternatives) - -## Terms - -- **CCM**: Cloud Controller Manager - The controller manager responsible for running cloud provider dependent logic, -such as the service and route controllers. -- **KCM**: Kubernetes Controller Manager - The controller manager responsible for running generic Kubernetes logic, -such as job and node_lifecycle controllers. -- **KAS**: Kubernetes API Server - The core api server responsible for handling all API requests for the Kubernetes -control plane. This includes things like namespace, node, pod and job resources. -- **K8s/K8s**: The core kubernetes github repository. -- **K8s/cloud-provider**: Any or all of the repos for each cloud provider. Examples include [cloud-provider-gcp](https://github.com/kubernetes/cloud-provider-gcp), -[cloud-provider-aws](https://github.com/kubernetes/cloud-provider-aws) and [cloud-provider-azure](https://github.com/kubernetes/cloud-provider-azure). -We have created these repos for each of the in-tree cloud providers. This document assumes in various places that the -cloud providers will place the relevant code in these repos. Whether this is a long-term solution to which additional -cloud providers will be added, or an incremental step toward moving out of the Kubernetes org is out of scope of this -document, and merits discussion in a broader forum and input from SIG-Architecture and Steering Committee. -- **K8s SIGs/library**: Any SIG owned repository. -- **Staging**: Staging: Separate repositories which are currently visible under the K8s/K8s repo, which contain code -considered to be safe to be vendored outside of the K8s/K8s repo and which should eventually be fully separated from -the K8s/K8s repo. Contents of Staging are prevented from depending on code in K8s/K8s which are not in Staging. -Controlled by [publishing kubernetes-rules-configmap](https://github.com/kubernetes/publishing-bot/blob/master/configs/kubernetes-rules-configmap.yaml) - -## Summary - -We want to remove any cloud provider specific logic from the kubernetes/kubernetes repo. We want to restructure the code -to make it easy for any cloud provider to extend the kubernetes core in a consistent manner for their cloud. New cloud -providers should look at the [Creating a Custom Cluster from Scratch](https://kubernetes.io/docs/getting-started-guides/scratch/#cloud-provider) -and the [cloud provider interface](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L31) -which will need to be implemented. - -## Motivation - -We are trying to remove any dependencies from Kubernetes Core to any specific cloud provider. Currently we have seven -such dependencies. To prevent this number from growing we have locked Kubernetes Core to the addition of any new -dependencies. This means all new cloud providers have to implement all their pieces outside of the Core. -However everyone still ends up consuming the current set of seven in repo dependencies. For the seven in repo cloud -providers any changes to their specific cloud provider code requires OSS PR approvals and a deployment to get those -changes in to an official build. The relevant dependencies require changes in the following areas. - -- [Kube Controller Manager](https://kubernetes.io/docs/reference/generated/kube-controller-manager/) - Track usages of [CMServer.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/options/options.go) -- [API Server](https://kubernetes.io/docs/reference/generated/kube-apiserver/) - Track usages of [ServerRunOptions.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-apiserver/app/options/options.go) -- [Kubelet](https://kubernetes.io/docs/reference/generated/kubelet/) - Track usages of [KubeletFlags.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go) -- [How Cloud Provider Functionality is deployed to and enabled in the cluster](https://kubernetes.io/docs/setup/pick-right-solution/#hosted-solutions) - Track usage from [PROVIDER_UTILS](https://github.com/kubernetes/kubernetes/blob/master/cluster/kube-util.sh) - -For the cloud providers who are in repo, moving out would allow them to more quickly iterate on their solution and -decouple cloud provider fixes from open source releases. Moving the cloud provider code out of the open source -processes means that these processes do not need to load/run unnecessary code for the environment they are in. -We would like to abstract a core controller manager library so help standardize the behavior of the cloud -controller managers produced by each cloud provider. We would like to minimize the number and scope of controllers -running in the cloud controller manager so as to minimize the surface area for per cloud provider deviation. - -### Goals - -- Get to a point where we do not load the cloud interface for any of kubernetes core processes. -- Remove all cloud provider specific code from kubernetes/kubernetes. -- Have a generic controller manager library available for use by the per cloud provider controller managers. -- Move the cloud provider specific controller manager logic into repos appropriate for those cloud providers. - -### Intermediary Goals - -Have a cloud controller manager in the kubernetes main repo which hosts all of -the controller loops for the in repo cloud providers. -Do not run any cloud provider logic in the kube controller manager, the kube apiserver or the kubelet. -At intermediary points we may just move some of the cloud specific controllers out. (Eg. volumes may be later than the rest) - -### Non-Goals - -Forcing cloud providers to use the generic cloud manager. - -## Proposal - -### Controller Manager Changes - -For the controller manager we would like to create a set of common code which can be used by both the cloud controller -manager and the kube controller manager. The cloud controller manager would then be responsible for running controllers -whose function is specific to cloud provider functionality. The kube controller manager would then be responsible -for running all controllers whose function was not related to a cloud provider. - -In order to create a 100% cloud independent controller manager, the controller-manager will be split into multiple binaries. - -1. Cloud dependent controller-manager binaries -2. Cloud independent controller-manager binaries - This is the existing `kube-controller-manager` that is being shipped -with kubernetes releases. - -The cloud dependent binaries will run those loops that rely on cloudprovider in a separate process(es) within the kubernetes control plane. -The rest of the controllers will be run in the cloud independent controller manager. -The decision to run entire controller loops, rather than only the very minute parts that rely on cloud provider was made -because it makes the implementation simple. Otherwise, the shared data structures and utility functions have to be -disentangled, and carefully separated to avoid any concurrency issues. This approach among other things, prevents code -duplication and improves development velocity. - -Note that the controller loop implementation will continue to reside in the core repository. It takes in -cloudprovider.Interface as an input in its constructor. Vendor maintained cloud-controller-manager binary could link -these controllers in, as it serves as a reference form of the controller implementation. - -There are four controllers that rely on cloud provider specific code. These are node controller, service controller, -route controller and attach detach controller. Copies of each of these controllers have been bundled together into -one binary. The cloud dependent binary registers itself as a controller, and runs the cloud specific controller loops -with the user-agent named "external-controller-manager". - -RouteController and serviceController are entirely cloud specific. Therefore, it is really simple to move these two -controller loops out of the cloud-independent binary and into the cloud dependent binary. - -NodeController does a lot more than just talk to the cloud. It does the following operations - - -1. CIDR management -2. Monitor Node Status -3. Node Pod Eviction - -While Monitoring Node status, if the status reported by kubelet is either 'ConditionUnknown' or 'ConditionFalse', then -the controller checks if the node has been deleted from the cloud provider. If it has already been deleted from the -cloud provider, then it deletes the nodeobject without waiting for the `monitorGracePeriod` amount of time. This is the -only operation that needs to be moved into the cloud dependent controller manager. - -Finally, The attachDetachController is tricky, and it is not simple to disentangle it from the controller-manager -easily, therefore, this will be addressed with Flex Volumes (Discussed under a separate section below) - - -The kube-controller-manager has many controller loops. [See NewControllerInitializers](https://github.com/kubernetes/kubernetes/blob/release-1.9/cmd/kube-controller-manager/app/controllermanager.go#L332) - - - [nodeIpamController](https://github.com/kubernetes/kubernetes/tree/release-1.10/pkg/controller/nodeipam) - - [nodeLifecycleController](https://github.com/kubernetes/kubernetes/tree/release-1.10/pkg/controller/nodelifecycle) - - [volumeController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/volume) - - [routeController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/route) - - [serviceController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/service) - - replicationController - - endpointController - - resourceQuotaController - - namespaceController - - deploymentController - - etc.. - -Among these controller loops, the following are cloud provider dependent. - - - [nodeIpamController](https://github.com/kubernetes/kubernetes/tree/release-1.10/pkg/controller/nodeipam) - - [nodeLifecycleController](https://github.com/kubernetes/kubernetes/tree/release-1.10/pkg/controller/nodelifecycle) - - [volumeController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/volume) - - [routeController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/route) - - [serviceController](https://github.com/kubernetes/kubernetes/tree/release-1.9/pkg/controller/service) - -The nodeIpamController uses the cloudprovider to handle cloud specific CIDR assignment of a node. Currently the only -cloud provider using this functionality is GCE. So the current plan is to break this functionality out of the common -version of the nodeIpamController. Most cloud providers can just run the default version of this controller. However any -cloud provider which needs cloud specific version of this functionality and disable the default version running in the -KCM and run their own version in the CCM. - -The nodeLifecycleController uses the cloudprovider to check if a node has been deleted from/exists in the cloud. -If cloud provider reports a node as deleted, then this controller immediately deletes the node from kubernetes. -This check removes the need to wait for a specific amount of time to conclude that an inactive node is actually dead. -The current plan is to move this functionality into its own controller, allowing the nodeIpamController to remain in -K8s/K8s and the Kube Controller Manager. - -The volumeController uses the cloudprovider to create, delete, attach and detach volumes to nodes. For instance, the -logic for provisioning, attaching, and detaching a EBS volume resides in the AWS cloudprovider. The volumeController -uses this code to perform its operations. - -The routeController configures routes for hosts in the cloud provider. - -The serviceController maintains a list of currently active nodes, and is responsible for creating and deleting -LoadBalancers in the underlying cloud. - -### Kubelet Changes - -Moving on to the kubelet, the following cloud provider dependencies exist in kubelet. - - - Find the cloud nodename of the host that kubelet is running on for the following reasons : - 1. To obtain the config map for the kubelet, if one already exists - 2. To uniquely identify current node using nodeInformer - 3. To instantiate a reference to the current node object - - Find the InstanceID, ProviderID, ExternalID, Zone Info of the node object while initializing it - - Periodically poll the cloud provider to figure out if the node has any new IP addresses associated with it - - It sets a condition that makes the node unschedulable until cloud routes are configured. - - It allows the cloud provider to post process DNS settings - -The majority of the calls by the kubelet to the cloud is done during the initialization of the Node Object. The other -uses are for configuring Routes (in case of GCE), scrubbing DNS, and periodically polling for IP addresses. - -All of the above steps, except the Node initialization step can be moved into a controller. Specifically, IP address -polling, and configuration of Routes can be moved into the cloud dependent controller manager. - -[Scrubbing DNS was found to be redundant](https://github.com/kubernetes/kubernetes/pull/36785). So, it can be disregarded. It is being removed. - -Finally, Node initialization needs to be addressed. This is the trickiest part. Pods will be scheduled even on -uninitialized nodes. This can lead to scheduling pods on incompatible zones, and other weird errors. Therefore, an -approach is needed where kubelet can create a Node, but mark it as "NotReady". Then, some asynchronous process can -update it and mark it as ready. This is now possible because of the concept of Taints. - -This approach requires kubelet to be started with known taints. This will make the node unschedulable until these -taints are removed. The external cloud controller manager will asynchronously update the node objects and remove the -taints. - -### API Server Changes - -Finally, in the kube-apiserver, the cloud provider is used for transferring SSH keys to all of the nodes, and within an -admission controller for setting labels on persistent volumes. - -Kube-apiserver uses the cloud provider for two purposes - -1. Distribute SSH Keys - This can be moved to the cloud dependent controller manager -2. Admission Controller for PV - This can be refactored using the taints approach used in Kubelet - -### Volume Management Changes - -Volumes need cloud providers, but they only need **specific** cloud providers. The majority of volume management logic -resides in the controller manager. These controller loops need to be moved into the cloud-controller manager. The cloud -controller manager also needs a mechanism to read parameters for initialization from cloud config. This can be done via -config maps. - -There are two entirely different approach to refactoring volumes - -[Flex Volumes](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md) and -[CSI Container Storage Interface](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md). There is an undergoing effort to move all -of the volume logic from the controller-manager into plugins called Flex Volumes. In the Flex volumes world, all of the -vendor specific code will be packaged in a separate binary as a plugin. After discussing with @thockin, this was -decidedly the best approach to remove all cloud provider dependency for volumes out of kubernetes core. Some of the discovery -information for this can be found at [https://goo.gl/CtzpVm](https://goo.gl/CtzpVm). - -### Deployment Changes - -This change will introduce new binaries to the list of binaries required to run kubernetes. The change will be designed -such that these binaries can be installed via `kubectl apply -f` and the appropriate instances of the binaries will be -running. - -Issues such as monitoring, configuring the new binaries will generally be left to cloud provider. However they should -ensure that test runs upload the logs for these new processes to [test grid](https://k8s-testgrid.appspot.com/). - -Applying the cloud controller manager is the only step that is different in the upgrade process. -In order to complete the upgrade process, you need to apply the cloud-controller-manager deployment to the setup. -A deployment descriptor file will be provided with this change. You need to apply this change using - -``` -kubectl apply -f cloud-controller-manager.yml -``` - -This will start the cloud specific controller manager in your kubernetes setup. - -The downgrade steps are also the same as before for all the components except the cloud-controller-manager. -In case of the cloud-controller-manager, the deployment should be deleted using - -``` -kubectl delete -f cloud-controller-manager.yml -``` - -### Implementation Details/Notes/Constraints - -#### Repository Requirements - -**This is a proposed structure, and may change during the 1.11 release cycle. -WG-Cloud-Provider will work with individual sigs to refine these requirements -to maintain consistency while meeting the technical needs of the provider -maintainers** - -Each cloud provider hosted within the `kubernetes` organization shall have a -single repository named `kubernetes/cloud-provider-<provider_name>`. Those -repositories shall have the following structure: - -* A `cloud-controller-manager` subdirectory that contains the implementation - of the provider-specific cloud controller. -* A `docs` subdirectory. -* A `docs/cloud-controller-manager.md` file that describes the options and - usage of the cloud controller manager code. -* A `docs/testing.md` file that describes how the provider code is tested. -* A `Makefile` with a `test` entrypoint to run the provider tests. - -Additionally, the repository should have: - -* A `docs/getting-started.md` file that describes the installation and basic - operation of the cloud controller manager code. - -Where the provider has additional capabilities, the repository should have -the following subdirectories that contain the common features: - -* `dns` for DNS provider code. -* `cni` for the Container Network Interface (CNI) driver. -* `csi` for the Container Storage Interface (CSI) driver. -* `flex` for the Flex Volume driver. -* `installer` for custom installer code. - -Each repository may have additional directories and files that are used for -additional feature that include but are not limited to: - -* Other provider specific testing. -* Additional documentation, including examples and developer documentation. -* Dependencies on provider-hosted or other external code. - - -##### Notes for Repository Requirements - -This purpose of these requirements is to define a common structure for the -cloud provider repositories owned by current and future cloud provider SIGs. -In accordance with the -[WG-Cloud-Provider Charter](https://docs.google.com/document/d/1m4Kvnh_u_9cENEE9n1ifYowQEFSgiHnbw43urGJMB64/edit#) -to "define a set of common expected behaviors across cloud providers", this -proposal defines the location and structure of commonly expected code. - -As each provider can and will have additional features that go beyond expected -common code, requirements only apply to the location of the -following code: - -* Cloud Controller Manager implementations. -* Documentation. - -This document may be amended with additional locations that relate to enabling -consistent upstream testing, independent storage drivers, and other code with -common integration hooks may be added - -The development of the -[Cloud Controller Manager](https://github.com/kubernetes/kubernetes/tree/master/cmd/cloud-controller-manager) -and -[Cloud Provider Interface](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go) -has enabled the provider SIGs to develop external providers that -capture the core functionality of the upstream providers. By defining the -expected locations and naming conventions of where the external provider code -is, we will create a consistent experience for: - -* Users of the providers, who will have easily understandable conventions for - discovering and using all of the providers. -* SIG-Docs, who will have a common hook for building or linking to externally - managed documentation -* SIG-Testing, who will be able to use common entry points for enabling - provider-specific e2e testing. -* Future cloud provider authors, who will have a common framework and examples - from which to build and share their code base. - -##### Repository Timeline - -To facilitate community development, providers named in the -[Makes SIGs responsible for implementations of `CloudProvider`](https://github.com/kubernetes/community/pull/1862) -patch can immediately migrate their external provider work into their named -repositories. - -Each provider will work to implement the required structure during the -Kubernetes 1.11 development cycle, with conformance by the 1.11 release. -WG-Cloud-Provider may actively change repository requirements during the -1.11 release cycle to respond to collective SIG technical needs. - -After the 1.11 release all current and new provider implementations must -conform with the requirements outlined in this document. - -### Security Considerations - -Make sure that you consider the impact of this feature from the point of view of Security. - -## Graduation Criteria - -How will we know that this has succeeded? -Gathering user feedback is crucial for building high quality experiences and SIGs have the important responsibility of -setting milestones for stability and completeness. -Hopefully the content previously contained in [umbrella issues][] will be tracked in the `Graduation Criteria` section. - -[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 - -### Graduation to Beta - -As part of the graduation to `stable` or General Availability (GA), we have set -both process and technical goals. - -#### Process Goals - -- - -We propose the following repository structure for the cloud providers which -currently live in `kubernetes/pkg/cloudprovider/providers/*` - -``` -git@github.com:kubernetes/cloud-provider-wg -git@github.com:kubernetes/cloud-provider-aws -git@github.com:kubernetes/cloud-provider-azure -git@github.com:kubernetes/cloud-provider-gcp -git@github.com:kubernetes/cloud-provider-openstack -``` - -We propose this structure in order to obtain - -- ease of contributor on boarding and off boarding by creating repositories under - the existing `kubernetes` GitHub organization -- ease of automation turn up using existing tooling -- unambiguous ownership of assets by the CNCF - -The use of a tracking repository `git@github.com:kubernetes/wg-cloud-provider` -is proposed to - -- create an index of all cloud providers which WG Cloud Provider believes - should be highlighted based on defined criteria for quality, usage, and other - requirements deemed necessary by the working group -- serve as a location for tracking issues which affect all Cloud Providers -- serve as a repository for user experience reports related to Cloud Providers - which live within the Kubernetes GitHub organization or desire to do so - -Major milestones: - -- March 18, 2018: Accepted proposal for repository requirements. - -*Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. -Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance -- the `Proposal` section being merged signaling agreement on a proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was available -- the version of Kubernetes where the KEP graduated to general availability -- when the KEP was retired or superseded* - -The ultimate intention of WG Cloud Provider is to prevent multiple classes -of software purporting to be an implementation of the Cloud Provider interface -from fracturing the Kubernetes Community while also ensuring that new Cloud -Providers adhere to standards of quality and whose management follow Kubernetes -Community norms. - -## Alternatives - -One alternate to consider is the use of a side-car. The cloud-interface in tree could then be a [GRPC](https://github.com/grpc/grpc-go) -call out to that side-car. We could then leave the Kube API Server, Kube Controller Manager and Kubelet pretty much as is. -We would still need separate repos to hold the code for the side care and to handle cluster setup for the cloud provider. -However we believe that different cloud providers will (already) want different control loops. As such we are likely to need -something like the cloud controller manager anyway. From the perspective it seems easier to centralize the effort in that -direction. In addition it should limit the proliferation of new processes across the entire cluster. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/0013-build-deploy-ccm.md b/keps/sig-cloud-provider/0013-build-deploy-ccm.md index e0775180..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/0013-build-deploy-ccm.md +++ b/keps/sig-cloud-provider/0013-build-deploy-ccm.md @@ -1,494 +1,4 @@ ---- -kep-number: 13 -title: Switching To Cloud Provider Repo And Builds -authors: - - "@cheftako" - - "@calebamiles" - - "@nckturner" -owning-sig: sig-apimachinery -participating-sigs: - - sig-apps - - sig-aws - - sig-azure - - sig-cloud-provider - - sig-gcp - - sig-network - - sig-openstack - - sig-storage -reviewers: - - "@andrewsykim" - - "@calebamiles" - - "@nckturner" -approvers: - - "@thockin" -editor: TBD -status: provisional ---- - -# Switching To Cloud Provider Repo And Builds - -## How To Remove Cloud Provider Code From Kubernetes Core - -## Table of Contents - -- [Switching To Cloud Provider Repo And Builds](#switching-to-cloud-provider-repo-and-builds) - - [Table of Contents](#table-of-contents) - - [Terms](#terms) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Intermediary Goals](#intermediary-goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [Building CCM For Cloud Providers](#building-ccm-for-cloud-providers) - - [Background](#background) - - [Design Options](#design-options) - - [Staging](#staging) - - [Cloud Provider Instances](#cloud-provider-instances) - - [Build Targets](#build-targets) - - [K8s/K8s Releases and K8s/CP Releases](#k8s/k8s-releases-and-k8s/cp-releases) - - [Migrating to a CCM Build](#migrating-to-a-ccm-build) - - [Flags, Service Accounts, etc](#flags,-service-accounts,-etc) - - [Deployment Scripts](#deployment-scripts) - - [kubeadm](#kubeadm) - - [CI, TestGrid and Other Testing Issues](#ci,-testgrid-and-other-testing-issues) - - [Alternatives](#alternatives) - - [Staging Alternatives](#staging-alternatives) - - [Git Filter-Branch](#Git Filter-Branch) - - [Build Location Alternatives](#Build Location Alternatives) - - [Build K8s/K8s from within K8s/Cloud-provider](#build-k8s/k8s-from-within-k8s/cloud-provider) - - [Build K8s/Cloud-provider within K8s/K8s](#build-k8s/cloud-provider-within-k8s/k8s) - - [Config Alternatives](#config-alternatives) - - [Use component config to determine where controllers run](#use-component-config-to-determine-where-controllers-run) - -## Terms - -- **CCM**: Cloud Controller Manager - The controller manager responsible for running cloud provider dependent logic, -such as the service and route controllers. -- **KCM**: Kubernetes Controller Manager - The controller manager responsible for running generic Kubernetes logic, -such as job and node_lifecycle controllers. -- **KAS**: Kubernetes API Server - The core api server responsible for handling all API requests for the Kubernetes -control plane. This includes things like namespace, node, pod and job resources. -- **K8s/K8s**: The core kubernetes github repository. -- **K8s/cloud-provider**: Any or all of the repos for each cloud provider. Examples include [cloud-provider-gcp](https://github.com/kubernetes/cloud-provider-gcp), -[cloud-provider-aws](https://github.com/kubernetes/cloud-provider-aws) and [cloud-provider-azure](https://github.com/kubernetes/cloud-provider-azure). -We have created these repos for each of the in-tree cloud providers. This document assumes in various places that the -cloud providers will place the relevant code in these repos. Whether this is a long-term solution to which additional -cloud providers will be added, or an incremental step toward moving out of the Kubernetes org is out of scope of this -document, and merits discussion in a broader forum and input from SIG-Architecture and Steering Committee. -- **K8s SIGs/library**: Any SIG owned repository. -- **Staging**: Staging: Separate repositories which are currently visible under the K8s/K8s repo, which contain code -considered to be safe to be vendored outside of the K8s/K8s repo and which should eventually be fully separated from -the K8s/K8s repo. Contents of Staging are prevented from depending on code in K8s/K8s which are not in Staging. -Controlled by [publishing kubernetes-rules-configmap](https://github.com/kubernetes/publishing-bot/blob/master/configs/kubernetes-rules-configmap.yaml) - - -## Summary - -We want to remove any cloud provider specific logic from the kubernetes/kubernetes repo. We want to restructure the code -to make it easy for any cloud provider to extend the kubernetes core in a consistent manner for their cloud. New cloud -providers should look at the [Creating a Custom Cluster from Scratch](https://kubernetes.io/docs/getting-started-guides/scratch/#cloud-provider) -and the [cloud provider interface](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/cloud.go#L31) -which will need to be implemented. - -## Motivation - -We are trying to remove any dependencies from Kubernetes Core to any specific cloud provider. Currently we have seven -such dependencies. To prevent this number from growing we have locked Kubernetes Core to the addition of any new -dependencies. This means all new cloud providers have to implement all their pieces outside of the Core. -However everyone still ends up consuming the current set of seven in repo dependencies. For the seven in repo cloud -providers any changes to their specific cloud provider code requires OSS PR approvals and a deployment to get those -changes in to an official build. The relevant dependencies require changes in the following areas. - -- [Kube Controller Manager](https://kubernetes.io/docs/reference/generated/kube-controller-manager/) - Track usages of [CMServer.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/options/options.go) -- [API Server](https://kubernetes.io/docs/reference/generated/kube-apiserver/) - Track usages of [ServerRunOptions.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-apiserver/app/options/options.go) -- [kubelet](https://kubernetes.io/docs/reference/generated/kubelet/) - Track usages of [KubeletFlags.CloudProvider](https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go) -- [How Cloud Provider Functionality is deployed to and enabled in the cluster](https://kubernetes.io/docs/setup/pick-right-solution/#hosted-solutions) - Track usage from [PROVIDER_UTILS](https://github.com/kubernetes/kubernetes/blob/master/cluster/kube-util.sh) - -For the cloud providers who are in repo, moving out would allow them to more quickly iterate on their solution and -decouple cloud provider fixes from open source releases. Moving the cloud provider code out of the open source -processes means that these processes do not need to load/run unnecessary code for the environment they are in. -We would like to abstract a core controller manager library so help standardize the behavior of the cloud -controller managers produced by each cloud provider. We would like to minimize the number and scope of controllers -running in the cloud controller manager so as to minimize the surface area for per cloud provider deviation. - -### Goals - -- Get to a point where we do not load the cloud interface for any of kubernetes core processes. -- Remove all cloud provider specific code from kubernetes/kubernetes. -- Have a generic controller manager library available for use by the per cloud provider controller managers. -- Move the cloud provider specific controller manager logic into repos appropriate for those cloud providers. - -### Intermediary Goals - -Have a cloud controller manager in the kubernetes main repo which hosts all of -the controller loops for the in repo cloud providers. -Do not run any cloud provider logic in the kube controller manager, the kube apiserver or the kubelet. -At intermediary points we may just move some of the cloud specific controllers out. (Eg. volumes may be later than the rest) - -### Non-Goals - -Forcing cloud providers to use the generic cloud manager. - -## Proposal - -## Building CCM For Cloud Providers - -### Background - -The CCM in K8s/K8s links in all eight of the in-tree providers (aws, azure, cloudstack, gce, openstack, ovirt, photon -and vsphere). Each of these providers has an implementation of the Cloud Provider Interface in the K8s/K8s repo. CNCF -has created a repo for each of the cloud providers to extract their cloud specific code into. The assumption is that -this will include the CCM executable, their Cloud Provider Implementation and various build and deploy scripts. Until -we have extracted every in-tree provider and removed cloud provider dependencies from other binaries, we need to -maintain the Cloud Provider Implementation in the K8s/K8s repo. After the Cloud Provider specific code has been -extracted Golang imports for things like the service controller, prometheus code, utilities etc will still require each -cloud provider specific repository to vendor in a significant portion of the code in k8s/k8s. - -We need a solution which meets the objective in both the short and long term. In the short term we cannot delete CCM or -cloud provider code from the K8s/K8s repo. We need to keep this code in K8s/K8s while we still support cloud provider -deployments from K8s/K8s. In the long term this code should be part of each cloud providers repo and that code should -be removed from K8s/K8s. This suggests that in the short term that code should have one source of truth. However it -should probably not end up in the vendor directory as that is not its intended final home. Other code such as the -controllers should end up in the vendor directory. Additionally each provider will need their own copy of -pkg/cloudprovider/providers/providers.go and related build file to properly control which Cloud Provider Implementation -get linked in. - -We also need to be able to package a combination of binaries from K8s/K8s and K8s/cloud-provider-<cp> into a deployable -package. The code for this will need to accommodate things like differing side cars for each cloud provider’s CSI -implementation and possible desire to run additional controller managers or extension api servers. As such it seems -better to have this code live in the cloud provider specific repo. This also allows this code to be simpler as it does -not have to attempt to support all the different cloud provider configurations. This separate from things like the -local deployment option, which K8s/K8s should continue to support. - -Lastly there are specific flags which need to be set on various binaries for this to work. Kubernetes API-Server, -Kubernetes Controller-Manager and kubelet should all have the --cloud-provider flag set to external. For the Cloud -Controller-Manager the --cloud-provider flag should be set appropriately for that cloud provider. In addition we need -to set the set of controllers running in the Kubernetes Controller-Manager. More on that later. - -For further background material please look at [running cloud controller](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/). - -### Design Options - -For the short term problem of sharing code between K8s/K8s repo, the K8s/cloud-provider repos and K8s SIGs/library repos; -there are 2 fundamental solutions. We can push more code into Staging to make the code available to the -K8s/cloud-provider repos. Currently code which needs to be shared between K8s/K8s itself and other projects is put -the code in Staging. This allows the build machinery to detect things like attempts to make the shared code depend on -code which is not shared (i.e. disallowing code in Staging from depending Iton code in K8s/K8s but not in Staging) -In addition using Staging means that we can benefit from work to properly break out the libraries/code which exists in -Staging. Most of the repositories we are adding to staging should end up as K8s SIGs/library repos. The K8s/cloud-provider -repos should be the top of any build dependency trees. Code which needs to be consumed by other repos (Eg CSI plugins, -shared controllers, ...) should not be in the K8s/cloud-provider repo but in an appropriate K8s SIGs/library repo. - -The code which needs to be shared can be broken into several types. - -There is code which properly belongs in the various cloud-provider repos. The classic example of this would be the -implementations of the cloud provider interface. (Eg. [gce](https://github.com/kubernetes/kubernetes/tree/master/pkg/cloudprovider/providers/gce) -or [aws](https://github.com/kubernetes/kubernetes/tree/master/pkg/cloudprovider/providers/aws)) These sections of code -need to be shared until we can remove all cloud provider dependencies from K8s/K8s. (i.e. When KAS, KCM and kubelet no -longer contain a cloud-provider flag and no longer depend on either the cloud provider interface or any cloud provider -implementations) At that point they should be permanently moved to the individual provider repos. I would suggest that -as long as the code is shared it be in vendor for K8s/cloud-provider. We would want to create a separate Staging repo -in K8s/K8s for each cloud-provider. - -The majority of the controller-manager framework and its dependencies also needs to be shared. This is code that would -be shared even after cloud providers are removed from K8s/K8s. As such it probably makes sense to make the controller -manager framework its own K8s/K8s Staging repo. - -It should be generally possible for cloud providers to determine where a controller runs and even over-ride specific -controller functionality. Please note that if a cloud provider exercises this possibility it is up to that cloud provider -to keep their custom controller conformant to the K8s/K8s standard. This means any controllers may be run in either KCM -or CCM. As an example the NodeIpamController, will be shared across K8s/K8s and K8s/cloud-provider-gce, both in the -short and long term. Currently it needs to take a cloud provider to allow it to do GCE CIDR management. We could handle -this by leaving the cloud provider interface with the controller manager framework code. The GCE controller manager could -then inject the cloud provider for that controller. For everyone else (especially the KCM) NodeIpamController is -interesting because currently everyone needs its generic behavior for things like ranges. However Google then wires in -the cloud provider to provide custom functionality on things like CIDRs. The thought then is that in the short term we -allow it to be run in either KCM or CCM. Most cloud providers would run it in the KCM, Google would run it in the CCM. -When we are ready to move cloud provider code out of K8s/K8s, we remove the cloud provider code from the version which -is in K8s/K8s and continue to have flags to control if it runs. K8s/Cloud-Provider-Google could then have an enhanced -version which is run in the CCM. Other controllers such as Route and Service needs to run in either the KCM or CCM. For -things like K8s/K8s e2e tests we will always want these controllers in the K8s/K8s repo. Having it in the K8s/K8s repo -is also useful for keeping the behavior of these sort of core systems consistent. - -#### Staging - -There are several sections of code which need to be shared between the K8s/K8s repo and the K8s/Cloud-provider repos. -The plan for doing that sharing is to move the relevant code into the Staging directory as that is where we share code -today. The current Staging repo has the following packages in it. -- Api -- Apiextensions-apiserver -- Apimachinery -- Apiserver -- Client-go -- Code-generator -- Kube-aggregator -- Metrics -- Sample-apiserver -- Sample-Controller - -With the additions needed in the short term to make this work; the Staging area would now need to look as follows. -- Api -- Apiextensions-apiserver -- Apimachinery -- Apiserver -- Client-go -- **Controller** - - **Cloud** - - **Service** - - **NodeIpam** - - **Route** - - **?Volume?** -- **Controller-manager** -- **Cloud-provider-aws** -- **Cloud-provider-azure** -- **Cloud-provider-cloudstack** -- **Cloud-provider-gce** -- **Cloud-provider-openstack** -- **Cloud-provider-ovirt** -- **Cloud-provider-photon** -- **Cloud-provider-vsphere** -- Code-generator -- Kube-aggregator -- **Kube-utilities** -- Metrics -- Sample-apiserver -- Sample-Controller - -When we complete the cloud provider work, several of the new modules in staging should be moving to their permanent new -home in the appropriate K8s/Cloud-provider repos they will no longer be needed in the K8s/K8s repo. There are however -other new modules we will add which continue to be needed by both K8s/K8s and K8s/Cloud-provider. Those modules will -remain in Staging until the Staging initiative completes and they are moved into some other Kubernetes shared code repo. -- Api -- Apiextensions-apiserver -- Apimachinery -- Apiserver -- Client-go -- **Controller** - - **Cloud** - - **Service** - - **NodeIpam** - - **Route** - - **?Volume?** -- **Controller-manager** -- ~~Cloud-provider-aws~~ -- ~~Cloud-provider-azure~~ -- ~~Cloud-provider-cloudstack~~ -- ~~Cloud-provider-gce~~ -- ~~Cloud-provider-openstack~~ -- ~~Cloud-provider-ovirt~~ -- ~~Cloud-provider-photon~~ -- ~~Cloud-provider-vsphere~~ -- Code-generator -- Kube-aggregator -- **Kube-utilities** -- Metrics -- Sample-apiserver -- Sample-Controller - -#### Cloud Provider Instances - -Currently in K8s/K8s the cloud providers are actually included by including [providers.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/providers.go) -file which then includes each of the in-tree cloud providers. In the short term we would leave that file where it is -and adjust it to point at the new homes under Staging. For the K8s/cloud-provider repo, would have the following CCM -wrapper file. (Essentially a modified copy of cmd/cloud-controller-manager/controller-manager.go) The wrapper for each -cloud provider would import just their vendored cloud-provider implementation rather than providers.go file. - -k8s/k8s: pkg/cloudprovider/providers/providers.go -```package cloudprovider - -import ( - // Prior to cloud providers having been moved to Staging - _ "k8s.io/cloudprovider-aws" - _ "k8s.io/cloudprovider-azure" - _ "k8s.io/cloudprovider-cloudstack" - _ "k8s.io/cloudprovider-gce" - _ "k8s.io/cloudprovider-openstack" - _ "k8s.io/cloudprovider-ovirt" - _ "k8s.io/cloudprovider-photon" - _ "k8s.io/cloudprovider-vsphere" -) -``` - -k8s/cloud-provider-gcp: pkg/cloudprovider/providers/providers.go -```package cloudprovider - -import ( - // Cloud providers - _ "k8s.io/cloudprovider-gce" -) -``` - -#### Build Targets - -We then get to the issue of creating a deployable artifact for each cloud provider. There are several artifacts beyond -the CCM which are cloud provider specific. These artifacts include things like the deployment scripts themselves, the -contents of the add-on manager and the sidecar needed to get CSI/cloud-specific persistent storage to work. Ideally -these would then be packaged with a version of the Kubernetes core components (KAS, KCM, kubelet, …) which have not -been statically linked against the cloud provider libraries. However in the short term the K8s/K8s deployable builds -will still need to link these binaries against all of the in-tree plugins. We need a way for the K8s/cloud-provider -repo to consume artifacts generated by the K8s/K8s repo. For official releases these artifacts should be published to a -package repository. From there the cloud providers can pull the cloud agnostic Kubernetes artifact and decorate it -appropriate to their cloud. The K8s/Cloud-Provider should be using some sort of manifest file to determine which -official K8s/K8s artifact to pull. This allows for things like repeatability in builds, for hot fix builds and also a -mechanism for rolling out changes which span K8s/K8s and K8s/Cloud-Provider. For local cloud releases we can use a -build convention. We can expect the K8s/K8s and K8s/Cloud-Provider repos to be checked out under the same GOPATH. -The K8s/Cloud-Provider can then have a local build target which looks for the K8s/K8s artifacts to be in the -appropriate place under the GOPATH (or at a location pointed to be a K8s home environment variable) This allows for -both official builds and for developers to easily work at running changes which span K8s/K8s and K8s/Cloud-Provider. - -#### K8s/K8s Releases and K8s/CP Releases - -One of the goals for this project is to free cloud providers to generate releases as they want. This implies that -K8s/CP releases are largely decoupled from K8s/K8s releases. The cadence of K8s/K8s releases should not change. We are -assuming that K8s/CP releases will be on a similar or faster cadence as needed. It is desirable for the community for -all cloud providers to support recent releases. As such it would be good for them to minimize the lag between K8s/K8s -releases and K8s/CP releases. A K8s/CP cannot however release a Kubernetes version prior to that version having been -released by K8s/K8s. (So for example the cloud provider cannot make a release with a 1.13 Kubernetes core, prior to -K8s/K8s having released 1.13) The ownership and responsibility of publishing K8s/K8s will not change with this project. -The K8s/CP releases must necessarily move to being the responsibility of that cloud provider or a set of owners -delegated by that cloud provider. - -#### Migrating to a CCM Build - -Migrating to a CCM build requires some thought. When dealing with a new cluster, things are fairly simple; we install -everything at the same version and then the customer begins to customize. Migrating an existing cluster which is a -generic K8s/K8s cluster with the cloud-provider flag set to a K8s/Cloud-Provider built is a bit trickier. The system -needs to work during the migration where disparate pieces can be on different versions. While we specify that the exact -upgrade steps are cloud provider specific, we do provide guidance that the control plane (master) should be upgraded -first (master version >= kubelet version) and that the system should be able to handle up to a 2 revision difference -between the control plane and the kubelets. In addition with disruptions budgets etc, there will not be a consistent -version of the kubelets, until the upgrade completes. So we need to ensure that our cloud provider/CCM builds work with -existing clusters. This means that we need to account for things like older kubelets having the cloud provider enabled -and using it for things like direct volume mount/unmount, IP discovery, … We can even expect that scaling events such -as increases in the size of a replica set may cause us to deploy old kubelet images which directly use the cloud -provider implementation in clusters which are controlled by a CCM build. We need to make sure we test these sort of -scenarios and ensure they work (get their IP, can mount cloud specific volume types, …) - -HA migrations presents some special issues. HA masters are composed of multiple master nodes in which components like -the controller managers use leader election to determine which is the currently operating instance. Before the -migration begins there is no guarantee which instances will be leading or that all the leaders will be on the same -master instance. So we will begin by taking down 1 master instance and upgrading it. At this point there will be some -well known conditions. The Kube-Controller-Master in the lead will be one of the old build instances. If the leader had -been the instance running on that master it will lose its lease to one of the older instances. At the same time there -will only be one Cloud-Controller-Manager which is the running on the master running the new code. This implies that we -will have a few controllers running in both the new lead cloud-controller-manager and the old kube-controller-manager. -I would suggest that as an initial part of a HA upgrade we disable these controllers in the kube-controller-managers. -This can be accomplished by providing the controllers flag. If the only controller we wanted to disable were service -and route then we would set the flag as follows - ``` -kube-controller-manager --controllers=\"*,-service,-route\" -``` -This assumes that upgrading the first master instance can be accomplished inside of the SLO for these controllers being -down. - -#### Flags, Service Accounts, etc - -The correct set of flags, service accounts etc, which will be needed for each cloud provider, is expected to be -different and is at some level, left as an exercise for each cloud provider. That having been said, there are a few -common guidelines which are worth mentioning. It is expected that all the core components (kube-apiserver, -kube-controller-manager, kubelet) should have their --cloud-provider flag set to “external”. Cloud-providers, who have -their own volume type (eg. gce-pd) but do not yet have the CSI plugin (& side car) enabled in their CCM build, will -need to set the --external-cloud-volume-plugin to their cloud provider implementation key. (eg. gce) There are also a -considerable number of roles and bindings which are needed to get the CCM working. For core components this is handled -through a bootstrapping process inside of the kube-apiserver. However the CCM and cloud-provider pieces are not -considered core components. The expectation then is that the appropriate objects will be created by deploying yaml -files for them in the addons directory. The add on manager (or cloud provider equivalent system) will then cause the -objects to be created as the system comes up. The set of objects which we know we will need include :- -- ServiceAccount - - cloud-controller-manager -- User - - system:cloud-controller-manager -- Role - - system::leader-locking-cloud-controller-manager -- ClusterRole - - system:controller:cloud-node-controller - - system:cloud-controller-manager - - system:controller:pvl-controller -- RoleBinding - - cloud-controller-manager to system::leader-locking-cloud-controller-manager -- ClusterRoleBinding - - cloud-node-controller to system:controller:cloud-node-controller - -#### Deployment Scripts - -Currently there is a lot of common code in the K8s/K8s cluster directory. Each of the cloud providers today build on -top of that common code to do their deployment. The code running in K8s/Cloud-provider will necessarily be different. -We have new files (addon) and executables (CCM and CSI) to be deployed. The existing executables need to be started -with different flags (--cloud-provider=external). We also have additional executables which need to be started. This -may then result in different resource requirements for the master and the kubelets. So it is clear that there will need -to be at least some changes between the deployment scripts going from K8s/K8s to K8s/Cloud-provider. There is also -likely to be some desire to stream-line and simplify these scripts in K8s/Cloud-provider. A lot of the generic handling -and flexibility in K8s/K8s is not needed in K8s/Cloud-provider. It is also worth looking at -[CCM Repo Requirements](#repository-requirements) for some suggestions on common K8s/Cloud-provider -layout suggestions. These include an installer directory for custom installer code. - -#### kubeadm [WIP] - -kubeadm is a tool for creating clusters. For reference see [creating cluster with kubeadm](https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/). -Need to determine how kubeadm and K8s/Cloud-providers should interact. More planning clearly needs to be done on -cloud-provider and kubeadm planning. - -#### CI, TestGrid and Other Testing Issues [WIP] - -As soon as we have more than one repo involved in a product/build we get some interesting problems in running tests. A -K8s/Cloud-provider deployment is the result of a combination of code from K8s/K8s and K8s/Cloud-provider. If something -is broken it could be the result of a code change in either, and modifications by an engineer testing against one cloud -provider could have unintended consequences in another. - -To address this issue, SIG-Testing has invested in a process by which participating cloud providers can report results -from their own CI testing processes back to TestGrid, the Kubernetes community tool for tracking project health. -Results become part of a central dashboard to display test results across multiple cloud providers. - -More thought should be put into managing multiple, independently evolving components. - -## Alternatives - -### Staging Alternatives - -#### Git Filter-Branch - -One possible alternative is to make use of a Git Filter Branch to extract a sub-directory into a virtual repo. The repo -needs to be sync'd in an ongoing basis with K8s/K8s as we want one source of truth until K8s/K8s does not pull in the -code. This has issues such as not giving K8s/K8s developers any indications of what the dependencies various -K8s/Cloud-providers have. Without that information it becomes very easy to accidentally break various cloud providers -and time you change dependencies in the K8s/K8s repo. With staging the dependency line is simple and [automatically -enforced](https://github.com/kubernetes/kubernetes/blob/master/hack/verify-no-vendor-cycles.sh). Things in Staging are -not allowed to depend on things outside of Staging. If you want to add such a dependency you need to add the dependent -code to Staging. The act of doing this means that code should get synced and solve the problem. In addition the usage -of a second different library and repo movement mechanism will make things more difficult for everyone. - -“Trying to share code through the git filter will not provide this protection. In addition it means that we now have -two sharing code mechanisms which increases complexity on the community and build tooling. As such I think it is better -to continue to use the Staging mechanisms. ” - -### Build Location Alternatives - -#### Build K8s/K8s from within K8s/Cloud-provider - -The idea here is to avoid having to add a new build target to K8s/K8s. The K8s/Cloud-provider could have their own -custom targets for building things like KAS without other cloud-providers implementations linked in. It would also -allow other customizations of the standard binaries to be created. While a powerful tool, this mechanism seems to -encourage customization of these core binaries and as such to be discouraged. Providing the appropriate generic -binaries cuts down on the need to duplicate build logic for these core components and allow each optimization of build. -Download prebuilt images at a version and then just build the appropriate addons. - -#### Build K8s/Cloud-provider within K8s/K8s - -The idea here would be to treat the various K8s/Cloud-provider repos as libraries. You would specify a build flavor and -we would pull in the relevant code based on what you specified. This would put tight restrictions on how the -K8s/Cloud-provider repos would work as they would need to be consumed by the K8s/K8s build system. This seems less -extensible and removes the nice loose coupling which the other systems have. It also makes it difficult for the cloud -providers to control their release cadence. - -### Config Alternatives - -#### Use component config to determine where controllers run - -Currently KCM and CCM have their configuration passed in as command line flags. If their configuration were obtained -from a configuration server (component config) then we could have a single source of truth about where each controller -should be run. This both solves the HA migration issue and other concerns about making sure that a controller only runs -in 1 controller manager. Rather than having the controllers as on or off, controllers would now be configured to state -where they should run, KCM, CCM, Nowhere, … If the KCM could handle this as a run-time change nothing would need to -change. Otherwise it becomes a slight variant of the proposed solution. This is probably the correct long term -solution. However for the timeline we are currently working with we should use the proposed solution. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md b/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md index 7ca64d01..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md +++ b/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md @@ -1,271 +1,4 @@ ---- -kep-number: 0018 -title: Reporting Conformance Test Results to Testgrid -authors: - - "@andrewsykim" -owning-sig: sig-cloud-provider -participating-sigs: - - sig-testing - - sig-release - - sig-aws - - sig-azure - - sig-gcp - - sig-ibmcloud - - sig-openstack - - sig-vmware -reviewers: - - TBD -approvers: - - TBD -editor: TBD -creation-date: 2018-06-06 -last-updated: 2018-11-16 -status: implementable - ---- - -# Reporting Conformance Test Results to Testgrid - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -This is a KEP outlining the motivation behind why cloud providers should periodically upload E2E conformance test results to [Testgrid](https://github.com/kubernetes/test-infra/tree/master/testgrid) and how a cloud provider can go about doing this. - -## Motivation - -The primary motivation behind collecting conformance test results from various cloud providers on a regular basis is to inform sig-release of any critical bugs. It's important to collect results from various cloud providers to increase coverage and inform other cloud providers of bugs they may be impacted by. - -### Goals - -* All SIGs / Subprojects owners representing a cloud provider is aware of the importance of frequently uploading conformance test results to testgrid -* There is a clear and detailed process in place for any cloud provider to upload conformance test results to Testgrid. - -### Non-Goals - -* Test coverage - increasing test coverage is outside the scope of this KEP -* CI/CD - what CI/CD tool is used to run E2E tests and upload results is outside the scope of this KEP. It is up to the cloud provider to decide where/how tests are run. -* Cluster Provisioning - how a cluster is provisioned is outside the scope of this KEP. - -## Proposal - -We would like to propose that every Kubernetes cloud provider reports conformance test results for every patch version of Kubernetes at the minimum. Running conformance tests against master and on pre-release versions are highly encouraged but will not be a requirement. - -### Implementation Details/Notes/Constraints - -Before continuing, it is highly recommended you read the following documentation provided by sig-testing: -* [Testgrid Configuration](https://github.com/kubernetes/test-infra/tree/master/testgrid#testgrid) -* [Display Conformance Tests with Testgrid](https://github.com/kubernetes/test-infra/blob/master/testgrid/conformance/README.md) - -#### How to Run E2E Conformance Tests - -This KEP outlines two ways of running conformance tests, the first using [Sonobuoy](https://github.com/heptio/sonobuoy) and the second using [kubetest](https://github.com/kubernetes/test-infra/tree/master/kubetest). Though Sonobuoy is easier to setup, it does not guarantee that you will run the latest set of conformance tests. Kubetest though requiring a bit more work to setup, can ensure that you are running the latest set of conformance tests. - -At this point we will assume that you have a running cluster and that `kubectl` is configured to point to that cluster with admin access. - -#### Sonobuoy - -You should use Sonobuoy if you would like to run the standard set of CNCF conformance tests. This may exclude any new tests added by the latest versions of Kubernetes. - -##### Installing Sonobuoy - -The following is mostly a [copy of the sonobuoy documentation](https://github.com/heptio/sonobuoy#download-and-run). - -First install sonobuoy. The following command adds it to the `$GOBIN` environment variable which is expected to be part of your `$PATH` environment variable. -``` -$ go get -u github.com/heptio/sonobuoy -``` - -##### Running Conformance Tests with Sonobuoy - -You can then start e2e tests by simply running: -``` -$ sonobuoy run -Running plugins: e2e, systemd-logs -INFO[0001] created object name=heptio-sonobuoy namespace= resource=namespaces -INFO[0001] created object name=sonobuoy-serviceaccount namespace=heptio-sonobuoy resource=serviceaccounts -INFO[0001] created object name=sonobuoy-serviceaccount-heptio-sonobuoy namespace= resource=clusterrolebindings -INFO[0001] created object name=sonobuoy-serviceaccount namespace= resource=clusterroles -INFO[0001] created object name=sonobuoy-config-cm namespace=heptio-sonobuoy resource=configmaps -INFO[0002] created object name=sonobuoy-plugins-cm namespace=heptio-sonobuoy resource=configmaps -INFO[0002] created object name=sonobuoy namespace=heptio-sonobuoy resource=pods -INFO[0002] created object name=sonobuoy-master namespace=heptio-sonobuoy resource=services -``` - -You can then check the status of your e2e tests by running -``` -$ sonobuoy status -PLUGIN STATUS COUNT -e2e running 1 -systemd_logs running 3 - -Sonobuoy is still running. Runs can take up to 60 minutes. -``` - -E2E tests can take up to an hour. Once your tests are done, you can download a snapshot of your results like so: -``` -$ sonobuoy retrieve . -``` - -Once you have the following snapshot, extract it's contents like so -``` -$ mkdir ./results; tar xzf *.tar.gz -C ./results -``` - -At this point you should have the log file and JUnit results from your tests: -``` -results/plugins/e2e/results/e2e.log -results/plugins/e2e/results/junit_01.xml -``` - -#### Kubetest - -You should use kubetest if you want to run the latest set of tests in upstream Kubernetes. This is highly recommended by sig-testing so that new tests can be accounted for in new releases. - -##### Installing Kubetest - -Install kubetest using the following command which adds it to your `$GOBIN` environment variable which is expected to be part of your `$PATH` environment variable. -``` -go get -u k8s.io/test-infra/kubetest -``` - -##### Running Conformance Test with kubetest - -Now you can run conformance test with the following: -``` -cd /path/to/k8s.io/kubernetes -export KUBERNETES_CONFORMANCE_TEST=y -kubetest \ - # conformance tests aren't supposed to be aware of providers - --provider=skeleton \ - # tell ginkgo to only run conformance tests - --test --test_args="--ginkgo.focus=\[Conformance\]" \ - # grab the most recent CI tarball of kubernetes 1.10, including the tests - --extract=ci/latest-1.10 \ - # directory to store junit results - --dump=$(pwd)/_artifacts | tee ./e2e.log -``` - -Note that `--extract=ci/latest-1.10` indicates that we want to use the binaries/tests on the latest version of 1.10. You can use `--extract=ci/latest` to run the latest set of conformance tests from master. - -Once the tests have finished (takes about an hour) you should have the log file and JUnit results from your tests: -``` -e2e.log -_artifacts/junit_01.xml -``` - -#### How to Upload Conformance Test Results to Testgrid - -##### Requesting a GCS Bucket - -Testgrid requires that you store results in a publicly readable GCS bucket. If for whatever reason you cannot set up a GCS bucket, please contact @BenTheElder or more generally the [gke-kubernetes-engprod](mailto:gke-kubernetes-engprod@google.com) team to arrange for a Google [GKE](https://cloud.google.com/kubernetes-engine/) EngProd provided / maintained bucket for hosting your results. - -##### Authenticating to your Testgrid Bucket - -Assuming that you have a publicly readable bucket provided by the GKE team, you should have been provided a service account JSON file which you can use with [gcloud](https://cloud.google.com/sdk/downloads) to authenticate with your GCS bucket. - -``` -$ gcloud auth activate-service-account --key-file /path/to/k8s-conformance-serivce-accout.json -Activated service account credentials for: [demo-bucket-upload@k8s-federated-conformance.iam.gserviceaccount.com] -``` - -##### Uploading results to Testgrid - -At this point you should be able to upload your testgrid results to your GCS bucket. You can do so by running a python script availabile [here](https://github.com/kubernetes/test-infra/tree/master/testgrid/conformance). For this example, we upload results for v1.10 into it's own GCS prefix. -``` -git clone https://github.com/kubernetes/test-infra -cd test-infra/testgrid/conformance -./upload_e2e.py --junit /path/to/junit_01.xml \ - --log /path/to/e2e.log \ - --bucket=gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10 -Uploading entry to: gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10/1528333637 -Run: ['gsutil', '-q', '-h', 'Content-Type:text/plain', 'cp', '-', 'gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10/1528333637/started.json'] stdin={"timestamp": 1528333637} -Run: ['gsutil', '-q', '-h', 'Content-Type:text/plain', 'cp', '-', 'gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10/1528333637/finished.json'] stdin={"timestamp": 1528337316, "result": "SUCCESS"} -Run: ['gsutil', '-q', '-h', 'Content-Type:text/plain', 'cp', '~/go/src/k8s.io/kubernetes/results/plugins/e2e/results/e2e.log', 'gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10/1528333637/build-log.txt'] -Run: ['gsutil', '-q', '-h', 'Content-Type:text/plain', 'cp', '~/go/src/k8s.io/kubernetes/results/plugins/e2e/results/e2e.log', 'gs://k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10/1528333637/artifacts/e2e.log'] -Done. -``` - -##### Testgrid Configuration - -Next thing you want to do is configure testgrid to read results from your GCS bucket. There are two [configuration](https://github.com/kubernetes/test-infra/tree/master/testgrid#configuration) steps required. One for your [test group](https://github.com/kubernetes/test-infra/tree/master/testgrid#test-groups) and one for your [dashboard](https://github.com/kubernetes/test-infra/tree/master/testgrid#dashboards). - -To add a test group update [config.yaml](https://github.com/kubernetes/test-infra/blob/master/testgrid/config.yaml) with something like the following: -``` -test_groups: -... -... -- name: cloud-provider-demo-e2e-conformance-release-v1.10 - gcs_prefix: k8s-conformance-demo/cloud-provider-demo/e2e-conformance-release-v1.10 -``` - -To add a link to your results in the testgrid dashboard, update [config.yaml](https://github.com/kubernetes/test-infra/blob/master/testgrid/config.yaml) with something like the following: -``` -dashboards: -... -... -- name: conformance-demo-cloud-provider - dashboard_tab: - - name: "Demo Cloud Provider, v1.10" - description: "Runs conformance tests for cloud provier demo on release v1.10" - test_group_name: cloud-provider-demo-e2e-conformance-release-v1.10 -``` - -Once you've made the following changes, open a PR against the test-infra repo adding the sig testing label (`/sig testing`) and cc'ing @kubernetes/sig-testing-pr-reviews. Once your PR merges you should be able to view your results on https://k8s-testgrid.appspot.com/ which should be ready to be consumed by the necessary stakeholders (sig-release, sig-testing, etc). - -#### Lifecycle of Test Results - -You can configure the lifecycle of testgrid results by specifying fields like `days_of_results` on your test group configuration. More details about this in the [Testgrid Advanced Configuration](https://github.com/kubernetes/test-infra/tree/master/testgrid#advanced-configuration) docs. If for whatever reason you urgently need to delete testgrid results, you can contact someone from sig-testing. - -#### Examples - -Here are some more concrete examples of how other cloud providers are running conformance tests and uploading results to testgrid: -* Open Stack - * [OpenLab zuul job for running/uploading testgrid results](https://github.com/theopenlab/openlab-zuul-jobs/tree/master/playbooks/cloud-provider-openstack-acceptance-test-e2e-conformance) - * [OpenStack testgrid config](https://github.com/kubernetes/test-infra/pull/7670) - * [OpenStack conformance tests dashboard](https://github.com/kubernetes/test-infra/pull/8154) - - -### Risks and Mitigations - -#### Operational Overhead - -Operating CI/CD system to run conformance tests on a regular basis may incur extra work from every cloud provider. Though we anticipate the benefits of running conformance tests to outweight the operational overhead, in some cases it may not. - -Mitigation: TODO - -#### Misconfigured Tests - -There are various scenarios where cloud providers may mistakenly upload incorrect conformance tests results. One example being uploading results for the wrong Kubernetes version. - -Mitigation: TODO - -#### Flaky Tests - -Tests can fail for various reasons in any cloud environment and may raise false negatives for the release team. - -Mitigation: TODO - - -## Graduation Criteria - -All providers are periodically uploading conformance test results in at least one of the methods outlined in this KEP. - - -[umbrella issues]: TODO - -## Implementation History - -- Jun 6th 2018: KEP is merged as a signal of acceptance. Cloud providers should now be looking to report their conformance test results to testgrid. -- Nov 19th 2018: KEP has been in implementation stage for roughly 5 months with Alibaba Cloud, Baidu Cloud, DigitalOcean, GCE, OpenStack and vSphere reporting conformance test results to testgrid. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/0019-cloud-provider-documentation.md b/keps/sig-cloud-provider/0019-cloud-provider-documentation.md index 0e12c4fe..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/0019-cloud-provider-documentation.md +++ b/keps/sig-cloud-provider/0019-cloud-provider-documentation.md @@ -1,182 +1,4 @@ ---- -kep-number: 0019 -title: Cloud Provider Documentation -authors: - - "@d-nishi" - - "@hogepodge" -owning-sig: sig-cloud-provider -participating-sigs: - - sig-docs - - sig-cluster-lifecycle - - sig-aws - - sig-azure - - sig-gcp - - sig-openstack - - sig-vmware -reviewers: - - "@andrewsykim" - - "@calebamiles" - - "@hogepodge" - - "@jagosan" -approvers: - - "@andrewsykim" - - "@hogepodge" - - "@jagosan" -editor: TBD -creation-date: 2018-07-31 -last-updated: 2018-11-16 -status: implementable ---- -## Transfer the responsibility of maintaining valid documentation for Cloud Provider Code to the Cloud Provider - -### Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) -* [User Stories [optional]](#user-stories) - * [Story 1](#story-1) - * [Story 2](#story-2) -* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints) -* [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Alternatives [optional]](#alternatives) - -### Summary -This KEP describes the documentation requirements for both in-tree and out-of-tree cloud controller managers. -These requirements are meant to capture critical usage documentation that is common between providers, set requirements for individual documentation, and create consistent standards across provider documentation. The scope of this document is limited to in-tree code that interfaces with kube-controller-manager, and out-of-tree code that interfaces with cloud-controller-manager - -### Motivation -Currently documentation for cloud providers for both in-tree and out-of-tree managers is limited in both scope, consistency, and quality. This KEP describes requirements, to be reached in the 1.12 release cycle, to create and maintain consistent documentation across all cloud provider manager code. By establishing these standards, SIG-Cloud-Provider will benefit the user-community by offering a single discoverable source of reliable documentation while relieving the SIG-Docs team from the burden of maintaining out-dated duplicated documentation. - -#### Goals -* Produce a common document that describes how to configure any in-tree cloud provider that can be reused by tools such as kubeadm, to create minimum viable Kubernetes clusters. - * Create documentation requirements on how to configure in-tree cloud providers. - * Produce documentation for every in-tree cloud provider. -* Provide a common document that describes how to configure any out-of-tree cloud-controller-manager by provider. - * Create documentation requirements on how to configure out-of-tree cloud providers. - * Produce documentation for every out-of-tree cloud provider. -* Maintain developer documentation for anyone wanting to build a new cloud-controller-manager. -* Generate confidence in SIG-docs to confidently link to SIG-Cloud-Provider documentation for all future releases. - -#### Non-Goals -This KEP is limited to documenting requirements for control plane components for in-tree implementation and cloud-controller-manager for out-of-tree implementation. It is not currently meant to document provider-specific drivers or code (example: Identity & access management: Keystone for Openstack, IAM for AWS etc). -SIG-Docs is not expected to produce or maintain any of this documentation. - -### Proposal - -#### In-Tree Documetation -Produce common documentation that describes how to configure any in-tree cloud provider that can be reused by tools such as kubeadm, to create minimum viable Kubernetes clusters. - -Kubernetes documentation lists details of current cloud-provider [here](https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/). Additional documentation [(1),](https://kubernetes.io/docs/concepts/services-networking/service/) [(2)](https://kubernetes.io/docs/tasks/administer-cluster/developing-cloud-controller-manager/) that link to cloud-provider code currently remains detached and poorly maintained. - -#### Requirement 1: -Provide validated manifests for kube-controller-manager, kubelet and kube-apiserver to enable a Kubernetes administrator to run cloud-provider=<providername> in-tree as is feasible today. Example manifests should be in the following directories: - -* kubernetes/kubernetes/pkg/cloudprovider/myprovider/docs/example-manifests/ - * [kube-apiserver.manifest](https://gist.github.com/d-nishi/1109fec153930e8de04a1bf160cacffb) - * [kube-controller-manager.manifest](https://gist.github.com/d-nishi/a41691cdf50239986d1e725af4d20033) - * [kubelet.manifest](https://gist.github.com/d-nishi/289cb82367580eb0cb129c9f967d903d) with [kubelet flags](https://gist.github.com/d-nishi/d7f9a1b59c0441d476646dc7cce7e811) - -The examples above are from a cluster running on AWS. - -#### Requirement 2: -Provide validated/tested descriptions with examples of controller features (annotations or labels) that are cloud-provider dependent that can be reused by any Kubernetes administrator to run `cloud-provider-<providername>` in-tree with `kube-controller-manager` as is described in the code <cloudprovider.go> Example: aws.go -These manifests should be regularly tested and updated post testing in the relevant provider location: - - -* kubernetes/kubernetes/pkg/cloudprovider/myprovider/docs/controllers/ - * node/ - * annotations.md - outlines what annotations the controller sets or reads from a node resource - * labels.md - outlines what labels the controller sets or read from a node resource - * README.md - outlines the purpose of this controller - * service/ - * annotations.md - outlines what annotations the controller sets or reads when managing a load balancer - * labels.md - outlines what labels the controller sets or read when managing a load balancer - * README.md - outlines the purpose of this controller - * persistentvolumelabel/ - * annotations.md - outlines what annotations the controller sets or read when managing persistent volumes - * labels.md - outlines what labels the controller sets when managing persistent volumes (previously known as PersistentVolumeLabel admission controller) - * README.md - outlines the purpose of this controller - * ... - -#### Out-of-Tree Documetation -Provide a common document that describes how to configure a Kubernetes cluster on any out-of-tree cloud provider. - -#### Requirement 1: - -Provide validated manifests for kube-controller-manager, kubelet, kube-apiserver and cloud-controller-manager to enable a Kubernetes administrator to run cloud-provider=<providername> out-of-tree as is feasible today. Example manifests should be in the following directories: - -* /path/to/out-of-tree-provider/docs/example-manifests/ - * [apiserver manifest](https://gist.github.com/andrewsykim/a7938e185d45e1c0ef760c375005fdef) - * [kube-controller-manager manifest](https://gist.github.com/andrewsykim/56ee2da95ade8386d3123e982d72aca9) - * [kubelet manifest](https://gist.github.com/andrewsykim/ac954b1657eb0e6a2e95af516594e2bd) - * [cloud controller manager DaemonSet](https://gist.github.com/andrewsykim/26e22e36471c1774e3626a70d2b7465f) - -The following examples are from provisioning a cluster on DigitalOcean using kops. - -#### Requirement 2: -List out the latest annotations or tags that are cloud-provider dependent and will be used by the Kubernetes administrator to run `cloud-provider-<providername>` out-of-tree with `cloud-controller-manager`. These manifests should be regularly tested and updated in the relevant provider location: - -* /path/to/out-of-tree-provider/docs/controllers/ - * node/ - * annotations.md - outlines what annotations the controller sets or reads from a node resource - * labels.md - outlines what labels the controller sets or read from a node resource - * README.md - outlines the purpose of this controller - * service/ - * annotations.md - outlines what annotations the controller sets or reads when managing a load balancer - * labels.md - outlines what labels the controller sets or read when managing a load balancer - * README.md - outlines the purpose of this controller - * persistentvolumelabel/ - * annotations.md - outlines what annotations the controller sets or read when managing persistent volumes - * labels.md - outlines what labels the controller sets when managing persistent volumes (previously known as PersistentVolumeLabel admission controller) - * README.md - outlines the purpose of this controller - * Other provider-specific-Controller e.g. Route controller for GCP - -### User Stories [optional] - -#### Story 1 -Sally is a devops engineer wants to run Kubernetes clouds across her on-premise environment and public cloud sites. She wants to use ansible or terraform to bring up Kubernetes v1.11. She references the cloud-provider documentation to understand how to enable in-tree provider code, and has a consistent set of documentation to help her write automation to target each individual cloud. - -#### Story 2 -Sam wants to add advanced features to external cloud provider. By consulting the external cloud provider documents, they are able to set up a development and test environment. Where previously documentation was inconsistent and spread across multiple sources, there is a single document that allows them to immediately launch provider code within their target cloud. - -### Implementation Details/Notes/Constraints [optional] -The requirements set forward need to accomplish several things: -* Identify and abstract common documentation across all providers. -* Create a consistent format that makes it easy to switch between providers. -* Allow for provider-specific documentation, quirks, and features. - -### Risks and Mitigations -This proposal relies heavily on individual cloud-provider developers to provide expertise in document generation and maintenance. Documentation can easily drift from implementation, making for a negative user experience. -To mitigate this, SIG-Cloud-Provider membership will work with developers to keep their documentation up to date. This will include a review of documents along release-cycle boundaries, and adherence to release-cycle deadlines. -SIG-Cloud-Provider will work with SIG-Docs to establish quality standards and with SIG-Node and SIG Cluster Lifecycle to keep common technical documentation up-to-date. - -### Graduation Criteria -This KEP represents an ongoing effort for the SIG-Cloud-Provider team. -* Immediate success is measured by the delivery of all goals outlined in the Goals (1) section. -* Long Term success is measured by the delivery of goals outlined in the Goals (2) section. -* Long Term success is also measured by the regular upkeep of all goals in Goals (1) and (2) sections. - -### Implementation History -Major milestones in the life cycle of a KEP should be tracked in Implementation History. Major milestones might include: -* the Summary and Motivation sections being merged signaling SIG acceptance -* the Proposal section being merged signaling agreement on a proposed design -* the date implementation started - July 25 2018 -* the first Kubernetes release where an initial version of the KEP was available - v1.12 -* the version of Kubernetes where the KEP graduated to general availability - v1.14 -* the date when the KEP was retired or superseded - NA - -### Alternatives [optional] -The Alternatives section is used to highlight and record other possible approaches to delivering the value proposed by a KEP. -* SIG docs could tag cloudprovider documentation as a blocking item for Kubernetes releases -* SIG docs could also assign SIG-<provider> leads to unblock cloudprovider documentation in the planning phase for the release. - -## Implementation History - -- July 31st 2018: KEP is merged as a signal of acceptance. Cloud providers should now be looking to add documentation for their provider according to this KEP. -- Nov 19th 2018: KEP has been in implementation stage for roughly 4 months with Alibaba Cloud, Azure, DigitalOcean, OpenStack and vSphere having written documentation for their providers according to this KEP. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/providers/0004-cloud-provider-template.md b/keps/sig-cloud-provider/providers/0004-cloud-provider-template.md index e5dde39a..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/providers/0004-cloud-provider-template.md +++ b/keps/sig-cloud-provider/providers/0004-cloud-provider-template.md @@ -1,105 +1,4 @@ ---- -kep-number: 4 -title: Cloud Provider Template -authors: - - "@janedoe" -owning-sig: sig-cloud-provider -participating-sigs: - - sig-aaa - - sig-bbb -reviewers: - - TBD - - "@alicedoe" -approvers: - - "@andrewsykim" - - "@hogepodge" - - "@jagosan" -editor: TBD -creation-date: yyyy-mm-dd -last-updated: yyyy-mm-dd -status: provisional -see-also: - - KEP-1 - - KEP-2 -replaces: - - KEP-3 -superseded-by: - - KEP-100 ---- - -# Cloud Provider FooBar - -This is a KEP template, outlining how to propose a new cloud provider into the Kubernetes ecosystem. - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Requirements](#requirements) -* [Proposal](#proposal) - -## Summary - -This is where you add a summary of your cloud provider and other additional information about your cloud provider that others may find useful. - -## Motivation - -### Goals - -This is where you can specify any goals you may have for your cloud provider. - -### Non-Goals - -This is where you can specify any work that you think is outside the scope of your cloud provider. - -## Prerequisites - -This is where you outline all the prerequisites for new providers that have been met. - -### Repository Requirements - -For [repository requirements](https://github.com/kubernetes/community/blob/master/keps/sig-cloud-provider/0002-cloud-controller-manager.md#repository-requirements) you are expected to have a repo (belonging to any organization, ideally owned by your cloud provider) that has a working implementation of the [Kubernetes Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/). Note that the list of requirements are subject to change. - -### User Experience Reports - -There must be a reasonable amount of user feedback about running Kubernetes for this cloud provider. You may want to link to sources that indicate this such as github issues, product data, customer tesitimonials, etc. - -### Testgrid Integration - -Your cloud provider is reporting conformance test results to TestGrid as per the [Reporting Conformance Test Results to Testgrid KEP](https://github.com/kubernetes/community/blob/master/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md). - -### CNCF Certified Kubernetes - -Your cloud provider is accepted as part of the [Certified Kubernetes Conformance Program](https://github.com/cncf/k8s-conformance). - -### Documentation - -There is documentation on running Kubernetes on your cloud provider as per the [cloud provider documentation KEP](https://github.com/kubernetes/community/blob/master/keps/sig-cloud-provider/0019-cloud-provider-documentation.md). - -### Technical Leads are members of the Kubernetes Organization - -All proposed technical leads for this provider must be members of the Kubernetes organization. Membership is used as a signal for technical ability, commitment to the project, and compliance to the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md) which we believe are important traits for subproject technical leads. Learn more about Kubernetes community membership [here](https://github.com/kubernetes/community/blob/master/community-membership.md). - -## Proposal - -This is where you can talk about what resources from the Kubernetes community you would like such as a repository in the Kubernetes organization to host your provider code. - -### Subproject Leads - -This is where you indicate the leads for the subproject. Make sure you include their github handles. See the [SIG Charter](https://github.com/kubernetes/community/blob/master/sig-cloud-provider/CHARTER.md#subprojectprovider-owners) for more details on expectations from subproject leads. - -### Repositories - -This is where you propose a repository within the Kubernetes org, it's important you specify the name of the repository you would like. Cloud providers typically have at least 1 repository named `kubernetes/cloud-provider-foobar`. It's also important to indiciate who the initial owners of the repositories will be. These owners will be added to the initial OWNERS file. The owners of the subproject must be owners of the repositories but you can add more owners in the repo if you'd like. If you are requesting any repositories, be sure to add them to the SIG Cloud Provider [subproject list](https://github.com/kubernetes/community/tree/master/sig-cloud-provider#subprojects). - -### Meetings - -This where you specify when you will have meetings to discuss development of your cloud provider. SIG Cloud Provider will provide zoom/youtube channels as required. Note that these meetings are in addition to the biweekly SIG Cloud Provider meetings that subproject leads are strongly encouraged to attend. - - -### Others - -Feel free to add anything else you may need. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/providers/0020-cloud-provider-alibaba-cloud.md b/keps/sig-cloud-provider/providers/0020-cloud-provider-alibaba-cloud.md index 81453922..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/providers/0020-cloud-provider-alibaba-cloud.md +++ b/keps/sig-cloud-provider/providers/0020-cloud-provider-alibaba-cloud.md @@ -1,117 +1,4 @@ ---- -kep-number: 20 -title: Cloud Provider for Alibaba Cloud -authors: - - "@aoxn" -owning-sig: sig-cloud-provider -reviewers: - - "@andrewsykim" -approvers: - - "@andrewsykim" - - "@hogepodge" - - "@jagosan" -editor: TBD -creation-date: 2018-06-20 -last-updated: 2018-06-20 -status: provisional - ---- - -# Cloud Provider for Alibaba Cloud - -This is a KEP for adding ```Cloud Provider for Alibaba Cloud``` into the Kubernetes ecosystem. - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Requirements](#requirements) -* [Proposal](#proposal) - -## Summary - -Alibaba Cloud provides the Cloud Provider interface implementation as an out-of-tree cloud-controller-manager. It allows Kubernetes clusters to leverage the infrastructure services of Alibaba Cloud . -It is original open sourced project is [https://github.com/AliyunContainerService/alicloud-controller-manager](https://github.com/AliyunContainerService/alicloud-controller-manager) - -## Motivation - -### Goals - -Cloud Provider of Alibaba Cloud implements interoperability between Kubernetes cluster and Alibaba Cloud. In this project, we will dedicated in: -- Provide reliable, secure and optimized integration with Alibaba Cloud for Kubernetes - -- Help on the improvement for decoupling cloud provider specifics from Kubernetes implementation. - - - -### Non-Goals - -The networking and storage support of Alibaba Cloud for Kubernetes will be provided by other projects. - -E.g. - -* [Flannel network for Alibaba Cloud VPC](https://github.com/coreos/flannel) -* [FlexVolume for Alibaba Cloud](https://github.com/AliyunContainerService/flexvolume) - - -## Prerequisites - -1. The VPC network is supported in this project. The support for classic network or none ECS environment will be out-of-scope. -2. When using the instance profile for authentication, an instance role is required to attach to the ECS instance firstly. -3. Kubernetes version v1.7 or higher - -### Repository Requirements - -[Alibaba Cloud Controller Manager](https://github.com/AliyunContainerService/alicloud-controller-manager) is a working implementation of the [Kubernetes Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/). - -The repo requirements is mainly a copy from [cloudprovider KEP](https://github.com/kubernetes/community/blob/master/keps/sig-cloud-provider/0002-cloud-controller-manager.md#repository-requirements). Open the link for more detail. - -### User Experience Reports -As a CNCF Platinum member, Alibaba Cloud is dedicated in providing users with highly secure , stable and efficient cloud service. -Usage of aliyun container services can be seen from github issues in the existing alicloud controller manager repo: https://github.com/AliyunContainerService/alicloud-controller-manager/issues - -### Testgrid Integration - Alibaba cloud provider is reporting conformance test results to TestGrid as per the [Reporting Conformance Test Results to Testgrid KEP](https://github.com/kubernetes/community/blob/master/keps/sig-cloud-provider/0018-testgrid-conformance-e2e.md). - See [report](https://k8s-testgrid.appspot.com/conformance-alibaba-cloud-provider#Alibaba%20Cloud%20Provider,%20v1.10) for more details. - -### CNCF Certified Kubernetes - Alibaba cloud provider is accepted as part of the [Certified Kubernetes Conformance Program](https://github.com/cncf/k8s-conformance). - For v1.11 See [https://github.com/cncf/k8s-conformance/tree/master/v1.11/alicloud](https://github.com/cncf/k8s-conformance/tree/master/v1.11/alicloud) - For v1.10 See [https://github.com/cncf/k8s-conformance/tree/master/v1.10/alicloud](https://github.com/cncf/k8s-conformance/tree/master/v1.10/alicloud) - For v1.9 See [https://github.com/cncf/k8s-conformance/tree/master/v1.9/alicloud](https://github.com/cncf/k8s-conformance/tree/master/v1.9/alicloud) - For v1.8 See [https://github.com/cncf/k8s-conformance/tree/master/v1.8/alicloud](https://github.com/cncf/k8s-conformance/tree/master/v1.8/alicloud) - -### Documentation - - Alibaba CloudProvider provide users with multiple documentation on build & deploy & utilize CCM. Please refer to [https://github.com/AliyunContainerService/alicloud-controller-manager/tree/master/docs](https://github.com/AliyunContainerService/alicloud-controller-manager/tree/master/docs) for more details. - -### Technical Leads are members of the Kubernetes Organization - -The Leads run operations and processes governing this subproject. - -- @cheyang Special Tech Leader, Alibaba Cloud. Kubernetes Member - -## Proposal - -Here we propose a repository from Kubernetes organization to host our cloud provider implementation. Cloud Provider of Alibaba Cloud would be a subproject under Kubernetes community. - -### Repositories - -Cloud Provider of Alibaba Cloud will need a repository under Kubernetes org named ```kubernetes/cloud-provider-alibaba-cloud``` to host any cloud specific code. -The initial owners will be indicated in the initial OWNER files. - -Additionally, SIG-cloud-provider take the ownership of the repo but Alibaba Cloud should have the fully autonomy permission to operator on this subproject. - -### Meetings - -Cloud Provider meetings is expected to have biweekly. SIG Cloud Provider will provide zoom/youtube channels as required. We will have our first meeting after repo has been settled. - -Recommended Meeting Time: Wednesdays at 20:00 PT (Pacific Time) (biweekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=20:00&tz=PT%20%28Pacific%20Time%29). -- Meeting notes and Agenda. -- Meeting recordings. - - -### Others +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/providers/0021-cloud-provider-digitalocean.md b/keps/sig-cloud-provider/providers/0021-cloud-provider-digitalocean.md index c9254fee..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/providers/0021-cloud-provider-digitalocean.md +++ b/keps/sig-cloud-provider/providers/0021-cloud-provider-digitalocean.md @@ -1,91 +1,4 @@ ---- -kep-number: 21 -title: Cloud Provider DigitalOcean -authors: - - "@andrewsykim" -owning-sig: sig-cloud-provider -reviewers: - - "@hogepodge" - - "@jagosan" -approvers: - - "@andrewsykim" - - "@hogepodge" - - "@jagosan" -editor: TBD -creation-date: 2018-07-23 -last-updated: 2018-07-23 -status: provisional - ---- - -# Cloud Provider DigitalOcean - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Requirements](#requirements) -* [Proposal](#proposal) - -## Summary - -DigitalOcean is a cloud computing platform built for developers and businesses who want a simple way to deploy & manage their infrastructure. This is a KEP proposing DigitalOcean as a cloud provider within the Kubernetes ecosystem. - -## Motivation - -### Goals - -* Supporting DigitalOcean within the Kubernetes ecosystem. This involves: - * providing an open community to promote discussions for Kubernetes on DigitalOcean - * providing an open environment for developing Kubernetes on DigitalOcean - -### Non-Goals - -* Using the Kubernetes ecosystem/community to onboard more users onto our platform. - -## Prerequisites - -### Repository Requirements - -The existing repository hosting the [DigitalOcean cloud controller manager](https://github.com/digitalocean/digitalocean-cloud-controller-manager) satisfies requirements as outlined in KEP 0002. - -### User Experience Reports - -DigitalOcean recently announced a [Kubernetes offering](https://www.digitalocean.com/products/kubernetes/). Many users have already signed up for early access. DigitalOcean is also a gold member of the CNCF. - -### Testgrid Integration - -TODO - -### CNCF Certified Kubernetes - -TODO - -### Documentation - -TODO - -### Technical Leads are members of the Kubernetes Organization - -TODO - -## Proposal - -### Subproject Leads - -Initially there will be one subproject lead. In the future the goal is to have 3 subproject leads. - -* Andrew Sy Kim (@andrewsykim) - - -### Repositories - -Please create a repository `kubernetes/cloud-provider-digitalocean`. - -### Meetings - -TBD. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cloud-provider/providers/0022-cloud-provider-baiducloud.md b/keps/sig-cloud-provider/providers/0022-cloud-provider-baiducloud.md index 2ce5f974..cfd1f5fa 100644 --- a/keps/sig-cloud-provider/providers/0022-cloud-provider-baiducloud.md +++ b/keps/sig-cloud-provider/providers/0022-cloud-provider-baiducloud.md @@ -1,104 +1,4 @@ ---- -kep-number: 22 -title: Cloud Provider BaiduCloud -authors: - - "@tizhou86" -owning-sig: sig-cloud-provider -reviewers: - - "@andrewsykim" -approvers: - - "@andrewsykim" - - "@hogepodge" - - "@jagosan" -editor: TBD -creation-date: 2018-07-23 -last-updated: 2018-07-23 -status: provisional - ---- -# Cloud Provider BaiduCloud - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Requirements](#requirements) -* [Proposal](#proposal) - -## Summary - -Baidu is a gold member of CNCF and we have a large team working on Kubernetes and related projects like complex scheduling, heterogeneous computing, auto-scaling etc. We build cloud platform to support Baidu emerging business including autonomous driving, deep learning, blockchain by leveraging Kubernetes. We also provide public container services named cloud container engine(CCE). - -## Motivation - -### Goals - -- Building, deploying, maintaining, supporting, and using Kubernetes on Baidu Cloud Container Engine(CCE) and Baidu Private Cloud(BPC). Both of the project are built on Kubernetes and related CNCF project. - -- Designing, discussing, and maintaining the cloud-provider-baidu repository under Github Kubernetes project. - -### Non-Goals - -- Identify domain knowledge and work that can be contributed back to Kubernetes and related CNCF projects. - -- Mentor CCE and BPC developers to contribute to CNCF projects. - -- Focus on Kubernetes and CNCF related projects, the discussion of development issue for CCE and BCP will not be included in the SIG. - -## Prerequisites - -### Repository Requirements - -The repository url which meets all the requirements is: https://github.com/baidu/cloud-provider-baiducloud - -### User Experience Reports - - -CCE-ticket-1: User want to get the Kubernetes cluster config file by using account's aksk. - - -CCE-ticket-2: User want to modify the image repository's username. - - -CCE-ticket-3: User want to have multi-tenant ability in a shared large CCE cluster. - -### Testgrid Integration - -TODO - -### CNCF Certified Kubernetes - -TODO - -### Documentation - -TODO - -### Technical Leads are members of the Kubernetes Organization - -TODO - -## Proposal - -### Subproject Leads - -The subproject will have 3 leaders at any given time. I will be an initial point of contact as we work on creating the subporject. My github account is: tizhou86 - -I will be the subproject leader at this moment. My github account is: tizhou86. - -### Repositories - -The repository we propose at this moment is: kubernetes/cloud-provider-baiducloud, I'll be the initial point of contact. - -### Meetings - -We plan to have bi-week online meeting at https://zoom.us/j/5134183949 on every next Wednesday 6pm PST. - - -### Others - -NA at this moment. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0003-cluster-api.md b/keps/sig-cluster-lifecycle/0003-cluster-api.md index 66dc0edf..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0003-cluster-api.md +++ b/keps/sig-cluster-lifecycle/0003-cluster-api.md @@ -1,231 +1,4 @@ ---- -kep-number: 3 -title: Kubernetes Cluster Management API -status: provisional -authors: - - "@roberthbailey" - - "@pipejakob" -owning-sig: sig-cluster-lifecycle -reviewers: - - "@thockin" -approvers: - - "@roberthbailey" -editor: - - "@roberthbailey" -creation-date: 2018-01-19 -last-updated: 2018-01-22 ---- - -# Kubernetes Cluster Management API - -## Table of Contents - -* [Kubernetes Cluster Management API](#kubernetes-cluster-management-api) - * [Metadata](#metadata) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non\-goals](#non-goals) - * [Challenges and Open Questions](#challenges-and-open-questions) - * [Proposal](#proposal) - * [Driving Use Cases](#driving-use-cases) - * [Cluster\-level API](#cluster-level-api) - * [Machine API](#machine-api) - * [Capabilities](#capabilities) - * [Overview](#overview) - * [In\-place vs\. Replace](#in-place-vs-replace) - * [Omitted Capabilities](#omitted-capabilities) - * [Conditions](#conditions) - * [Types](#types) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Drawbacks](#drawbacks) - * [Alternatives](#alternatives) - -## Summary - -We are building a set of Kubernetes cluster management APIs to enable common cluster lifecycle operations (install, upgrade, repair, delete) across disparate environments. -We represent nodes and other infrastructure in Kubernetes-style APIs to enable higher level controllers to update the desired state of the cluster (e.g. the autoscaling controller requesting additional machines) and reconcile the world with that state (e.g. communicating with cloud providers to create or delete virtual machines). -With the full state of the cluster represented as API objects, Kubernetes installers can use them as a common configuration language, and more sophisticated tooling can be built in an environment-agnostic way. - -## Motivation - -Kubernetes has a common set of APIs (see the [Kubernetes API Conventions](/contributors/devel/api-conventions.md)) to orchestrate containers regardless of deployment mechanism or cloud provider. -Kubernetes also has APIs for handling some infrastructure, like load-balancers, ingress rules, or persistent volumes, but not for creating new machines. -As a result, the deployment mechanisms that manage Kubernetes clusters each have unique APIs and implementations for how to handle lifecycle events like cluster creation or deletion, master upgrades, and node upgrades. -Additionally, the cluster-autoscaler is responsible not only for determining when the cluster should be scaled, but also responsible for adding capacity to the cluster by interacting directly with the cloud provider to perform the scaling. -When another component needs to create or destroy virtual machines, like the node auto provisioner, it would similarly need to reimplement the logic for interacting with the supported cloud providers (or reuse the same code to prevent duplication). - -### Goals - -* The cluster management APIs should be declarative, Kubernetes-style APIs that follow our existing [API Conventions](/contributors/devel/api-conventions.md). -* To the extent possible, we should separate state that is environment-specific from environment-agnostic. - * However, we still want the design to be able to utilize environment-specific functionality, or else it likely won’t gain traction in favor of other tooling that is more powerful. - -### Non-goals - -* To add these cluster management APIs to Kubernetes core. -* To support infrastructure that is irrelevant to Kubernetes clusters. - * We are not aiming to create terraform-like capabilities of creating any arbitrary cloud resources, nor are we interested in supporting infrastructure used solely by applications deployed on Kubernetes. The goal is to support the infrastructure necessary for the cluster itself. -* To convince every Kubernetes lifecycle product ([kops](https://github.com/kubernetes/kops), [kubespray](https://github.com/kubernetes-incubator/kubespray), [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/), [Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/), [Elastic Container Service for Kubernetes](https://aws.amazon.com/eks/), etc.) to support these APIs. - * There is value in having consistency between installers and broad support for the cluster management APIs and in having common infrastructure reconcilers used post-installation, but 100% adoption isn't an immediate goal. -* To model state that is purely internal to a deployer. - * Many Kubernetes deployment tools have intermediate representations of resources and other internal state to keep track of. They should continue to use their existing methods to track internal state, rather than attempting to model it in these APIs. - -### Challenges and Open Questions - -* Should a single Kubernetes cluster only house definitions for itself? - * If so, that removes the ability to have a single cluster control the reconciliation of infrastructure for other clusters. - * However, with the concurrent [Cluster Registry](/contributors/design-proposals/multicluster/cluster-registry/api-design.md) project, a good separation of responsibilities would be that the Cluster Registry API is responsible for indexing multiple clusters, each of which would only have to know about itself. In order to achieve cross-cluster reconciliation, a controller would need to integrate with a Cluster Registry for discovery. -* Should a cluster’s control plane definition should be housed within that same cluster. - * If the control plane becomes unhealthy, then it won’t be able to rectify itself without external intervention. If the control plane configuration lives elsewhere, and the controllers reconciling its state are able to act in the face of control plane failure, then this API could be used to fix a misconfigured control plane that is unresponsive. -* Should our representation of Nodes allow declarative versioning of non-Kubernetes packages, like the container runtime, the Linux kernel, etc.? - * It potentially enables the use case of smaller, in-place upgrades to nodes without changing the node image. - * We may be able to leverage cloud-init to some extent, but since it isn’t supported across all cloud/distributions, and doesn’t support upgrades (or any actions beyond initialization), this may devolve into rolling our own solution. -* Should the Cluster API bother with control plane configuration, or expect each component to use component config? - * One option is to allow arbitrary API objects to be defined during cluster initialization, which will be a combination of Cluster objects, NodeSet objects, and ConfigMaps for relevant component config. This makes the Cluster API less comprehensive, but avoids redundancy and more accurately reflects the desired state of the cluster. - * Another option is to have key component config embedded in the Cluster API, which will then be created as the appropriate ConfigMaps during creation. This would be used as a convenience during cluster creation, and then the separate ConfigMaps become the authoritative configuration, potentially with a control loop to propagate changes from the embedded component config in the Cluster API to the appropriate (authoritative) ConfigMaps on an ongoing basis. -* Do we want to allow for arbitrary node boot scripts? - * Some existing tools like kubicorn support this, but the user demand isn’t clear yet. - * Also see https://github.com/kubernetes/kops/issues/387 - * Kops now has hooks -* Are there any environments in which it only makes sense to refer to a group of homogeneous nodes, instead of individual ones? - * The current proposal is to start with individual objects to represent each declarative node (called a “Machine”), which allows us to build support for Sets and Deployments on top of them in the future. However, does this simplification break for any environment we want to support? - - -## Proposal - -### Driving Use Cases - -_TODO_: Separate out the use cases that are focused on the control plane vs. those focused on nodes. - - -These use cases are in scope for our v1alpha1 API design and initial prototype implementation: - -* Initial cluster creation using these API objects in yaml files (implemented via client-side bootstrapping of resources) - * Rather than each Kubernetes installer having its own custom APIs and cluster definitions, they could be fed the definition of the cluster via serialized API objects. This would lower the friction of moving between different lifecycle products. -* Declarative Kubernetes upgrades for the control plane and kubelets -* Declarative upgrades for node OS images -* Maintaining consistency of control plane and machine configuration across different clusters / clouds - * By representing important cluster configuration via declarative objects, operations like “diffing” the configuration of two clusters becomes very straightforward. Also, reconcilers can be written to ensure that important cluster configuration is kept in sync between different clusters by simply copying objects. -* Cloud adoption / lift and shift / liberation - -These use cases are in scope for the project, but post-v1alpha1: - -* Server-side node draining -* Autoscaling - * Currently, the OSS cluster autoscaler has the responsibility of determining the right size of the cluster and calling the cloud provider to perform the scaling (supporting every cloud provider directly). Modeling groups of nodes in a declarative way would allow autoscalers to only need to worry about the correct cluster size and error handling when that can’t be achieved (e.g. in the case of stockouts), and then separate cloud controllers can be responsible for creating and deleting nodes to reconcile that state and report any errors encountered. -* Integration with the Cluster Registry API - * Automatically add a new cluster to a registry, support tooling that works across multiple clusters using a registry, delete a cluster from a registry. -* Supporting other common tooling, like monitoring - -These use cases are out of scope entirely: - -* Creating arbitrary cloud resources - -### Cluster-level API - -This level of the Cluster Management API describes the global configuration of a cluster. It should be capable of representing the versioning and configuration of the entire control plane, irrespective of the representation of nodes. - -Given the recent efforts of SIG Cluster Lifecycle to make kubeadm the de facto standard toolkit for cloud- and vendor-agnostic cluster initialization, and because kubeadm has [an existing API](https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/apis/kubeadm/v1alpha3/types.go) to define the global configuration for a cluster, it makes sense to coalesce the global portion of the Cluster API with the API used by “kubeadm init” to configure a cluster master. - -A current goal is to make these APIs as cloud-agnostic as possible, so that the entire definition of a Cluster could remain reasonably in-sync across different deployments potentially in different cloud providers, which would help enable hybrid usecases where it’s desirable to have key configuration stay in sync across different clusters potentially in different clouds/environments. However, this goal is balanced against making the APIs coherent and usable, which strict separation may harm. - -The full types for this API can be seen and were initially discussed in [kube-deploy#306](https://github.com/kubernetes/kube-deploy/pull/306). - -### Machine API - -#### Capabilities - -The set of node capabilities that this proposal is targeting for v1alpha1 are: -1. A new Node can be created in a declarative way, including Kubernetes version and container runtime version. It should also be able to specify provider-specific information such as OS image, instance type, disk configuration, etc., though this will not be portable. -1. A specific Node can be deleted, freeing external resources associated with it. -1. A specific Node can have its kubelet version upgraded or downgraded in a declarative way\*. -1. A specific Node can have its container runtime changed, or its version upgraded or downgraded, in a declarative way\*. -1. A specific Node can have its OS image upgraded or downgraded in a declarative way\*. - -\* It is an implementation detail of the provider if these operations are performed in-place or via Node replacement. - -#### Overview - -This proposal introduces a new API type: **Machine**. - -A "Machine" is the declarative spec for a Node, as represented in Kubernetes core. If a new Machine object is created, a provider-specific controller will handle provisioning and installing a new host to register as a new Node matching the Machine spec. If the Machine's spec is updated, a provider-specific controller is responsible for updating the Node in-place or replacing the host with a new one matching the updated spec. If a Machine object is deleted, the corresponding Node should have its external resources released by the provider-specific controller, and should be deleted as well. - -Fields like the kubelet version, the container runtime to use, and its version, are modeled as fields on the Machine's spec. Any other information that is provider-specific, though, is part of an opaque ProviderConfig string that is not portable between different providers. - -The ProviderConfig is recommended to be a serialized API object in a format owned by that provider, akin to the [Component Config](https://docs.google.com/document/d/1arP4T9Qkp2SovlJZ_y790sBeiWXDO6SG10pZ_UUU-Lc/edit) pattern. This will allow the configuration to be strongly typed, versioned, and have as much nested depth as appropriate. These provider-specific API definitions are meant to live outside of the Machines API, which will allow them to evolve independently of it. Attributes like instance type, which network to use, and the OS image all belong in the ProviderConfig. - -#### In-place vs. Replace - -One simplification that might be controversial in this proposal is the lack of API control over "in-place" versus "replace" reconciliation strategies. For instance, if a Machine's spec is updated with a different version of kubelet than is actually running, it is up to the provider-specific controller whether the request would best be fulfilled by performing an in-place upgrade on the Node, or by deleting the Node and creating a new one in its place (or reporting an error if this particular update is not supported). One can force a Node replacement by deleting and recreating the Machine object rather than updating it, but no similar mechanism exists to force an in-place change. - -Another approach considered was that modifying an existing Machine should only ever attempt an in-place modification to the Node, and Node replacement should only occur by deleting and creating a new Machine. In that case, a provider would set an error field in the status if it wasn't able to fulfill the requested in-place change (such as changing the OS image or instance type in a cloud provider). - -The reason this approach wasn't used was because most cluster upgrade tools built on top of the Machines API would follow the same pattern: - -``` -for machine in machines: - attempt to upgrade machine in-place - if error: - create new machine - delete old machine -``` - -Since updating a Node in-place is likely going to be faster than completely replacing it, most tools would opt to use this pattern to attempt an in-place modification first, before falling back to a full replacement. - -It seems like a much more powerful concept to allow every tool to instead say: - -``` -for machine in machines: - update machine -``` - -and allow the provider to decide if it is capable of performing an in-place update, or if a full Node replacement is necessary. - -#### Omitted Capabilities - -**A scalable representation of a group of nodes** - -Given the existing targeted capabilities, this functionality could easily be built client-side via label selectors to find groups of Nodes and using (1) and (2) to add or delete instances to simulate this scaling. - -It is natural to extend this API in the future to introduce the concepts of MachineSets and MachineDeployments that mirror ReplicaSets and Deployments, but an initial goal is to first solidify the definition and behavior of a single Machine, similar to how Kubernetes first solidifed Pods. - -A nice property of this proposal is that if provider controllers are written solely against Machines, the concept of MachineSets can be implemented in a provider-agnostic way with a generic controller that uses the MachineSet template to create and delete Machine instances. All Machine-based provider controllers will continue to work, and will get full MachineSet functionality for free without modification. Similarly, a MachineDeployment controller could then be introduced to generically operate on MachineSets without having to know about Machines or providers. Provider-specific controllers that are actually responsible for creating and deleting hosts would only ever have to worry about individual Machine objects, unless they explicitly opt into watching higher-level APIs like MachineSets in order to take advantage of provider-specific features like AutoScalingGroups or Managed Instance Groups. - -However, this leaves the barrier to entry very low for adding new providers: simply implement creation and deletion of individual Nodes, and get Sets and Deployments for free. - -**A provider-agnostic mechanism to request new nodes** - -In this proposal, only certain attributes of Machines are provider-agnostic and can be operated on in a generic way. In other iterations of similar proposals, much care had been taken to allow the creation of truly provider-agnostic Machines that could be mapped to provider-specific attributes in order to better support usecases around automated Machine scaling. This introduced a lot of upfront complexity in the API proposals. - -This proposal starts much more minimalistic, but doesn't preclude the option of extending the API to support these advanced concepts in the future. - -**Dynamic API endpoint** - -This proposal lacks the ability to declaratively update the kube-apiserver endpoint for the kubelet to register with. This feature could be added later, but doesn't seem to have demand now. Rather than modeling the kube-apiserver endpoint in the Machine object, it is expected that the cluster installation tool resolves the correct endpoint to use, starts a provider-specific Machines controller configured with this endpoint, and that the controller injects the endpoint into any hosts it provisions. - -#### Conditions - -[bgrant0607](https://github.com/bgrant0607) and [erictune](https://github.com/erictune) have indicated that the API pattern of having "Conditions" lists in object statuses is soon to be deprecated. These have generally been used as a timeline of state transitions for the object's reconciliation, and difficult to consume for clients that just want a meaningful representation of the object's current state. There are no existing examples of the new pattern to follow instead, just the guidance that we should use top-level fields in the status to represent meaningful information. We can revisit the specifics when new patterns start to emerge in core. - -#### Types - -The full Machine API types can be found and discussed in [kube-deploy#298](https://github.com/kubernetes/kube-deploy/pull/298). - -## Graduation Criteria - -__TODO__ - -## Implementation History - -* **December 2017 (KubeCon Austin)**: Prototype implementation on Google Compute Engine using Custom Resource Definitions - -## Drawbacks - -__TODO__ - -## Alternatives - -__TODO__ +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0004-bootstrap-checkpointing.md b/keps/sig-cluster-lifecycle/0004-bootstrap-checkpointing.md index 1a39243b..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0004-bootstrap-checkpointing.md +++ b/keps/sig-cluster-lifecycle/0004-bootstrap-checkpointing.md @@ -1,143 +1,4 @@ ---- -kep-number: 4 -title: Kubernetes Bootstrap Checkpointing Proposal -status: implemented -authors: - - "@timothysc" -owning-sig: sig-cluster-lifecycle -participating-sigs: - - sig-node -reviewers: - - "@yujuhong" - - "@luxas" - - "@roberthbailey" -approvers: - - "@yujuhong" - - "@roberthbailey" -editor: - name: "@timothysc" -creation-date: 2017-10-20 -last-updated: 2018-01-23 ---- - -# Kubernetes Bootstrap Checkpointing Proposal - -## Table of Contents - -* [Summary](#summary) -* [Objectives](#objectives) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Unresolved Questions](#unresolved-questions) - -## Summary - -There are several methods to deploy a kubernetes cluster, one method that -offers some unique advantages is self hosting. The purpose of this proposal -is to outline a method to checkpoint specific annotated pods, namely the -control plane components, for the purpose of enabling self hosting. - -The details of self hosting are beyond the scope of this proposal, and are -outlined in the references listed below: - - - [Self Hosted Kubernetes][0] - - [Kubeadm Upgrades][1] - -Extra details on this proposal, and its history, can be found in the links -below: - - - [Bootstrap Checkpointing Draft 1][2] - - [Bootstrap Checkpointing Draft 2][3] - - [WIP Implementation][4] - -## Objectives - -The scope of this proposal is **bounded**, but has the potential for broader -reuse in the future. The reader should be mindful of the explicitly stated -[Non-Goals](#non-goals) that are listed below. - -### Goals - - - Provide a basic framework for recording annotated *Pods* to the filesystem. - - Ensure that a restart of the kubelet checks for existence of these files - and loads them on startup. - -### Non-Goals - -- This is not a generic checkpointing mechanism for arbitrary resources. -(e.g. Secrets) Such changes require wider discussions. -- This will not checkpoint internal kubelet state. -- This proposal does not cover self hosted kubelet(s). It is beyond the -scope of this proposal, and comes with it's own unique set of challenges. - -## Proposal -The enablement of this feature is gated by a single command line flag that -is passed to the kubelet on startup, ```--bootstrap-checkpoint-path``` , -and will be denoted that it is ```[Alpha]```. - -### User Stories - -#### Pod Submission to Running -- On submission of a Pod, via kubeadm or an operator, an annotation -```node.kubernetes.io/bootstrap-checkpoint=true``` is added to that Pod, which -indicates that it should be checkpointed by the kubelet. When the kubelet -receives a notification from the apiserver that a new pod is to run, it will -inspect the ```--bootstrap-checkpoint-path``` flag to determine if -checkpointing is enabled. Finally, the kubelet will perform an atomic -write of a ```Pod_UID.yaml``` file when the afore mentioned annotation exists. -The scope of this annotation is bounded and will not be promoted to a field. - -#### Pod Deletion -- On detected deletion of a Pod, the kubelet will remove the associated -checkpoint from the filesystem. Any failure to remove a pod, or file, will -result in an error notification in the kubelet logs. - -#### Cold Start -- On a cold start, the kubelet will check the value of -```--bootstrap-checkpoint-path```. If the value is specified, it will read in -the contents of the that directory and startup the appropriate Pod. Lastly, -the kubelet will then pull the list of pods from the api-server and rectify -what is supposed to be running according to what is bound, and will go through -its normal startup procedure. - -### Implementation Constraints -Due to its opt-in behavior, administrators will need to take the same precautions -necessary in segregating master nodes, when enabling the bootstrap annotation. - -Please see [WIP Implementation][4] for more details. - -## Graduation Criteria - -Graduating this feature is a responsibility of sig-cluster-lifecycle and -sig-node to determine over the course of the 1.10 and 1.11 releases. History -has taught us that initial implementations often have a tendency overlook use -cases and require refinement. It is the goal of this proposal to have an -initial alpha implementation of bootstrap checkpoining in the 1.9 cycle, -and further refinement will occur after we have validated it across several -deployments. - -## Testing -Testing of this feature will occur in three parts. -- Unit testing of standard code behavior -- Simple node-e2e test to ensure restart recovery -- (TODO) E2E test w/kubeadm self hosted master restart recovery of an apiserver. - -## Implementation History - -- 20171020 - 1.9 draft proposal -- 20171101 - 1.9 accepted proposal -- 20171114 - 1.9 alpha implementation code complete - -## Unresolved Questions - -* None at this time. - -[0]: /contributors/design-proposals/cluster-lifecycle/self-hosted-kubernetes.md -[1]: https://github.com/kubernetes/community/pull/825 -[2]: https://docs.google.com/document/d/1hhrCa_nv0Sg4O_zJYOnelE8a5ClieyewEsQM6c7-5-o/edit?ts=5988fba8# -[3]: https://docs.google.com/document/d/1qmK0Iq4fqxnd8COBFZHpip27fT-qSPkOgy1x2QqjYaQ/edit?ts=599b797c# -[4]: https://github.com/kubernetes/kubernetes/pull/50984 +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0008-kubeadm-config-versioning.md b/keps/sig-cluster-lifecycle/0008-kubeadm-config-versioning.md index fe3bde8c..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0008-kubeadm-config-versioning.md +++ b/keps/sig-cluster-lifecycle/0008-kubeadm-config-versioning.md @@ -1,145 +1,4 @@ ---- -kep-number: draft-20180412 -title: Kubeadm Config versioning -authors: - - "@liztio" -owning-sig: sig-cluster-lifecycle -participating-sigs: [] -reviewers: - - "@timothysc" -approvers: - - TBD -editor: TBD -creation-date: 2018-04-12 -last-updated: 2018-04-12 -status: draft -see-also: [] -replaces: [] -superseded-by: [] ---- - -# Kubeadm Config Versioning - -## Table of Contents - -A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. - -<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --> -**Table of Contents** - -- [Kubeadm Config to Beta](#kubeadm-config-to-beta) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [User Stories [optional]](#user-stories-optional) - - [As a user upgrading with Kubeadm, I want the upgrade process to not fail with unfamiliar configuration.](#as-a-user-upgrading-with-kubeadm-i-want-the-upgrade-process-to-not-fail-with-unfamiliar-configuration) - - [As a infrastructure system using kubeadm, I want to be able to write configuration files that always work.](#as-a-infrastructure-system-using-kubeadm-i-want-to-be-able-to-write-configuration-files-that-always-work) - - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Alternatives](#alternatives) - -<!-- markdown-toc end --> - -## Summary - -Kubeadm uses MasterConfiguraton for two distinct but similar operations: Initialising a new cluster and upgrading an existing cluster. -The former is typically created by hand by an administrator. -It is stored on disk and passed to `kubeadm init` via command line flag. -The latter is produced by kubeadm using supplied configuration files, command line options, and internal defaults. -It will be stored in a ConfigMap so upgrade operations can find. - -Right now the configuration format is unversioned. -This means configuration file formats can change between kubeadm versions and there's no safe way to update the configuration format. - -We propose a stable versioning of this configuration, `v1alpha2` and eventually `v1beta1`. -Version information will be _mandatory_ going forward, both for user-generated configuration files and machine-generated configuration maps. - -There as an [existing document][config] describing current Kubernetes best practices around component configuration. - -[config]: https://docs.google.com/document/d/1FdaEJUEh091qf5B98HM6_8MS764iXrxxigNIdwHYW9c/edit#heading=h.nlhhig66a0v6 - -## Motivation - -After 1.10.0, we discovered a bug in the upgrade process. -The `MasterConfiguraton` embedded a [struct that had changed][proxyconfig], which caused a backwards-incompatible change to the configuration format. -This caused `kubeadm upgrade` to fail, because a newer version of kubeadm was attempting to deserialise an older version of the struct. - -Because the configuration is often written and read by different versions of kubeadm compiled by different versions of kubernetes, -it's very important for this configuration file to be well-versioned. - -[proxyconfig]: https://github.com/kubernetes/kubernetes/commit/57071d85ee2c27332390f0983f42f43d89821961 - -### Goals - -* kubeadm init fails if a configuration file isn't versioned -* the config map written out contains a version -* the configuration struct does not embed any other structs -* existing configuration files are converted on upgrade to a known, stable version -* structs should be sparsely populated -* all structs should have reasonable defaults so an empty config is still sensible - -### Non-Goals - -* kubeadm is able to read and write configuration files for older and newer versions of kubernetes than it was compiled with -* substantially changing the schema of the `MasterConfiguration` - -## Proposal - -The concrete proposal is as follows. - -1. Immediately start writing Kind and Version information into the `MasterConfiguraton` struct. -2. Define the previous (1.9) version of the struct as `v1alpha1`. -3. Duplicate the KubeProxyConfig struct that caused the schema change, adding the old version to the `v1alpha1` struct. -3. Create a new `v1alpha2` directory mirroring the existing [`v1alpha1`][v1alpha1], which matches the 1.10 schema. - This version need not duplicate the file as well. -2. Warn users if their configuration files do not have a version and kind -4. Use [apimachinery's conversion][conversion] library to design migrations from the old (v1alpha1) versions to the new (v1alpha2) versions -5. Determine the changes for v1beta1 -6. With v1beta1, enforce presence of version numbers in config files and ConfigMaps, erroring if not present. - -[conversion]: https://godoc.org/k8s.io/apimachinery/pkg/conversion -[v1alpha1]: https://github.com/kubernetes/kubernetes/tree/d7d4381961f4eb2a4b581160707feb55731e324e/cmd/kubeadm/app/apis/kubeadm - -### User Stories [optional] - -#### As a user upgrading with Kubeadm, I want the upgrade process to not fail with unfamiliar configuration. - -In the past, the haphazard nature of the versioning system has meant it was hard to provide strong guarantees between versions. -Implementing strong version guarantees mean any given configuration generated in the past by kubeadm will work with a future version of kubeadm. -Deprecations can happen in the future in well-regulated ways. - -#### As a infrastructure system using kubeadm, I want to be able to write configuration files that always work. - -Having a configuration file that changes without notice makes it very difficult to write software that integrates with kubeadm. -By providing strong version guarantees, we can guarantee that the files these tools produce will work with a given version of kubeadm. - -### Implementation Details/Notes/Constraints - -The incident that caused the breakage in alpha wasn't a field changed it Kubeadm, it was a struct [referenced][struct] inside the `MasterConfiguration` struct. -By completely owning our own configuration, changes in the rest of the project can't unknowingly affect us. -When we do need to interface with the rest of the project, we will do so explicitly in code and be protected by the compiler. - -[struct]: https://github.com/kubernetes/kubernetes/blob/d7d4381961f4eb2a4b581160707feb55731e324e/cmd/kubeadm/app/apis/kubeadm/v1alpha1/types.go#L285 - -### Risks and Mitigations - -Moving to a strongly versioned configuration from a weakly versioned one must be done carefully so as not break kubeadm for existing users. -We can start requiring versions of the existing `v1alpha1` format, issuing warnings to users when Version and Kind aren't present. -These fields can be used today, they're simply ignored. -In the future, we could require them, and transition to using `v1alpha1`. - -## Graduation Criteria - -This KEP can be considered complete once all currently supported versions of Kubeadm write out `v1beta1`-version structs. - -## Implementation History - -## Alternatives - -Rather than creating our own copies of all structs in the `MasterConfiguration` struct, we could instead continue embedding the structs. -To provide our guarantees, we would have to invest a lot more in automated testing for upgrades. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0014-20180707-componentconfig-api-types-to-staging.md b/keps/sig-cluster-lifecycle/0014-20180707-componentconfig-api-types-to-staging.md index ec221294..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0014-20180707-componentconfig-api-types-to-staging.md +++ b/keps/sig-cluster-lifecycle/0014-20180707-componentconfig-api-types-to-staging.md @@ -1,344 +1,4 @@ ---- -kep-number: 17 -title: Moving ComponentConfig API types to staging repos -status: implementable -authors: - - "@luxas" - - "@sttts" -owning-sig: sig-cluster-lifecycle -participating-sigs: - - sig-api-machinery - - sig-node - - sig-network - - sig-scheduling - - sig-cloud-provider -reviewers: - - "@thockin" - - "@liggitt" - - "@wojtek-t" - - "@stewart-yu" - - "@dixudx" -approvers: - - "@thockin" - - "@jbeda" - - "@deads2k" -editor: - name: "@luxas" -creation-date: 2018-07-07 -last-updated: 2018-08-10 ---- - -# Moving ComponentConfig API types to staging repos - -**How we can start supporting reading versioned configuration for all our components after a code move for ComponentConfig to staging** - -## Table of Contents - - * [Moving ComponentConfig API types to staging repos](#moving-componentconfig-api-types-to-staging-repos) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [The current state of the world](#the-current-state-of-the-world) - * [Current kubelet](#current-kubelet) - * [Current kube-proxy](#current-kube-proxy) - * [Current kube-scheduler](#current-kube-scheduler) - * [Current kube-controller-manager](#current-kube-controller-manager) - * [Current kube-apiserver](#current-kube-apiserver) - * [Current cloud-controller-manager](#current-cloud-controller-manager) - * [Goals](#goals) - * [Non-goals](#non-goals) - * [Related proposals and further references](#related-proposals-and-further-references) - * [Proposal](#proposal) - * [Migration strategy per component or k8s.io repo](#migration-strategy-per-component-or-k8sio-repo) - * [k8s.io/apimachinery changes](#k8sioapimachinery-changes) - * [k8s.io/apiserver changes](#k8sioapiserver-changes) - * [kubelet changes](#kubelet-changes) - * [kube-proxy changes](#kube-proxy-changes) - * [kube-scheduler changes](#kube-scheduler-changes) - * [k8s.io/controller-manager changes](#k8siocontroller-manager-changes) - * [kube-controller-manager changes](#kube-controller-manager-changes) - * [cloud-controller-manager changes](#cloud-controller-manager-changes) - * [kube-apiserver changes](#kube-apiserver-changes) - * [Timeframe and Implementation Order](#timeframe-and-implementation-order) - * [OWNERS files for new packages and repos](#owners-files-for-new-packages-and-repos) - -## Summary - -Currently all ComponentConfiguration API types are in the core Kubernetes repo. This makes them practically inaccessible for any third-party tool. With more and more generated code being removed from the core Kubernetes repo, vendoring gets even more complicated. Last but not least, efforts to move out kubeadm from the core repo are blocked by this. - -This KEP is about creating new staging repos, `k8s.io/{component}`, which will host the external -types of the core components’ ComponentConfig in a top-level `config/` package. Internal types will *eventually* be stored in -`k8s.io/{component}/pkg/apis/config` (but a non-goal for this KEP). Shared types will go to `k8s.io/{apimachinery,apiserver,controller-manager}/pkg/apis/config`. - -### The current state of the world - -#### Current kubelet - -* **Package**: [k8s.io/kubernetes/pkg/kubelet/apis/kubeletconfig](https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/kubelet/apis/kubeletconfig/types.go) -* **GroupVersionKind:** `kubelet.config.k8s.io/v1beta.KubeletConfiguration` -* **Supports** reading **config from file** with **flag precedence**, **well-tested**. - -#### Current kube-proxy - -* **Package**: [k8s.io/kubernetes/pkg/proxy/apis/kubeproxyconfig](https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/proxy/apis/kubeproxyconfig/types.go) -* **GroupVersionKind**: `kubeproxy.config.k8s.io/v1alpha1.KubeProxyConfiguration` -* **Supports** reading **config from file**, **without flag precedence**, **not tested**. -* This API group has its own copy of `ClientConnectionConfiguration` instead of a shared type. - -#### Current kube-scheduler - -* **Package**: [k8s.io/kubernetes/pkg/apis/componentconfig](https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/componentconfig/types.go) -* **GroupVersionKind**: `componentconfig/v1alpha1.KubeSchedulerConfiguration` -* **Supports** reading **config from file**, **without flag precedence**, **not tested** -* This API group has its own copies of `ClientConnectionConfiguration` & `LeaderElectionConfiguration` instead of shared types. - -#### Current kube-controller-manager - -* **Package**: [k8s.io/kubernetes/pkg/apis/componentconfig](https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/componentconfig/types.go) -* **GroupVersionKind**: `componentconfig/v1alpha1.KubeControllerManagerConfiguration` -* **No support for config from file** -* This API group has its own copies of `ClientConnectionConfiguration` & `LeaderElectionConfiguration` instead of shared types. - -#### Current kube-apiserver - -* **Doesn’t expose component configuration anywhere** -* **No support for config from file** -* The most similar thing to componentconfig for the API server is the `ServerRunOptions` struct in - [k8s.io/kubernetes/cmd/kube-apiserver/app/options/options.go](https://github.com/kubernetes/kubernetes/blob/release-1.11/cmd/kube-apiserver/app/options/options.go) - -#### Current cloud-controller-manager - -* **Package**: [k8s.io/kubernetes/pkg/apis/componentconfig](https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/componentconfig/types.go) -* **GroupVersionKind**: `componentconfig/v1alpha1.CloudControllerManagerConfiguration` -* **No support for config from file** -* This API group has its own copies of `ClientConnectionConfiguration` & `LeaderElectionConfiguration` instead of shared types. - -### Goals - -* Find a home for the ComponentConfig API types, hosted as a staging repo in the "core" repo that is kubernetes/kubernetes -* Make ComponentConfig API types consumable from *projects outside of kube* and from different parts of kube itself - * Resolve dependencies from the external ComponentConfig API types so that everything can depend on them - * The only dependency of the ComponentConfig API types should be `k8s.io/apimachinery` -* Split internal types from versioned types -* Remove the monolithic `componentconfig/v1alpha1` API group -* Enable the staging bot so that a `[https://github.com/kubernetes/](https://github.com/kubernetes/){component}` - (imported as `k8s.io/{component}`) repos are published regularly. -* The future API server componentconfig code should be compatible with the proposed structure - -### Non-goals - -* Graduate the API versions - * For v1.12, we’re working incrementally and will keep the API versions of the existing ComponentConfigs. -* Do major refactoring of the ComponentConfigs. This PR is about code moves, not about re-defining the structure. We will do the latter in follow-ups. -* Change the components to support reading a config file, do flag precedence correctly or add e2e testing - * Further, the "load-versioned-config-from-flag" feature in this proposal *should not* be confused with the - [Dynamic Kubelet Configuration](https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/) feature. There is nothing in this proposal advocating for that - a component should support a similar feature. This is all about the making the one-off “read bytes from a source and unmarshal into internal config state” possible for - both the internal components and external consumers of these APIs - * This is work to be done after this proposal is implemented (for every component but the kubelet which has this implemented already), and might or might not require - further, more component-specific proposals/KEPs -* Create a net-new ComponentConfiguration struct for the API server -* Publish the internal types to the new `k8s.io/{component}` repo -* Support ComponentConfiguration for the cloud-controller-manager, as it’s a stop-gap for the cloud providers to move out of tree. This effort is in progress. - * When the currently in-tree cloud providers have move out of tree, e.g. to `k8s.io/cloud-provider-gcp`, they should create their own external and internal types and - make the command support loading configuration files. - * The new repo can reuse the generic types from the to-be-created, `k8s.io/controller-manager` repo eventually. - * Meanwhile, the cloud-controller-manager will reference the parts it needs from the main repo, and live privately in `cmd/cloud-controller-manager` -* Expose defaulting functions for the external ComponentConfig types in `k8s.io/{component}/config` packages. - * Defaulting functions will still live local to the component in e.g. `k8s.io/kubernetes/pkg/{component}/apis/config/{version}` and be - registered in the default scheme, but won't be publicly exposed in the `k8s.io/{component}` repo. - * The only defaulting functions that are published to non-core repos are for the shared config types, in other words in - `k8s.io/{apimachinery,apiserver,controller-manager}/pkg/apis/config/{version}`, but **they are not registered in the scheme by default** - (with the normal `SetDefault_Foo` method and the `addDefaultingFunc(scheme *runtime.Scheme) { return RegisterDefaults(scheme) }` function). - Instead, there will be `RecommendedDefaultFoo` methods exposed, which the consumer of the shared types may or may not manually run in - `SetDefaults_Bar` functions (where `Bar` wraps `Foo` as a field). - -### Related proposals and further references - -* [Original Google Docs version of this KEP](https://docs.google.com/document/d/1-u2y03ufX7FzBDWv9dVI_HyiIQZL8iz3o3vabeOpP5Y/edit) -* [Kubernetes Component Configuration](https://docs.google.com/document/d/1arP4T9Qkp2SovlJZ_y790sBeiWXDO6SG10pZ_UUU-Lc/edit) by [@mikedanese](https://github.com/mikedanese) -* [Versioned Component Configuration Files](https://docs.google.com/document/d/1FdaEJUEh091qf5B98HM6_8MS764iXrxxigNIdwHYW9c/edit#) by [@mtaufen](https://github.com/mtaufen) -* [Creating a ComponentConfig struct for the API server](https://docs.google.com/document/d/1fcStTcdS2Foo6dVdI787Dilr0snNbqzXcsYdF74JIrg/edit) by [@luxas](https://github.com/luxas) & [@sttts](https://github.com/sttts) -* Related tracking issues in kubernetes/kubernetes: - * [Move `KubeControllerManagerConfiguration` to `pkg/controller/apis/`](https://github.com/kubernetes/kubernetes/issues/57618) - * [kube-proxy config should move out of ComponentConfig apigroup](https://github.com/kubernetes/kubernetes/issues/53577) - * [Move ClientConnectionConfiguration struct to its own api group](https://github.com/kubernetes/kubernetes/issues/54318) - * [Move `CloudControllerManagerConfiguration` to `cmd/cloud-controller-manager/apis`](https://github.com/kubernetes/kubernetes/issues/65458) - -## Proposal - -* for component in [kubelet kubeproxy kubecontrollermanager kubeapiserver kubescheduler] - * API group name: `{component}.config.k8s.io` - * Kind name: `{Component}Configuration` - * Code location: - * External types: `k8s.io/{component}/config/{version}/types.go` - * Like `k8s.io/api` - * Internal types: `k8s.io/kubernetes/pkg/{component}/apis/config` - * Alternatives, if applicable - * `k8s.io/{component}/pkg/apis/config` (preferred, in the future) - * `k8s.io/kubernetes/cmd/{component}/app/apis/config` - * If dependencies allow it, we can move them to `k8s.io/{component}/pkg/apis/config/types.go`. Not having the external types there is intentional because the `pkg/` package tree is considered as "on-your-own-risks / no code compatibility guarantees", while `config/` is considered as a code API. - * Internal scheme package: `k8s.io/kubernetes/pkg/{component}/apis/config/scheme/scheme.go` - * The scheme package should expose `Scheme *runtime.Scheme`, `Codecs *serializer.CodecFactory`, and `AddToScheme(*runtime.Scheme)`, and have an `init()` method that runs `AddToScheme(Scheme)` - * For the move to a staging repo to be possible, the external API package must not depend on the core repo. - * Hence, all non-staging repo dependencies need to be removed/resolved before the package move. - * Conversions from the external type to the internal type will be kept in `{internal_api_path}/{external_version}`, like for `k8s.io/api` - * Defaulting code will be kept in this package, besides the conversion functions. - * The defaulting code here is specific for the usage of the component, and internal by design. If there are defaulting functions we - feel would be generally useful, they might be exposed in `k8s.io/{component}/config/{version}/defaults.go` as `RecommendedDefaultFoo` - functions that can be used by various consumers optionally. - * Add at least some kind of minimum validation coverage for the types (e.g. ranges for integer values, URL scheme/hostname/port parsing, - DNS name/label/domain validation) in the `{internal_api_path}/validation` package, targeting the internal API version. The consequence of - these validations targeting the internal type is that they can't be exposed in the `k8s.io/{component}` repo, but publishing more - functionality like that is out of scope for this KEP and left as a future task. -* Create a "shared types"-package with structs generic to all or many componentconfig API groups, in the `k8s.io/apimachinery`, `k8s.io/apiserver` and `k8s.io/controller-manager` repos, depending on the struct. - * Location: `k8s.io/{apimachinery,apiserver,controller-manager}/pkg/apis/config/{,v1alpha1}` - * These aren’t "real" API groups, but they have both internal and external versions - * Conversions and internal types are published to the staging repo. - * Defaulting functions are of the `RecommendedDefaultFoo` format and opt-ins for consumers. No defaulting functions are registered in the scheme. -* Remove the monolithic `componentconfig/v1alpha1` API group (`pkg/apis/componentconfig`) -* Enable the staging bot to create the Github repos -* Add API roundtrip (fuzzing), defaulting, conversion, JSON tag consistency and validation tests. - -### Migration strategy per component or k8s.io repo - -#### k8s.io/apimachinery changes - -* **Not a "real" API group, instead shared packages only with both external and internal types.** -* **External Package with defaulting (where absolutely necessary) & conversions**: `k8s.io/apimachinery/pkg/apis/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/apimachinery/pkg/apis/config/types.go` -* Structs to be hosted initially: - * ClientConnectionConfiguration -* Assignee: @hanxiaoshuai - -#### k8s.io/apiserver changes - -* **Not a "real" API group, instead shared packages only with both external and internal types.** -* **External Package with defaulting (where absolutely necessary) & conversions**: `k8s.io/apiserver/pkg/apis/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/apiserver/pkg/apis/config/types.go` -* Structs to be hosted initially: - * `LeaderElectionConfiguration` - * `DebuggingConfiguration` - * later to be created: SecureServingConfiguration, AuthenticationConfiguration, AuthorizationConfiguration, etc. -* Assignee: @hanxiaoshuai - -#### kubelet changes - -* **GroupVersionKind:** `kubelet.config.k8s.io/v1beta.KubeletConfiguration` -* **External Package:** `k8s.io/kubelet/config/v1beta1/types.go` -* **Internal Package:** `k8s.io/kubernetes/pkg/kubelet/apis/config/types.go` -* **Internal Scheme:** `k8s.io/kubernetes/pkg/kubelet/apis/config/scheme/scheme.go` -* **Conversions & defaulting (where absolutely necessary) Package:** `k8s.io/kubernetes/pkg/kubelet/apis/config/v1beta1` -* **Future Internal Package:** `k8s.io/kubelet/pkg/apis/config/types.go` -* Assignee: @mtaufen - -#### kube-proxy changes - -* **GroupVersionKind**: `kubeproxy.config.k8s.io/v1alpha1.KubeProxyConfiguration` -* **External Package**: `k8s.io/kube-proxy/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/kubernetes/pkg/proxy/apis/config/types.go` -* **Internal Scheme**: `k8s.io/kubernetes/pkg/proxy/apis/config/scheme/scheme.go` -* **Conversions & defaulting (where absolutely necessary) Package:** `k8s.io/kubernetes/pkg/proxy/apis/config/v1alpha1` -* **Future Internal Package:** `k8s.io/kube-proxy/pkg/apis/config/types.go` -* Start referencing `ClientConnectionConfiguration` from the generic ComponentConfig packages -* Assignee: @m1093782566 - -#### kube-scheduler changes - -* **GroupVersionKind**: `kubescheduler.config.k8s.io/v1alpha1.KubeSchedulerConfiguration` -* **External Package**: `k8s.io/kube-scheduler/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/kubernetes/pkg/scheduler/apis/config/types.go` -* **Internal Scheme**: `k8s.io/kubernetes/pkg/scheduler/apis/config/scheme/scheme.go` -* **Conversions & defaulting (where absolutely necessary) Package:** `k8s.io/kubernetes/pkg/scheduler/apis/config/v1alpha1` -* **Future Internal Package:** `k8s.io/kube-scheduler/pkg/apis/config/types.go` -* Start referencing `ClientConnectionConfiguration` & `LeaderElectionConfiguration` from the generic ComponentConfig packages -* Assignee: @dixudx - -#### k8s.io/controller-manager changes - -* **Not a "real" API group, instead shared packages only with both external and internal types.** -* **External Package with defaulting (where absolutely necessary) & conversions**: `k8s.io/controller-manager/pkg/apis/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/controller-manager/pkg/apis/config/types.go` -* Will host structs: - * `GenericComponentConfiguration` (which will be renamed to `GenericControllerManagerConfiguration`) -* Assignee: @stewart-yu - -#### kube-controller-manager changes - -* **GroupVersionKind**: `kubecontrollermanager.config.k8s.io/v1alpha1.KubeControllerManagerConfiguration` -* **External Package**: `k8s.io/kube-controller-manager/config/v1alpha1/types.go` -* **Internal Package**: `k8s.io/kubernetes/pkg/controller/apis/config/types.go` -* **Internal Scheme**: `k8s.io/kubernetes/pkg/controller/apis/config/scheme/scheme.go` -* **Conversions & defaulting (where absolutely necessary) Package:** `k8s.io/kubernetes/pkg/controller/apis/config/v1alpha1` -* **Future Internal Package:** `k8s.io/kube-controller-manager/pkg/apis/config/types.go` -* Start referencing `ClientConnectionConfiguration` & `LeaderElectionConfiguration` from the generic ComponentConfig packages -* Assignee: @stewart-yu - -#### cloud-controller-manager changes - -* **Not a "real" API group, instead only internal types in `cmd/`.** -* **Internal Package:** `cmd/cloud-controller-manager/app/apis/config/types.go` -* We do not plan to publish any external types for this in a staging repo. -* The internal cloud-controller-manager ComponentConfiguration types will reference both `k8s.io/controller-manager/pkg/apis/config` - and `k8s.io/kubernetes/pkg/controller/apis/config/` -* Assignee: @stewart-yu - -#### kube-apiserver changes - -* **Doesn’t have a ComponentConfig struct at the moment, so there is nothing to move around.** -* Eventually, we want to create this ComponentConfig struct, but exactly how to do that is out of scope for this specific proposal. -* See [Creating a ComponentConfig struct for the API server](https://docs.google.com/document/d/1fcStTcdS2Foo6dVdI787Dilr0snNbqzXcsYdF74JIrg/edit) for a proposal on how to refactor the API server code to be able to expose the final ComponentConfig structure. - -### Vendoring - -#### Vgo - -Vgo – as the future standard vendoring mechanism in Golang – supports [vgo modules](https://research.swtch.com/vgo-module) using a `k8s.io/{component}/config/go.mod` file. Tags of the shape `config/vX.Y` on `k8s.io/{component}` will define a version of the component config of that component. Such a tagged module can be imported into a 3rd-party program without inheriting dependencies outside of the `k8s.io/{component}/config` package. - -The `k8s.io/{component}/config/go.mod` file will look like this: - -``` -module "k8s.io/{component}/config" -require ( - "k8s.io/apimachinery" v1.12.0 -) -``` - -The exact vgo semver versioning scheme we will use is out of scope of this document. We will be able to version the config package independently from the main package `k8s.io/{component}` if we want to, e.g. to implement correct semver semantics. - -Other 3rd-party code can import the config module as usual. Vgo does not add the dependencies from code outside of `k8s.io/{component}/config` (actually, vgo creates a separate `vgo.sum` for the config package with the transitive dependencies). - -Compare http://github.com/sttts/kubeadm for a test project using latest vgo. - -#### Dep - -Dep supports the import of sub-packages without inheriting dependencies from outside of the sub-package. - -### Timeframe and Implementation Order - -Objective: Done for v1.12 - -Implementation order: -* Start with copying over the necessary structs to the `k8s.io/apiserver` and `k8s.io/apimachinery ` shared config packages, with external and internal API versions. The defaulting pattern for these types are of the `RecommendedDefaultFoo` form, in other words defaulting is not part of the scheme. -* Remove as many unnecessary references to `pkg/apis/componentconfig` from the rest of the core repo as possible -* Make the types in `pkg/apis/componentconfig` reuse the newly-created types in `k8s.io/apiserver` and `k8s.io/apimachinery `. -* Start with the scheduler as the first component to be moved out. - * One PR for moving `KubeSchedulerConfiguration` to `staging/src/k8s.io/kube-scheduler/config/v1alpha1/types.go`, and the internal type to `pkg/scheduler/apis/config/types.go`. - * Set up the conversion for the external type by creating the package `pkg/scheduler/apis/config/v1alpha1`, without `types.go`, like how `k8s.io/api` is set up. - * This should be a pure code move. -* Set up staging publishing bot (async, non-critical) -* The kubelet, kube-proxy and kube-controller-manager types follow, each one independently. - -### OWNERS files for new packages and repos - -* Approvers: - * @kubernetes/api-approvers - * @sttts - * @luxas - * @mtaufen -* Reviewers: - * @kubernetes/api-reviewers - * @sttts - * @luxas - * @mtaufen - * @dixudx - * @stewart-yu +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0015-kubeadm-join-control-plane.md b/keps/sig-cluster-lifecycle/0015-kubeadm-join-control-plane.md index 78c2546f..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0015-kubeadm-join-control-plane.md +++ b/keps/sig-cluster-lifecycle/0015-kubeadm-join-control-plane.md @@ -1,444 +1,4 @@ -# kubeadm join --control-plane workflow - -## Metadata - -```yaml ---- -kep-number: 15 -title: kubeadm join --control-plane workflow -status: accepted -authors: - - "@fabriziopandini" -owning-sig: sig-cluster-lifecycle -reviewers: - - "@chuckha” - - "@detiber" - - "@luxas" -approvers: - - "@luxas" - - "@timothysc" -editor: - - "@fabriziopandini" -creation-date: 2018-01-28 -last-updated: 2018-06-29 -see-also: - - KEP 0004 -``` - -## Table of Contents - -<!-- TOC --> - -- [kubeadm join --control-plane workflow](#kubeadm-join---control-plane-workflow) - - [Metadata](#metadata) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-goals](#non-goals) - - [Challenges and Open Questions](#challenges-and-open-questions) - - [Proposal](#proposal) - - [User Stories](#user-stories) - - [Create a cluster with more than one control plane instance (static workflow)](#create-a-cluster-with-more-than-one-control-plane-instance-static-workflow) - - [Add a new control-plane instance (dynamic workflow)](#add-a-new-control-plane-instance-dynamic-workflow) - - [Implementation Details](#implementation-details) - - [Initialize the Kubernetes cluster](#initialize-the-kubernetes-cluster) - - [Preparing for execution of kubeadm join --control-plane](#preparing-for-execution-of-kubeadm-join---control-plane) - - [The kubeadm join --control-plane workflow](#the-kubeadm-join---control-plane-workflow) - - [dynamic workflow (advertise-address == `controlplaneAddress`)](#dynamic-workflow-advertise-address--controlplaneaddress) - - [Static workflow (advertise-address != `controlplaneAddress`)](#static-workflow-advertise-address--controlplaneaddress) - - [Strategies for deploying control plane components](#strategies-for-deploying-control-plane-components) - - [Strategies for distributing cluster certificates](#strategies-for-distributing-cluster-certificates) - - [`kubeadm upgrade` for HA clusters](#kubeadm-upgrade-for-ha-clusters) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - -<!-- /TOC --> - -## Summary - -We are extending the kubeadm distinctive `init` and `join` workflow, introducing the -capability to add more than one control plane instance to an existing cluster by means of the -new `kubeadm join --control-plane` option (in alpha release the flag will be named --experimental-control-plane) - -As a consequence, kubeadm will provide a best-practice, “fast path” for creating a -minimum viable, conformant Kubernetes cluster with one or more nodes hosting control-plane instances and -zero or more worker nodes; as better detailed in following paragraphs, please note that -this proposal doesn't solve every possible use case or even the full end-to-end flow automatically. - -## Motivation - -Support for high availability is one of the most requested features for kubeadm. - -Even if, as of today, there is already the possibility to create an HA cluster -using kubeadm in combination with some scripts and/or automation tools (e.g. -[this](https://kubernetes.io/docs/setup/independent/high-availability/)), this KEP was -designed with the objective to introduce an upstream simple and reliable solution for -achieving the same goal. - -Such solution will provide a consistent and repeatable base for implementing additional -capabilities like e.g. kubeadm upgrade for HA clusters. - -### Goals - -- "Divide and conquer” - - This proposal - at least in its initial release - does not address all the possible - user stories for creating an highly available Kubernetes cluster, but instead - focuses on: - - - Defining a generic and extensible flow for bootstrapping a cluster with multiple control plane instances, - the `kubeadm join --control-plane` workflow. - - Providing a solution *only* for well defined user stories. see - [User Stories](#user-stories) and [Non-goals](#non-goals). - -- Enable higher-level tools integration - - We expect higher-level and tooling will leverage on kubeadm for creating HA clusters; - accordingly, the `kubeadm join --control-plane` workflow should provide support for - the following operational practices used by higher level tools: - - - Parallel node creation - - Higher-level tools could create nodes in parallel (both nodes hosting control-plane instances and workers) - for reducing the overall cluster startup time. - `kubeadm join --control-plane` should support natively this practice without requiring - the implementation of any synchronization mechanics by higher-level tools. - -- Provide support both for dynamic and static bootstrap flow - - At the time a user is running `kubeadm init`, they might not know what - the cluster setup will look like eventually. For instance, the user may start with - only one control plane instance + n nodes, and then add further control plane instances with - `kubeadm join --control-plane` or add more worker nodes with `kubeadm join` (in any order). - This kind of workflow, where the user doesn’t know in advance the final layout of the control plane - instances, into this document is referred as “dynamic bootstrap workflow”. - - Nevertheless, kubeadm should support also more “static bootstrap flow”, where a user knows - in advance the target layout of the control plane instances (the number, the name and the IP - of nodes hosting control plane instances). - -- Support different etcd deployment scenarios, and more specifically run control plane components - and the etcd cluster on the same machines (stacked control plane nodes) or run the etcd - cluster on dedicated machines. - -### Non-goals - -- Installing a control-plane instance on an existing workers node. - The nodes must be created as a control plane instance or as workers and then are supposed to stick to the - assigned role for their entire life cycle. - -- This proposal doesn't include a solution for etcd cluster management (but nothing in this proposal should - prevent to address this in future). - -- This proposal doesn't include a solution for API server load balancing (Nothing in this proposal - should prevent users from choosing their preferred solution for API server load balancing). - -- This proposal doesn't address the ongoing discussion about kubeadm self-hosting; in light of - divide and conquer goal stated before, it is not planned to provide support for self-hosted clusters - neither in the initial proposal nor in the foreseeable future (but nothing in this proposal should - explicitly prevent to reconsider this in future as well). - -- This proposal doesn't provide an automated solution for transferring the CA key and other required - certs from one control-plane instance to the other. More specifically, this proposal doesn't address - the ongoing discussion about storage of kubeadm TLS assets in secrets and it is not planned - to provide support for clusters with TLS stored in secrets (but nothing in this - proposal should explicitly prevent to reconsider this in future). - -- Nothing in this proposal should prevent practices that exist today. - -### Challenges and Open Questions - -- Keep the UX simple. - - - _What are the acceptable trade-offs between the need to have a clean and simple - UX and the variety/complexity of possible kubernetes HA deployments?_ - -- Create a cluster without knowing its final layout - - Supporting a dynamic workflow implies that some information about the cluster are - not available at init time, like e.g. the number of control plane instances, the IP of - nodes candidates for hosting control-plane instances etc. etc. - - - _How to configure a Kubernetes cluster in order to easily adapt to future change - of its own control plane layout like e.g. add a new control-plane instance, remove a - control plane instance?_ - - - _What are the "pivotal" cluster settings that must be defined before initializing - the cluster?_ - - - _How to combine into a single UX support for both static and dynamic bootstrap - workflows?_ - -- Kubeadm limited scope of action - - - Kubeadm binary can execute actions _only_ on the machine where it is running - e.g. it is not possible to execute actions on other nodes, to copy files across - nodes etc. - - During the join workflow, kubeadm can access the cluster _only_ using identities - with limited grants, namely `system:unauthenticated` or `system:node-bootstrapper`. - -- Upgradability - - - How to setup an high available cluster in order to simplify the execution - of cluster version upgrades, both manually or with the support of `kubeadm upgrade`?_ - -## Proposal - -### User Stories - -#### Create a cluster with more than one control plane instance (static workflow) - -As a kubernetes administrator, I want to create a Kubernetes cluster with more than one -control-plane instances, of which I know in advance the name and the IP. - -\* A new "control plane instance" is a new kubernetes node with -`node-role.kubernetes.io/master=""` label and -`node-role.kubernetes.io/master:NoSchedule` taint; a new instance of control plane -components will be deployed on the new node. -As described in goals/non goals, in this first release of the proposal -creating a new control plane instance doesn't trigger the creation of a new etcd member on the -same machine. - -#### Add a new control-plane instance (dynamic workflow) - -As a kubernetes administrator, (_at any time_) I want to add a new control-plane instance* to -an existing Kubernetes cluster. - -### Implementation Details - -#### Initialize the Kubernetes cluster - -As of today, a Kubernetes cluster should be initialized by running `kubeadm init` on a -first node, afterward referred as the bootstrap control plane. - -in order to support the `kubeadm join --control-plane` workflow a new Kubernetes cluster is -expected to satisfy following conditions : - -- The cluster must have a stable `controlplaneAddress` endpoint (aka the IP/DNS of the - external load balancer) -- The cluster must use an external etcd. - -All the above conditions/settings could be set by passing a configuration file to `kubeadm init`. - -#### Preparing for execution of kubeadm join --control-plane - -Before invoking `kubeadm join --control-plane`, the user/higher level tools -should copy control plane certificates from an existing control plane instance, e.g. the bootstrap control plane - -> NB. kubeadm is limited to execute actions *only* -> in the machine where it is running, so it is not possible to copy automatically -> certificates from remote locations. - -Please note that strictly speaking only ca, front-proxy-ca certificate and service account key pair -are required to be equal among all control plane instances. Accordingly: - -- `kubeadm join --control-plane` will check for the mandatory certificates and fail fast if - they are missing -- given the required certificates exists, if some/all of the other certificates are provided - by the user as well, `kubeadm join --control-plane` will use them without further checks. -- If any other certificates are missing, `kubeadm join --control-plane` will create them. - -> see "Strategies for distributing cluster certificates" paragraph for -> additional info about this step. - -#### The kubeadm join --control-plane workflow - -The `kubeadm join --control-plane` workflow will be implemented as an extension of the -existing `kubeadm join` flow. - -`kubeadm join --control-plane` will accept an additional parameter, that is the apiserver advertise -address of the joining node; as detailed in following paragraphs, the value assigned to -this parameter depends on the user choice between a dynamic bootstrap workflow or a static -bootstrap workflow. - -The updated join workflow will be the following: - -1. Discovery cluster info [No changes to this step] - - > NB This step waits for a first instance of the kube-apiserver to become ready - > (the bootstrap control plane); And thus it acts as embedded mechanism for handling the sequence - > `kubeadm init` and `kubeadm join` actions in case of parallel node creation. - -2. Executes the kubelet TLS bootstrap process [No changes to this step]: - -3. In case of `join --control-plane` [New step] - - 1. Using the bootstrap token as identity, read the `kubeadm-config` configMap - in `kube-system` namespace. - - > This requires to grant access to the above configMap for - > `system:bootstrappers` group. - - 2. Check if the cluster/the node is ready for joining a new control plane instance: - - a. Check if the cluster has a stable `controlplaneAddress` - a. Check if the cluster uses an external etcd - a. Checks if the mandatory certificates exists on the file system - - 3. Prepare the node for hosting a control plane instance: - - a. Create missing certificates (in any). - > please note that by creating missing certificates kubeadm can adapt seamlessly - > to a dynamic workflow or to a static workflow (and to apiserver advertise address - > of the joining node). see following paragraphs for more details for additional info. - - a. In case of control plane deployed as static pods, create related kubeconfig files - and static pod manifests. - - > see "Strategies for deploying control plane components" paragraph - > for additional info about this step. - - 4. Create the admin.conf kubeconfig file - - > This operation creates an additional root certificate that enables management of the cluster - > from the joining node and allows a simple and clean UX for the final steps of this workflow - > (similar to the what happen for `kubeadm init`). - > However, it is important to notice that this certificate should be treated securely - > for avoiding to compromise the cluster. - - 5. Apply master taint and label to the node. - - 6. Update the `kubeadm-config` configMap with the information about the new control plane instance. - -#### dynamic workflow (advertise-address == `controlplaneAddress`) - -There are many ways to configure an highly available cluster. - -Among them, the approach best suited for a dynamic bootstrap workflow requires the -user to set the `--apiserver-advertise-address` of each kube-apiserver instance, including the in on the -bootstrap control plane, _equal to the `controlplaneAddress` endpoint_ provided during kubeadm init -(the IP/DNS of the external load balancer). - -By using the same advertise address for all the kube-apiserver instances, `kubeadm init` can create -a unique API server serving certificate that could be shared across many control plane instances; -no changes will be required to this certificate when adding/removing kube-apiserver instances. - -Please note that: - -- if the user is not planning to distribute the apiserver serving certificate among control plane instances, - kubeadm will generate a new apiserver serving certificate “almost equal” to the certificate - created on the bootstrap control plane (it differs only for the domain name of the joining node) - -#### Static workflow (advertise-address != `controlplaneAddress`) - -In case of a static bootstrap workflow the final layout of the control plane - the number, the -name and the IP of control plane nodes - is know in advance. - -Given such information, the user can choose a different approach where each kube-apiserver instance has a -specific apiserver advertise address different from the `controlplaneAddress`. - -Please note that: - -- if the user is not planning to distribute the apiserver certificate among control plane instances, kubeadm - will generate a new apiserver serving certificate with the SANS required for the joining control plane instance -- if the user is planning to distribute the apiserver certificate among control plane instances, the - operator is required to provide during `kubeadm init` the list of the list of IP - addresses for all the kube-apiserver instances as alternative names for the API servers certificate, thus - allowing the proper functioning of all the API server instances that will join - -#### Strategies for deploying control plane components - -As of today kubeadm supports two solutions for deploying control plane components: - -1. Control plane deployed as static pods (current kubeadm default) -2. Self-hosted control plane (currently alpha) - -The proposed solution for case 1. "Control plane deployed as static pods", assumes -that the `kubeadm join --control plane` flow will take care of creating required kubeconfig -files and required static pod manifests. - -As stated above, supporting for Self-hosted control plane is non goal for this -proposal. - -#### Strategies for distributing cluster certificates - -As of today kubeadm supports two solutions for storing cluster certificates: - -1. Cluster certificates stored on file system (current kubeadm default) -2. Cluster certificates stored in secrets (currently alpha) - -The proposed solution for case 1. "Cluster certificates stored on file system", -requires the user/the higher level tools to execute an additional action _before_ -invoking `kubeadm join --control plane`. - -More specifically, in case of cluster with "cluster certificates stored on file -system", before invoking `kubeadm join --control plane`, the user/higher level tools -should copy control plane certificates from an existing node, e.g. the bootstrap control plane - -> NB. kubeadm is limited to execute actions *only* -in the machine where it is running, so it is not possible to copy automatically -certificates from remote locations. - -Then, the `kubeadm join --control plane` flow will take care of checking certificates -existence and conformance. - -As stated above, supporting for Cluster certificates stored in secrets is a non goal -for this proposal. - -#### `kubeadm upgrade` for HA clusters - -The `kubeadm upgrade` workflow as of today is composed by two high level phases, upgrading the -control plane and upgrading nodes. - -The above hig-level workflow will remain the same also in case of clusters with more than -one control plane instances, but with a new sub-step to be executed on secondary control-plane instances: - -1. Upgrade the control plane - - 1. Run `kubeadm upgrade apply` on a first control plane instance [No changes to this step] - 1. Run `kubeadm upgrade node experimental-control-plane` on secondary control-plane instances [new step] - -1. Upgrade nodes/kubelet [No changes to this step] - -Further detail might be provided in a subsequent release of this KEP when all the detail -of the `v1beta1` release of kubeadm api will be available (including a proper modeling -of many control plane instances). - -## Graduation Criteria - -- To create a periodic E2E test that bootstraps an HA cluster with kubeadm - and exercise the static bootstrap workflow -- To create a periodic E2E test that bootstraps an HA cluster with kubeadm - and exercise the dynamic bootstrap workflow -- To ensure upgradability of HA clusters (possibly with another E2E test) -- To document the kubeadm support for HA in kubernetes.io - -## Implementation History - -- original HA proposals [#1](https://goo.gl/QNtj5T) and [#2](https://goo.gl/C8V8PV) -- merged [Kubeadm HA design doc](https://goo.gl/QpD5h8) -- HA prototype [demo](https://goo.gl/2WLUUc) and [notes](https://goo.gl/NmTahy) -- [PR #58261](https://github.com/kubernetes/kubernetes/pull/58261) with the showcase implementation of the first release of this KEP - -## Drawbacks - -The `kubeadm join --control-plane` workflow requires that some condition are satisfied at `kubeadm init` time, -that is to use a `controlplaneAddress` and use an external etcd. - -## Alternatives - -1) Execute `kubeadm init` on many nodes - -The approach based on execution of `kubeadm init` on each node candidate for hosting a control plane instance -was considered as well, but not chosen because it seems to have several drawbacks: - -- There is no real control on parameters passed to `kubeadm init` executed on different nodes, - and this might lead to unpredictable inconsistent configurations. -- The init sequence for above nodes won't go through the TLS bootstrap process, - and this might be perceived as a security concern. -- The init sequence executes a lot of steps which are un-necessary (on an existing cluster); now those steps are - mostly idempotent, so basically now no harm is done by executing them two or three times. Nevertheless, to - maintain this contract in future could be complex. - -Additionally, by having a separated `kubeadm join --control-plane` workflow instead of a single `kubeadm init` -workflow we can provide better support for: - -- Steps that should be done in a slightly different way on a secondary control plane instances with respect - to the bootstrap control plane (e.g. updating the kubeadm-config map adding info about the new control plane - instance instead of creating a new configMap from scratch). -- Checking that the cluster/the kubeadm-config is properly configured for many control plane instances -- Blocking users trying to create secondary control plane instances on clusters with configurations - we don't want to support as a SIG (e.g. HA with self-hosted control plane) +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0023-documentation-for-images.md b/keps/sig-cluster-lifecycle/0023-documentation-for-images.md index c0ae3b8d..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0023-documentation-for-images.md +++ b/keps/sig-cluster-lifecycle/0023-documentation-for-images.md @@ -1,148 +1,4 @@ - - -# Documentation for images - -Open https://www.websequencediagrams.com/ and paste the spec for the desired image: - -- [kubeadm init](#kubeadm-init) -- [kubeadm join (and join --control-plane)](#kubeadm-join-and-join---control-plane) -- [kubeadm reset](#kubeadm-reset) -- [kubeadm upgrade](#kubeadm-upgrade) -- [kubeadm upgrade node](#kubeadm-upgrade-node) - -## kubeadm init - -``` -title kubeadm init (interactions with the v1beta1 configuration) - -participant "user" as u -participant "kubeadm" as k -participant "kubelet" as kk -participant "node\n(api object)" as n -participant "kubeadm-config\nConfigMap" as cm -participant "kubeproxy-config\nConfigMap" as kpcm -participant "kubelet-config\nConfigMap-1.*" as kcm - -u->k:provide\nInitConfiguration (with NodeRegistrationOptions, ControlPlaneConfiguration)\nClusterConfiguration\nkube-proxy component configuration\nkubelet component configuration - -k->kk:write kubelet component configuration\nto /var/lib/kubelet/config.yaml -k->kk:write NodeRegistrationOptions\nto /var/lib/kubelet/kubeadm-flags.env -kk->n:start node -k->n:save NodeRegistrationOptions.CRISocket\nto kubeadm.alpha.kubernetes.io/cri-socket annotation - -k->k:use InitConfiguration\n(e.g. tokens) - -k->cm:save ClusterConfiguration -k->cm:add Current ControlPlaneConfiguration to ClusterConfiguration.Status - -k->kpcm:save kube-proxy component configuration -k->kcm:save kubelet component configuration -``` - -## kubeadm join (and join --control-plane) - -``` -title kubeadm join and join --control-plane (interactions with the v1beta1 configuration) - -participant "user" as u -participant "kubeadm" as k -participant "kubeadm-config\nConfigMap" as cm -participant "kubelet-config\nConfigMap-1.*" as kcm -participant "kubelet" as kk -participant "node\n(api object)" as n - -u->k:provide\nJoinConfiguration\n(with NodeRegistrationOptions) - -k->cm:read ClusterConfiguration -cm->k: -k->k:use ClusterConfiguration\n(e.g. ClusterName) - -k->kcm:read kubelet\ncomponent configuration -kcm->k: -k->kk:write kubelet component configuration\nto /var/lib/kubelet/config.yaml -k->kk:write NodeRegistrationOptions\nto /var/lib/kubelet/kubeadm-flags.env -kk->n:start node -k->n:save NodeRegistrationOptions.CRISocket\nto kubeadm.alpha.kubernetes.io/cri-socket annotation - -k->cm:add new ControlPlaneConfiguration\nto ClusterConfiguration.Status\n(only for join --control-plane) -``` - -## kubeadm reset - -``` -title kubeadm reset (interactions with the v1beta1 configuration) - -participant "user" as u -participant "kubeadm" as k -participant "kubeadm-config\nConfigMap" as cm -participant "node\n(api object)" as n - - -u->k: - -k->cm:read ClusterConfiguration -cm->k: -k->cm:remove ControlPlaneConfiguration\nfrom ClusterConfiguration.Status\n(only if the node hosts a control plane instance) - -k->n:read kubeadm.alpha.kubernetes.io/cri-socket annotation -n->k: -k->k:use CRIsocket\nto delete containers -``` - -## kubeadm upgrade - -``` -title kubeadm upgrade apply (interactions with the v1beta1 configuration) - -participant "user" as u -participant "kubeadm" as k -participant "kubeadm-config\nConfigMap" as cm -participant "kubeproxy-config\nConfigMap" as kpcm -participant "kubelet-config\nConfigMap-1.*+1" as kcm -participant "kubelet" as kk -participant "node\n(api object)" as n - -u->k: UpgradeConfiguration -note over u, n:Upgrade configuration should allow only well known changes to the cluster e.g. the change of custom images if used - - -k->cm:read ClusterConfiguration -cm->k: -k->k:update\nClusterConfiguration\nusing api machinery -k->cm:save updated ClusterConfiguration - -k->kpcm:read kube-proxy component configuration -kpcm->k: -k->k:update kube-proxy\ncomponent configuration\nusing api machinery -k->kpcm:save updated kube-proxy component configuration -note over kpcm, n:the updated kube-proxy component configuration will\nbe used by the updated kube-proxy DaemonSet - -k->kcm:read kubelet component configuration -kcm->k: -k->k:update kubelet\ncomponent configuration\nusing api machinery -k->kcm:save updated kubelet component configuration -k->kk:write kubelet component configuration\nto /var/lib/kubelet/config.yaml -k->kk:write NodeRegistrationOptions\nto /var/lib/kubelet/kubeadm-flags.env -kk->n:start node - -note over kcm, n:the updated kubelet component configuration\nwill be used by other nodes\nwhen running\nkubeadm upgrade nodes locally - -``` - -## kubeadm upgrade node - -``` -title kubeadm upgrade node (interactions with the v1beta1 configuration) - -participant "user" as u -participant "kubeadm" as k -participant "kubelet-config\nConfigMap-1.*" as kcm -participant "kubelet" as kk - -u->k: - -k->kcm:read kubelet\ncomponent configuration -kcm->k: -k->kk:write kubelet component configuration\nto /var/lib/kubelet/config.yaml -``` - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0023-kubeadm-config-v1beta1.md b/keps/sig-cluster-lifecycle/0023-kubeadm-config-v1beta1.md index e5b988eb..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0023-kubeadm-config-v1beta1.md +++ b/keps/sig-cluster-lifecycle/0023-kubeadm-config-v1beta1.md @@ -1,244 +1,4 @@ ---- -kep-number: 23 -title: Kubeadm config file graduation to v1beta1 -authors: - - "@fabriziopandini" - - "@luxas" -owning-sig: sig-cluster-lifecycle -reviewers: - - "@chuckha" - - "@detiber" - - "@liztio" - - "@neolit123" -approvers: - - "@luxas" - - "@timothysc" -editor: - - "@fabriziopandini" -creation-date: 2018-08-01 -last-updated: -see-also: - - KEP 0008 ---- - -# kubeadm Config file graduation to v1beta1 - -## Table of Contents - -<!-- TOC --> - -- [kubeadm Config file graduation to v1beta1](#kubeadm-config-file-graduation-to-v1beta1) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [Decoupling the kubeadm types from other ComponentConfig types](#decoupling-the-kubeadm-types-from-other-componentconfig-types) - - [Re-design how kubeadm configurations are persisted](#re-design-how-kubeadm-configurations-are-persisted) - - [Use substructures instead of the current "single flat object"](#use-substructures-instead-of-the-current-single-flat-object) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - -<!-- /TOC --> - -## Summary - -This KEP is meant to describe design goal and the proposed solution for implementing the kubeadm -config file `v1beta1` version. - -The kubeadm config file today is one of the first touch points with Kubernetes for many users and -also for higher level tools leveraging kubeadm; as a consequence, providing a more stable and -reliable config file format is considered one of the top priorities for graduating kubeadm itself -to GA and for the future of kubeadm itself. - -## Motivation - -The kubeadm config file is a set of YAML documents with versioned structs that follow the Kubernetes -API conventions with regards to `apiVersion` and `kind`, but these types aren’t exposed as an -API endpoint in the API server. kubeadm follows the ComponentConfig conventions. - -The kubeadm config file was originally created as alternative to command line flags for `kubeadm init` -and `kubeadm join` actions, but over time the number of options supported by the kubeadm config file -has grown continuously, while the number of command line flags is intentionally kept under control -and limited to the most common and simplest use cases. - -As a consequence today the kubeadm config file is the only viable way for implementing many use cases -like e.g. usage of an external etcd, customizing Kubernetes control plane components or kube-proxy -and kubelet parameters. - -Additionally, the kubeadm config file today acts also as a persistent representation of the cluster -specification that can be used at a any points in time after `kubeadm init` e.g. for executing -`kubeadm upgrade` actions. - -The `v1beta1` version of kubeadm config file is a required, important consolidation step of the -current config file format, aimed at rationalize the considerable number of attributes added in the -past, provide a more robust and clean integration with the component config API, address the -weakness of the current design for representing multi master clusters, and ultimately lay down -a more sustainable foundation for the evolution of kubeadm itself. - -### Goals - -- To provide a solution for decoupling the kubeadm ComponentConfig types from other Kubernetes - components’ ComponentConfig types. - -- To re-design how kubeadm configurations are persisted, addressing the known limitations/weakness - of the current design; more in detail, with the aim to provide a better support for high - availability clusters, it should be provided a clear separation between cluster wide settings, - control plane instance settings, node/kubelet settings and runtime settings (setting used - by the current command but not persisted). - -- Improve the current kubeadm config file format by using specialized substructures instead - of the current "single flat object with only fields". - -### Non-Goals - -- To steer/coordinate all the implementation efforts for adoption of ComponentConfig across all - the different Kubernetes components. - -- To define a new home for the Bootstrap Token Go structs - -## Proposal - -### Decoupling the kubeadm types from other ComponentConfig types - -The `v1alpha2` kubeadm config types currently embeds the ComponentConfig for -kube-proxy and kubelet into the **MasterConfiguration** object; it is expected that also -ComponentConfig for kube-controller manager, kube-scheduler and kube-apiserver will be added -in future (non goal of this KEP). - -This strong type of dependency - embeds - already created some problem in the v1.10 cycle, and -despite some improvements, the current situation is not yet ideal, because e.g embedded dependency -could impact the kubeadm config file life cycle, forcing kubeadm to change its own config file -version every time one of the embedded component configurations changes. - -`v1beta1` config file version is going to address this problem by removing embedded dependencies -from the _external_ kubeadm config types. - -Instead, the user will be allowed to pass other component’s ComponentConfig in separated YAML -documents inside of the same YAML file given to `kubeadm init --config`. - -> please note that the _kubeadm internal config_ will continue to embed components config -> for the foreseeable future because kubeadm requires the knowledge of such data structures e.g. -> for propagating network configuration settings to kubelet, setting defaults, validating -> or manipulating YAML etc. - -### Re-design how kubeadm configurations are persisted - -Currently the kubeadm **MasterConfiguration** struct is persisted as a whole into the -`kubeadm-config` ConfigMap, but this situation has well know limitations/weaknesses: - -- There is no clear distinction between cluster wide settings (e.g. the kube-apiserver server - extra-args that should be consistent across all instances) and control plane instance settings - (e.g. the API server advertise address of a kube-apiserver instance). - NB. This is currently the main blocker for implementing support for high availability clusters - in kubeadm. - -- There is no clear distinction between cluster wide settings and node/kubelet specific - settings (e.g. the node name of the current node) - -- There is no clear distinction between cluster wide settings and runtime configurations - (e.g. the token that should be created by kubeadm init) - -- ComponentConfigs are stored both in the `kubeadm-config` and in the `kubeproxy-config` and - `kubelet-config-vX.Y` ConfigMaps, with the first used as authoritative source for updates, - while the others are the one effectively used by components. - -Considering all the above points, and also the split of the other components ComponentConfigs -from the kubeadm **MasterConfiguration** type described in the previous paragraph, it should -be re-designed how kubeadm configuration is persisted. - -The proposed solution leverage on the new kubeadm capability to handle separated YAML documents -inside of the same kubeadm-config YAML file. More in detail: - -- **MasterConfiguration** will be split into two other top-level kinds: **InitConfiguration** - and **ClusterConfiguration**. -- **InitConfiguration** will host the node-specific options like the node name, kubelet CLI flag - overrides locally, and ephemeral, init-only configuration like the Bootstrap Tokens to initialize - the cluster with. -- **ClusterConfiguration** will host the cluster-wide configuration, and **ClusterConfiguration** - is the object that will be stored in the `kubeadm-config` ConfigMap. -- Additionally, **NodeConfiguration** will be renamed to **JoinConfiguration** to be consistent with - **InitConfiguration** and highlight the coupling to the `kubeadm join` command and its - ephemeral nature. - -The new `kubeadm init` flow configuration-wise is summarized by the attached schema. - - - -[link](0023-kubeadm-init.png) - -As a consequence, also how the kubeadm configuration is consumed by kubeadm commands should -be adapted as described by following schemas: - -- [kubeadm join and kubeadm join --master](0023-kubeadm-join.png) -- [kubeadm upgrade apply](0023-kubeadm-upgrade-apply.png) -- [kubeadm upgrade node](0023-kubeadm-upgrade-node.png) -- [kubeadm reset](0023-kubeadm-reset.png) - -### Use substructures instead of the current "single flat object" - -Even if with few exceptions, the kubeadm **MasterConfiguration** and **NodeConfiguration** types -in `v1alpha1` and `v1alpha2` are basically single, flat objects that holds all the configuration -settings, and this fact e.g. doesn’t allow to a clearly/easily understand which configuration -options relate to each other or apply to the different control plane components. - -While redesigning the config file for addressing the main issues described in previous paragraphs, -kubeadm will provide also a cleaner representation of attributes belonging to single component/used -for a specific goal by creating dedicated objects, similarly to what’s already improved for -etcd configuration in the `v1alpha2` version. - -### Risks and Mitigations - -This is a change mostly driven by kubeadm maintainers, without an explicit buy-in from customers -using kubeadm in large installations - -The differences from the current config file are relevant and kubeadm users can get confused. - -Above risks will be mitigated by: - -- providing a fully automated conversion mechanism and a set of utilities under the kubeadm - config command (a goal and requirement for this KEP) -- The new structure could potentially make configuration options less discoverable as they’re - buried deeper in the code. Sufficient documentation for common and advanced tasks will help - mitigate this. -- writing a blog post before the release cut -- providing adequate instructions in the release notes - -Impact on the code are considerable. - -This risk will be mitigated by implementing the change according to following approach: - -- introducing a new `v1alpha3` config file as a intermediate step before `v1beta1` -- implementing all the new machinery e.g. for managing multi YAML documents in one file, early - in the cycle -- ensuring full test coverage about conversion from `v1alpha2` to `v1alpha3`, early in the cycle -- postponing the final rename from `v1alpha3` to `v1beta1` only when all the graduation criteria - are met, or if this is not the case, iterating the above steps in following release cycles - -## Graduation Criteria - -The kubeadm API group primarily used in kubeadm is `v1beta1` or higher. There is an upgrade path -from earlier versions. The primary kinds that can be serialized/deserialized are `InitConfiguration`, -`JoinConfiguration` and `ClusterConfiguration`. ComponentConfig structs for other Kubernetes -components are supplied besides `ClusterConfiguration` in different YAML documents. -SIG Cluster Life cycle is happy with the structure of the types. - -## Implementation History - -TBD - -## Drawbacks - -The differences from the current kubeadm config are relevant and kubeadm users can get confused. - -The impacts on the current codebase are considerable it is required an high commitment from -the SIG. This comes with a real opportunity cost. - -## Alternatives - -Graduate kubeadm GA with the current kubeadm config and eventually change afterward -(respecting GA contract rules). +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0029-20180918-kubeadm-phases-beta.md b/keps/sig-cluster-lifecycle/0029-20180918-kubeadm-phases-beta.md index 88707e6a..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0029-20180918-kubeadm-phases-beta.md +++ b/keps/sig-cluster-lifecycle/0029-20180918-kubeadm-phases-beta.md @@ -1,178 +1,4 @@ ---- -kep-number: 29 -title: kubeadm phases to beta -authors: - - "@fabriziopandini" -owning-sig: sig-cluster-lifecycle -reviewers: - - "@chuckha" - - "@detiber" - - "@liztio" - - "@neolit123" -approvers: - - "@luxas" - - "@timothysc" -editor: - - "@fabriziopandini" -creation-date: 2018-03-16 -last-updated: 2018-09-10 -status: provisional -see-also: - - KEP 0008 ---- - -# kubeadm phases to beta - -## Table of Contents - -<!-- TOC --> - -- [kubeadm phases to beta](#kubeadm-phases-to-beta) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-goals](#non-goals) - - [Proposal](#proposal) - - [User Stories](#user-stories) - - [Implementation Details](#implementation-details) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - -<!-- /TOC --> - -## Summary - -We are defining the road map for graduating `kubeadm alpha phase` commands to -beta, addressing concerns/lessons learned so far about the additional -effort for maintenance of this feature. - -## Motivation - -The `kubeadm phase` command was introduced in v1.8 cycle under the `kubeadm alpha phase` -command with the goal of providing users with an interface for invoking individually -any task/phase of the `kubeadm init` workflow. - -During this period, `kubeadm phase` proved to be a valuable and composable -API/toolbox that can be used by any IT automation tool or by an advanced user for -creating custom clusters. - -However, the existing separation of `kubeadm init` and `kubeadm phase` in the code base -required a continuous effort to keep the two things in sync, with a proliferation of flags, -duplication of code or in some case inconsistencies between the init and phase implementation. - -### Goals - -- To define the approach for graduating to beta the `kubeadm phase` user - interface. -- To significantly reduces the effort for maintaining `phases` up to date - with `kubeadm init`. -- To support extension of the "phases" concept to other kubeadm workflows. -- To enable re-use of phases across different workflows e.g. the cert phase - used by the `kubeadm init` and by the `kubeadm join` workflows. - -### Non-goals - -- This proposal doesn't include any changes of improvements to the actual `kubeadm init` - workflow. -- This proposal doesn't include implementation of workflows different than `kubeadm init`; - nevertheless, this proposal introduces a framework that will allow such implementation in future. - -## Proposal - -### User Stories - -- As a kubernetes administrator/IT automation tool, I want to run all the phases of - the `kubeadm init` workflow. -- As a kubernetes administrator/IT automation tool, I want to run only one or some phases - of the `kubeadm init` workflow. -- As a kubernetes administrator/IT automation tool, I want to run all the phases of - the `kubeadm init` workflow except some phases. - -### Implementation Details - -The core of the new phase design consist into a simple, lightweight workflow manager to be used -for implementing composable kubeadm workflows. - -Composable kubeadm workflows are build by an ordered sequence of phases; each phase can have it's -own, nested, ordered sequence of phases. For instance: - -```bash - preflight Run master pre-flight checks - certs Generates all PKI assets necessary to establish the control plane - /ca Generates a self-signed kubernetes CA - /apiserver Generates an API server serving certificate and key - ... - kubeconfig Generates all kubeconfig files necessary to establish the control plane - /admin Generates a kubeconfig file for the admin to use and for kubeadm itself - /kubelet Generates a kubeconfig file for the kubelet to use. - ... - ... -```` - -The above list of ordered phases should be made accessible from all the command supporting phases -via the command help, e.g. `kubeadm init --help` (and eventually in the future `kubeadm join --help` etc.) - -Additionally we are going to improve consistency between the command outputs/logs with the name of phases -defined in the above list. This will be achieved by enforcing that the prefix of each output/log should match -the name of the corresponding phase, e.g. `[certs/ca] Generated ca certificate and key.` instead of the current -`[certificates] Generated ca certificate and key.`. - -Single phases will be made accessible to the users via a new `phase` sub command that will be nested in the -command supporting phases, e.g. `kubeadm init phase` (and eventually in the future `kubeadm join phase` etc.). e.g. - -```bash -kubeadm init phases certs [flags] - -kubeadm init phases certs ca [flags] -``` - -Additionally we are going also to allow users to skip phases from the main workflow, via the `--skip-phases` flag. e.g. - -```bash -kubeadm init --skip-phases addons/proxy -``` - -The above UX will be supported by a new components, the `PhaseRunner` that will be responsible -of running phases according to the given order; nested phases will be executed -immediately after their parent phase. - -The `PhaseRunner` will be instantiated by kubeadm commands with the configuration of the specific list of ordered -phases; the `PhaseRunner` in turn will dynamically generate all the `phase` sub commands for the phases. - -Phases invoked by the `PhaseRunner` should be designed in order to ensure reuse across different -workflows e.g. reuse of phase `certs` in both `kubeadm init` and `kubeadm join` workflows. - -## Graduation Criteria - -* To create a periodic E2E test that bootstraps a cluster using phases -* To document the new user interface for phases in kubernetes.io - -## Implementation History - -* [#61631](https://github.com/kubernetes/kubernetes/pull/61631) First prototype implementation - (now outdated) - -## Drawbacks - -By merging phases into kubeadm workflows derives a reduced capability to customize -the user interface for each phase. More specifically: - -- It would not be possible to provide any kind of advice to the user about which - flags are relevant for one specific phase (the help will always show all the flags). -- It would not be possible to add long description and/or examples to each phase -- It would not be possible to provide additional flags specific for one phase - (the flags are shared between init and all the phases). -- It would not be possible to expose to the users phases which are not part of kubeadm workflows - ("extra" phases should be hosted on dedicated commands). - -This is considered an acceptable trade-off in light of the benefits of the suggested -approach. - -## Alternatives - -It is possible to graduate phases by simply moving corresponding command to first level, -but this approach provide less opportunities for reducing the effort -for maintaining phases up to date with the changes of kubeadm.
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md b/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md index 8ef0d9c5..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md +++ b/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md @@ -1,211 +1,4 @@ ---- -kep-number: 31 -title: etcdadm -authors: - - "@justinsb" -owning-sig: sig-cluster-lifecycle -#participating-sigs: -#- sig-apimachinery -reviewers: - - "@roberthbailey" - - "@timothysc" -approvers: - - "@roberthbailey" - - "@timothysc" -editor: TBD -creation-date: 2018-10-22 -last-updated: 2018-10-22 -status: provisional -#see-also: -# - KEP-1 -# - KEP-2 -#replaces: -# - KEP-3 -#superseded-by: -# - KEP-100 ---- - -# etcdadm - automation for etcd clusters - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Manual Cluster Creation](#manual-cluster-creation) - * [Automatic Cluster Creation](#automatic-cluster-creation) - * [Automatic Cluster Creation with EBS volumes](#automatic-cluster-creation-with-ebs-volumes) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Infrastructure Needed](#infrastructure-needed) - -## Summary - -etcdadm makes operation of etcd for the Kubernetes control plane easy, on clouds -and on bare-metal, including both single-node and HA configurations. - -It is able to perform cluster reconfigurations, upgrades / downgrades, and -backups / restores. - -## Motivation - -Today each installation tool must reimplement etcd operation, and this is -difficult. It also leads to ecosystem fragmentation - e.g. etcd backups from -one tool are not necessarily compatible with the backups from other tools. The -failure modes are subtle and rare, and thus the kubernetes project benefits from -having more collaboration. - - -### Goals - -The following key tasks are in scope: - -* Cluster creation -* Cluster teardown -* Cluster resizing / membership changes -* Cluster backups -* Disaster recovery or restore from backup -* Cluster upgrades -* Cluster downgrades -* PKI management - -We will implement this functionality both as a base layer of imperative (manual -CLI) operation, and a self-management layer which should enable automated -in "safe" scenarios (with fallback to manual operation). - -We'll also optionally support limited interaction with cloud infrastructure, for -example for mounting volumes and peer-discovery. This is primarily for the -self-management layer, but we'll expose it via etcdadm for consistency and for -power-users. The tasks are limited today to listing & mounting a persistent -volume, and listing instances to find peers. A full solution for management of -machines or networks (for example) is out of scope, though we might share some -example configurations for exposition. We expect kubernetes installation -tooling to configure the majority of the cloud infrastructure here, because both -the configurations and the configuration tooling varies widely. - -The big reason that volume mounting is in scope is that volume mounting acts as -a simple mutex on most clouds - it is a cheap way to boost the safety of our -leader/gossip algorithms, because we have an external source of truth. - -We'll also support reading & writing backups to S3 / GCS etc. - -### Non-Goals - -* The project is not targeted at operation of an etcd cluster for use other than - by Kubernetes apiserver. We are not building a general-purpose etcd operation - toolkit. Likely it will work well for other use-cases, but other tools may be - more suitable. -* As described above, we aren't building a full "turn up an etcd cluster on a - cloud solution"; we expect this to be a building block for use by kubernetes - installation tooling (e.g. cluster API solutions). - -## Proposal - -We will combine the [etcdadm](https://github.com/platform9/etcdadm) from -Platform9 with the [etcd-manager](https://github.com/kopeio/etcd-manager) -project from kopeio / @justinsb. - -etcdadm gives us easy to use CLI commands, which will form the base layer of -operation. Automation should ideally describe what it is doing in terms of -etcdadm commands, though we will also expose etcdadm as a go-library for easier -consumption, following the kubectl pattern of a `cmd/` layer calling into a -`pkg/` layer. This means the end-user can understand the operation of the -tooling, and advanced users can feel confident that they can use the CLI tooling -for advanced operations. - -etcd-manager provides automation of the common scenarios, particularly when -running on a cloud. It will be rebased to work in terms of etcdadm CLI -operations (which will likely require some functionality to be added to etcdadm -itself). Where automation is not known to be safe, etcd-manager can stop and -allow for manual intervention using the CLI. - -kops is currently using etcd-manager, and we aim to switch to the (new) etcadm asap. - -We expect other tooling (e.g. cluster-api implementations) to adopt this project -for etcd management going forwards, and do a first integration or two if it -hasn't happened already. - -### User Stories - -#### Manual Cluster Creation - -A cluster operator setting up a cluster manually will be able to do so using etcdadm and kubeadm. - -The basic flow looks like: - -* On a master machine, run `etcdadm init`, making note of the `etcdadm join - <endpoint>` command -* On each other master machine, copy the CA certificate and key from one of the - other masters, then run the `etcdadm join <endpoint>` command. -* Run kubeadm following the [external etcd procedure](https://kubernetes.io/docs/setup/independent/high-availability/#external-etcd) - -This results in an multi-node ("HA") etcd cluster. - -#### Automatic Cluster Creation - -etcd-manager works by coordinating via a shared filesystem-like store (e.g. S3 -or GCS) and/or via cloud APIs (e.g. EC2 or GCE). In doing so it is able to -automate the manual commands, which is very handy for running in a cloud -environment like AWS or GCE. - -The basic flow would look like: - -* The user writes a configuration file to GCS using `etcdadm seed - gs://mybucket/cluster1/etcd1 version=3.2.12 nodes=3` -* On each master machine, run `etcdadm auto gs://mybucket/cluster1/etcd1`. - (Likely the user will have to run that persistently, either as a systemd - service or a static pod.) - -`etcdadm auto` downloads the target configuration from GCS, discovers other -peers also running etcdadm, gossips with them to do basic leader election. When -sufficient nodes are available to form a quorum, it starts etcd. - -#### Automatic Cluster Creation with EBS volumes - -etcdadm can also automatically mount EBS volumes. The workflow looks like this: - -* As before, write a configuration file using `etcadm seed ...`, but this time - passing additional arguments "--volume-tag cluster=mycluster" -* Create EBS volumes with the matching tags -* On each master machine, run `etcdadm auto ...` as before. Now etcdadm will - try to mount a volume with the correct tags before acting as a member of the - cluster. - -### Implementation Details/Notes/Constraints - -* There will be some changes needed to both platform9/etcdadm (e.g. etcd2 - support) and kopeio/etcd-manager (to rebase on top of etcdadm). -* It is unlikely that e.g. GKE / EKS will use etcdadm (at least initially), - which limits the pool of contributors. - -### Risks and Mitigations - -* Automatic mode may make incorrect decisions and break a cluster. Mitigation: - automated backups, and a willingness to stop and wait for a fix / operator - intervention (CLI mode). -* Automatic mode relies on peer-to-peer discovery and gossiping, which is less - reliable than Raft. Mitigation: rely on Raft as much as possible, be very - conservative in automated operations (favor correctness over availability or - speed). etcd non-voting members will make this much more reliable. - -## Graduation Criteria - -etcdadm will be considered successful when it is used by the majority of OSS -cluster installations. - -## Implementation History - -* Much SIG discussion -* Initial proposal to SIG 2018-10-09 -* Initial KEP draft 2018-10-22 -* Added clarification of cloud interaction 2018-10-23 - -## Infrastructure Needed - -* etcdadm will be a subproject under sig-cluster-lifecycle +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/0032-create-a-k8s-io-component-repo.md b/keps/sig-cluster-lifecycle/0032-create-a-k8s-io-component-repo.md index a605aa31..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/0032-create-a-k8s-io-component-repo.md +++ b/keps/sig-cluster-lifecycle/0032-create-a-k8s-io-component-repo.md @@ -1,245 +1,4 @@ ---- -kep-number: 32 -title: Create a `k8s.io/component` repo -status: implementable -authors: - - "@luxas" - - "@sttts" -owning-sig: sig-cluster-lifecycle -participating-sigs: - - sig-api-machinery - - sig-cloud-provider -reviewers: - - "@thockin" - - "@jbeda" - - "@bgrant0607" - - "@smarterclayton" - - "@liggitt" - - "@lavalamp" - - "@andrewsykim" - - "@cblecker" -approvers: - - "@thockin" - - "@jbeda" - - "@bgrant0607" - - "@smarterclayton" -editor: - name: "@luxas" -creation-date: 2018-11-27 -last-updated: 2018-11-27 ---- - -# Create a `k8s.io/component` repo - -**How we can consolidate the look and feel of core and non-core components with regards to ComponentConfiguration, flag handling, and common functionality with a new repository** - -## Table of Contents - -- [Create a `k8s.io/component` repo](#create-a--k8sio-component--repo) - - [Table of Contents](#table-of-contents) - - [Abstract](#abstract) - - [History and Motivation](#history-and-motivation) - - ["Component" definition](#-component--definition) - - [Goals](#goals) - - [Success metrics](#success-metrics) - - [Non-goals](#non-goals) - - [Related proposals / references](#related-proposals---references) - - [Proposal](#proposal) - - [Part 1: ComponentConfig](#part-1--componentconfig) - - [Standardized encoding/decoding](#standardized-encoding-decoding) - - [Testing helper methods](#testing-helper-methods) - - [Generate OpenAPI specifications](#generate-openapi-specifications) - - [Part 2: Command building / flag parsing](#part-2--command-building---flag-parsing) - - [Wrapper around cobra.Command](#wrapper-around-cobracommand) - - [Flag precedence over config file](#flag-precedence-over-config-file) - - [Standardized logging](#standardized-logging) - - [Part 3: HTTPS serving](#part-3--https-serving) - - [Common endpoints](#common-endpoints) - - [Standardized authentication / authorization](#standardized-authentication---authorization) - - [Part 4: Sample implementation in k8s.io/sample-component](#part-4--sample-implementation-in-k8sio-sample-component) - - [Code structure](#code-structure) - - [Timeframe and Implementation Order](#timeframe-and-implementation-order) - - [OWNERS file for new packages](#owners-file-for-new-packages) - -## Abstract - -The proposal is about preparing the Kubernetes core package structure in a way that all core component can share common code around - -- ComponentConfig implementation -- flag and command handling -- HTTPS serving -- delegated authn/z -- logging. - -Today this code is spread over the k8s.io/kubernetes repository, staging repository or pieces of code are in locations they don't belong to (example: k8s.io/apiserver/pkg/util/logs is the for general logging, totally independent of API servers). We miss a repository far enough in the dependency hierarchy for code that is or should be common among core Kubernetes component (neither k8s.io/apiserver, k8s.io/apimachinery or k8s.io/client-go are right for that). - -### History and Motivation - -By this time in the Kubernetes development, we know pretty well how we want a Kubernetes component to work, function, and look. But achieving this requires a fair amount of more or less advanced code. As we scale the ecosystem, and evolve Kubernetes to work more as a kernel, it's increasingly important to make writing extensions and custom Kubernetes-aware components relatively easy. As it stands today, this is anything but straightforward. In fact, even the in-core components diverge in terms of configurability (Can it be declaratively configured? Do flag names follow a consistent pattern? Are configuration sources consistently merged?), common functionality (Does it support the common "/version," "/healthz," "/configz," "/pprof," and "/metrics" endpoints? Does it utilize Kubernetes' authentication/authorization mechanisms? Does it write logs in a consistent manner? Does it handle signals as others do?), and testability (Do the internal configuration structs set up correctly to conform with the Kubernetes API machinery, and have roundtrip, defaulting, validation unit tests in place? Does it merge flags and the config file correctly? Is the logging mechanism set up in a testable manner? Can it be verified that the HTTP server has the standard endpoints registered and working? Can it be verified that authentication and authorization is set up correctly?). - -This document proposes to create a new Kubernetes staging repository with minimal dependencies (_k8s.io/apimachinery_, _k8s.io/client-go_, and _k8s.io/api_) and good documentation on how to write a Kubernetes-aware component that follows best practices. The code and best practices in this repo would be used by all the core components as well. Unifying the core components would be great progress in terms of the internal code structure, capabilities, and test coverage. Most significantly, this would lead to an adoption of ComponentConfig for all internal components as both a "side effect" and a desired outcome, which is long time overdue. - -The current inconsistency is a headache for many Kubernetes developers, and confusing for end users. Implementing this proposal will lead to better code quality, higher test coverage in these specific areas of the code, and better reusability possibilities as we grow the ecosystem (e.g. breaking out the _cloud provider_ code, building Cluster API controllers, etc.). This work consists of three major pillars, and we hope to complete at least the ComponentConfig part of it—if not (ideally) all three pieces of work—in v1.14. - -### "Component" definition - -In this case, when talking about a "component", I mean: "a CLI tool or a long-running server process that consumes configuration from a versioned configuration file (with apiVersion/kind) and optionally overriding flags". The component's implementation of ComponentConfig and command & flag setup is well unit-tested. The component is to some extent Kubernetes-aware. The component follows Kubernetes' conventions for config serialization and merging, logging, and common HTTPS endpoints (in the server case). _To begin with_, this proposal will **only focus on the core Kubernetes components** (kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy, kubeadm), but as we go, this library will probably be generic enough to be usable by cloud provider and Cluster API controller extensions, as well as aggregated API servers. - -### Goals - -- Make it easy for a component to correctly adopt ComponentConfig. -- Avoid moving code into _k8s.io/apiserver_ which does not strictly belong to an etcd-based, API group-serving apiserver. Corollary: remove etcd dependency from components. -- Components should be consistent in how they load (and write) configuration, and merge config with CLI flags. -- Factor out command- and flag-building code to a shared place. -- Factor out common HTTPS endpoints describing a component's status. -- Make the core Kubernetes components utilize these new packages. -- Have good documentation about how to build a component with a similar look and feel as core Kubernetes components. -- Increase test coverage for the configuration, command building, and HTTPS server areas of the component code. -- Break out OpenAPI definitions and violations for the current ComponentConfigs from the core repo to a dedicated place per-component. With auto-generated OpenAPI specs for each component ComponentConfig consumers can validate/vet their configs without running the component. - -### Success metrics - -- All core Kubernetes components (kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy, kubeadm) are using these shared packages in a consistent manner. -- Cloud providers can be moved out of core without having to depend on the core repository. - - Related issue: [https://github.com/kubernetes/kubernetes/issues/69585](https://github.com/kubernetes/kubernetes/issues/69585) -- It's easier for _kubeadm_ to move out of the core repo when these component-related packages are in a "public" staging repository. - -### Non-goals - -- Graduate any ComponentConfig API versions (in this proposal). -- Make this library toolkit a "generic" cloud-native component builder. Such a toolbox, if ever created, could instead consume these packages. In other words, this repository is solely focused on Kubernetes' components' needs. -- Fixing _all the problems_ our components have right now, and expanding this beyond what's really necessary. Instead working incrementally, and starting to break out some basic stuff we _know_ every component must handle (e.g. configuration and flag parsing) - -### Related proposals / references - -- [Kubernetes Component Configuration](https://docs.google.com/document/d/1arP4T9Qkp2SovlJZ_y790sBeiWXDO6SG10pZ_UUU-Lc/edit) by [@mikedanese](https://github.com/mikedanese) -- [Versioned Component Configuration Files](https://docs.google.com/document/d/1FdaEJUEh091qf5B98HM6_8MS764iXrxxigNIdwHYW9c/edit#) by [@mtaufen](https://github.com/mtaufen) -- [Moving ComponentConfig API types to staging repos](https://github.com/kubernetes/community/blob/master/keps/sig-cluster-lifecycle/0014-20180707-componentconfig-api-types-to-staging.md) by [@luxas](https://github.com/luxas) and [@sttts](https://github.com/sttts) - -## Proposal - -This proposal contains three logical units of work. Each subsection is explained in more detail below. - -### Part 1: ComponentConfig - -#### Standardized encoding/decoding - -- Encoding/decoding helper methods in `k8s.io/component/config/serializer` that would be referenced in every scheme package -- Warn or (if desired, error) on unknown fields - - This has the benefit that it makes it possible for the user to spot e.g. config typos. More high-level, this can be used for e.g. a `--validate-config` flag -- Support both JSON and YAML -- Support multiple YAML documents if needed - -#### Testing helper methods - -- Conversion / roundtrip testing -- API group testing - - External types must have JSON tags - - Internal types must not have any JSON tags - - ... -- Defaulting testing - -#### Generate OpenAPI specifications - -Provide a common way to generate OpenAPI specifications local to the component, so that external consumers can access it, and the component can expose it via e.g. a CLI flag or HTTPS endpoint. - -### Part 2: Command building / flag parsing - -#### Wrapper around cobra.Command - -See the `cmd/kubelet` code for how much extra setup a Kubernetes component needs to do for building commands and flag sets. This code can be refactored into a generic wrapper around _cobra_ for use with Kubernetes. - -#### Flag precedence over config file - -If the component supports both ComponentConfiguration and flags, flags should override fields set in the ComponentConfiguration. This is not straightforward to implement in code, and only the kubelet does this at the moment. Refactoring this code in a generic helper library in this new repository will make adoption of the feature easy and testable. The details of flag versus ComponentConfig semantics are to be decided later in a different proposal. Meanwhile, this flag precedence feature will be opt-in, so the kubelet and kubeadm can directly adopt this code, until the details have been decided on for all components. - -#### Standardized logging - -Use the _k8s.io/klog_ package in a standardized way. - -### Part 3: HTTPS serving - -Many Kubernetes controllers are clients to the API server and run as daemons. In order to expose information on how the component is doing (e.g. profiling, metrics, current configuration, etc.), an HTTPS server is run. - -#### Common endpoints - -In order to make it easy to expose this kind of information, a package is made in this new repo that hosts this common code. Initially targeted endpoints are "/version," "/healthz," "/configz," "/pprof," and "/metrics." - -#### Standardized authentication / authorization - -In order to not expose this kind of information (e.g. metrics) to anyone that can talk to the component, it may utilize SubjectAccessReview requests to the API server, and hence delegate authentication and authorization to the API server. It should be easy to add this functionality to your component. - -### Part 4: Sample implementation in k8s.io/sample-component - -Provides an example usage of the three main functions of the _k8s.io/component_ repo, implementing ComponentConfig, the CLI wrapper tooling and the common HTTPS endpoints with delegated auth. - -### Code structure - -- k8s.io/component - - config/ - - Would hold internal, shared ComponentConfig types across core components - - {v1,v1beta1,v1alpha1} - - Would hold external, shared ComponentConfig types across core components - - serializer/ - - Would hold common methods for encoding/decoding ComponentConfig - - testing/ - - Would hold common testing code for use in unit tests local to the implementation of ComponentConfig. - - cli/ - - Would hold common methods and types for building a k8s component command (building on top of github.com/spf13/{pflag,cobra}) - - options/ - - Would hold flag definitions - - testing/ - - Would hold common testing code for use in unit tests local to the implementation of the code - - logging/ - - Would hold common code for using _k8s.io/klog_ - - server/ - - auth/ - - Would hold code for implementing delegated authentication and authorization to Kubernetes - - configz/ - - Would hold code for implementing a `/configz` endpoint in the component - - healthz/ - - Would hold code for implementing a `/healthz` endpoint in the component - - metrics/ - - Would hold code for implementing a `/metrics` endpoint in the component - - pprof/ - - Would hold code for implementing a `/pprof` endpoint in the component - - version/ - - Would hold code for implementing a `/version` endpoint in the component - -### Timeframe and Implementation Order - -**Objective:** The ComponentConfig part done for v1.14 - -**Stretch goal:** Get the CLI and HTTPS server parts done for v1.14. - -**Implementation order:** - -1. Create the k8s.io/component repo with the initial ComponentConfig shared code -2. Move shared v1alpha1 ComponentConfig types and references from `k8s.io/api{server,machinery}/pkg/apis/config` to `k8s.io/component/config` -3. Set up good unit testing for all core ComponentConfig usage, by writing the `k8s.io/component/config/testing` package -4. Move server-related util packages from `k8s.io/kubernetes/pkg/util/` to `k8s.io/component/server`. e.g. delegated authn/authz "/configz", "/healthz", and "/metrics" packages are suitable -5. Move common flag parsing / cobra.Command setup code to `k8s.io/component/cli` from (mainly) the kubelet codebase. -6. Start using the command- and server-related code in all core components. - -In parallel to all the steps above, a _k8s.io/sample-component_ repo is built up with an example and documentation how to consume the _k8s.io/component_ code - -### OWNERS file for new packages - -- Approvers for the config/{v1,v1beta1,v1alpha1} packages - - @kubernetes/api-approvers -- Approvers for staging/src/k8s.io/{sample-,}component - - @sttts - - @luxas - - @jbeda - - @lavalamp -- Approvers for subpackages: - - those who owned packages before code move -- Reviewers for staging/src/k8s.io/{sample-,}component: - - @sttts - - @luxas - - @dixudx - - @rosti - - @stewart-yu - - @dims -- Reviewers for subpackages: - - those who owned packages before code move +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-cluster-lifecycle/README.md b/keps/sig-cluster-lifecycle/README.md index 75b7a9d0..cfd1f5fa 100644 --- a/keps/sig-cluster-lifecycle/README.md +++ b/keps/sig-cluster-lifecycle/README.md @@ -1,3 +1,4 @@ -# SIG Cluster Lifecycle KEPs - -This directory contains KEPs related to [SIG Cluster Lifecycle](../../sig-cluster-lifecycle).
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-contributor-experience/0005-contributor-site.md b/keps/sig-contributor-experience/0005-contributor-site.md index 803d648a..cfd1f5fa 100644 --- a/keps/sig-contributor-experience/0005-contributor-site.md +++ b/keps/sig-contributor-experience/0005-contributor-site.md @@ -1,153 +1,4 @@ ---- -kep-number: 5 -title: Contributor Site -authors: - - "@jbeda" -owning-sig: sig-contributor-experience -participating-sigs: - - sig-architecture - - sig-docs -reviewers: - - "@castrojo" -approvers: - - "@parispittman" -editor: TBD -creation-date: "2018-02-19" -last-updated: "2018-03-07" -status: implementable ---- - -# Contributor Site - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks](#drawbacks) -* [Alternatives](#alternatives) - -## Summary - -We need a way to organize and publish information targeted at contributors. -In order to continue to scale the Kubernetes contributor community we need a convenient, scalable and findable way to publish information. - -## Motivation - -While the current kubernetes.io site is great for end users, it isn't often used by or aimed at project contributors. -Instead, most contributors look at documentation in markdown files that are spread throughout a wide set of repos and orgs. -It is difficult for users to find this documentation. - -Furthermore, this documentation is often duplicated and out of date. -The fact that it isn't collected in one place and presented as a whole leads to fragmentation. -Often times documentation will be duplicated because the authors themselves can't find the relevant docs. - -This site will also serve as a starting point for those that are looking to contribute. -This site (and the contributor guide) can provide a soft introduction to the main processes and groups. - - -Finally, some simple domain specific indexing could go a long way to make it easier to discover and cross link information. -Specifically, building a site that can take advantage of the KEP metadata will both make KEPs more discoverable and encourage those in the community to publish information in a way that *can* be discovered. - -### Goals - -* A contributor community facing portal to collect information for those actively working on upstream Kubernetes. -* An easy to remember URL. (`contrib.kubernetes.io`? `contributors.kubernetes.io`? `c.kubernetes.io`?) -* A streamlined process to update and share this information. - Ownership should be delegated using the existing OWNERS mechanisms. -* A site that will be indexed well on Google to collect markdown files from the smattering of repos that we currently have. - This includes information that is currently in the [community repo](https://github.com/kubernetes/community). -* Provide a place to launch and quickly evolve the contributor handbook. -* Build some simple tools to enhance discoverability within the site. - This could include features such as automatically linking KEP and SIG names. -* Over time, add an index of events, meetups, and other forums for those that are actively contributing to k8s. - -### Non-Goals - -* Actively migrate information from multiple orgs/repos. - This should be a place that people in the contributor community choose to use to communicate vs. being forced. -* Create a super dynamic back end. This is most likely served best with a static site. -* Other extended community functions like a job board or a list of vendors. - -## Proposal - -We will build a new static site out of the [community repo](https://github.com/kubernetes/community). - -This site will be focused on communicating with and being a place to publish information for those that are looking to contribute to Kubernetes. - -We will use Hugo and netlify to build and host the site, respectively. (Details TBD) - -The main parts of the site that will be built out first: -* A main landing page describing the purpose of the site. -* A guide on how to contribute/update the site. -* A list and index of KEPs -* A place to start publishing and building the contributor guide. - -### Risks and Mitigations - -The main risk here is abandonment and rot. -If the automation for updating the site breaks then someone will have to fix it. -If the people or the skillset doesn't exist to do so then the site will get out of sync with the source and create more confusion. - -To mitigate this we will (a) ensure that SIG-contributor-experience is signed up to own this site moving forward and (b) keep it simple. -By relying on off the shelf tooling with many users (Hugo and Netlify) we can ensure that there are fewer custom processes and code to break. -The current generation scripts in the community repo haven't proven to be too much for us to handle. - -## Graduation Criteria - -This effort will have succeeded if: - -* The contributor site becomes the de-facto way to publish information for the community. -* People consistently refer to the contributor site when answering questions about "how do I do X" or "what is the status of X". -* The amount of confusion over where to find information is reduced. -* Others in the contributor community actively look to expand the information on the contributor site and move information from islands to this site. - -## Implementation History - -## Drawbacks - -The biggest drawback is that this is yet another thing to keep running. -Currently the markdown files are workable but not super discoverable. -However, they utilize the familiar mechanisms and do not require extra effort or understanding to publish. - -The current mechanisms also scale across orgs and repos. -This is a strength as the information is close to the code but also a big disadvantage as it ends up being much less discoverable. - -## Alternatives - -One alternative is to do nothing. -However, the smattering of markdown through many repos is not scaling and is not discoverable via Google or for other members of the contributor community. - -The main alternative here is to build something that is integrated into the user facing kubernetes.io site. -This is not preferred for a variety of reasons. - -* **Workflow.** Currently there is quite a bit of process for getting things merged into the main site. - That process involves approval from someone on SIG-Docs from an editorial point of view along with approval for technical accuracy. - The two stage approval slows down contributions and creates a much larger barrier than the current markdown based flow. - In addition, SIG-Docs is already stretched thin dealing with the (more important) user facing content that is their main charter. -* **Quality standards.** The bar for the user facing site is higher than that of the contributor site. - Speed and openness of communication dominates for the contributor facing site. - Our bar here is the current pile of Markdown. -* **Different tooling.** We may want to create specialized preprocessors as part of the contributor site build process. - This could include integrating our current expansion of sigs.yaml into Markdown files. - It may also include recognizing specific patterns (KEP-N) and creating automatic linkages. - Applying these to a part of a site or validating them across a larger site will slow creation of these tools. - -An alternative to building directly into the website repo is to build in some other repo and do some sort of import into the main website repo. -There are serious downsides to this approach. - -* **No pre-commit visualization.** Netlifies capability to show a preview per PR won't work with a custom cross repo workflow. -* **Higher latency for changes.** If the merges are batched and manually approved then there could be a significant time gap between when something is changed and when it is published. - This is a significant change from the current "pile of markdown in github" process. -* **Opportunity for more complex build breaks.** If something is checked into a satellite repo it may pass all of the presubmit tests there but then fail presubmits on the parent repo. - This creates a situation where manual intervention is required. - Complicated pre-submit tests could be built for the satellite repo but those need to be maintained and debugged themselves. -* **New tooling.** New tooling would need to be built that doesn't directly benefit the target audience. - This tooling will have to be documented and supported vs. using an off the shelf service like netlify. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-contributor-experience/0007-20180403-community-forum.md b/keps/sig-contributor-experience/0007-20180403-community-forum.md index 04ceb38b..cfd1f5fa 100644 --- a/keps/sig-contributor-experience/0007-20180403-community-forum.md +++ b/keps/sig-contributor-experience/0007-20180403-community-forum.md @@ -1,198 +1,4 @@ ---- -kep-number: 0007 -title: A community forum for Kubernetes -authors: - - "@castrojo" -owning-sig: sig-contributor-experience -participating-sigs: -reviewers: - - "@jberkus" - - "@joebeda" - - "@cblecker" -approvers: - - "@parispittman" - - "@grodrigues3" - -editor: TBD -creation-date: 2018-04-03 -last-updated: 2018-04-17 -status: implemented ---- - -# A community forum for Kubernetes - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Implementation Details/Notes/Constraints](#implementation-details) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks](#drawbacks) - - -## Summary - -Kubernetes is large enough that we should take a more active role in growing our community. We need a place to call our own that can encompass users, contributors, meetups, and other groups in the community. Is there a need for something between email and real time chat that can fulfill this? The primary purpose of this KEP is to determine whether we can provide a better community forum experience and perhaps improve our mailing list workflow. - -The site would be forum.k8s.io, and would be linked to from the homepage and major properties. [See KEP005](https://github.com/kubernetes/community/blob/master/keps/sig-contributor-experience/0005-contributor-site.md) for related information on a contributor website. - -## Motivation - -- We're losing too much information in the Slack ether, and most of it does not show up in search engines. -- Mailing lists remain mostly the domain of existing developers and require subscribing, whereas an open forum allows people to drive by and participate with minimal effort. - - There's an entire universe of users and developers that we could be reaching that didn't grow up on mailing lists and emacs. :D - - Specifically, hosting our lists on google groups has some issues: - - Automated filtering traps Zoom invites for SIG/WG leads - - Cannot use non-google accounts as first class citizens (Google Account required to create/manage group, join a group) - - Hard to search across multiple lists - - There's no way to see all the kubernetes lists in one view, we have to keep them indexed in sigs.yaml - - Filtering issues with the webui with countries that block Google - - Non-kubernetes branding -- As part of a generic community portal, this gives people a place to go where they can discuss Kubernetes, and a sounding board for developers to make announcements of things happening around Kubernetes that can reach a wider audience. -- We would be in charge of our own destiny, (aka. The Slack XMPP/IRC gateway removal-style concerns can be partially addressed) - - Software is 100% Open Source - - We'd have full access to our data, including the ability to export all of it. - - Kubernetes branded experience that would look professional and match the website and user docs look and feel, providing us a more consistent look across k8s properties. - -### Goals - -- Set up a prototype at discuss.k8s.io/discuss.kubernetes.io - - Determine if the mailing list feature is robust enough to replace our google groups - - References: [Mailing list roadmap](https://meta.discourse.org/t/moss-roadmap-mailing-lists/36432), [Discourse and email lists](https://meta.discourse.org/t/discourse-and-email-lists-like-google-groups/39915) -- Heavy user engagement within 6 months. - - Clear usage growth in metrics - - Number of active users - - Number of threads - - Amount of traffic - - SIG Contributor Experience can analyze analytics regularly to determine growth and health. - - A feedback subforum would enable us to move quickly in addressing the needs of the community for the site. - -### Non-Goals - -- This is not a proposal to replace Slack, this is a proposal for a community forum. - - The motivation of having searchable information that is owned by the Kubernetes community comes from voiced concerns about having so much of Kubernetes depend on Slack. - - You are encouraged to propose a KEP for real-time communication if you would like to champion that, this KEP is not about Slack. -- This does not replace Stack Overflow or kubernetes-users for user support. - - However inevitably users who prefer forums will undoubtedly use it for support. - - Strictly policing "this should be posted here, that should be posted there" won't work. - - I believe our community is large enough where we can have a support section, and there are enough people to make that self sustaining, we can also encourage cross posting from StackOverflow to integrate things better as both sites have good integration points. - - Over time as community interaction and knowledge base builds people will end up with a better experience than in #kubernetes-users on slack and will naturally gravitate there. - - Other large OSS communities have a presence on both StackOverflow and do support on their user forums already and it doesn't appear to be a big issue. -- This will not replace kubernetes-devel or SIG mailing lists. - - They work, but we could experiment with mailing list integration. (See below) - - Let's concentrate on an end-user experience for now, and allow SIGs and working groups who want a more user-facing experience to opt-in if they wish. - -## Proposal - -### User Stories - -- A place for open ended discussion. For example "What CI/CD tools is everyone using?" - - This would be closed as offtopic on StackOverflow, but would be perfect for a forum. -- Post announcements about kubernetes core that are important for end users - - An announcements subforum can be connected to Slack so that we have a single place for us to post announcements that get's propagated to other services. -- Post announcements about related kubernetes projects - - Give the ecosystem of tools around k8s a place to go and build communities around all the tools people are building. - - "Jill's neat K8s project on github" is too small to have it's own official k8s presence, but it could be a post on a forum. -- Events section for meetups and KubeCon/CloudNativeCon -- Sub boards for meetup groups -- Sub boards for non-english speaking community members -- Developer section can include: - - Updated posting of community meeting notes - - We can inline the youtube videos as well - - Steering Committee announcements - - Link to important SIG announcements - - Any related user-facing announcements for our mentorship programs -- Job board - - This might be difficult to do properly and keep classy, but leaving it here as a discussion point. -- reddit.com/r/kubernetes has some great examples - - "What are you working on this week?" to spur activity - - "So, what's the point of containerization?" would be hard to scope on StackOverflow, but watercooly enough for a forum. - - [Top discussions](https://www.reddit.com/r/kubernetes/top/?t=month) over the last month. - -### Implementation Details - -- Software - - Discourse.org - https://www.discourse.org/features - - Free Software - - Vibrant community with lots of integrations with services we already use, like Slack and Github - - Rich API would allow us to build fun participatory integrations (User flair, contributor badges, etc.) - - Facilities for running polls and other plugins - - SSO with common login methods our community is already using. - - Moderation system is user-based on trust, so we would only need to choose 5 people as admins and then as people participate they build trust and get more admin responsibilities. - - Other developer and OSS communities such as Docker, Mozilla, Ubuntu, Twitter, Rust, and Atom are already effectively, software is mature and commonly used. - - Friendly upstream with known track record of working with other OSS projects. -- Hosting - - We should host with a Discourse SaaS paid plan so we can concentrate on building the community and k8s itself. - - [Pricing information](https://payments.discourse.org/pricing) - - If other CNCF projects are interested in this we could help document best practices and do bulk pricing. - - Schedule regular dumps of our data to cloud provider buckets - - Exporting/Self Hosting is always available as an option - - Google Analytics integration would allow us to see what users are interested in and give us better insight on what is interesting to them. - - Data explorer so we can run ad hoc SQL queries and reports on the live data. - - Mailing list import would allow us to immediately have a searchable resource of our past activity. - -### Risks and Mitigations - -- One more thing to check everyday(tm) - - User fatigue with mailing lists, discourse, slack, stackoverflow, youtube channel, KubeCon/CloudNativeCon, your local meetup, etc. - - This is why I am proposing we investigate if we can replace the lists as well, two birds with one stone. -- Lack of developer participation - - The mailing lists work, how suitable is Discourse to replace a mailing list these days? CNCF has tried Discourse in the past. See [@cra's post](https://twitter.com/cra/status/981548716405547008) - - [Discussion on the pros and cons of each](https://meta.discourse.org/t/discourse-vs-email-mailing-lists/54298) - - We have enough churn and new Working Groups that we could pilot a few, opt-in for SIGs that want to try it? -- A community forum is asynchronous, whereas chat is realtime. - - This doesn't solve our Slack lock-in concerns, but can be a good first step in being more active in running our own community properties so that we can build out own resources. - - Ghost have [totally migrated to Discourse](https://twitter.com/johnonolan/status/980872508395188224?s=12) and shut down their Slack. - - We should keep an eye on this and see what data we can gleam from this. Engage with Ghost community folks to see what lessons they've learned. - - Not sure if getting rid of realtime chat entirely is a good idea either. -- [GDPR Compliance](https://www.eugdpr.org/) - - Lots of data retention options in Discourse. - - We'd need to engage with upstream on their plans for this, we would want to avoid having to manage this ourselves. - -#### References from other projects - -- [Chef RFC](https://github.com/chef/chef-rfc/blob/master/rfc028-mailing-list-migration.md) - - [Blog post](https://coderanger.net/chef-mailing-list/) from a community member - good mailing list and community feedback here. -- [Swift's Plan](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170206/031657.html) - Long discussion, worth reading -- [HTM Forum](https://discourse.numenta.org/t/guidelines-for-using-discourse-via-email/314) -- [Julia](https://discourse.julialang.org/t/discourse-as-a-mailing-list/57) - It might be useful for us to investigate pregenerating the mail addresses? -- [How's Discourse working out for Ghost](https://forum.ghost.org/t/hows-discourse-working-out-for-ghost/947) - We asked them for some direct feedback on their progress so far - - -## Graduation Criteria - -There will be a feedback subforum where users can directly give us feedback on what they'd like to see. Metrics and site usage should determine if this will be viable in the long term. - -After a _three month_ prototyping period SIG Contributor Experience will: - -- Determine if this is a better solution than what we have, and figure out where this would fit in the ecosystem - - There is a strong desire that this would replace an existing support venue, SIG Contributor Experience will weigh the options. -- If this solution is not better than what we have, and we don't want to support yet another tool we would shut the project down. -- If we don't have enough information to draw a conclusion, we may decide to extend the evaluation period. -- Site should have a moderation and administrative policies written down. - - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. -Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance -- the `Proposal` section being merged signaling agreement on a proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was available -- the version of Kubernetes where the KEP graduated to general availability -- when the KEP was retired or superseded - -## Drawbacks - -- Kubernetes has seen explosive growth without having a forum at all. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-instrumentation/0031-kubernetes-metrics-overhaul.md b/keps/sig-instrumentation/0031-kubernetes-metrics-overhaul.md index b90ee6d6..cfd1f5fa 100644 --- a/keps/sig-instrumentation/0031-kubernetes-metrics-overhaul.md +++ b/keps/sig-instrumentation/0031-kubernetes-metrics-overhaul.md @@ -1,113 +1,4 @@ ---- -kep-number: 0031 -title: Kubernetes Metrics Overhaul -authors: - - "@brancz" -owning-sig: sig-instrumentation -participating-sigs: - - sig-aaa - - sig-bbb -reviewers: - - "@piosz" - - "@DirectXMan12" -approvers: - - "@piosz" - - "@DirectXMan12" -editor: @DirectXMan12 -creation-date: 2018-11-06 -last-updated: 2018-11-06 -status: provisional ---- - -# Kubernetes Metrics Overhaul - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [cAdvisor instrumentation changes](#cadvisor-instrumentation-changes) - * [Consistent labeling](#consistent-labeling) - * [Changing API latency histogram buckets](#changing-api-latency-histogram-buckets) - * [Kubelet metric changes](#kubelet-metric-changes) - * [Make metrics aggregatable](#make-metrics-aggregatable) - * [Export less metrics](#export-less-metrics) - * [Prevent apiserver's metrics from accidental registration](#prevent-apiservers-metrics-from-accidental-registration) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -This Kubernetes Enhancement Proposal (KEP) outlines the changes planned in the scope of an overhaul of all metrics instrumented in the main kubernetes/kubernetes repository. This is a living document and as existing metrics, that are planned to change are added to the scope, they will be added to this document. As this initiative is going to affect all current users of Kubernetes metrics, this document will also be a source for migration documentation coming out of this effort. - -This KEP is targeted to land in Kubernetes 1.14. The aim is to get all changes into one Kubernetes minor release, to have only a migration be necessary. We are preparing a number of changes, but intend to only start merging them once the 1.14 development window opens. - -## Motivation - -A number of metrics that Kubernetes is instrumented with do not follow the [official Kubernetes instrumentation guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/instrumentation.md). This is for a number of reasons, such as the metrics having been created before the instrumentation guidelines were put in place (around two years ago), and just missing it in code reviews. Beyond the Kubernetes instrumentation guidelines, there are several violations of the [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/). In order to have consistently named and high quality metrics, this effort aims to make working with metrics exposed by Kubernetes consistent with the rest of the ecosystem. In fact even metrics exposed by Kubernetes are inconsistent in themselves, making joining of metrics difficult. - -Kubernetes also makes extensive use of a global metrics registry to register metrics to be exposed. Aside from general shortcomings of global variables, Kubernetes is seeing actual effects of this, causing a number of components to use `sync.Once` or other mechanisms to ensure to not panic, when registering metrics. Instead a metrics registry should be passed to each component in order to explicitly register metrics instead of through `init` methods or other global, non-obvious executions. Within the scope of this KEP, we want to explore other ways, however, it is not blocking for its success, as the primary goal is to make the metrics exposed themselves more consistent and stable. - -While uncertain at this point, once cleaned up, this effort may put us a step closer to having stability guarantees for Kubernetes around metrics. Currently metrics are excluded from any kind of stability requirements. - -### Goals - -* Provide consistently named and high quality metrics in line with the rest of the Prometheus ecosystem. -* Consistent labeling in order to allow straightforward joins of metrics. - -### Non-Goals - -* Add/remove metrics. The scope of this effort just concerns the existing metrics. As long as the same or higher value is presented, adding/removing may be in scope (this is handled on a case by case basis). -* This effort does not concern logging or tracing instrumentation. - -## Proposal - -### cAdvisor instrumentation changes - -#### Consistent labeling - -Change the container metrics exposed through cAdvisor (which is compiled into the Kubelet) to [use consistent labeling according to the instrumentation guidelines](https://github.com/kubernetes/kubernetes/pull/69099). Concretely what that means is changing all the occurrences of the labels: -`pod_name` to `pod` -`container_name` to `container` - -As Kubernetes currently rewrites meta labels of containers to “well-known” `pod_name`, and `container_name` labels, this code is [located in the Kubernetes source](https://github.com/kubernetes/kubernetes/blob/097f300a4d8dd8a16a993ef9cdab94c1ef1d36b7/pkg/kubelet/cadvisor/cadvisor_linux.go#L96-L98), so it does not concern the cAdvisor code base. - -### Changing API latency histogram buckets - -API server histogram latency buckets run from 125ms to 8s. This range does not accurately model most API server request latencies, which could run as low as 1ms for GETs or as high as 60s before hitting the API server global timeout. - -https://github.com/kubernetes/kubernetes/pull/67476 - -### Kubelet metric changes - -#### Make metrics aggregatable - -Currently, all Kubelet metrics are exposed as summary data types. This means that it is impossible to calculate certain metrics in aggregate across a cluster, as summaries cannot be aggregated meaningfully. For example, currently one cannot calculate the [pod start latency in a given percentile on a cluster](https://github.com/kubernetes/kubernetes/issues/66791). - -Hence, where possible, we should change summaries to histograms, or provide histograms in addition to summaries like with the API server metrics. - -#### Export less metrics - -https://github.com/kubernetes/kubernetes/issues/68522 - -#### Prevent apiserver's metrics from accidental registration - -https://github.com/kubernetes/kubernetes/pull/63924 - -### Risks and Mitigations - -Risks include users upgrading Kubernetes, but not updating their usage of Kubernetes exposed metrics in alerting and dashboarding potentially causing incidents to go unnoticed. - -To prevent this, we will implement recording rules for Prometheus that allow best effort backward compatibility as well as update uses of breaking metric usages in the [Kubernetes monitoring mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin), a widely used collection of Prometheus alerts and Grafana dashboards for Kubernetes. - -## Graduation Criteria - -All metrics exposed by components from kubernetes/kubernetes follow Prometheus best practices and (nice to have) tooling is built and enabled in CI to prevent simple violations of said best practices. - -## Implementation History - -Multiple pull requests have already been opened, but not merged as of writing of this document.
\ No newline at end of file +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0007-pod-ready++.md b/keps/sig-network/0007-pod-ready++.md index ddc34b55..cfd1f5fa 100644 --- a/keps/sig-network/0007-pod-ready++.md +++ b/keps/sig-network/0007-pod-ready++.md @@ -1,173 +1,4 @@ ---- -kep-number: 1 -title: Pod Ready++ -authors: - - "freehan@" -owning-sig: sig-network -participating-sigs: - - sig-node - - sig-cli -reviewers: - - thockin@ - - dchen1107@ -approvers: - - thockin@ - - dchen1107@ -editor: freehan@ -creation-date: 2018-04-01 -last-updated: 2018-04-01 -status: provisional - ---- - -# Pod Ready++ - - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Alternatives](#alternatives-optional) - - -## Summary - -This proposal aims to add extensibility to pod readiness. Besides container readiness, external feedback can be injected into PodStatus and influence pod readiness. Thus, achieving pod “ready++”. - -## Motivation - -Pod readiness indicates whether the pod is ready to serve traffic. Pod readiness is dictated by kubelet with user specified readiness probe. On the other hand, pod readiness determines whether pod address shows up on the address list on related endpoints object. K8s primitives that manage pods, such as Deployment, only takes pod status into account for decision making, such as advancement during rolling update. - -For example, during deployment rolling update, a new pod becomes ready. On the other hand, service, network policy and load-balancer are not yet ready for the new pod due to whatever reason (e.g. slowness in api machinery, endpoints controller, kube-proxy, iptables or infrastructure programming). This may cause service disruption or lost of backend capacity. In extreme cases, if rolling update completes before any new replacement pod actually start serving traffic, this will cause service outage. - - -### Goals - -- Allow extra signals for pod readiness. - -### Non-Goals - -- Provide generic framework to solve all transition problems in k8s (e.g. blue green deployment). - -## Proposal - -[K8s Proposal: Pod Ready++](https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit#) - -### PodSpec -Introduce an extra field called ReadinessGates in PodSpec. The field stores a list of ReadinessGate structure as follows: -```yaml -type ReadinessGate struct { - conditionType string -} -``` -The ReadinessGate struct has only one string field called ConditionType. ConditionType refers to a condition in the PodCondition list in PodStatus. And the status of conditions specified in the ReadinessGates will be evaluated for pod readiness. If the condition does not exist in the PodCondition list, its status will be default to false. - -#### Constraints: -- ReadinessGates can only be specified at pod creation. -- No Update allowed on ReadinessGates. -- ConditionType must conform to the naming convention of custom pod condition. - -### Pod Readiness -Change the pod readiness definition to as follows: -``` -Pod is ready == containers are ready AND conditions in ReadinessGates are True -``` -Kubelet will evaluate conditions specified in ReadinessGates and update the pod “Ready” status. For example, in the following pod spec, two readinessGates are specified. The status of “www.example.com/feature-1” is false, hence the pod is not ready. - -```yaml -Kind: Pod -… -spec: - readinessGates: - - conditionType: www.example.com/feature-1 - - conditionType: www.example.com/feature-2 -… -status: - conditions: - - lastProbeTime: null - lastTransitionTime: 2018-01-01T00:00:00Z - status: "False" - type: Ready - - lastProbeTime: null - lastTransitionTime: 2018-01-01T00:00:00Z - status: "False" - type: www.example.com/feature-1 - - lastProbeTime: null - lastTransitionTime: 2018-01-01T00:00:00Z - status: "True" - type: www.example.com/feature-2 - containerStatuses: - - containerID: docker://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - ready : true -… -``` - -Another pod condition `ContainerReady` will be introduced to capture the old pod `Ready` condition. -``` -ContainerReady is true == containers are ready -``` - -### Custom Pod Condition -Custom pod condition can be injected thru PATCH action using KubeClient. Please be noted that “kubectl patch” does not support patching object status. Need to use client-go or other KubeClient implementations. - -Naming Convention: -The type of custom pod condition must comply with k8s label key format. For example, “www.example.com/feature-1”. - - -### Implementation Details/Notes/Constraints - -##### Workloads -To conform with this proposals, workload controllers MUST take pod “Ready” condition as the final signal to proceed during transitions. - -For the workloads that take pod readiness as a critical signal for its decision making, they will automatically comply with this proposal without any change. Majority, if not all, of the workloads satisfy this condition. - -##### Kubelet -- Use PATCH instead of PUT to update PodStatus fields that are dictated by kubelet. -- Only compare the fields that managed by kubelet for PodStatus reconciliation . -- Watch PodStatus changes and evaluate ReadinessGates for pod readiness. - -### Feature Integration -In this section, we will discuss how to make ReadinessGates transparent to K8s API user. In order words, a K8s API user does not need to specify ReadinessGates to use specific features. This allows existing manifests to just work with features that require ReadinessGate. -Each feature will bear the burden of injecting ReadinessGate and keep its custom pod condition in sync. ReadinessGate can be injected using mutating webhook at pod creation time. After pod creation, each feature is responsible for keeping its custom pod condition in sync as long as its ReadinessGate exists in the PodSpec. This can be achieved by running k8s controller to sync conditions on relevant pods. This is to ensure that PodStatus is observable and recoverable even when catastrophic failure (e.g. loss of data) occurs at API server. - - - -### Risks and Mitigations - -Risks: -- Features that utilize the extension point from this proposal may abuse the API. -- User confusion on pod ready++ - -Mitigations: -- Better specification and API validation. -- Better CLI/UI/UX - - -## Graduation Criteria - -- Kubelet changes should not have any impact on kubelet reliability. -- Feature integration with the pod ready++ extension. - - -## Implementation History - -TBD - - -## Alternatives - -##### Why not fix the workloads? - -There are a lot of workloads including core workloads such as deployment and 3rd party workloads such as spark operator. Most if not all of them take pod readiness as a critical signal for decision making, while ignoring higher level abstractions (e.g. service, network policy and ingress). To complicate the problem more, label selector makes membership relationship implicit and dynamic. Solving this problem in all workload controllers would require much bigger change than this proposal. - -##### Why not extend container readiness? - -Container readiness is tied to low level constructs such as runtime. This inherently implies that the kubelet and underlying system has full knowledge of container status. Injecting external feedback into container status would complicate the abstraction and control flow. Meanwhile, higher level abstractions (e.g. service) generally takes pod as the atom instead of container. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0010-20180314-coredns-GA-proposal.md b/keps/sig-network/0010-20180314-coredns-GA-proposal.md index 54494eea..cfd1f5fa 100644 --- a/keps/sig-network/0010-20180314-coredns-GA-proposal.md +++ b/keps/sig-network/0010-20180314-coredns-GA-proposal.md @@ -1,126 +1,4 @@ ---- -kep-number: 10 -title: Graduate CoreDNS to GA -authors: - - "@johnbelamaric" - - "@rajansandeep" -owning-sig: sig-network -participating-sigs: - - sig-cluster-lifecycle -reviewers: - - "@bowei" - - "@thockin" -approvers: - - "@thockin" -editor: "@rajansandeep" -creation-date: 2018-03-21 -last-updated: 2018-05-18 -status: provisional -see-also: https://github.com/kubernetes/community/pull/2167 ---- - -# Graduate CoreDNS to GA - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Cases](#use-cases) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -CoreDNS is sister CNCF project and is the successor to SkyDNS, on which kube-dns is based. It is a flexible, extensible -authoritative DNS server and directly integrates with the Kubernetes API. It can serve as cluster DNS, -complying with the [dns spec](https://git.k8s.io/dns/docs/specification.md). As an independent project, -it is more actively developed than kube-dns and offers performance and functionality beyond what kube-dns has. For more details, see the [introductory presentation](https://docs.google.com/presentation/d/1v6Coq1JRlqZ8rQ6bv0Tg0usSictmnN9U80g8WKxiOjQ/edit#slide=id.g249092e088_0_181), or [coredns.io](https://coredns.io), or the [CNCF webinar](https://youtu.be/dz9S7R8r5gw). - -Currently, we are following the road-map defined [here](https://github.com/kubernetes/features/issues/427). CoreDNS is Beta in Kubernetes v1.10, which can be installed as an alternate to kube-dns. -The purpose of this proposal is to graduate CoreDNS to GA. - -## Motivation - -* CoreDNS is more flexible and extensible than kube-dns. -* CoreDNS is easily extensible and maintainable using a plugin architecture. -* CoreDNS has fewer moving parts than kube-dns, taking advantage of the plugin architecture, making it a single executable and single process. -* It is written in Go, making it memory-safe (kube-dns includes dnsmasq which is not). -* CoreDNS has [better performance](https://github.com/kubernetes/community/pull/1100#issuecomment-337747482) than [kube-dns](https://github.com/kubernetes/community/pull/1100#issuecomment-338329100) in terms of greater QPS, lower latency, and lower memory consumption. - -### Goals - -* Bump up CoreDNS to be GA. -* Make CoreDNS available as an image in a Kubernetes repository (To Be Defined) and ensure a workflow/process to update the CoreDNS versions in the future. - May be deferred to [next KEP](https://github.com/kubernetes/community/pull/2167) if goal not achieved in time. -* Provide a kube-dns to CoreDNS upgrade path with configuration translation in `kubeadm`. -* Provide a CoreDNS to CoreDNS upgrade path in `kubeadm`. - -### Non-Goals - -* Translation of CoreDNS ConfigMap back to kube-dns (i.e., downgrade). -* Translation configuration of kube-dns to equivalent CoreDNS that is defined outside of the kube-dns ConfigMap. For example, modifications to the manifest or `dnsmasq` configuration. -* Fate of kube-dns in future releases, i.e. deprecation path. -* Making [CoreDNS the default](https://github.com/kubernetes/community/pull/2167) in every installer. - -## Proposal - -The proposed solution is to enable the selection of CoreDNS as a GA cluster service discovery DNS for Kubernetes. -Some of the most used deployment tools have been upgraded by the CoreDNS team, in cooperation of the owners of these tools, to be able to deploy CoreDNS: -* kubeadm -* kube-up -* minikube -* kops - -For other tools, each maintainer would have to add the upgrade to CoreDNS. - -### Use Cases - -* CoreDNS supports all functionality of kube-dns and also addresses [several use-cases kube-dns lacks](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network/coredns.md#use-cases). Some of the Use Cases are as follows: - * Supporting [Autopath](https://coredns.io/plugins/autopath/), which reduces the high query load caused by the long DNS search path in Kubernetes. - * Making an alias for an external name [#39792](https://github.com/kubernetes/kubernetes/issues/39792) - -* By default, the user experience would be unchanged. For more advanced uses, existing users would need to modify the ConfigMap that contains the CoreDNS configuration file. -* Since CoreDNS has more supporting features than kube-dns, there will be no path to retain the CoreDNS configuration in case a user wants to switch to kube-dns. - -#### Configuring CoreDNS - -The CoreDNS configuration file is called a `Corefile` and syntactically is the same as a [Caddyfile](https://caddyserver.com/docs/caddyfile). The file consists of multiple stanzas called _server blocks_. -Each of these represents a set of zones for which that server block should respond, along with the list of plugins to apply to a given request. More details on this can be found in the -[Corefile Explained](https://coredns.io/2017/07/23/corefile-explained/) and [How Queries Are Processed](https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/) blog entries. - -The following can be expected when CoreDNS is graduated to GA. - -#### Kubeadm - -* The CoreDNS feature-gates flag will be marked as GA. -* As Kubeadm maintainers chose to deploy CoreDNS as the default Cluster DNS for Kubernetes 1.11: - * CoreDNS will be installed by default in a fresh install of Kubernetes via kubeadm. - * For users upgrading Kubernetes via kubeadm, it will install CoreDNS by default whether the user had kube-dns or CoreDNS in a previous kubernetes version. - * In case a user wants to install kube-dns instead of CoreDNS, they have to set the feature-gate of CoreDNS to false. `--feature-gates=CoreDNS=false` -* When choosing to install CoreDNS, the configmap of a previously installed kube-dns will be automatically translated to the equivalent CoreDNS configmap. - -#### Kube-up - -* CoreDNS will be installed when the environment variable `CLUSTER_DNS_CORE_DNS` is set to `true`. The default value is `false`. - -#### Minikube - -* CoreDNS to be an option in the add-on manager, with CoreDNS disabled by default. - -## Graduation Criteria - -* Verify that all e2e conformance and DNS related tests (xxx-kubernetes-e2e-gce, ci-kubernetes-e2e-gce-gci-ci-master and filtered by `--ginkgo.skip=\\[Slow\\]|\\[Serial\\]|\\[Disruptive\\]|\\[Flaky\\]|\\[Feature:.+\\]`) run successfully for CoreDNS. - None of the tests successful with Kube-DNS should be failing with CoreDNS. -* Add CoreDNS as part of the e2e Kubernetes scale runs and ensure tests are not failing. -* Extend [perf-tests](https://github.com/kubernetes/perf-tests/tree/master/dns) for CoreDNS. -* Add a dedicated DNS related tests in e2e scalability test [Feature:performance]. - -## Implementation History - -* 20170912 - [Feature proposal](https://github.com/kubernetes/features/issues/427) for CoreDNS to be implemented as the default DNS in Kubernetes. -* 20171108 - Successfully released [CoreDNS as an Alpha feature-gate in Kubernetes v1.9](https://github.com/kubernetes/kubernetes/pull/52501). -* 20180226 - CoreDNS graduation to Incubation in CNCF. -* 20180305 - Support for Kube-dns configmap translation and move up [CoreDNS to Beta](https://github.com/kubernetes/kubernetes/pull/58828) for Kubernetes v1.10. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0011-ipvs-proxier.md b/keps/sig-network/0011-ipvs-proxier.md index 7e6e2328..cfd1f5fa 100644 --- a/keps/sig-network/0011-ipvs-proxier.md +++ b/keps/sig-network/0011-ipvs-proxier.md @@ -1,574 +1,4 @@ ---- -kep-number: TBD -title: IPVS Load Balancing Mode in Kubernetes -status: implemented -authors: - - "@rramkumar1" -owning-sig: sig-network -reviewers: - - "@thockin" - - "@m1093782566" -approvers: - - "@thockin" - - "@m1093782566" -editor: - - "@thockin" - - "@m1093782566" -creation-date: 2018-03-21 ---- - -# IPVS Load Balancing Mode in Kubernetes - -**Note: This is a retroactive KEP. Credit goes to @m1093782566, @haibinxie, and @quinton-hoole for all information & design in this KEP.** - -**Important References: https://github.com/kubernetes/community/pull/692/files** - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non\-goals](#non-goals) -* [Proposal](#proposal) - * [Kube-Proxy Parameter Changes](#kube-proxy-parameter-changes) - * [Build Changes](#build-changes) - * [Deployment Changes](#deployment-changes) - * [Design Considerations](#design-considerations) - * [IPVS service network topology](#ipvs-service-network-topology) - * [Port remapping](#port-remapping) - * [Falling back to iptables](#falling-back-to-iptables) - * [Supporting NodePort service](#supporting-nodeport-service) - * [Supporting ClusterIP service](#supporting-clusterip-service) - * [Supporting LoadBalancer service](#supporting-loadbalancer-service) - * [Session Affinity](#session-affinity) - * [Cleaning up inactive rules](#cleaning-up-inactive-rules) - * [Sync loop pseudo code](#sync-loop-pseudo-code) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks](#drawbacks) -* [Alternatives](#alternatives) - -## Summary - -We are building a new implementation of kube proxy built on top of IPVS (IP Virtual Server). - -## Motivation - -As Kubernetes grows in usage, the scalability of its resources becomes more and more -important. In particular, the scalability of services is paramount to the adoption of Kubernetes -by developers/companies running large workloads. Kube Proxy, the building block of service routing -has relied on the battle-hardened iptables to implement the core supported service types such as -ClusterIP and NodePort. However, iptables struggles to scale to tens of thousands of services because -it is designed purely for firewalling purposes and is based on in-kernel rule chains. On the -other hand, IPVS is specifically designed for load balancing and uses more efficient data structures -under the hood. For more information on the performance benefits of IPVS vs. iptables, take a look -at these [slides](https://docs.google.com/presentation/d/1BaIAywY2qqeHtyGZtlyAp89JIZs59MZLKcFLxKE6LyM/edit?usp=sharing). - -### Goals - -* Improve the performance of services - -### Non-goals - -None - -### Challenges and Open Questions [optional] - -None - - -## Proposal - -### Kube-Proxy Parameter Changes - -***Parameter: --proxy-mode*** -In addition to existing userspace and iptables modes, IPVS mode is configured via --proxy-mode=ipvs. In the initial implementation, it implicitly uses IPVS [NAT](http://www.linuxvirtualserver.org/VS-NAT.html) mode. - -***Parameter: --ipvs-scheduler*** -A new kube-proxy parameter will be added to specify the IPVS load balancing algorithm, with the parameter being --ipvs-scheduler. If it’s not configured, then round-robin (rr) is default value. If it’s incorrectly configured, then kube-proxy will exit with error message. - * rr: round-robin - * lc: least connection - * dh: destination hashing - * sh: source hashing - * sed: shortest expected delay - * nq: never queue -For more details, refer to http://kb.linuxvirtualserver.org/wiki/Ipvsadm - -In future, we can implement service specific scheduler (potentially via annotation), which has higher priority and overwrites the value. - -***Parameter: --cleanup-ipvs*** -Similar to the --cleanup-iptables parameter, if true, cleanup IPVS configuration and IPTables rules that are created in IPVS mode. - -***Parameter: --ipvs-sync-period*** -Maximum interval of how often IPVS rules are refreshed (e.g. '5s', '1m'). Must be greater than 0. - -***Parameter: --ipvs-min-sync-period*** -Minimum interval of how often the IPVS rules are refreshed (e.g. '5s', '1m'). Must be greater than 0. - - -### Build Changes - -No changes at all. The IPVS implementation is built on [docker/libnetwork](https://godoc.org/github.com/docker/libnetwork/ipvs) IPVS library, which is a pure-golang implementation and talks to kernel via socket communication. - -### Deployment Changes - -IPVS kernel module installation is beyond Kubernetes. It’s assumed that IPVS kernel modules are installed on the node before running kube-proxy. When kube-proxy starts, if the proxy mode is IPVS, kube-proxy would validate if IPVS modules are installed on the node. If it’s not installed, then kube-proxy will fall back to the iptables proxy mode. - -### Design Considerations - -#### IPVS service network topology - -We will create a dummy interface and assign all kubernetes service ClusterIP's to the dummy interface (default name is `kube-ipvs0`). For example, - -```shell -# ip link add kube-ipvs0 type dummy -# ip addr -... -73: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000 - link/ether 26:1f:cc:f8:cd:0f brd ff:ff:ff:ff:ff:ff - -#### Assume 10.102.128.4 is service Cluster IP -# ip addr add 10.102.128.4/32 dev kube-ipvs0 -... -73: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000 - link/ether 1a:ce:f5:5f:c1:4d brd ff:ff:ff:ff:ff:ff - inet 10.102.128.4/32 scope global kube-ipvs0 - valid_lft forever preferred_lft forever -``` - -Note that the relationship between a Kubernetes service and an IPVS service is `1:N`. Consider a Kubernetes service that has more than one access IP. For example, an External IP type service has 2 access IP's (ClusterIP and External IP). Then the IPVS proxier will create 2 IPVS services - one for Cluster IP and the other one for External IP. - -The relationship between a Kubernetes endpoint and an IPVS destination is `1:1`. -For instance, deletion of a Kubernetes service will trigger deletion of the corresponding IPVS service and address bound to dummy interface. - - -#### Port remapping - -There are 3 proxy modes in ipvs - NAT (masq), IPIP and DR. Only NAT mode supports port remapping. We will use IPVS NAT mode in order to supporting port remapping. The following example shows ipvs mapping service port `3080` to container port `8080`. - -```shell -# ipvsadm -ln -IP Virtual Server version 1.2.1 (size=4096) -Port LocalAddress:Port Scheduler Flags - -> RemoteAddress:Port Forward Weight ActiveConn InActConn -TCP 10.102.128.4:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 - -> 10.244.1.237:8080 Masq 1 0 0 - -``` - -#### Falling back to iptables - -IPVS proxier will employ iptables in doing packet filtering, SNAT and supporting NodePort type service. Specifically, ipvs proxier will fall back on iptables in the following 4 scenarios. - -* kube-proxy start with --masquerade-all=true -* Specify cluster CIDR in kube-proxy startup -* Load Balancer Source Ranges is specified for LB type service -* Support NodePort type service - -And, IPVS proxier will maintain 5 kubernetes-specific chains in nat table - -- KUBE-POSTROUTING -- KUBE-MARK-MASQ -- KUBE-MARK-DROP - -`KUBE-POSTROUTING`, `KUBE-MARK-MASQ`, ` KUBE-MARK-DROP` are maintained by kubelet and ipvs proxier won't create them. IPVS proxier will make sure chains `KUBE-MARK-SERVICES` and `KUBE-NODEPORTS` exist in its sync loop. - -**1. kube-proxy start with --masquerade-all=true** - -If kube-proxy starts with `--masquerade-all=true`, the IPVS proxier will masquerade all traffic accessing service ClusterIP, which behaves same as what iptables proxier does. -Suppose there is a service with Cluster IP `10.244.5.1` and port `8080`: - -```shell -# iptables -t nat -nL - -Chain PREROUTING (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain OUTPUT (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain POSTROUTING (policy ACCEPT) -target prot opt source destination -KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */ - -Chain KUBE-POSTROUTING (1 references) -target prot opt source destination -MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000 - -Chain KUBE-MARK-DROP (0 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000 - -Chain KUBE-MARK-MASQ (6 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000 - -Chain KUBE-SERVICES (2 references) -target prot opt source destination -KUBE-MARK-MASQ tcp -- 0.0.0.0/0 10.244.5.1 /* default/foo:http cluster IP */ tcp dpt:8080 -``` - -**2. Specify cluster CIDR in kube-proxy startup** - -If kube-proxy starts with `--cluster-cidr=<cidr>`, the IPVS proxier will masquerade off-cluster traffic accessing service ClusterIP, which behaves same as what iptables proxier does. -Suppose kube-proxy is provided with the cluster cidr `10.244.16.0/24`, and service Cluster IP is `10.244.5.1` and port is `8080`: - -```shell -# iptables -t nat -nL - -Chain PREROUTING (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain OUTPUT (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain POSTROUTING (policy ACCEPT) -target prot opt source destination -KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */ - -Chain KUBE-POSTROUTING (1 references) -target prot opt source destination -MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000 - -Chain KUBE-MARK-DROP (0 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000 - -Chain KUBE-MARK-MASQ (6 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000 - -Chain KUBE-SERVICES (2 references) -target prot opt source destination -KUBE-MARK-MASQ tcp -- !10.244.16.0/24 10.244.5.1 /* default/foo:http cluster IP */ tcp dpt:8080 -``` - -**3. Load Balancer Source Ranges is specified for LB type service** - -When service's `LoadBalancerStatus.ingress.IP` is not empty and service's `LoadBalancerSourceRanges` is specified, IPVS proxier will install iptables rules which looks like what is shown below. - -Suppose service's `LoadBalancerStatus.ingress.IP` is `10.96.1.2` and service's `LoadBalancerSourceRanges` is `10.120.2.0/24`: - -```shell -# iptables -t nat -nL - -Chain PREROUTING (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain OUTPUT (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain POSTROUTING (policy ACCEPT) -target prot opt source destination -KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */ - -Chain KUBE-POSTROUTING (1 references) -target prot opt source destination -MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000 - -Chain KUBE-MARK-DROP (0 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000 - -Chain KUBE-MARK-MASQ (6 references) -target prot opt source destination -MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000 - -Chain KUBE-SERVICES (2 references) -target prot opt source destination -ACCEPT tcp -- 10.120.2.0/24 10.96.1.2 /* default/foo:http loadbalancer IP */ tcp dpt:8080 -DROP tcp -- 0.0.0.0/0 10.96.1.2 /* default/foo:http loadbalancer IP */ tcp dpt:8080 -``` - -**4. Support NodePort type service** - -Please check the section below. - -#### Supporting NodePort service - -For supporting NodePort type service, iptables will recruit the existing implementation in the iptables proxier. For example, - -```shell -# kubectl describe svc nginx-service -Name: nginx-service -... -Type: NodePort -IP: 10.101.28.148 -Port: http 3080/TCP -NodePort: http 31604/TCP -Endpoints: 172.17.0.2:80 -Session Affinity: None - -# iptables -t nat -nL - -[root@100-106-179-225 ~]# iptables -t nat -nL -Chain PREROUTING (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain OUTPUT (policy ACCEPT) -target prot opt source destination -KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ - -Chain KUBE-SERVICES (2 references) -target prot opt source destination -KUBE-MARK-MASQ tcp -- !172.16.0.0/16 10.101.28.148 /* default/nginx-service:http cluster IP */ tcp dpt:3080 -KUBE-SVC-6IM33IEVEEV7U3GP tcp -- 0.0.0.0/0 10.101.28.148 /* default/nginx-service:http cluster IP */ tcp dpt:3080 -KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL - -Chain KUBE-NODEPORTS (1 references) -target prot opt source destination -KUBE-MARK-MASQ tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service:http */ tcp dpt:31604 -KUBE-SVC-6IM33IEVEEV7U3GP tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service:http */ tcp dpt:31604 - -Chain KUBE-SVC-6IM33IEVEEV7U3GP (2 references) -target prot opt source destination -KUBE-SEP-Q3UCPZ54E6Q2R4UT all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service:http */ -Chain KUBE-SEP-Q3UCPZ54E6Q2R4UT (1 references) -target prot opt source destination -KUBE-MARK-MASQ all -- 172.17.0.2 0.0.0.0/0 /* default/nginx-service:http */ -DNAT -``` - -#### Supporting ClusterIP service - -When creating a ClusterIP type service, IPVS proxier will do 3 things: - -* make sure dummy interface exists in the node -* bind service cluster IP to the dummy interface -* create an IPVS service whose address corresponds to the Kubernetes service Cluster IP. - -For example, - -```shell -# kubectl describe svc nginx-service -Name: nginx-service -... -Type: ClusterIP -IP: 10.102.128.4 -Port: http 3080/TCP -Endpoints: 10.244.0.235:8080,10.244.1.237:8080 -Session Affinity: None - -# ip addr -... -73: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000 - link/ether 1a:ce:f5:5f:c1:4d brd ff:ff:ff:ff:ff:ff - inet 10.102.128.4/32 scope global kube-ipvs0 - valid_lft forever preferred_lft forever - -# ipvsadm -ln -IP Virtual Server version 1.2.1 (size=4096) -Prot LocalAddress:Port Scheduler Flags - -> RemoteAddress:Port Forward Weight ActiveConn InActConn -TCP 10.102.128.4:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 - -> 10.244.1.237:8080 Masq 1 0 0 -``` - -### Support LoadBalancer service - -IPVS proxier will NOT bind LB's ingress IP to the dummy interface. When creating a LoadBalancer type service, ipvs proxier will do 4 things: - -- Make sure dummy interface exists in the node -- Bind service cluster IP to the dummy interface -- Create an ipvs service whose address corresponding to kubernetes service Cluster IP -- Iterate LB's ingress IPs, create an ipvs service whose address corresponding LB's ingress IP - -For example, - -```shell -# kubectl describe svc nginx-service -Name: nginx-service -... -IP: 10.102.128.4 -Port: http 3080/TCP -Endpoints: 10.244.0.235:8080 -Session Affinity: None - -#### Only bind Cluster IP to dummy interface -# ip addr -... -73: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000 - link/ether 1a:ce:f5:5f:c1:4d brd ff:ff:ff:ff:ff:ff - inet 10.102.128.4/32 scope global kube-ipvs0 - valid_lft forever preferred_lft forever - -#### Suppose LB's ingress IPs {10.96.1.2, 10.93.1.3}. IPVS proxier will create 1 ipvs service for cluster IP and 2 ipvs services for LB's ingree IP. Each ipvs service has its destination. -# ipvsadm -ln -IP Virtual Server version 1.2.1 (size=4096) -Prot LocalAddress:Port Scheduler Flags - -> RemoteAddress:Port Forward Weight ActiveConn InActConn -TCP 10.102.128.4:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 -TCP 10.96.1.2:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 -TCP 10.96.1.3:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 -``` - -Since there is a need of supporting access control for `LB.ingress.IP`. IPVS proxier will fall back on iptables. Iptables will drop any packet which is not from `LB.LoadBalancerSourceRanges`. For example, - -```shell -# iptables -A KUBE-SERVICES -d {ingress.IP} --dport {service.Port} -s {LB.LoadBalancerSourceRanges} -j ACCEPT -``` - -When the packet reach the end of chain, ipvs proxier will drop it. - -```shell -# iptables -A KUBE-SERVICES -d {ingress.IP} --dport {service.Port} -j KUBE-MARK-DROP -``` - -### Support Only NodeLocal Endpoints - -Similar to iptables proxier, when a service has the "Only NodeLocal Endpoints" annotation, ipvs proxier will only proxy traffic to endpoints in the local node. - -```shell -# kubectl describe svc nginx-service -Name: nginx-service -... -IP: 10.102.128.4 -Port: http 3080/TCP -Endpoints: 10.244.0.235:8080, 10.244.1.235:8080 -Session Affinity: None - -#### Assume only endpoint 10.244.0.235:8080 is in the same host with kube-proxy - -#### There should be 1 destination for ipvs service. -[root@SHA1000130405 home]# ipvsadm -ln -IP Virtual Server version 1.2.1 (size=4096) -Prot LocalAddress:Port Scheduler Flags - -> RemoteAddress:Port Forward Weight ActiveConn InActConn -TCP 10.102.128.4:3080 rr - -> 10.244.0.235:8080 Masq 1 0 0 -``` - -#### Session affinity - -IPVS support client IP session affinity (persistent connection). When a service specifies session affinity, the IPVS proxier will set a timeout value (180min=10800s by default) in the IPVS service. For example, - -```shell -# kubectl describe svc nginx-service -Name: nginx-service -... -IP: 10.102.128.4 -Port: http 3080/TCP -Session Affinity: ClientIP - -# ipvsadm -ln -IP Virtual Server version 1.2.1 (size=4096) -Prot LocalAddress:Port Scheduler Flags - -> RemoteAddress:Port Forward Weight ActiveConn InActConn -TCP 10.102.128.4:3080 rr persistent 10800 -``` - -#### Cleaning up inactive rules - -It seems difficult to distinguish if an IPVS service is created by the IPVS proxier or other processes. Currently we assume IPVS rules will be created only by the IPVS proxier on a node, so we can clear all IPVSrules on a node. We should add warnings in documentation and flag comments. - -#### Sync loop pseudo code - -Similar to the iptables proxier, the IPVS proxier will do a full sync loop in a configured period. Also, each update on a Kubernetes service or endpoint will trigger an IPVS service or destination update. For example, - -* Creating a Kubernetes service will trigger creating a new IPVS service. -* Updating a Kubernetes service(for instance, change session affinity) will trigger updating an existing IPVS service. -* Deleting a Kubernetes service will trigger deleting an IPVS service. -* Adding an endpoint for a Kubernetes service will trigger adding a destination for an existing IPVS service. -* Updating an endpoint for a Kubernetes service will trigger updating a destination for an existing IPVS service. -* Deleting an endpoint for a Kubernetes service will trigger deleting a destination for an existing IPVS service. - -Any IPVS service or destination updates will send an update command to kernel via socket communication, which won't take a service down. - -The sync loop pseudo code is shown below: - -```go -func (proxier *Proxier) syncProxyRules() { - When service or endpoint update, begin sync ipvs rules and iptables rules if needed. - ensure dummy interface exists, if not, create one. - for svcName, svcInfo := range proxier.serviceMap { - // Capture the clusterIP. - construct ipvs service from svcInfo - Set session affinity flag and timeout value for ipvs service if specified session affinity - bind Cluster IP to dummy interface - call libnetwork API to create ipvs service and destinations - - // Capture externalIPs. - if externalIP is local then hold the svcInfo.Port so that can install ipvs rules on it - construct ipvs service from svcInfo - Set session affinity flag and timeout value for ipvs service if specified session affinity - call libnetwork API to create ipvs service and destinations - - // Capture load-balancer ingress. - for _, ingress := range svcInfo.LoadBalancerStatus.Ingress { - if ingress.IP != "" { - if len(svcInfo.LoadBalancerSourceRanges) != 0 { - install specific iptables - } - construct ipvs service from svcInfo - Set session affinity flag and timeout value for ipvs service if specified session affinity - call libnetwork API to create ipvs service and destinations - } - } - - // Capture nodeports. - if svcInfo.NodePort != 0 { - fall back on iptables, recruit existing iptables proxier implementation - } - - call libnetwork API to clean up legacy ipvs services which is inactive any longer - unbind service address from dummy interface - clean up legacy iptables chains and rules - } -} -``` - -## Graduation Criteria - -### Beta -> GA - -The following requirements should be met before moving from Beta to GA. It is -suggested to file an issue which tracks all the action items. - -- [ ] Testing - - [ ] 48 hours of green e2e tests. - - [ ] Flakes must be identified and filed as issues. - - [ ] Integrate with scale tests and. Failures should be filed as issues. -- [ ] Development work - - [ ] Identify all pending changes/refactors. Release blockers must be prioritized and fixed. - - [ ] Identify all bugs. Release blocking bugs must be identified and fixed. -- [ ] Docs - - [ ] All user-facing documentation must be updated. - -### GA -> Future - -__TODO__ - -## Implementation History - -**In chronological order** - -1. https://github.com/kubernetes/kubernetes/pull/46580 - -2. https://github.com/kubernetes/kubernetes/pull/52528 - -3. https://github.com/kubernetes/kubernetes/pull/54219 - -4. https://github.com/kubernetes/kubernetes/pull/57268 - -5. https://github.com/kubernetes/kubernetes/pull/58052 - - -## Drawbacks [optional] - -None - -## Alternatives [optional] - -None +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0012-20180518-coredns-default-proposal.md b/keps/sig-network/0012-20180518-coredns-default-proposal.md index f4540704..cfd1f5fa 100644 --- a/keps/sig-network/0012-20180518-coredns-default-proposal.md +++ b/keps/sig-network/0012-20180518-coredns-default-proposal.md @@ -1,88 +1,4 @@ ---- -kep-number: 11 -title: Switch CoreDNS to the default DNS -authors: - - "@johnbelamaric" - - "@rajansandeep" -owning-sig: sig-network -participating-sigs: - - sig-cluster-lifecycle -reviewers: - - "@bowei" - - "@thockin" -approvers: - - "@thockin" -editor: "@rajansandeep" -creation-date: 2018-05-18 -last-updated: 2018-05-18 -status: provisional ---- - -# Switch CoreDNS to the default DNS - -## Table of Contents - -* [Summary](#summary) -* [Goals](#goals) -* [Proposal](#proposal) - * [User Cases](#use-cases) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) - -## Summary - -CoreDNS is now well-established in Kubernetes as the DNS service, with CoreDNS starting as an alpha feature from Kubernetes v1.9 to now being GA in v1.11. -After successfully implementing the road-map defined [here](https://github.com/kubernetes/features/issues/427), CoreDNS is GA in Kubernetes v1.11, which can be installed as an alternate to kube-dns in tools like kubeadm, kops, minikube and kube-up. -Following the [KEP to graduate CoreDNS to GA](https://github.com/kubernetes/community/pull/1956), the purpose of this proposal is to make CoreDNS as the default DNS for Kubernetes, replacing kube-dns. - -## Goals -* Make CoreDNS the default DNS for Kubernetes for all the remaining install tools (kube-up, kops, minikube). -* Make CoreDNS available as an image in a Kubernetes repository (To Be Defined) and ensure a workflow/process to update the CoreDNS versions in the future. - This goal is carried over from the [previous KEP](https://github.com/kubernetes/community/pull/1956), in case it cannot be completed there. - -## Proposal - -The proposed solution is to enable CoreDNS as the default cluster service discovery DNS for Kubernetes. -Some of the most used deployment tools will be upgraded by the CoreDNS team, in cooperation with the owners of these tools, to be able to deploy CoreDNS as default: -* kubeadm (already done for Kubernetes v1.11) -* kube-up -* minikube -* kops - -For other tools, each maintainer would have to add the upgrade to CoreDNS. - -### Use Cases - -Use cases for CoreDNS has been well defined in the [previous KEP](https://github.com/kubernetes/community/pull/1956). -The following can be expected when CoreDNS is made the default DNS. - -#### Kubeadm - -* CoreDNS is already the default DNS from Kubernetes v1.11 and shall continue be the default DNS. -* In case users want to install kube-dns instead of CoreDNS, they have to set the feature-gate of CoreDNS to false. `--feature-gates=CoreDNS=false` - -#### Kube-up - -* CoreDNS will now become the default DNS. -* To install kube-dns in place of CoreDNS, set the environment variable `CLUSTER_DNS_CORE_DNS` to `false`. - -#### Minikube - -* CoreDNS to be enabled by default in the add-on manager, with kube-dns disabled by default. - -#### Kops - -* CoreDNS will now become the default DNS. - -## Graduation Criteria - -* Add CoreDNS image in a Kubernetes repository (To Be Defined) and ensure a workflow/process to update the CoreDNS versions in the future. -* Have a certain number (To Be Defined) of clusters of significant size (To Be Defined) adopting and running CoreDNS as their default DNS. - -## Implementation History - -* 20170912 - [Feature proposal](https://github.com/kubernetes/features/issues/427) for CoreDNS to be implemented as the default DNS in Kubernetes. -* 20171108 - Successfully released [CoreDNS as an Alpha feature-gate in Kubernetes v1.9](https://github.com/kubernetes/kubernetes/pull/52501). -* 20180226 - CoreDNS graduation to Incubation in CNCF. -* 20180305 - Support for Kube-dns configmap translation and move up [CoreDNS to Beta](https://github.com/kubernetes/kubernetes/pull/58828) for Kubernetes v1.10. -* 20180515 - CoreDNS was added as [GA and the default DNS in kubeadm](https://github.com/kubernetes/kubernetes/pull/63509) for Kubernetes v1.11. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0015-20180614-SCTP-support.md b/keps/sig-network/0015-20180614-SCTP-support.md index 4c16aaf4..cfd1f5fa 100644 --- a/keps/sig-network/0015-20180614-SCTP-support.md +++ b/keps/sig-network/0015-20180614-SCTP-support.md @@ -1,293 +1,4 @@ ---- -kep-number: 15 -title: SCTP support -authors: - - "@janosi" -owning-sig: sig-network -participating-sigs: - - sig-network -reviewers: - - "@thockin" -approvers: - - "@thockin" -editor: - - "@janosi" -creation-date: 2018-06-14 -last-updated: 2018-09-14 -status: implemented -see-also: - - PR64973 -replaces: -superseded-by: ---- - -# SCTP support - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories-optional) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) - -## Summary - -The goal of the SCTP support feature is to enable the usage of the SCTP protocol in Kubernetes [Service][], [NetworkPolicy][], and [ContainerPort][]as an additional protocol value option beside the current TCP and UDP options. -SCTP is an IETF protocol specified in [RFC4960][], and it is used widely in telecommunications network stacks. -Once SCTP support is added as a new protocol option those applications that require SCTP as L4 protocol on their interfaces can be deployed on Kubernetes clusters on a more straightforward way. For example they can use the native kube-dns based service discovery, and their communication can be controlled on the native NetworkPolicy way. - -[Service]: https://kubernetes.io/docs/concepts/services-networking/service/ -[NetworkPolicy]: -https://kubernetes.io/docs/concepts/services-networking/network-policies/ -[ContainerPort]:https://kubernetes.io/docs/concepts/services-networking/connect-applications-service/#exposing-pods-to-the-cluster -[RFC4960]: https://tools.ietf.org/html/rfc4960 - - -## Motivation - -SCTP is a widely used protocol in telecommunications. It would ease the management and execution of telecommunication applications on Kubernetes if SCTP were added as a protocol option to Kubernetes. - -### Goals - -Add SCTP support to Kubernetes ContainerPort, Service and NetworkPolicy, so applications running in pods can use the native kube-dns based service discovery for SCTP based services, and their communication can be controlled via the native NetworkPolicy way. - -It is also a goal to enable ingress SCTP connections from clients outside the Kubernetes cluster, and to enable egress SCTP connections to servers outside the Kubernetes cluster. - -### Non-Goals - -It is not a goal here to add SCTP support to load balancers that are provided by cloud providers. The Kubernetes side implementation will not restrict the usage of SCTP as the protocol for the Services with type=LoadBalancer, but we do not implement the support of SCTP into the cloud specific load balancer implementations. - -It is not a goal to support multi-homed SCTP associations. Such a support also depends on the ability to manage multiple IP addresses for a pod, and in the case of Services with ClusterIP or NodePort the support of multi-homed associations would also require the support of NAT for multihomed associations in the SCTP related NF conntrack modules. - -## Proposal - -### User Stories [optional] - -#### Service with SCTP and Virtual IP -As a user of Kubernetes I want to define Services with Virtual IPs for my applications that use SCTP as L4 protocol on their interfaces,so client applications can use the services of my applications on top of SCTP via that Virtual IP. - -Example: -``` -kind: Service -apiVersion: v1 -metadata: - name: my-service -spec: - selector: - app: MyApp - ports: - - protocol: SCTP - port: 80 - targetPort: 9376 -``` - -#### Headless Service with SCTP -As a user of Kubernetes I want to define headless Services for my applications that use SCTP as L4 protocol on their interfaces, so client applications can discover my applications in kube-dns, or via any other service discovery methods that get information about endpoints via the Kubernetes API. - -Example: -``` -kind: Service -apiVersion: v1 -metadata: - name: my-service -spec: - selector: - app: MyApp - ClusterIP: "None" - ports: - - protocol: SCTP - port: 80 - targetPort: 9376 -``` -#### Service with SCTP without selector -As a user of Kubernetes I want to define Services without selector for my applications that use SCTP as L4 protocol on their interfaces, so I can implement my own service controllers if I want to extend the basic functionality of Kubernetes. - -Example: -``` -kind: Service -apiVersion: v1 -metadata: - name: my-service -spec: - ClusterIP: "None" - ports: - - protocol: SCTP - port: 80 - targetPort: 9376 -``` - -#### SCTP as container port protocol in Pod definition -As a user of Kubernetes I want to define hostPort for the SCTP based interfaces of my applications -Example: -``` -apiVersion: v1 -kind: Pod -metadata: - name: mypod -spec: - containers: - - name: container-1 - image: mycontainerimg - ports: - - name: diameter - protocol: SCTP - containerPort: 80 - hostPort: 80 -``` - -#### SCTP port accessible from outside the cluster - -As a user of Kubernetes I want to have the option that client applications that reside outside of the cluster can access my SCTP based services that run in the cluster. - -Example: -``` -kind: Service -apiVersion: v1 -metadata: - name: my-service -spec: - type: NodePort - selector: - app: MyApp - ports: - - protocol: SCTP - port: 80 - targetPort: 9376 -``` - -Example: -``` -kind: Service -apiVersion: v1 -metadata: - name: my-service -spec: - selector: - app: MyApp - ports: - - protocol: SCTP - port: 80 - targetPort: 9376 - externalIPs: - - 80.11.12.10 -``` - -#### NetworkPolicy with SCTP -As a user of Kubernetes I want to define NetworkPolicies for my applications that use SCTP as L4 protocol on their interfaces, so the network plugins that support SCTP can control the accessibility of my applications on the SCTP based interfaces, too. - -Example: -``` -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: myservice-network-policy - namespace: myapp -spec: - podSelector: - matchLabels: - role: myservice - policyTypes: - - Ingress - ingress: - - from: - - ipBlock: - cidr: 172.17.0.0/16 - except: - - 172.17.1.0/24 - - namespaceSelector: - matchLabels: - project: myproject - - podSelector: - matchLabels: - role: myclient - ports: - - protocol: SCTP - port: 7777 -``` -#### Userspace SCTP stack -As a user of Kubernetes I want to deploy and run my applications that use a userspace SCTP stack, and at the same time I want to define SCTP Services in the same cluster. I use a userspace SCTP stack because of the limitations of the kernel's SCTP support. For example: it's not possible to write an SCTP server that proxies/filters arbitrary SCTP streams using the sockets APIs and kernel SCTP. - -### Implementation Details/Notes/Constraints [optional] - -#### SCTP in Services -##### Kubernetes API modification -The Kubernetes API modification for Services to support SCTP is obvious. - -##### Services with host level ports - -The kube-proxy and the kubelet starts listening on the defined TCP or UDP port in case of Servies with ClusterIP or NodePort or externalIP, and in case of containers with HostPort defined. The goal of this is to reserve the port in question so no other host level process can use that by accident. When it comes to SCTP the agreement is that we do not follow this pattern. That is, Kubernetes will not listen on host level ports with SCTP as protocol. The reason for this decision is, that the current TCP and UDP related implementation is not perfect either, it has known gaps in some use cases, and in those cases this listening is not started. But no one complained about those gaps so most probably this port reservation via listening logic is not needed at all. - -##### Services with type=LoadBalancer - -For Services with type=LoadBalancer we expect that the cloud provider's load balancer API client in Kubernetes rejects the requests with unsupported protocol. - -#### SCTP support in Kube DNS -Kube DNS shall support SRV records with "_sctp" as "proto" value. According to our investigations, the DNS controller is very flexible from this perspective, and it can create SRV records with any protocol name. I.e. there is no need for additional implementation to achieve this goal. - -Example: - -``` -_diameter._sctp.my-service.default.svc.cluster.local. 30 IN SRV 10 100 1234 my-service.default.svc.cluster.local. -``` -#### SCTP in the Pod's ContainerPort -The Kubernetes API modification for the Pod is obvious. - -We support SCTP as protocol for any combinations of containerPort and hostPort. - -#### SCTP in NetworkPolicy -The Kubernetes API modification for the NetworkPolicy is obvious. - -In order to utilize the new protocol value the network plugin must support it. - -#### Interworking with applications that use a user space SCTP stack - -##### Problem definition -A userpace SCTP stack usually creates raw sockets with IPPROTO_SCTP. And as it is clearly highlighted in the [documentation of raw sockets][]: ->Raw sockets may tap all IP protocols in Linux, even protocols like ICMP or TCP which have a protocol module in the kernel. In this case, the packets are passed to both the kernel module and the raw socket(s). - -I.e. if both the kernel module (lksctp) and a userspace SCTP stack are active on the same node both receive the incoming SCTP packets according to the current [kernel][] logic. - -But in turn the SCTP kernel module will handle those packets that are actually destined to the raw socket as Out of the blue (OOTB) packets according to the rules defined in [RFC4960][]. I.e. the SCTP kernel module sends SCTP ABORT to the sender, and on that way it aborts the connections of the userspace SCTP stack. - -As we can see, a userspace SCTP stack cannot co-exist with the SCTP kernel module (lksctp) on the same node. That is, the loading of the SCTP kernel module must be avoided on nodes where such applications that use userspace SCTP stack are planned to be run. The SCTP kernel module loading is triggered when an application starts managing SCTP sockets via the standard socket API or via syscalls. - -In order to resolve this problem the solution was to dedicate nodes to userspace SCTP applications in the past. Such applications that would trigger the loading of the SCTP kernel module were not deployed on those nodes. - -##### The solution in the Kubernetes SCTP support implementation -Our main task here is to provide the same node level isolation possibility that was used in the past: i.e. to provide the option to dedicate some nodes to userspace SCTP applications, and ensure that the actions performed by Kubernetes (kubelet, kube-proxy) do not load the SCTP kernel modules on those dedicated nodes. - -On the Kubernetes side we solve this problem so, that we do not start listening on the SCTP ports defined for Servies with ClusterIP or NodePort or externalIP, neither in the case when containers with SCTP HostPort are defined. On this way we avoid the loading of the kernel module due to Kubernetes actions. - -On application side it is pretty easy to separate application pods that use a userspace SCTP stack from those application pods that use the kernel space SCTP stack: the usual nodeselector label based mechanism, or taints are there for this very purpose. - -NOTE! The handling of TCP and UDP Services does not change on those dedicated nodes. - -We propose the following solution: - -We describe in the Kubernetes documentation the mutually exclusive nature of userspace and kernel space SCTP stacks, and we would highlight, that the required separation of the userspace SCTP stack applications and the kernel module users shall be achieved with the usual nodeselector or taint based mechanisms. - - -[documentation of raw sockets]: http://man7.org/linux/man-pages/man7/raw.7.html -[kernel]: https://github.com/torvalds/linux/blob/0fbc4aeabc91f2e39e0dffebe8f81a0eb3648d97/net/ipv4/ip_input.c#L191 - -### Risks and Mitigations - -## Graduation Criteria - -## Implementation History - -## Drawbacks [optional] - -## Alternatives [optional] - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0030-nodelocal-dns-cache.md b/keps/sig-network/0030-nodelocal-dns-cache.md index 694a11e9..cfd1f5fa 100644 --- a/keps/sig-network/0030-nodelocal-dns-cache.md +++ b/keps/sig-network/0030-nodelocal-dns-cache.md @@ -1,215 +1,4 @@ ---- -kep-number: 30 -title: NodeLocal DNS Cache -authors: - - "@prameshj" -owning-sig: sig-network -participating-sigs: - - sig-network -reviewers: - - "@thockin" - - "@bowei" - - "@johnbelamaric" - - "@sdodson" -approvers: - - "@thockin" - - "@bowei" -editor: TBD -creation-date: 2018-10-05 -last-updated: 2018-10-30 -status: provisional ---- - -# NodeLocal DNS Cache - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Rollout Plan](#rollout-plan) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -This proposal aims to improve DNS performance by running a dns caching agent on cluster nodes as a Daemonset. In today's architecture, pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns endpoint via iptables rules added by kube-proxy. With this new architecture, pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking. The local caching agent will query kube-dns for cache misses of cluster hostnames(cluster.local suffix by default). - - -## Motivation - -* With the current DNS architecture, it is possible that pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns instance. -Having a local cache will help improve the latency in such scenarios. - -* Skipping iptables DNAT and connection tracking will help reduce [conntrack races](https://github.com/kubernetes/kubernetes/issues/56903) and avoid UDP DNS entries filling up conntrack table. - -* Connections from local caching agent to kube-dns can be upgraded to TCP. TCP conntrack entries will be removed on connection close in contrast with UDP entries that have to timeout ([default](https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt) `nf_conntrack_udp_timeout` is 30 seconds) - -* Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s (3 retries + 10s timeout). Since the nodelocal cache listens for UDP DNS queries, applications don't need to be changed. - -* Metrics & visibility into dns requests at a node level. - -* Neg caching can be re-enabled, thereby reducing number of queries to kube-dns. - -* There are several open github issues proposing a local DNS Cache daemonset and scripts to run it: - * [https://github.com/kubernetes/kubernetes/issues/7470#issuecomment-248912603](https://github.com/kubernetes/kubernetes/issues/7470#issuecomment-248912603) - - * [https://github.com/kubernetes/kubernetes/issues/32749](https://github.com/kubernetes/kubernetes/issues/32749) - - * [https://github.com/kubernetes/kubernetes/issues/45363](https://github.com/kubernetes/kubernetes/issues/45363) - - -This shows that there is interest in the wider Kubernetes community for a solution similar to the proposal here. - - -### Goals - -Being able to run a dns caching agent as a Daemonset and get pods to use the local instance. Having visibility into cache stats and other metrics. - -### Non-Goals - -* Providing a replacement for kube-dns/CoreDNS. -* Changing the underlying protocol for DNS (e.g. to gRPC) - -## Proposal - -A nodeLocal dns cache runs on all cluster nodes. This is managed as an add-on, runs as a Daemonset. All pods using clusterDNS will now talk to the nodeLocal cache, which will query kube-dns in case of cache misses in cluster's configured DNS suffix and for all reverse lookups(in-addr.arpa and ip6.arpa). User-configured stubDomains will be passed on to this local agent. -The node's resolv.conf will be used by this local agent for all other cache misses. One benefit of doing the non-cluster lookups on the nodes from which they are happening, rather than the kube-dns instances, is better use of per-node DNS resources in cloud. For instance, in a 10-node cluster with 3 kube-dns instances, the 3 nodes running kube-dns will end up resolving all external hostnames and can exhaust QPS quota. Spreading the queries across the 10 nodes will help alleviate this. - -#### Daemonset and Listen Interface for caching agent - -The caching agent daemonset runs in hostNetwork mode in kube-system namespace with a Priority Class of “system-node-critical”. It listens for dns requests on a dummy interface created on the host. A separate ip address is assigned to this dummy interface, so that requests to kube-dns or any other custom service are not incorrectly intercepted by the caching agent. This will be a link-local ip address selected by the user. Each cluster node will have this dummy interface. This ip address will be passed on to kubelet via the --cluster-dns flag, if the feature is enabled. - -The selected link-local IP will be handled specially because of the NOTRACK rules described in the section below. - -#### iptables NOTRACK - -NOTRACK rules are added for connections to and from the nodelocal dns ip. Additional rules in FILTER table to whitelist these connections, since the INPUT and OUTPUT chains have a default DROP policy. - -The nodelocal cache process will create the dummy interface and iptables rules . It gets the nodelocal dns ip as a parameter, performs setup and listens for dns requests. The Daemonset runs in privileged securityContext since it needs to create this dummy interface and add iptables rules. - The cache process will also periodically ensure that the dummy interface and iptables rules are present, in the background. Rules need to be checked in the raw table and filter table. Rules in these tables do not grow with number of valid services. Services with no endpoints will have rules added in filter table to drop packets destined to these ip. The resource usage for periodic iptables check was measured by creating 2k services with no endpoints and running the nodelocal caching agent. Peak memory usage was 20Mi for the caching agent when it was responding to queries along with the periodic checks. This was measured using `kubectl top` command. More details on the testing are in the following section. - -[Proposal presentation](https://docs.google.com/presentation/d/1c43cZqbVhGAlw3dSNQIOGuvQmDfKaA2yiAPRoYpa6iY), also shared at the sig-networking meeting on 2018-10-04 - -Slide 5 has a diagram showing how the new dns cache fits in. - -#### Choice of caching agent - -The current plan is to run CoreDNS by default. Benchmark [ tests](https://github.com/kubernetes/perf-tests/tree/master/dns) were run using [Unbound dns server](https://www.nlnetlabs.nl/projects/unbound/about/) and CoreDNS. 2 more tests were added to query for 20 different services and to query several external hostnames. - -Tests were run on a 1.9.7 cluster with 2 nodes on GCE, using Unbound 1.7.3 and CoreDNS 1.2.3. -Resource limits for nodelocaldns daemonset was CPU - 50m, Memory 25Mi - -Resource usage and QPS were measured with a nanny process for Unbound/CoreDNS plugin adding iptables rules and ensuring that the rules exist, every minute. - -Caching was minimized in Unbound by setting: -msg-cache-size: 0 -rrset-cache-size: 0 -msg-cache-slabs:1 -rrset-cache-slabs:1 -Previous tests did not set the last 2 and there were quite a few unexpected cache hits. - -Caching was disabled in CoreDNS by skipping the cache plugin from Corefile. - -These are the results when dnsperf test was run with no QPS limit. In this mode, the tool sends queries until they start timing out. - -| Test Type | Program | Caching | QPS | -|-----------------------|---------|---------|------| -| Multiple services(20) | CoreDNS | Yes | 860 | -| Multiple services(20) | Unbound | Yes | 3030 | -| | | | | -| External queries | CoreDNS | Yes | 213 | -| External queries | Unbound | Yes | 115 | -| | | | | -| Single Service | CoreDNS | Yes | 834 | -| Single Service | Unbound | Yes | 3287 | -| | | | | -| Single NXDomain | CoreDNS | Yes | 816 | -| Single NXDomain | Unbound | Yes | 3136 | -| | | | | -| Multiple services(20) | CoreDNS | No | 859 | -| Multiple services(20) | Unbound | No | 1463 | -| | | | | -| External queries | CoreDNS | No | 180 | -| External queries | Unbound | No | 108 | -| | | | | -| Single Service | CoreDNS | No | 818 | -| Single Service | Unbound | No | 2992 | -| | | | | -| Single NXDomain | CoreDNS | No | 827 | -| Single NXDomain | Unbound | No | 2986 | - - -Peak memory usage was ~20 Mi for both Unbound and CoreDNS. - -For the single service and single NXDomain query, Unbound still had cache hits since caching could not be completely disabled. - -CoreDNS QPS was twice as much as Unbound for external queries. They were mostly unique hostnames from this file - [ftp://ftp.nominum.com/pub/nominum/dnsperf/data/queryfile-example-current.gz](ftp://ftp.nominum.com/pub/nominum/dnsperf/data/queryfile-example-current.gz) - -When multiple cluster services were queried with cache misses, Unbound was better(1463 vs 859), but not by a large factor. - -Unbound performs much better when all requests are cache hits. - -CoreDNS will be the local cache agent in the first release, after considering these reasons: - -* Better QPS numbers for external hostname queries -* Single process, no need for a separate nanny process -* Prometheus metrics already available, also we can get per zone stats. Unbound gives consolidated stats. -* Easier to make changes to the source code - - It is possible to run any program as caching agent by modifying the daemonset and configmap spec. Publishing an image with Unbound DNS can be added as a follow up. - -Based on the prototype/test results, these are the recommended defaults: -CPU request: 50m -Memory Limit : 25m - -CPU request can be dropped to a smaller value if QPS needs are lower. - -#### Metrics - -Per-zone metrics will be available via the metrics/prometheus plugin in CoreDNS. - - -### Risks and Mitigations - -Having the pods query the nodelocal cache introduces a single point of failure. - -* This is mitigated by having a livenessProbe to periodically ensure DNS is working. In case of upgrades, the recommendation is to drain the node before starting to upgrade the local instance. The user can also configure [customPodDNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config) pointing to clusterDNS ip for pods that cannot handle DNS disruption during upgrade. - -* The Daemonset is assigned a PriorityClass of "system-node-critical", to ensure it is not evicted. - -* Populating both the nodelocal cache ip address and kube-dns ip address in resolv.conf is not a reliable option. Depending on underlying implementation, this can result in kube-dns being queried only if cache ip does not repond, or both queried simultaneously. - - -## Graduation Criteria -TODO - -## Rollout Plan -This feature will be launched with Alpha support in the first release. Master versions v1.13 and above will deploy the new add-on. Node versions v1.13 and above will have kubelet code to modify pods' resolv.conf. Nodes running older versions will run the nodelocal daemonset, but it will not be used. The user can specify a custom dnsConfig to use this local cache dns server. - -## Implementation History - -* 2018-10-05 - Creation of the KEP -* 2018-10-30 - Follow up comments and choice of cache agent - -## Drawbacks [optional] - -Additional resource consumption for the Daemonset might not be necessary for clusters with low DNS QPS needs. - - -## Alternatives [optional] - -* The listen ip address for the dns cache could be a service ip. This ip address is obtained by creating a nodelocaldns service, with same endpoints as the clusterDNS service. Using the same endpoints as clusterDNS helps reduce DNS downtime in case of upgrades/restart. When no other special handling is provided, queries to the nodelocaldns ip will be served by kube-dns/CoreDNS pods. Kubelet takes the service name as an argument `--cluster-dns-svc=<namespace>/<svc name>`, looks up the ip address and populates pods' resolv.conf with this value instead of clusterDNS. -This approach works only for iptables mode of kube-proxy. This is because kube-proxy creates a dummy interface bound to all service IPs in ipvs mode and ipvs rules are added to load-balance between endpoints. The packet seems to get dropped if there are no endpoints. If there are endpoints, adding iptables rules does not bypass the ipvs loadbalancing rules. - -* A nodelocaldns service can be created with a hard requirement of same-node endpoint, once we have [this](https://github.com/kubernetes/community/pull/2846) supported. All the pods in the nodelocaldns daemonset will be endpoints, the one running locally will be selected. iptables rules to NOTRACK connections can still be added, in order to skip DNAT in the iptables kube-proxy implementation. - -* Instead of just a dns-cache, a full-fledged kube-dns instance can be run on all nodes. This will consume much more resources since each instance will also watch Services and Endpoints. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/0031-20181017-kube-proxy-services-optional.md b/keps/sig-network/0031-20181017-kube-proxy-services-optional.md index 9e77d4a2..cfd1f5fa 100644 --- a/keps/sig-network/0031-20181017-kube-proxy-services-optional.md +++ b/keps/sig-network/0031-20181017-kube-proxy-services-optional.md @@ -1,127 +1,4 @@ ---- -kep-number: 31 -title: Make kube-proxy service abstraction optional -authors: - - "@bradhoekstra" -owning-sig: sig-network -participating-sigs: -reviewers: - - "@freehan" -approvers: - - "@thockin" -editor: "@bradhoekstra" -creation-date: 2018-10-17 -last-updated: 2018-11-12 -status: provisional -see-also: -replaces: -superseded-by: ---- - -# Make kube-proxy service abstraction optional - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Story 1](#story-1) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Design](#design) - * [Testing](#testing) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) - -## Summary - -In a cluster that has a service mesh a lot of the work being done by kube-proxy is redundant and wasted. -Specifically, services that are only reached via other services in the mesh will never use the service abstraction implemented by kube-proxy in iptables (or ipvs). -By informing the kube-proxy of this, we can lighten the work it is doing and the burden on its proxy backend. - -## Motivation - -The motivation for the enhancement is to allow higher scalability in large clusters with lots of services that are making use of a service mesh. - -### Goals - -The goal is to reduce the load on: -* The kube-proxy having to deserialize and process all services and endpoints -* The backend system (e.g. iptables) for whichever proxy mode kube-proxy is using - -### Non-Goals - -* Making sure the service is still routable via the service mesh -* Preserving any kube-proxy functionality for any intentionally disabled Service, including but not limited to: externalIPs, external LB routing, nodePorts, externalTrafficPolicy, healthCheckNodePort, UDP, SCTP - -## Proposal - -### User Stories - -#### Story 1 - -As a cluster operator, operating a cluster using a service mesh I want to be able to disable the kube-proxy service implementation for services in that mesh to reduce overall load on the whole cluster - -### Implementation Details/Notes/Constraints - -#### Overview - -In a cluster where a service is only accessed via other applications in the service mesh the work that kube-proxy does to program the proxy (e.g. iptables) for that service is duplicated and unused. The service mesh itself handles load balancing for the service VIP. This case is often true in the standard service mesh setup of utilizing ingress/egress gateways, such that services are not directly exposed outside the cluster. In this setup, application services rarely make use of other Service features such as externalIPs, external LB routing, nodePorts, externalTrafficPolicy, healthCheckNodePort, UDP, SCTP. We can optimize this cluster by giving kube-proxy a way to not have to perform the duplicate work for these services. - -It is important for overall scalability that kube-proxy does not receive data for Service/Endpoints objects that it is not going to affect. This can reduce load on the kube-proxy and the network by never receiving the updates in the first place. - -The proposal is to make this feature available by annotating the Service object with this label: `service.kubernetes.io/service-proxy-name`. If this label key is set, with any value, the associated Endpoints object will automatically inherit that label from the Service object as well. - -When this label is set, kube-proxy will behave as if that service does not exist. None of the functionality that kube-proxy provides will be available for that service. - -kube-proxy will properly implement this label both at object creation and on dynamic addition/removal/updates of this label, either providing functionality or not for the service based on the latest version on the object. - -It is optional for other service proxy implementations (besides kube-proxy) to implement this feature. They may ignore this value and still remain conformant with kubernetes services. - -It is expected that this feature will mainly be used on large clusters with lots (>1000) of services. Any use of this feature in a smaller cluster will have negligible impact. - -The envisioned cluster that will make use of this feature looks something like the following: -* Most/all traffic from outside the cluster is handled by gateways, such that each service in the cluster does not need a nodePort -* These small number of entry points into the cluster are a part of the service mesh -* There are many micro-services in the cluster, all a part of the service mesh, that are only accessed from inside the service mesh - -Higher level frameworks built on top of service meshes, such as [Knative](https://github.com/knative/docs), will be able to enable this feature by default due to having a more controlled application/service model and being reliant on the service mesh. - -#### Design - -Currently, when ProxyServer starts up it creates informers for all Service (ServiceConfig) and Endpoints (EndpointsConfig) objects using a single shared informer factory. - -The new design will simply add a LabelSelector filter to the shared informer factory, such that objects with the above label are filtered out by the API server: -```diff -- informerFactory := informers.NewSharedInformerFactory(s.Client, s.ConfigSyncPeriod) -+ informerFactory := informers.NewSharedInformerFactoryWithOptions(s.Client, s.ConfigSyncPeriod, -+ informers.WithTweakListOptions(func(options *v1meta.ListOptions) { -+ options.LabelSelector = "!service.kubernetes.io/service-proxy-name" -+ })) -``` - -This code will also handle the dynamic label update case. When the label selector is matched (service is enabled) an 'add' event will be generated by the informer. When the label selector is not matched (service is disabled) a 'delete' event will be generated by the informer. - -#### Testing - -The following cases should be tested. In each case, make sure that services are added/removed from iptables (or other) as expected: -* Adding/removing services/endpoints with and without the above label -* Adding/removing the above label from existing services/endpoints - -### Risks and Mitigations - -We will keep the existing behaviour enabled by default, and only disable the kube-proxy service proxy when the service contains this new label. - -This will have no effect on alternate service proxy implementations since they will not handle this label. - -## Graduation Criteria - -N/A - -## Implementation History - -- 2018-10-17 - This KEP is created -- 2018-11-12 - KEP updated, including approver/reviewer +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-network/README.md b/keps/sig-network/README.md index cdd5348f..cfd1f5fa 100644 --- a/keps/sig-network/README.md +++ b/keps/sig-network/README.md @@ -1,3 +1,4 @@ -# SIG Network KEPs - -This directory contains KEPs related to SIG Network. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md index 4a2090a1..cfd1f5fa 100644 --- a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md +++ b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md @@ -1,225 +1,4 @@ ---- -kep-number: 8 -title: Protomote sysctl annotations to fields -authors: - - "@ingvagabund" -owning-sig: sig-node -participating-sigs: - - sig-auth -reviewers: - - "@sjenning" - - "@derekwaynecarr" -approvers: - - "@sjenning " - - "@derekwaynecarr" -editor: -creation-date: 2018-04-30 -last-updated: 2018-05-02 -status: provisional -see-also: -replaces: -superseded-by: ---- - -# Promote sysctl annotations to fields - -## Table of Contents - -* [Promote sysctl annotations to fields](#promote-sysctl-annotations-to-fields) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Promote annotations to fields](#promote-annotations-to-fields) - * [Promote --experimental-allowed-unsafe-sysctls kubelet flag to kubelet config api option](#promote---experimental-allowed-unsafe-sysctls-kubelet-flag-to-kubelet-config-api-option) - * [Gate the feature](#gate-the-feature) - * [Proposal](#proposal) - * [User Stories](#user-stories) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - -## Summary - -Setting the `sysctl` parameters through annotations provided a successful story -for defining better constraints of running applications. -The `sysctl` feature has been tested by a number of people without any serious -complaints. Promoting the annotations to fields (i.e. to beta) is another step in making the -`sysctl` feature closer towards the stable API. - -Currently, the `sysctl` provides `security.alpha.kubernetes.io/sysctls` and `security.alpha.kubernetes.io/unsafe-sysctls` annotations that can be used -in the following way: - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: sysctl-example - annotations: - security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1 - security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3 - spec: - ... - ``` - - The goal is to transition into native fields on pods: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: sysctl-example - spec: - securityContext: - sysctls: - - name: kernel.shm_rmid_forced - value: 1 - - name: net.ipv4.route.min_pmtu - value: 1000 - unsafe: true - - name: kernel.msgmax - value: "1 2 3" - unsafe: true - ... - ``` - -The `sysctl` design document with more details and rationals is available at [design-proposals/node/sysctl.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#pod-api-changes) - -## Motivation - -As mentioned in [contributors/devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version): - -> Previously, annotations were used for experimental alpha features, but are no longer recommended for several reasons: -> -> They expose the cluster to "time-bomb" data added as unstructured annotations against an earlier API server (https://issue.k8s.io/30819) -> They cannot be migrated to first-class fields in the same API version (see the issues with representing a single value in multiple places in backward compatibility gotchas) -> -> The preferred approach adds an alpha field to the existing object, and ensures it is disabled by default: -> -> ... - -The annotations as a means to set `sysctl` are no longer necessary. -The original intent of annotations was to provide additional description of Kubernetes -objects through metadata. -It's time to separate the ability to annotate from the ability to change sysctls settings -so a cluster operator can elevate the distinction between experimental and supported usage -of the feature. - -### Promote annotations to fields - -* Introduce native `sysctl` fields in pods through `spec.securityContext.sysctl` field as: - - ```yaml - sysctl: - - name: SYSCTL_PATH_NAME - value: SYSCTL_PATH_VALUE - unsafe: true # optional field - ``` - -* Introduce native `sysctl` fields in [PSP](https://kubernetes.io/docs/concepts/policy/pod-security-policy/) as: - - ```yaml - apiVersion: v1 - kind: PodSecurityPolicy - metadata: - name: psp-example - spec: - sysctls: - - kernel.shmmax - - kernel.shmall - - net.* - ``` - - More examples at [design-proposals/node/sysctl.md#allowing-only-certain-sysctls](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#allowing-only-certain-sysctls) - -### Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option - -As there is no longer a need to consider the `sysctl` feature experimental, -the list of unsafe sysctls can be configured accordingly through: - -```go -// KubeletConfiguration contains the configuration for the Kubelet -type KubeletConfiguration struct { - ... - // Whitelist of unsafe sysctls or unsafe sysctl patterns (ending in *). - // Default: nil - // +optional - AllowedUnsafeSysctls []string `json:"allowedUnsafeSysctls,omitempty"` -} -``` - -Upstream issue: https://github.com/kubernetes/kubernetes/issues/61669 - -### Gate the feature - -As the `sysctl` feature stabilizes, it's time to gate the feature [1] and enable it by default. - -* Expected feature gate key: `Sysctls` -* Expected default value: `true` - -With the `Sysctl` feature enabled, both sysctl fields in `Pod` and `PodSecurityPolicy` -and the whitelist of unsafed sysctls are acknowledged. -If disabled, the fields and the whitelist are just ignored. - -[1] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ - -## Proposal - -This is where we get down to the nitty gritty of what the proposal actually is. - -### User Stories - -* As a cluster admin, I want to have `sysctl` feature versioned so I can assure backward compatibility - and proper transformation between versioned to internal representation and back.. -* As a cluster admin, I want to be confident the `sysctl` feature is stable enough and well supported so - applications are properly isolated -* As a cluster admin, I want to be able to apply the `sysctl` constraints on the cluster level so - I can define the default constraints for all pods. - -### Implementation Details/Notes/Constraints - -Extending `SecurityContext` struct with `Sysctls` field: - -```go -// PodSecurityContext holds pod-level security attributes and common container settings. -// Some fields are also present in container.securityContext. Field values of -// container.securityContext take precedence over field values of PodSecurityContext. -type PodSecurityContext struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. - Sysctls []Sysctl `json:"sysctls,omitempty"` -} -``` - -Extending `PodSecurityPolicySpec` struct with `Sysctls` field: - -```go -// PodSecurityPolicySpec defines the policy enforced on sysctls. -type PodSecurityPolicySpec struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. - Sysctls []Sysctl `json:"sysctls,omitempty"` -} -``` - -Following steps in [devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version) -during implementation. - -Validation checks implemented as part of [#27180](https://github.com/kubernetes/kubernetes/pull/27180). - -### Risks and Mitigations - -We need to assure backward compatibility, i.e. object specifications with `sysctl` annotations -must still work after the graduation. - -## Graduation Criteria - -* API changes allowing to configure the pod-scoped `sysctl` via `spec.securityContext` field. -* API changes allowing to configure the cluster-scoped `sysctl` via `PodSecurityPolicy` object -* Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option -* feature gate enabled by default -* e2e tests - -## Implementation History - -The `sysctl` feature is tracked as part of [features#34](https://github.com/kubernetes/features/issues/34). -This is one of the goals to promote the annotations to fields. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0009-node-heartbeat.md b/keps/sig-node/0009-node-heartbeat.md index f80b9609..cfd1f5fa 100644 --- a/keps/sig-node/0009-node-heartbeat.md +++ b/keps/sig-node/0009-node-heartbeat.md @@ -1,392 +1,4 @@ ---- -kep-number: 8 -title: Efficient Node Heartbeat -authors: - - "@wojtek-t" - - "with input from @bgrant0607, @dchen1107, @yujuhong, @lavalamp" -owning-sig: sig-node -participating-sigs: - - sig-scalability - - sig-apimachinery - - sig-scheduling -reviewers: - - "@deads2k" - - "@lavalamp" -approvers: - - "@dchen1107" - - "@derekwaynecarr" -editor: TBD -creation-date: 2018-04-27 -last-updated: 2018-04-27 -status: implementable -see-also: - - https://github.com/kubernetes/kubernetes/issues/14733 - - https://github.com/kubernetes/kubernetes/pull/14735 -replaces: - - n/a -superseded-by: - - n/a ---- - -# Efficient Node Heartbeats - -## Table of Contents - -Table of Contents -================= - -* [Efficient Node Heartbeats](#efficient-node-heartbeats) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Proposal](#proposal) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Alternatives](#alternatives) - * [Dedicated “heartbeat” object instead of “leader election” one](#dedicated-heartbeat-object-instead-of-leader-election-one) - * [Events instead of dedicated heartbeat object](#events-instead-of-dedicated-heartbeat-object) - * [Reuse the Component Registration mechanisms](#reuse-the-component-registration-mechanisms) - * [Split Node object into two parts at etcd level](#split-node-object-into-two-parts-at-etcd-level) - * [Delta compression in etcd](#delta-compression-in-etcd) - * [Replace etcd with other database](#replace-etcd-with-other-database) - -## Summary - -Node heartbeats are necessary for correct functioning of Kubernetes cluster. -This proposal makes them significantly cheaper from both scalability and -performance perspective. - -## Motivation - -While running different scalability tests we observed that in big enough clusters -(more than 2000 nodes) with non-trivial number of images used by pods on all -nodes (10-15), we were hitting etcd limits for its database size. That effectively -means that etcd enters "alert mode" and stops accepting all write requests. - -The underlying root cause is combination of: - -- etcd keeping both current state and transaction log with copy-on-write -- node heartbeats being pontetially very large objects (note that images - are only one potential problem, the second are volumes and customers - want to mount 100+ volumes to a single node) - they may easily exceed 15kB; - even though the patch send over network is small, in etcd we store the - whole Node object -- Kubelet sending heartbeats every 10s - -This proposal presents a proper solution for that problem. - - -Note that currently (by default): - -- Lack of NodeStatus update for `<node-monitor-grace-period>` (default: 40s) - results in NodeController marking node as NotReady (pods are no longer - scheduled on that node) -- Lack of NodeStatus updates for `<pod-eviction-timeout>` (default: 5m) - results in NodeController starting pod evictions from that node - -We would like to preserve that behavior. - - -### Goals - -- Reduce size of etcd by making node heartbeats cheaper - -### Non-Goals - -The following are nice-to-haves, but not primary goals: - -- Reduce resource usage (cpu/memory) of control plane (e.g. due to processing - less and/or smaller objects) -- Reduce watch-related load on Node objects - -## Proposal - -We propose introducing a new `Lease` built-in API in the newly create API group -`coordination.k8s.io`. To make it easily reusable for other purposes it will -be namespaced. Its schema will be as following: - -``` -type Lease struct { - metav1.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata - // +optional - ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"` - - // Specification of the Lease. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status - // +optional - Spec LeaseSpec `json:"spec,omitempty"` -} - -type LeaseSpec struct { - HolderIdentity string `json:"holderIdentity"` - LeaseDurationSeconds int32 `json:"leaseDurationSeconds"` - AcquireTime metav1.MicroTime `json:"acquireTime"` - RenewTime metav1.MicroTime `json:"renewTime"` - LeaseTransitions int32 `json:"leaseTransitions"` -} -``` - -The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][]. -The only difference is using `MicroTime` instead of `Time` for better precision. -That would hopefully allow us go get directly to Beta. - -We will use that object to represent node heartbeat - for each Node there will -be a corresponding `Lease` object with Name equal to Node name in a newly -created dedicated namespace (we considered using `kube-system` namespace but -decided that it's already too overloaded). -That namespace should be created automatically (similarly to "default" and -"kube-system", probably by NodeController) and never be deleted (so that nodes -don't require permission for it). - -We considered using CRD instead of built-in API. However, even though CRDs are -`the new way` for creating new APIs, they don't yet have versioning support -and are significantly less performant (due to lack of protobuf support yet). -We also don't know whether we could seamlessly transition storage from a CRD -to a built-in API if we ran into a performance or any other problems. -As a result, we decided to proceed with built-in API. - - -With this new API in place, we will change Kubelet so that: - -1. Kubelet is periodically computing NodeStatus every 10s (at it is now), but that will - be independent from reporting status -1. Kubelet is reporting NodeStatus if: - - there was a meaningful change in it (initially we can probably assume that every - change is meaningful, including e.g. images on the node) - - or it didn’t report it over last `node-status-update-period` seconds -1. Kubelet creates and periodically updates its own Lease object and frequency - of those updates is independent from NodeStatus update frequency. - -In the meantime, we will change `NodeController` to treat both updates of NodeStatus -object as well as updates of the new `Lease` object corresponding to a given -node as healthiness signal from a given Kubelet. This will make it work for both old -and new Kubelets. - -We should also: - -1. audit all other existing core controllers to verify if they also don’t require - similar changes in their logic ([ttl controller][] being one of the examples) -1. change controller manager to auto-register that `Lease` CRD -1. ensure that `Lease` resource is deleted when corresponding node is - deleted (probably via owner references) -1. [out-of-scope] migrate all LeaderElection code to use that CRD - -Once all the code changes are done, we will: - -1. start updating `Lease` object every 10s by default, at the same time - reducing frequency of NodeStatus updates initially to 40s by default. - We will reduce it further later. - Note that it doesn't reduce frequency by which Kubelet sends "meaningful" - changes - it only impacts the frequency of "lastHeartbeatTime" changes. - <br> TODO: That still results in higher average QPS. It should be acceptable but - needs to be verified. -1. announce that we are going to reduce frequency of NodeStatus updates further - and give people 1-2 releases to switch their code to use `Lease` - object (if they relied on frequent NodeStatus changes) -1. further reduce NodeStatus updates frequency to not less often than once per - 1 minute. - We can’t stop periodically updating NodeStatus as it would be API breaking change, - but it’s fine to reduce its frequency (though we should continue writing it at - least once per eviction period). - - -To be considered: - -1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes - (instead of 1 minute). That would help with performance/scalability even more. - Caveats: - - NodeProblemDetector is currently updating (some) node conditions every 1 minute - (unconditionally, because lastHeartbeatTime always changes). To make reduction - of NodeStatus updates frequency really useful, we should also change NPD to - work in a similar mode (check periodically if condition changes, but report only - when something changed or no status was reported for a given time) and decrease - its reporting frequency too. - - In general, we recommend to keep frequencies of NodeStatus reporting in both - Kubelet and NodeProblemDetector in sync (once all changes will be done) and - that should be reflected in [NPD documentation][]. - - Note that reducing frequency to 1 minute already gives us almost 6x improvement. - It seems more than enough for any foreseeable future assuming we won’t - significantly increase the size of object Node. - Note that if we keep adding node conditions owned by other components, the - number of writes of Node object will go up. But that issue is separate from - that proposal. - -Other notes: - -1. Additional advantage of using Lease for that purpose would be the - ability to exclude it from audit profile and thus reduce the audit logs footprint. - -[LeaderElectionRecord]: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L37 -[ttl controller]: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/ttl/ttl_controller.go#L155 -[NPD documentation]: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/ -[kubernetes/kubernetes#63667]: https://github.com/kubernetes/kubernetes/issues/63677 - -### Risks and Mitigations - -Increasing default frequency of NodeStatus updates may potentially break clients -relying on frequent Node object updates. However, in non-managed solutions, customers -will still be able to restore previous behavior by setting appropriate flag values. -Thus, changing defaults to what we recommend is the path to go with. - -## Graduation Criteria - -The API can be immediately promoted to Beta, as the API is effectively a copy of -already existing LeaderElectionRecord. It will be promoted to GA once it's gone -a sufficient amount of time as Beta with no changes. - -The changes in components logic (Kubelet, NodeController) should be done behind -a feature gate. We suggest making that enabled by default once the feature is -implemented. - -## Implementation History - -- RRRR-MM-DD: KEP Summary, Motivation and Proposal merged - -## Alternatives - -We considered a number of alternatives, most important mentioned below. - -### Dedicated “heartbeat” object instead of “leader election” one - -Instead of introducing and using “lease” object, we considered -introducing a dedicated “heartbeat” object for that purpose. Apart from that, -all the details about the solution remain pretty much the same. - -Pros: - -- Conceptually easier to understand what the object is for - -Cons: - -- Introduces a new, narrow-purpose API. Lease is already used by other - components, implemented using annotations on Endpoints and ConfigMaps. - -### Events instead of dedicated heartbeat object - -Instead of introducing a dedicated object, we considered using “Event” object -for that purpose. At the high-level the solution looks very similar. -The differences from the initial proposal are: - -- we use existing “Event” api instead of introducing a new API -- we create a dedicated namespace; events that should be treated as healthiness - signal by NodeController will be written by Kubelets (unconditionally) to that - namespace -- NodeController will be watching only Events from that namespace to avoid - processing all events in the system (the volume of all events will be huge) -- dedicated namespace also helps with security - we can give access to write to - that namespace only to Kubelets - -Pros: - -- No need to introduce new API - - We can use that approach much earlier due to that. -- We already need to optimize event throughput - separate etcd instance we have - for them may help with tuning -- Low-risk roll-forward/roll-back: no new objects is involved (node controller - starts watching events, kubelet just reduces the frequency of heartbeats) - -Cons: - -- Events are conceptually “best-effort” in the system: - - they may be silently dropped in case of problems in the system (the event recorder - library doesn’t retry on errors, e.g. to not make things worse when control-plane - is starved) - - currently, components reporting events don’t even know if it succeeded or not (the - library is built in a way that you throw the event into it and are not notified if - that was successfully submitted or not). - Kubelet sending any other update has full control on how/if retry errors. - - lack of fairness mechanisms means that even when some events are being successfully - send, there is no guarantee that any event from a given Kubelet will be submitted - over a given time period - So this would require a different mechanism of reporting those “heartbeat” events. -- Once we have “request priority” concept, I think events should have the lowest one. - Even though no particular heartbeat is important, guarantee that some heartbeats will - be successfully send it crucial (not delivering any of them will result in unnecessary - evictions or not-scheduling to a given node). So heartbeats should be of the highest - priority. OTOH, node heartbeats are one of the most important things in the system - (not delivering them may result in unnecessary evictions), so they should have the - highest priority. -- No core component in the system is currently watching events - - it would make system’s operation harder to explain -- Users watch Node objects for heartbeats (even though we didn’t recommend it). - Introducing a new object for the purpose of heartbeat will allow those users to - migrate, while using events for that purpose breaks that ability. (Watching events - may put us in tough situation also from performance reasons.) -- Deleting all events (e.g. event etcd failure + playbook response) should continue to - not cause a catastrophic failure and the design will need to account for this. - -### Reuse the Component Registration mechanisms - -Kubelet is one of control-place components (shared controller). Some time ago, Component -Registration proposal converged into three parts: - -- Introducing an API for registering non-pod endpoints, including readiness information: #18610 -- Changing endpoints controller to also watch those endpoints -- Identifying some of those endpoints as “components” - -We could reuse that mechanism to represent Kubelets as non-pod endpoint API. - -Pros: - -- Utilizes desired API - -Cons: - -- Requires introducing that new API -- Stabilizing the API would take some time -- Implementing that API requires multiple changes in different components - -### Split Node object into two parts at etcd level - -We may stick to existing Node API and solve the problem at storage layer. At the -high level, this means splitting the Node object into two parts in etcd (frequently -modified one and the rest). - -Pros: - -- No need to introduce new API -- No need to change any components other than kube-apiserver - -Cons: - -- Very complicated to support watch -- Not very generic (e.g. splitting Spec and Status doesn’t help, it needs to be just - heartbeat part) -- [minor] Doesn’t reduce amount of data that should be processed in the system (writes, - reads, watches, …) - -### Delta compression in etcd - -An alternative for the above can be solving this completely at the etcd layer. To -achieve that, instead of storing full updates in etcd transaction log, we will just -store “deltas” and snapshot the whole object only every X seconds/minutes. - -Pros: - -- Doesn’t require any changes to any Kubernetes components - -Cons: - -- Computing delta is tricky (etcd doesn’t understand Kubernetes data model, and - delta between two protobuf-encoded objects is not necessary small) -- May require a major rewrite of etcd code and not even be accepted by its maintainers -- More expensive computationally to get an object in a given resource version (which - is what e.g. watch is doing) - -### Replace etcd with other database - -Instead of using etcd, we may also consider using some other open-source solution. - -Pros: - -- Doesn’t require new API - -Cons: - -- We don’t even know if there exists solution that solves our problems and can be used. -- Migration will take us years. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0014-runtime-class.md b/keps/sig-node/0014-runtime-class.md index 1d1cac28..cfd1f5fa 100644 --- a/keps/sig-node/0014-runtime-class.md +++ b/keps/sig-node/0014-runtime-class.md @@ -1,399 +1,4 @@ ---- -kep-number: 14 -title: Runtime Class -authors: - - "@tallclair" -owning-sig: sig-node -participating-sigs: - - sig-architecture -reviewers: - - dchen1107 - - derekwaynecarr - - yujuhong -approvers: - - dchen1107 - - derekwaynecarr -creation-date: 2018-06-19 -status: implementable ---- - -# Runtime Class - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non\-Goals](#non-goals) - * [User Stories](#user-stories) -* [Proposal](#proposal) - * [API](#api) - * [Runtime Handler](#runtime-handler) - * [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts) - * [Implementation Details](#implementation-details) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Appendix](#appendix) - * [Examples of runtime variation](#examples-of-runtime-variation) - -## Summary - -`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the -control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the -`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node. - -## Motivation - -There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the -primary motivator for this right now, with both Kata containers and gVisor looking to integrate with -Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also -require support in the future. RuntimeClass provides a way to select between different runtimes -configured in the cluster and surface their properties (both to the cluster & the user). - -In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to -the control plane level, including: accounting for runtime overhead, scheduling to nodes that -support the runtime, and surfacing which optional features are supported by different -runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a -cluster-scoped resource tied to the runtime that can help solve these problems in a future update. - -[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit - -### Goals - -- Provide a mechanism for surfacing container runtime properties to the control plane -- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired - runtime - -### Non-Goals - -- RuntimeClass is NOT RuntimeComponentConfig. -- RuntimeClass is NOT a general policy mechanism. -- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general - RuntimeClass should not be a cross product of runtime properties and node properties. - -The following goals are out-of-scope for the initial implementation, but may be explored in a future -iteration: - -- Surfacing support for optional features by runtimes, and surfacing errors caused by - incompatible features & runtimes earlier. -- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the - cluster admin or provider), and are asserted to be an accurate representation of the runtime. -- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster - (different runtime configurations on different nodes) through scheduling primitives like - `NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and - automatic runtime-aware scheduling is out-of-scope. -- Define standardized or conformant runtime classes - although I would like to declare some - predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP. -- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice, - the details of how pod resource overhead will be implemented is out of scope for this KEP. -- Provide a mechanism to dynamically register or provision additional runtimes. -- Requiring specific RuntimeClasses according to policy. This should be addressed by other - cluster-level policy mechanisms, such as PodSecurityPolicy. -- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and - letting the system match an appropriate RuntimeClass, rather than explicitly assigning a - RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a - future iteration. - -[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit - -### User Stories - -- As a cluster operator, I want to provide multiple runtime options to support a wide variety of - workloads. Examples include native linux containers, "sandboxed" containers, and windows - containers. -- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For - example, rolling out an update with backwards incompatible changes or previously unsupported - features. -- As an application developer, I want to select the runtime that best fits my workload. -- As an application developer, I don't want to study the nitty-gritty details of different runtime - implementations, but rather choose from pre-configured classes. -- As an application developer, I want my application to be portable across clusters that use similar - but different variants of a "class" of runtimes. - -## Proposal - -The initial design includes: - -- `RuntimeClass` API resource definition -- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with -- Kubelet implementation for fetching & interpreting the RuntimeClass -- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler). - -### API - -`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group. - -> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired. -> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_ - -_(This is a simplified declaration, syntactic details will be covered in the API PR review)_ - -```go -type RuntimeClass struct { - metav1.TypeMeta - // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class. - // Namespace should be left blank. - metav1.ObjectMeta - - Spec RuntimeClassSpec -} - -type RuntimeClassSpec struct { - // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container - // creation. The possible values are specific to a given configuration & CRI implementation. - // The empty string is equivalent to the default behavior. - // +optional - RuntimeHandler string -} -``` - -The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is -scheduled, the RuntimeClass cannot be changed. - -```go -type PodSpec struct { - ... - // RuntimeClassName refers to a RuntimeClass object with the same name, - // which should be used to run this pod. - // +optional - RuntimeClassName string - ... -} -``` - -The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards -compatible with current Kubernetes. This means that the legacy runtime does not specify any -RuntimeHandler or perform any feature validation (all features are "supported"). - -```go -const ( - // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy - // RuntimeClass does not specify a runtime handler or perform any - // feature validation. - RuntimeClassNameLegacy = "legacy" -) -``` - -An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is -not defaulted to `legacy` (to leave room for configurable defaults in a future update). - -#### Examples - -Suppose we operate a cluster that lets users choose between native runc containers, and gvisor and -kata-container sandboxes. We might create the following runtime classes: - -```yaml -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: native # equivalent to 'legacy' for now -spec: - runtimeHandler: runc ---- -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: gvisor -spec: - runtimeHandler: gvisor ----- -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: kata-containers -spec: - runtimeHandler: kata-containers ----- -# provides the default sandbox runtime when users don't care about which they're getting. -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: sandboxed -spec: - runtimeHandler: gvisor -``` - -Then when a user creates a workload, they can choose the desired runtime class to use (or not, if -they want the default). - -```yaml -apiVersion: extensions/v1beta1 -kind: Deployment -metadata: - name: sandboxed-nginx -spec: - replicas: 2 - selector: - matchLabels: - app: sandboxed-nginx - template: - metadata: - labels: - app: sandboxed-nginx - spec: - runtimeClassName: sandboxed # <---- Reference the desired RuntimeClass - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - protocol: TCP -``` - -#### Runtime Handler - -The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`: - -```proto -message RunPodSandboxRequest { - // Configuration for creating a PodSandbox. - PodSandboxConfig config = 1; - // Named runtime configuration to use for this PodSandbox. - string RuntimeHandler = 2; -} -``` - -The RuntimeHandler is provided as a mechanism for CRI implementations to select between different -predetermined configurations. The initial use case is replacing the experimental pod annotations -currently used for selecting a sandboxed runtime by various CRI implementations: - -| CRI Runtime | Pod Annotation | -| ------------|-------------------------------------------------------------| -| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" | -| containerd | io.kubernetes.cri.untrusted-workload: "true" | -| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" | -| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" | - -These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred -approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be -matched against the specified RuntimeHandler. For example, containerd might have a configuration -corresponding to a "kata-runtime" handler: - -``` -[plugins.cri.containerd.kata-runtime] - runtime_type = "io.containerd.runtime.v1.linux" - runtime_engine = "/opt/kata/bin/kata-runtime" - runtime_root = "" -``` - -This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection -(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox -types (e.g. `kata-containers` or `gvisor` RuntimeClasses). - -### Versioning, Updates, and Rollouts - -Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha -implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**, -thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods -must be updated to reference the new RuntimeClass, and comes with the advantage of native support -for rolling updates through the same mechanisms as any other application update. The -`RuntimeClassName` pod field is also immutable post scheduling. - -This conservative approach is preferred since it's much easier to relax constraints in a backwards -compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass -to beta. - -### Implementation Details - -The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is -added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once -resolved, the RuntimeHandler field is passed to the CRI as part of the -[`RunPodSandboxRequest`][runpodsandbox]. At that point, the interpretation of the RuntimeHandler is -left to the CRI implementation, but it should be cached if needed for subsequent calls. - -If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will -be rejected in admission (controller to be detailed in a following update). If the RuntimeClass -cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail -the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If -the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return -an error. - -[runpodsandbox]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344 - -### Risks and Mitigations - -**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default -dumping ground for every new feature exposed by the node. For each feature, careful consideration -should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The -[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features. - -**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for -PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying -runtime implementation should be extremely limited (generally only around updates & rollouts). To -enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to -restrict a user to a specific RuntimeClass, you must use another policy mechanism such as -PodSecurityPolicy. - -**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity -of runtime configuration from most users (aside from the cluster admin or provisioner). However, we -are still side-stepping the issue of precisely defining specific types of runtimes like -"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories -is even possible. RuntimeClass allows us to decouple this specification from the implementation, but -it is still something I hope we can address in a future iteration through the concept of pre-defined -or "conformant" RuntimeClasses. - -**Non-portability.** We are already in a world of non-portability for many features (see [examples -of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help -address this issue by formally declaring supported features, or matching the runtime that supports a -given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name, -which may not be defined in every cluster. This is something that can be addressed through -pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible -RuntimeClasses. - -## Graduation Criteria - -Alpha: - -- Everything described in the current proposal: - - Introduce the RuntimeClass API resource - - Add a RuntimeClassName field to the PodSpec - - Add a RuntimeHandler field to the CRI `RunPodSandboxRequest` - - Lookup the RuntimeClass for pods & plumb through the RuntimeHandler in the Kubelet (feature - gated) -- RuntimeClass support in at least one CRI runtime & dockershim - - Runtime Handlers can be statically configured by the runtime, and referenced via RuntimeClass - - An error is reported when the handler or is unknown or unsupported -- Testing - - [CRI validation tests][cri-validation] - - Kubernetes E2E tests (only validating single runtime handler cases) - -[cri-validation]: https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/validation.md - -Beta: - -- Most runtimes support RuntimeClass, and the current [untrusted annotations](#runtime-handler) are - deprecated. -- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass -- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary. -- The cluster admin can choose which RuntimeClass is the default in a cluster. -- Additional requirements TBD - -## Implementation History - -- 2018-06-11: SIG-Node decision to move forward with proposal -- 2018-06-19: Initial KEP published. - -## Appendix - -### Examples of runtime variation - -- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods, - but those are mutually exclusive, and support of either is not required by the runtime. The - default configuration is also not well defined. -- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is - defined by the runtime, and support is not guaranteed. -- Windows containers - isolation features are very OS-specific, and most of the current features are - limited to linux. As we build out Windows container support, we'll need to add windows-specific - features as well. -- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes - (e.g. Kata-containers & gVisor). -- Per-pod and Per-container resource overhead varies by runtime. -- Device support (e.g. GPUs) varies wildly by runtime & nodes. -- Supported volume types varies by node - it remains TBD whether this information belongs in - RuntimeClass. -- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may - have differing defaults, or support a subset of capabilities. -- `Privileged` mode is not well defined, and thus may have differing implementations. -- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed - workloads) +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md index a6c5aaba..cfd1f5fa 100644 --- a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md @@ -1,807 +1,4 @@ ---- -kep-number: 0 -title: Quotas for Ephemeral Storage -authors: - - "@RobertKrawitz" -owning-sig: sig-xxx -participating-sigs: - - sig-node -reviewers: - - TBD -approvers: - - "@dchen1107" - - "@derekwaynecarr" -editor: TBD -creation-date: yyyy-mm-dd -last-updated: yyyy-mm-dd -status: provisional -see-also: -replaces: -superseded-by: ---- - -# Quotas for Ephemeral Storage - -## Table of Contents -<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again --> -**Table of Contents** - -- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Project Quotas](#project-quotas) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Future Work](#future-work) - - [Proposal](#proposal) - - [Control over Use of Quotas](#control-over-use-of-quotas) - - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) - - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) - - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) - - [Operation Notes](#operation-notes) - - [Selecting a Project ID](#selecting-a-project-id) - - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) - - [Return a Project ID To the System](#return-a-project-id-to-the-system) - - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - - [Notes on Implementation](#notes-on-implementation) - - [Notes on Code Changes](#notes-on-code-changes) - - [Testing Strategy](#testing-strategy) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks [optional]](#drawbacks-optional) - - [Alternatives [optional]](#alternatives-optional) - - [Alternative quota-based implementation](#alternative-quota-based-implementation) - - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) - - [Infrastructure Needed [optional]](#infrastructure-needed-optional) - - [References](#references) - - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) - - [CVE](#cve) - - [Other Security Issues Without CVE](#other-security-issues-without-cve) - - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) - -<!-- markdown-toc end --> - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -This proposal applies to the use of quotas for ephemeral-storage -metrics gathering. Use of quotas for ephemeral-storage limit -enforcement is a [non-goal](#non-goals), but as the architecture and -code will be very similar, there are comments interspersed related to -enforcement. _These comments will be italicized_. - -Local storage capacity isolation, aka ephemeral-storage, was -introduced into Kubernetes via -<https://github.com/kubernetes/features/issues/361>. It provides -support for capacity isolation of shared storage between pods, such -that a pod can be limited in its consumption of shared resources and -can be evicted if its consumption of shared storage exceeds that -limit. The limits and requests for shared ephemeral-storage are -similar to those for memory and CPU consumption. - -The current mechanism relies on periodically walking each ephemeral -volume (emptydir, logdir, or container writable layer) and summing the -space consumption. This method is slow, can be fooled, and has high -latency (i. e. a pod could consume a lot of storage prior to the -kubelet being aware of its overage and terminating it). - -The mechanism proposed here utilizes filesystem project quotas to -provide monitoring of resource consumption _and optionally enforcement -of limits._ Project quotas, initially in XFS and more recently ported -to ext4fs, offer a kernel-based means of monitoring _and restricting_ -filesystem consumption that can be applied to one or more directories. - -A prototype is in progress; see <https://github.com/kubernetes/kubernetes/pull/66928>. - -### Project Quotas - -Project quotas are a form of filesystem quota that apply to arbitrary -groups of files, as opposed to file user or group ownership. They -were first implemented in XFS, as described here: -<http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html>. - -Project quotas for ext4fs were [proposed in late -2014](https://lwn.net/Articles/623835/) and added to the Linux kernel -in early 2016, with -commit -[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e). -They were designed to be compatible with XFS project quotas. - -Each inode contains a 32-bit project ID, to which optionally quotas -(hard and soft limits for blocks and inodes) may be applied. The -total blocks and inodes for all files with the given project ID are -maintained by the kernel. Project quotas can be managed from -userspace by means of the `xfs_quota(8)` command in foreign filesystem -(`-f`) mode; the traditional Linux quota tools do not manipulate -project quotas. Programmatically, they are managed by the `quotactl(2)` -system call, using in part the standard quota commands and in part the -XFS quota commands; the man page implies incorrectly that the XFS -quota commands apply only to XFS filesystems. - -The project ID applied to a directory is inherited by files created -under it. Files cannot be (hard) linked across directories with -different project IDs. A file's project ID cannot be changed by a -non-privileged user, but a privileged user may use the `xfs_io(8)` -command to change the project ID of a file. - -Filesystems using project quotas may be mounted with quotas either -enforced or not; the non-enforcing mode tracks usage without enforcing -it. A non-enforcing project quota may be implemented on a filesystem -mounted with enforcing quotas by setting a quota too large to be hit. -The maximum size that can be set varies with the filesystem; on a -64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for -ext4fs. - -Conventionally, project quota mappings are stored in `/etc/projects` and -`/etc/projid`; these files exist for user convenience and do not have -any direct importance to the kernel. `/etc/projects` contains a mapping -from project ID to directory/file; this can be a one to many mapping -(the same project ID can apply to multiple directories or files, but -any given directory/file can be assigned only one project ID). -`/etc/projid` contains a mapping from named projects to project IDs. - -This proposal utilizes hard project quotas for both monitoring _and -enforcement_. Soft quotas are of no utility; they allow for temporary -overage that, after a programmable period of time, is converted to the -hard quota limit. - - -## Motivation - -The mechanism presently used to monitor storage consumption involves -use of `du` and `find` to periodically gather information about -storage and inode consumption of volumes. This mechanism suffers from -a number of drawbacks: - -* It is slow. If a volume contains a large number of files, walking - the directory can take a significant amount of time. There has been - at least one known report of nodes becoming not ready due to volume - metrics: <https://github.com/kubernetes/kubernetes/issues/62917> -* It is possible to conceal a file from the walker by creating it and - removing it while holding an open file descriptor on it. POSIX - behavior is to not remove the file until the last open file - descriptor pointing to it is removed. This has legitimate uses; it - ensures that a temporary file is deleted when the processes using it - exit, and it minimizes the attack surface by not having a file that - can be found by an attacker. The following pod does this; it will - never be caught by the present mechanism: - -```yaml -apiVersion: v1 -kind: Pod -max: -metadata: - name: "diskhog" -spec: - containers: - - name: "perl" - resources: - limits: - ephemeral-storage: "2048Ki" - image: "perl" - command: - - perl - - -e - - > - my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999 - volumeMounts: - - name: a - mountPath: /data/a - volumes: - - name: a - emptyDir: {} -``` -* It is reactive rather than proactive. It does not prevent a pod - from overshooting its limit; at best it catches it after the fact. - On a fast storage medium, such as NVMe, a pod may write 50 GB or - more of data before the housekeeping performed once per minute - catches up to it. If the primary volume is the root partition, this - will completely fill the partition, possibly causing serious - problems elsewhere on the system. This proposal does not address - this issue; _a future enforcing project would_. - -In many environments, these issues may not matter, but shared -multi-tenant environments need these issues addressed. - -### Goals - -These goals apply only to local ephemeral storage, as described in -<https://github.com/kubernetes/features/issues/361>. - -* Primary: improve performance of monitoring by using project quotas - in a non-enforcing way to collect information about storage - utilization of ephemeral volumes. -* Primary: detect storage used by pods that is concealed by deleted - files being held open. -* Primary: this will not interfere with the more common user and group - quotas. - -### Non-Goals - -* Application to storage other than local ephemeral storage. -* Application to container copy on write layers. That will be managed - by the container runtime. For a future project, we should work with - the runtimes to use quotas for their monitoring. -* Elimination of eviction as a means of enforcing ephemeral-storage - limits. Pods that hit their ephemeral-storage limit will still be - evicted by the kubelet even if their storage has been capped by - enforcing quotas. -* Enforcing node allocatable (limit over the sum of all pod's disk - usage, including e. g. images). -* Enforcing limits on total pod storage consumption by any means, such - that the pod would be hard restricted to the desired storage limit. - -### Future Work - -* _Enforce limits on per-volume storage consumption by using - enforced project quotas._ - -## Proposal - -This proposal applies project quotas to emptydir volumes on qualifying -filesystems (ext4fs and xfs with project quotas enabled). Project -quotas are applied by selecting an unused project ID (a 32-bit -unsigned integer), setting a limit on space and/or inode consumption, -and attaching the ID to one or more files. By default (and as -utilized herein), if a project ID is attached to a directory, it is -inherited by any files created under that directory. - -_If we elect to use the quota as enforcing, we impose a quota -consistent with the desired limit._ If we elect to use it as -non-enforcing, we impose a large quota that in practice cannot be -exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). - -### Control over Use of Quotas - -At present, two feature gates control operation of quotas: - -* `LocalStorageCapacityIsolation` must be enabled for any use of - quotas. - -* `LocalStorageCapacityIsolationFSMonitoring` must be enabled in addition. If this is - enabled, quotas are used for monitoring, but not enforcement. At - present, this defaults to False, but the intention is that this will - default to True by initial release. - -* _`LocalStorageCapacityIsolationFSEnforcement` must be enabled, in addition to - `LocalStorageCapacityIsolationFSMonitoring`, to use quotas for enforcement._ - -### Operation Flow -- Applying a Quota - -* Caller (emptydir volume manager or container runtime) creates an - emptydir volume, with an empty directory at a location of its - choice. -* Caller requests that a quota be applied to a directory. -* Determine whether a quota can be imposed on the directory, by asking - each quota provider (one per filesystem type) whether it can apply a - quota to the directory. If no provider claims the directory, an - error status is returned to the caller. -* Select an unused project ID ([see below](#selecting-a-project-id)). -* Set the desired limit on the project ID, in a filesystem-dependent - manner ([see below](#notes-on-implementation)). -* Apply the project ID to the directory in question, in a - filesystem-dependent manner. - -An error at any point results in no quota being applied and no change -to the state of the system. The caller in general should not assume a -priori that the attempt will be successful. It could choose to reject -a request if a quota cannot be applied, but at this time it will -simply ignore the error and proceed as today. - -### Operation Flow -- Retrieving Storage Consumption - -* Caller (kubelet metrics code, cadvisor, container runtime) asks the - quota code to compute the amount of storage used under the - directory. -* Determine whether a quota applies to the directory, in a - filesystem-dependent manner ([see below](#notes-on-implementation)). -* If so, determine how much storage or how many inodes are utilized, - in a filesystem dependent manner. - -If the quota code is unable to retrieve the consumption, it returns an -error status and it is up to the caller to utilize a fallback -mechanism (such as the directory walk performed today). - -### Operation Flow -- Removing a Quota. - -* Caller requests that the quota be removed from a directory. -* Determine whether a project quota applies to the directory. -* Remove the limit from the project ID associated with the directory. -* Remove the association between the directory and the project ID. -* Return the project ID to the system to allow its use elsewhere ([see - below](#return-a-project-id-to-the-system)). -* Caller may delete the directory and its contents (normally it will). - -### Operation Notes - -#### Selecting a Project ID - -Project IDs are a shared space within a filesystem. If the same -project ID is assigned to multiple directories, the space consumption -reported by the quota will be the sum of that of all of the -directories. Hence, it is important to ensure that each directory is -assigned a unique project ID (unless it is desired to pool the storage -use of multiple directories). - -The canonical mechanism to record persistently that a project ID is -reserved is to store it in the `/etc/projid` (`projid[5]`) and/or -`/etc/projects` (`projects(5)`) files. However, it is possible to utilize -project IDs without recording them in those files; they exist for -administrative convenience but neither the kernel nor the filesystem -is aware of them. Other ways can be used to determine whether a -project ID is in active use on a given filesystem: - -* The quota values (in blocks and/or inodes) assigned to the project - ID are non-zero. -* The storage consumption (in blocks and/or inodes) reported under the - project ID are non-zero. - -The algorithm to be used is as follows: - -* Lock this instance of the quota code against re-entrancy. -* open and `flock()` the `/etc/project` and `/etc/projid` files, so that - other uses of this code are excluded. -* Start from a high number (the prototype uses 1048577). -* Iterate from there, performing the following tests: - * Is the ID reserved by this instance of the quota code? - * Is the ID present in `/etc/projects`? - * Is the ID present in `/etc/projid`? - * Are the quota values and/or consumption reported by the kernel - non-zero? This test is restricted to 128 iterations to ensure - that a bug here or elsewhere does not result in an infinite loop - looking for a quota ID. -* If an ID has been found: - * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so - that any other uses of project quotas do not reuse it. - * Write temporary copies of `/etc/projects` and `/etc/projid` that are - `flock()`ed - * If successful, rename the temporary files appropriately (if - rename of one succeeds but the other fails, we have a problem - that we cannot recover from, and the files may be inconsistent). -* Unlock `/etc/projid` and `/etc/projects`. -* Unlock this instance of the quota code. - -A minor variation of this is used if we want to reuse an existing -quota ID. - -#### Determine Whether a Project ID Applies To a Directory - -It is possible to determine whether a directory has a project ID -applied to it by requesting (via the `quotactl(2)` system call) the -project ID associated with the directory. Whie the specifics are -filesystem-dependent, the basic method is the same for at least XFS -and ext4fs. - -It is not possible to determine in constant operations the directory -or directories to which a project ID is applied. It is possible to -determine whether a given project ID has been applied to an existing -directory or files (although those will not be known); the reported -consumption will be non-zero. - -The code records internally the project ID applied to a directory, but -it cannot always rely on this. In particular, if the kubelet has -exited and has been restarted (and hence the quota applying to the -directory should be removed), the map from directory to project ID is -lost. If it cannot find a map entry, it falls back on the approach -discussed above. - -#### Return a Project ID To the System - -The algorithm used to return a project ID to the system is very -similar to the algorithm used to select a project ID, except of course -for selecting a project ID. It performs the same sequence of locking -`/etc/project` and `/etc/projid`, editing a copy of the file, and -restoring it. - -If the project ID is applied to multiple directories and the code can -determine that, it will not remove the project ID from `/etc/projid` -until the last reference is removed. While it is not anticipated in -this KEP that this mode of operation will be used, at least initially, -this can be detected even on kubelet restart by looking at the -reference count in `/etc/projects`. - - -### Implementation Details/Notes/Constraints [optional] - -#### Notes on Implementation - -The primary new interface defined is the quota interface in -`pkg/volume/util/quota/quota.go`. This defines five operations: - -* Does the specified directory support quotas? - -* Assign a quota to a directory. If a non-empty pod UID is provided, - the quota assigned is that of any other directories under this pod - UID; if an empty pod UID is provided, a unique quota is assigned. - -* Retrieve the consumption of the specified directory. If the quota - code cannot handle it efficiently, it returns an error and the - caller falls back on existing mechanism. - -* Retrieve the inode consumption of the specified directory; same - description as above. - -* Remove quota from a directory. If a non-empty pod UID is passed, it - is checked against that recorded in-memory (if any). The quota is - removed from the specified directory. This can be used even if - AssignQuota has not been used; it inspects the directory and removes - the quota from it. This permits stale quotas from an interrupted - kubelet to be cleaned up. - -Two implementations are provided: `quota_linux.go` (for Linux) and -`quota_unsupported.go` (for other operating systems). The latter -returns an error for all requests. - -As the quota mechanism is intended to support multiple filesystems, -and different filesystems require different low level code for -manipulating quotas, a provider is supplied that finds an appropriate -quota applier implementation for the filesystem in question. The low -level quota applier provides similar operations to the top level quota -code, with two exceptions: - -* No operation exists to determine whether a quota can be applied - (that is handled by the provider). - -* An additional operation is provided to determine whether a given - quota ID is in use within the filesystem (outside of `/etc/projects` - and `/etc/projid`). - -The two quota providers in the initial implementation are in -`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While -some quota operations do require different system calls, a lot of the -code is common, and factored into -`pkg/volume/util/quota/common/quota_linux_common_impl.go`. - -#### Notes on Code Changes - -The prototype for this project is mostly self-contained within -`pkg/volume/util/quota` and a few changes to -`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were -required elsewhere: - -* The operation executor needs to pass the desired size limit to the - volume plugin where appropriate so that the volume plugin can impose - a quota. The limit is passed as 0 (do not use quotas), _positive - number (impose an enforcing quota if possible, measured in bytes),_ - or -1 (impose a non-enforcing quota, if possible) on the volume. - - This requires changes to - `pkg/volume/util/operationexecutor/operation_executor.go` (to add - `DesiredSizeLimit` to `VolumeToMount`), - `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and - `pkg/kubelet/eviction/helpers.go` (the latter in order to determine - whether the volume is a local ephemeral one). - -* The volume manager (in `pkg/volume/volume.go`) changes the - `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new - `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to - allow passing the desired size and pod UID (in the event we choose - to implement quotas shared between multiple volumes; [see - below](#alternative-quota-based-implementation)). This required - small changes to all volume plugins and their tests, but will in the - future allow adding additional data without having to change code - other than that which uses the new information. - -#### Testing Strategy - -The quota code is by an large not very amendable to unit tests. While -there are simple unit tests for parsing the mounts file, and there -could be tests for parsing the projects and projid files, the real -work (and risk) involves interactions with the kernel and with -multiple instances of this code (e. g. in the kubelet and the runtime -manager, particularly under stress). It also requires setup in the -form of a prepared filesystem. It would be better served by -appropriate end to end tests. - -### Risks and Mitigations - -* The SIG raised the possibility of a container being unable to exit - should we enforce quotas, and the quota interferes with writing the - log. This can be mitigated by either not applying a quota to the - log directory and using the du mechanism, or by applying a separate - non-enforcing quota to the log directory. - - As log directories are write-only by the container, and consumption - can be limited by other means (as the log is filtered by the - runtime), I do not consider the ability to write uncapped to the log - to be a serious exposure. - - Note in addition that even without quotas it is possible for writes - to fail due to lack of filesystem space, which is effectively (and - in some cases operationally) indistinguishable from exceeding quota, - so even at present code must be able to handle those situations. - -* Filesystem quotas may impact performance to an unknown degree. - Information on that is hard to come by in general, and one of the - reasons for using quotas is indeed to improve performance. If this - is a problem in the field, merely turning off quotas (or selectively - disabling project quotas) on the filesystem in question will avoid - the problem. Against the possibility that cannot be done - (because project quotas are needed for other purposes), we should - provide a way to disable use of quotas altogether via a feature - gate. - - A report <https://blog.pythonanywhere.com/110/> notes that an - unclean shutdown on Linux kernel versions between 3.11 and 3.17 can - result in a prolonged downtime while quota information is restored. - Unfortunately, [the link referenced - here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no - longer available. - -* Bugs in the quota code could result in a variety of regression - behavior. For example, if a quota is incorrectly applied it could - result in ability to write no data at all to the volume. This could - be mitigated by use of non-enforcing quotas. XFS in particular - offers the `pqnoenforce` mount option that makes all quotas - non-enforcing. - - -## Graduation Criteria - -How will we know that this has succeeded? Gathering user feedback is -crucial for building high quality experiences and SIGs have the -important responsibility of setting milestones for stability and -completeness. Hopefully the content previously contained in [umbrella -issues][] will be tracked in the `Graduation Criteria` section. - -[umbrella issues]: N/A - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in -`Implementation History`. Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG - acceptance -- the `Proposal` section being merged signaling agreement on a - proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was - available -- the version of Kubernetes where the KEP graduated to general - availability -- when the KEP was retired or superseded - -## Drawbacks [optional] - -* Use of quotas, particularly the less commonly used project quotas, - requires additional action on the part of the administrator. In - particular: - * ext4fs filesystems must be created with additional options that - are not enabled by default: -``` -mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ -``` - * An additional option (`prjquota`) must be applied in `/etc/fstab` - * If the root filesystem is to be quota-enabled, it must be set in - the grub options. -* Use of project quotas for this purpose will preclude future use - within containers. - -## Alternatives [optional] - -I have considered two classes of alternatives: - -* Alternatives based on quotas, with different implementation - -* Alternatives based on loop filesystems without use of quotas - -### Alternative quota-based implementation - -Within the basic framework of using quotas to monitor and potentially -enforce storage utilization, there are a number of possible options: - -* Utilize per-volume non-enforcing quotas to monitor storage (the - first stage of this proposal). - - This mostly preserves the current behavior, but with more efficient - determination of storage utilization and the possibility of building - further on it. The one change from current behavior is the ability - to detect space used by deleted files. - -* Utilize per-volume enforcing quotas to monitor and enforce storage - (the second stage of this proposal). - - This allows partial enforcement of storage limits. As local storage - capacity isolation works at the level of the pod, and we have no - control of user utilization of ephemeral volumes, we would have to - give each volume a quota of the full limit. For example, if a pod - had a limit of 1 MB but had four ephemeral volumes mounted, it would - be possible for storage utilization to reach (at least temporarily) - 4MB before being capped. - -* Utilize per-pod enforcing user or group quotas to enforce storage - consumption, and per-volume non-enforcing quotas for monitoring. - - This would offer the best of both worlds: a fully capped storage - limit combined with efficient reporting. However, it would require - each pod to run under a distinct UID or GID. This may prevent pods - from using setuid or setgid or their variants, and would interfere - with any other use of group or user quotas within Kubernetes. - -* Utilize per-pod enforcing quotas to monitor and enforce storage. - - This allows for full enforcement of storage limits, at the expense - of being able to efficiently monitor per-volume storage - consumption. As there have already been reports of monitoring - causing trouble, I do not advise this option. - - A variant of this would report (1/N) storage for each covered - volume, so with a pod with a 4MiB quota and 1MiB total consumption, - spread across 4 ephemeral volumes, each volume would report a - consumption of 256 KiB. Another variant would change the API to - report statistics for all ephemeral volumes combined. I do not - advise this option. - -### Alternative loop filesystem-based implementation - -Another way of isolating storage is to utilize filesystems of -pre-determined size, using the loop filesystem facility within Linux. -It is possible to create a file and run `mkfs(8)` on it, and then to -mount that filesystem on the desired directory. This both limits the -storage available within that directory and enables quick retrieval of -it via `statfs(2)`. - -Cleanup of such a filesystem involves unmounting it and removing the -backing file. - -The backing file can be created as a sparse file, and the `discard` -option can be used to return unused space to the system, allowing for -thin provisioning. - -I conducted preliminary investigations into this. While at first it -appeared promising, it turned out to have multiple critical flaws: - -* If the filesystem is mounted without the `discard` option, it can - grow to the full size of the backing file, negating any possibility - of thin provisioning. If the file is created dense in the first - place, there is never any possibility of thin provisioning without - use of `discard`. - - If the backing file is created densely, it additionally may require - significant time to create if the ephemeral limit is large. - -* If the filesystem is mounted `nosync`, and is sparse, it is possible - for writes to succeed and then fail later with I/O errors when - synced to the backing storage. This will lead to data corruption - that cannot be detected at the time of write. - - This can easily be reproduced by e. g. creating a 64MB filesystem - and within it creating a 128MB sparse file and building a filesystem - on it. When that filesystem is in turn mounted, writes to it will - succeed, but I/O errors will be seen in the log and the file will be - incomplete: - -``` -# mkdir /var/tmp/d1 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383 -# mkfs.ext4 /var/tmp/fs1 -# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1 -# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767 -# mkfs.ext4 /var/tmp/d1/fs2 -# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 - ...will normally succeed... -# sync - ...fails with I/O error!... -``` - -* If the filesystem is mounted `sync`, all writes to it are - immediately committed to the backing store, and the `dd` operation - above fails as soon as it fills up `/var/tmp/d1`. However, - performance is drastically slowed, particularly with small writes; - with 1K writes, I observed performance degradation in some cases - exceeding three orders of magnitude. - - I performed a test comparing writing 64 MB to a base (partitioned) - filesystem, to a loop filesystem without `sync`, and a loop - filesystem with `sync`. Total I/O was sufficient to run for at least - 5 seconds in each case. All filesystems involved were XFS. Loop - filesystems were 128 MB and dense. Times are in seconds. The - erratic behavior (e. g. the 65536 case) was involved was observed - repeatedly, although the exact amount of time and which I/O sizes - were affected varied. The underlying device was an HP EX920 1TB - NVMe SSD. - -| I/O Size | Partition | Loop w/sync | Loop w/o sync | -| ---: | ---: | ---: | ---: | -| 1024 | 0.104 | 0.120 | 140.390 | -| 4096 | 0.045 | 0.077 | 21.850 | -| 16384 | 0.045 | 0.067 | 5.550 | -| 65536 | 0.044 | 0.061 | 20.440 | -| 262144 | 0.043 | 0.087 | 0.545 | -| 1048576 | 0.043 | 0.055 | 7.490 | -| 4194304 | 0.043 | 0.053 | 0.587 | - - The only potentially viable combination in my view would be a dense - loop filesystem without sync, but that would render any thin - provisioning impossible. - -## Infrastructure Needed [optional] - -* Decision: who is responsible for quota management of all volume - types (and especially ephemeral volumes of all types). At present, - emptydir volumes are managed by the kubelet and logdirs and writable - layers by either the kubelet or the runtime, depending upon the - choice of runtime. Beyond the specific proposal that the runtime - should manage quotas for volumes it creates, there are broader - issues that I request assistance from the SIG in addressing. - -* Location of the quota code. If the quotas for different volume - types are to be managed by different components, each such component - needs access to the quota code. The code is substantial and should - not be copied; it would more appropriately be vendored. - -## References - -### Bugs Opened Against Filesystem Quotas - -The following is a list of known security issues referencing -filesystem quotas on Linux, and other bugs referencing filesystem -quotas in Linux since 2012. These bugs are not necessarily in the -quota system. - -#### CVE - -* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel - before 3.3.6, when huge pages are enabled, allows local users to - cause a denial of service (system crash) or possibly gain privileges - by interacting with a hugetlbfs filesystem, as demonstrated by a - umount operation that triggers improper handling of quota data. - - The issue is actually related to huge pages, not quotas - specifically. The demonstration of the vulnerability resulted in - incorrect handling of quota data. - -* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c) - in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl - function the first time without a host name, which might allow - remote attackers to bypass TCP Wrappers rules in hosts.deny (related - to rpc.rquotad; remote attackers might be able to bypass TCP - Wrappers rules). - - This issue is related to remote quota handling, which is not the use - case for the proposal at hand. - -#### Other Security Issues Without CVE - -* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and - Create Large Files](https://securitytracker.com/id/1002610) - - A setuid root binary inheriting file descriptors from an - unprivileged user process may write to the file without respecting - quota limits. If this issue is still present, it would allow a - setuid process to exceed any enforcing limits, but does not affect - the quota accounting (use of quotas for monitoring). - -### Other Linux Quota-Related Bugs Since 2012 - -* [ext4: report delalloc reserve as non-free in statfs mangled by - project quota](https://lore.kernel.org/patchwork/patch/884530/) - - This bug, fixed in Feb. 2018, properly accounts for reserved but not - committed space in project quotas. At this point I have not - determined the impact of this issue. - -* [XFS quota doesn't work after rebooting because of - crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730) - - This bug resulted in XFS quotas not working after a crash or forced - reboot. Under this proposal, Kubernetes would fall back to du for - monitoring should a bug of this nature manifest itself again. - -* [quota can show incorrect filesystem - name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527) - - This issue, which will not be fixed, results in the quota command - possibly printing an incorrect filesystem name when used on remote - filesystems. It is a display issue with the quota command, not a - quota bug at all, and does not result in incorrect quota information - being reported. As this proposal does not utilize the quota command - or rely on filesystem name, or currently use quotas on remote - filesystems, it should not be affected by this bug. - -In addition, the e2fsprogs have had numerous fixes over the years. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 1ce72617..cfd1f5fa 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -1,150 +1,4 @@ ---- -kep-number: 18 -title: Kubelet endpoint for device assignment observation details -authors: - - "@dashpole" - - "@vikaschoudhary16" -owning-sig: sig-node -reviewers: - - "@thockin" - - "@derekwaynecarr" - - "@dchen1107" - - "@vishh" -approvers: - - "@sig-node-leads" -editors: - - "@dashpole" - - "@vikaschoudhary16" -creation-date: "2018-07-19" -last-updated: "2018-07-19" -status: provisional ---- -# Kubelet endpoint for device assignment observation details - -Table of Contents -================= -* [Abstract](#abstract) -* [Background](#background) -* [Objectives](#objectives) -* [User Journeys](#user-journeys) - * [Device Monitoring Agents](#device-monitoring-agents) -* [Changes](#changes) -* [Potential Future Improvements](#potential-future-improvements) -* [Alternatives Considered](#alternatives-considered) - -## Abstract -In this document we will discuss the motivation and code changes required for introducing a kubelet endpoint to expose device to container bindings. - -## Background -[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) requires external agents to be able to determine the set of devices in-use by containers and attach pod and container metadata for these devices. - -## Objectives - -* To remove current device-specific knowledge from the kubelet, such as [accellerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229) -* To enable future use-cases requiring device-specific knowledge to be out-of-tree - -## User Journeys - -### Device Monitoring Agents - -* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors. -* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics: - - - - -## Changes - -Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below: -```protobuf -// PodResources is a service provided by the kubelet that provides information about the -// node resources consumed by pods and containers on the node -service PodResources { - rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} -} - -// ListPodResourcesRequest is the request made to the PodResources service -message ListPodResourcesRequest {} - -// ListPodResourcesResponse is the response returned by List function -message ListPodResourcesResponse { - repeated PodResources pod_resources = 1; -} - -// PodResources contains information about the node resources assigned to a pod -message PodResources { - string name = 1; - string namespace = 2; - repeated ContainerResources containers = 3; -} - -// ContainerResources contains information about the resources assigned to a container -message ContainerResources { - string name = 1; - repeated ContainerDevices devices = 2; -} - -// ContainerDevices contains information about the devices assigned to a container -message ContainerDevices { - string resource_name = 1; - repeated string device_ids = 2; -} -``` - -### Potential Future Improvements - -* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll. -* Add identifiers for other resources used by pods to the `PodResources` message. - * For example, persistent volume location on disk - -## Alternatives Considered - -### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers. -* Pros: - * Reuse an existing API for describing containers rather than inventing a new one -* Cons: - * It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future - * It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case. -* Notes: - * Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container. - -### Add a field to Pod Status. -* Pros: - * Allows for observation of container to device bindings local to the node through the `/pods` endpoint -* Cons: - * Only consumed locally, which doesn't justify an API change - * Device Bindings are immutable after allocation, and are _debatably_ observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status. - -### Use the Kubelet Device Manager Checkpoint file -* Allows for observability of device to container bindings through what exists in the checkpoint file - * Requires adding additional metadata to the checkpoint file as required by the monitoring agent -* Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet -* Future modifications to the checkpoint file are more difficult. - -### Add a field to the Pod Spec: -* A new object `ComputeDevice` will be defined and a new variable `ComputeDevices` will be added in the `Container` (Spec) object which will represent a list of `ComputeDevice` objects. -```golang -// ComputeDevice describes the devices assigned to this container for a given ResourceName -type ComputeDevice struct { - // DeviceIDs is the list of devices assigned to this container - DeviceIDs []string - // ResourceName is the name of the compute resource - ResourceName string -} - -// Container represents a single container that is expected to be run on the host. -type Container struct { - ... - // ComputeDevices contains the devices assigned to this container - // This field is alpha-level and is only honored by servers that enable the ComputeDevices feature. - // +optional - ComputeDevices []ComputeDevice - ... -} -``` -* During Kubelet pod admission, if `ComputeDevices` is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today. -* Before starting the pod, the kubelet writes the assigned `ComputeDevices` back to the pod spec. - * Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup. -* Allows devices to potentially be assigned by a custom scheduler. -* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-release/k8s-image-promoter.md b/keps/sig-release/k8s-image-promoter.md index ccaba41e..cfd1f5fa 100644 --- a/keps/sig-release/k8s-image-promoter.md +++ b/keps/sig-release/k8s-image-promoter.md @@ -1,103 +1,4 @@ ---- -title: Image Promoter -authors: - - "@javier-b-perez" -owning-sig: sig-release -participating-sigs: - - TBD -reviewers: - - "@AishSundar" - - "@BenTheElder" - - "@dims" - - "@listx" -approvers: - - "@thockin" -creation-date: 2018-09-05 -last-updated: 2018-11-14 -status: implementable ---- - -# Image Promoter - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) -* [Proposal](#proposal) - * [Staging Container Registry](#staging-container-registry) - * [Production Container Registry](#production-container-registry) - * [Promotion Process](#promotion-process) -* [Graduation Criteria](#graduation-criteria) -* [Infrastructure Needed](#infrastructure-needed) - - -## Summary - -For security reasons, we cannot allow everyone to publish container images into the official kubernetes container registry. This is why we need a process that allows us to review who built an image and who approved it to be shown in the official channels. - - -## Motivation - -There are multiple reasons why we should have a process to publish container images in place: - -* We cannot allow all community members to publish images into the official kubernetes container registry. -* We should restrict who can push images to a small set of members and some systems accounts for automation. -* We can run scans and tests on the images before we publish them into the official kubernetes container registry. -* The process to publish into an official channel shouldn't be hard or long to follow. We don’t want to block developers or releases. - -### Goals - -1. Define a process for publishing container images into an official GCR through a code review process and automated promotion from GCR staging environment. -1. Allow the community to own and manage the project registries. - -## Proposal - -Following the *GitOps* idea, the proposal is to use a code review process to approve publishing container images into official distribution channels. - -This requires two GCR registries: - -* Staging: temporary container registry to share container images for testing and scanning. -* Production or *official*: GCR used to host all the approved container images by the community. - -### Staging Container Registry - -This temporary storage allows to have a public place where to pull images and run qualification tests or vulnerability scans on the images before pushing them to the *official* container registry. - -Each project/subproject in the community, will require at least one member of their community to have push access to the staging area. - -### Production Container Registry - -A restricted set of members can have push access to override any tool or process if necessary. -Ideally we only push images that have been approved by the owners of the production container registry, following the promotion process. - -### Promotion Process - -1. Maintainer create a container image and push it into *staging* GCR. -1. Maintainer creates a PR in GitHub to add the new image into the *official* container registry. -1. Once owners of the official container registry approve the change and merge it into the master branch, the promoter tool will automatically copy the container image(s) from *staging* into *official* container registry. - -If the infrastructure support it, the promoter tool could sign container images when pushing to the official container registry. - - - -In the future, we could add more information into the context of the PR like the vulnerability scan and test results of the container image. - -## Graduation Criteria - -We will know we are done when we have: - -* User guide for developers/maintainers: how to build? how to promote? -* User guide for owners: review and approve PR, how to push images? -* A repository to host the manifest file. -* Initial set of repository's owners who can approve changes. -* Criteria to grant or remove access to staging and production container registries. -* A tool that automatically copy approved images into official channels. - -## Infrastructure Needed - -* Two GCP projects with GCR enabled. - * One project should have GCB enabled to run the promotion tool in it. -* Repository to host the manifest for promotions. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-scheduling/node-labels-quota.md b/keps/sig-scheduling/node-labels-quota.md index 1d44a58d..cfd1f5fa 100644 --- a/keps/sig-scheduling/node-labels-quota.md +++ b/keps/sig-scheduling/node-labels-quota.md @@ -1,137 +1,4 @@ ---- -kep-number: 27 -title: Resource Quota based on Node Labels -authors: - - "@vishh" - - "@bsalamat" -owning-sig: sig-scheduling -participating-sigs: sig-architecture -reviewers: - - "@derekwaynecarr" - - "@davidopp" -approvers: - - TBD -editor: TBD -creation-date: 2018-08-23 -status: provisional ---- - -# Resource Quota based on Node Labels - -## Summary - -Allowing Resource Quota to be applied on pods based on their node selector configuration opens up a flexible interface for addressing some immediate and potential future use cases. - -## Motivation - -As a kubernetes cluster administrator, I'd like to, - -1. Restrict namespaces to specific HW types they can consume. Nodes are expected to be homogeneous wrt. to specific types of HW and HW type will be exposed as node labels. - * A concrete example - An intern should only use the cheapest GPU available in my cluster, while researchers can consume the latest or most expensive GPUs. -2. Restrict compute resources consumed by namespaces on different zones or dedicated node pools. -3. Restrict compute resources consumed by namespaces based on policy (FIPS, HIPAA, etc) compliance on individual nodes. - -This proposal presents flexible solution(s) for addressing these use cases without introducing much additional complexity to core kubernetes. - -## Potential solutions - -This proposal currently identifies two possible solutions, with the first one being the _preferred_ solution. - -### Solution A - Extend Resource Quota Scopes - -Resource Quota already includes a built in extension mechanism called [Resource Scopes](https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4746). -It is possible to add a new Resource Scope called “NodeAffinityKey” (or something similar) that will allow for Resource Quota limits to apply to node selector and/or affinity fields specified in the pod spec. - -Here’s an illustration of a sample object with these new fields: - -```yaml -apiVersion: v1 -kind: ResourceQuota -metadata: - name: hipaa-nodes - namespace: team-1 -spec: - hard: - cpu: 1000 - memory: 100Gi - scopeSelector: - scopeName: NodeAffinityKey - operator: In - values: [“hipaa-compliant: true”] -``` - -``` yaml -apiVersion: v1 -kind: ResourceQuota -metadata: - name: nvidia-tesla-v100-quota - namespace: team-1 -spec: - hard: - - nvidia.com/gpu: 128 - scopeSelector: - scopeName: NodeAffinityKey - operator: In - values: [“nvidia.com/gpu-type:nvidia-tesla-v100”] -``` - -It is possible for quotas to overlap with this feature as is the case today. -All quotas have to be satisfied for the pod to be admitted. - -[Quota configuration object](https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/plugin/pkg/admission/resourcequota/apis/resourcequota/types.go#L32) will also support the new scope to allow for preventing pods from running on nodes that match a label selector unless a corresponding quota object has been created. - -#### Pros - -- Support arbitrary properties to be consumed as part of quota as long as they are exposed as node labels. -- Little added cognitive burden - follows existing API paradigms. -- Implementation is straightforward. -- Doesn’t compromise portability - Quota remains an administrator burden. - -#### Cons - -- Requires property labels to become standardized if portability is desired. This is required anyways irrespective of how they are exposed outside of the node for scheduling portability. -- Label keys and values are concatenated. Given that most selector use cases for quota will be deterministic (one -> one), the proposed API schema might be adequate. - -### Solution B - Extend Resource Quota to include an explicit Node Selector field - -This solution is similar to the previous one with changes to the API where instead of re-using scopes we can add an explicit Node Selector field to the Resource Quota object. - -```yaml -apiVersion: v1 -kind: ResourceQuota -metadata: - name: hipaa-nodes - namespace: team-1 -spec: - hard: - cpu: 1000 - memory: 100Gi - podNodeSelector: - matchExpressions: - - key: hipaa-compliant - operator: In - values: ["true"] -``` - -Users should already be familiar with the Node Selector spec illustrated here as it is used in pod and volume topology specifications. -However this solution introduces a field that is only applicable to a few types of resources that Resource Quota can be used to control. - -### Solution C - CRD for expressing Resource Quota for extended resources - -The idea behind this solution is to let individual kubernetes vendors create additional CRDs that will allow for expressing quota per namespace for their resource and have a controller that will use mutating webhooks to quota pods on creation & deletion. -The controller can also keep track of “in use” quota for the resource it owns similar to the built in resource quota object. -The schema for quota is controlled by the resource vendor and the onus of maintaining compatibility and portability is on them. - -#### Pros - -- Maximum flexibility - - Use arbitrary specifications associated with a pod to define quota policies - - The spec for quota itself can be arbitrarily complex -- Develop and maintain outside of upstream - -#### Cons - -- Added administrator burden. An admin needs to identify multiple types of quota objects based on the HW they consume. -- It is not trivial to develop an external CRD given the lack of some critical validation, versioning, and lifecycle primitives. -- Tracking quota is non trivial - perhaps a canonical (example) quota controller might help ease the pain. -- Hard to generate available and in-use quota reports for users - existing quota support in ecosystem components will not support this new quota object (kubectl for example). +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-storage/0031-20181029-VolumeSubpathEnvExpansion-apichange.md b/keps/sig-storage/0031-20181029-VolumeSubpathEnvExpansion-apichange.md index 315f4169..cfd1f5fa 100644 --- a/keps/sig-storage/0031-20181029-VolumeSubpathEnvExpansion-apichange.md +++ b/keps/sig-storage/0031-20181029-VolumeSubpathEnvExpansion-apichange.md @@ -1,288 +1,4 @@ ---- -kep-number: 31 -title: VolumeSubpathEnvExpansion -authors: - - "@kevtaylor" -owning-sig: sig-storage -participating-sigs: - - sig-storage - - sig-architecture -reviewers: - - "@msau42" - - "@thockin" -approvers: - - "@thockin" - - "@msau42" -editor: TBD -creation-date: 2018-10-29 -last-updated: 2018-10-29 -status: implementable -see-also: - - n/a -replaces: - - n/a -superseded-by: - - n/a ---- - -# Title - -VolumeSubpathEnvExpansion API change - -## Table of Contents - - * [Title](#title) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Proposal](#proposal) - * [User Stories](#user-stories) - * [Current workarounds - k8s <=1.9.3](#current-workarounds---k8s-193) - * [Workarounds - k8s >1.9.3](#workarounds---k8s-193) - * [Alternatives - using subPath directly](#alternatives---using-subpath-directly) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Alternatives - Using subPathFrom](#alternatives---using-subpathfrom) - -## Summary - -Legacy systems create all manner of log files and these are not easily streamed into stdout - -Files that are written to a host file path need to be uniquely partitioned - -If 2 or more pods run on the same host writing the same log file names to the same volume, they will clash - -Using the `subPath` is a neat option but because the `subPath` is "hardcoded" ie. `subPath: mySubPath` it does not enforce uniqueness - -To alleviate this, the `subPath` should be able to be configured from an environment variable as `subPath: $(MY_VARIABLE)` - -The workaround to this issue is to use symbolic links or relative symbolic links but these introduce sidecar init containers -and a messy configuration overhead to try to create upfront folders with unique names - examples of this complexity are detailed below - -## Motivation - -The initial alpha feature was implemented to allow unique addressing of subpaths on a host -This cannot currently be achieved with the downwardAPI and requires complex workarounds - -The workarounds became more difficult after 1.9.3 when symbolic links were removed from initContainers - -### Goals - -To reduce excessive boiler-plate workarounds and remove the need for complex initContainers - -### Non-Goals - -Full template implementation for subPaths - -## Proposal - -The api change proposed is to create a Mutually Exclusive Field separate from the `subPath` -called `subPathExpr` - -The subpath code which expands environment variables from the API would (under this proposal) change from - -``` - env: - - name: POD_NAME - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.name - - ... - - volumeMounts: - - name: workdir1 - mountPath: /logs - subPath: $(POD_NAME) -``` - -to: - -``` - volumeMounts: - - name: workdir1 - mountPath: /logs - subPathExpr: $(POD_NAME) -``` - -This would then introduce the new element to be processed separately from the `subPath` - -### User Stories - -## Current workarounds - k8s <=1.9.3 - -This makes use of symbolical linking to the underlying subpath system -The symbolic link element was removed after 1.9.3 - -``` -apiVersion: extensions/v1beta1 -kind: Deployment -metadata: - labels: - app: podtest - name: podtest -spec: - replicas: 1 - selector: - matchLabels: - app: podtest - template: - metadata: - labels: - app: podtest - spec: - containers: - - env: - - name: POD_NAME - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.name - - name: POD_NAMESPACE - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.namespace - image: <image> - name: podtest - volumeMounts: - - mountPath: /logs - name: workdir - subPath: logs - initContainers: - - command: - - /bin/sh - - -xc - - | - LOGDIR=/logs/${POD_NAMESPACE}/${POD_NAME}; mkdir -p ${LOGDIR} && ln -sfv ${LOGDIR} /workdir/logs && chmod -R ugo+wr ${LOGDIR} - env: - - name: POD_NAME - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.name - - name: POD_NAMESPACE - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.namespace - image: alpine:3.5 - name: prep-logs - volumeMounts: - - mountPath: /logs - name: logs - - mountPath: /workdir - name: workdir - volumes: - - emptyDir: {} - name: workdir - - hostPath: - path: /logs - type: "" - name: logs -``` - -## Workarounds - k8s >1.9.3 - -Beyond 1.9.3 some attempts were made to provide a workaround using relative paths, rather than symbolic links - -These have been deemed to be a cumbersome manipulation of the operating system and are flawed and unworkable - -This effectively negates an upgrade path to k8s 1.10 - -The only foreseeable solution is to move directly to 1.11 from 1.9.3 and switch on the alpha feature gate VolumeSubPathEnvExpansion - -## Alternatives - using subPath directly - -An initial attempt has been made for this but there are edge case compatibility issues which are highlighted here -https://github.com/kubernetes/kubernetes/pull/65769 regarding the alpha implementation - -The objection is regarding backward compatibility with existing users' subpaths - -Because of a breaking change in the API, it has been decided to offer an alternative based on the -original discussion here https://github.com/kubernetes/kubernetes/issues/48677 - -The `VolumeSubPathEnvExpansion` alpha feature was delivered in k8s 1.11 allowing -subpaths to be created from downward api variables as -``` -apiVersion: v1 -kind: Pod -metadata: - name: pod1 -spec: - containers: - - name: container1 - env: - - name: POD_NAME - valueFrom: - fieldRef: - apiVersion: v1 - fieldPath: metadata.name - image: busybox - command: [ "sh", "-c", "while [ true ]; do echo 'Hello'; sleep 10; done | tee -a /logs/hello.txt" ] - volumeMounts: - - name: workdir1 - mountPath: /logs - subPath: $(POD_NAME) - restartPolicy: Never - volumes: - - name: workdir1 - hostPath: - path: /var/log/pods -``` - -Because of the mentioned breaking changes, this implementation cannot proceed forward - -### Risks and Mitigations - -The alpha implementation already provided a number of test cases to ensure that validations of `subPath` configurations -were not circumvented or violated - -The API change would ensure that the substitute of the variables takes place immediately before the subpath mount validation - -We would also need review existing validation to ensure that any potential security issues are addressed as: -`$ escape and "../../../../../proc are not allowed` - -Due to the vulnerabilities highlighted in https://github.com/kubernetes/kubernetes/issues/60813 the subpath validations in kubelet have been -highly orchestrated. Any implementation of this feature needs to ensure that the security fixes put in place are still effective - - -## Graduation Criteria - -The existing alpha feature introduced many tests to mitigate issues. These would be reused as part of the api implementation. - -[umbrella issues]: https://github.com/kubernetes/kubernetes/pull/49388 - -## Implementation History - -* Initial issue: https://github.com/kubernetes/kubernetes/issues/48677 -* Feature gate proposal: https://github.com/kubernetes/enhancements/issues/559 -* Alpha Implementation: https://github.com/kubernetes/kubernetes/pull/49388 -* Beta Issue: https://github.com/kubernetes/kubernetes/issues/64604 -* Beta PR and Discussion: https://github.com/kubernetes/kubernetes/pull/65769 - -## Alternatives - Using subPathFrom -A possible further implementation could derive directly from the `fieldRef` as - -``` - volumeMounts: - - mountPath: /logs - name: logs - subPathFrom: - fieldRef: - fieldPath: metadata.name - volumes: - - name: logs - hostPath: - path: /logs -``` - -This method would not be favoured as it fixes the `subPath` to a single value and would not allow concatenation -of paths such as `$(NAMESPACE)/$(POD_NAME)` - - - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-testing/0028-20180625-new-label-for-trusted-pr-identification.md b/keps/sig-testing/0028-20180625-new-label-for-trusted-pr-identification.md index 3a11d615..cfd1f5fa 100644 --- a/keps/sig-testing/0028-20180625-new-label-for-trusted-pr-identification.md +++ b/keps/sig-testing/0028-20180625-new-label-for-trusted-pr-identification.md @@ -1,135 +1,4 @@ ---- -kep-number: 28 -title: New label for trusted PR identification -authors: - - "@matthyx" -owning-sig: sig-testing -participating-sigs: - - sig-contributor-experience -reviewers: - - "@fejta" - - "@cjwagner" - - "@BenTheElder" - - "@cblecker" - - "@stevekuznetsov" -approvers: - - TBD -editor: TBD -creation-date: 2018-06-25 -last-updated: 2018-09-03 -status: provisional ---- - -# New label for trusted PR identification - -## Table of Contents - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Benefits](#benefits) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Future evolutions](#future-evolutions) -* [References](#references) -* [Implementation History](#implementation-history) - -## Summary - -This document describes a major change to the way the `trigger` plugin determines if test jobs should be started on a pull request (PR). - -We propose introducing a new label named `ok-to-test` that will be applied on non-member PRs once they have been `/ok-to-test` by a legitimate reviewer. - -## Motivation - -PR test jobs are started by the trigger plugin on *trusted PR* events, or when a *untrusted* PR becomes *trusted*. -> A PR is considered trusted if the author is a member of the *trusted organization* for the repository or if such a member has left an `/ok-to-test` command on the PR. - -It is easy spot an untrusted PR opened by a non-member of the organization by its `needs-ok-to-test` label. However the contrary is difficult and involves scanning every comment for a `/ok-to-test`, which increases code complexity and API token consumption. - -### Goals - -This KEP will only target PRs authored from non-members of the organization: - -* introduce a new `ok-to-test` label -* modify `/ok-to-test` command to apply `ok-to-test` on success -* modify `trigger` plugin and other tools to use `ok-to-test` for PR trust -* support `/ok-to-test` calls inside review comments - -### Non-Goals - -This KEP will not change the current process for members of the organization. - -## Proposal - -We suggest introducing a new label named `ok-to-test` that would be required on any non-member PR before automatic test jobs can be started by the `trigger` plugin. - -This label will be added by members of the *trusted organization* for the repository using the `/ok-to-test` command, detected with a single GenericCommentEvent handler on corresponding events (issue_comment, pull_request_review, and pull_request_review_comment). - -### Implementation Details/Notes/Constraints - -1. PR: declare `ok-to-test` - * add `ok-to-test` to `label_sync` -1. (custom tool needed) batch add `ok-to-test` label to non-members trusted PRs - * for all PR without `ok-to-test` or `needs-ok-to-test` - * if author is not a member of trusted org - * add `ok-to-test` -1. documentation updates - * edit all documentation references to `needs-ok-to-test` -1. other cookie crumbs updates - * update infra links (eg: redirect https://go.k8s.io/needs-ok-to-test) - * update locations where edits are expected (eg: https://github.com/search?q=org%3Akubernetes+needs-ok-to-test&type=Code) -1. PR: switch to `ok-to-test` - * remove `needs-ok-to-test` from `missingLabels` in `prow/config.yaml` - * edit `prow/config/jobs_test.go` - * edit `prow/cmd/deck/static/style.css` - * edit `prow/cmd/tide/README.md` - * code changes in `trigger`: - * `/ok-to-test` adds `ok-to-test` - * PR trust relies on `ok-to-test` - * if PR has both labels, drop `needs-ok-to-test` - * edit all references to `needs-ok-to-test` -1. run batch job again, to catch new PRs that arrived between first run and merge/deploy -1. (to be discussed) periodically check for and report PRs with both `ok-to-test` and `needs-ok-to-test` labels - -### Benefits - -* Trusted PRs are easily identified by either being authored by org members, or by having the `ok-to-test` label. -* Race conditions can no longer happen when checking if a PR is trusted. -* API tokens are saved by avoiding listing the comments, reviews, and review comments every time we need to check if a PR is trusted. - -### Risks and Mitigations - -TODO - -## Graduation Criteria - -TODO - -## Future evolutions - -In the future, we might decide to require the new label for all PRs, which means that organization members will also need the `ok-to-test` label applied to their PRs before automatic testing can be triggered. - -Trusted and untrusted PRs will be even easier to tell apart. - -This would require adding automatically the `ok-to-test` label to member authored PRs to keep the current functionality. - -## References - -* https://github.com/kubernetes/test-infra/issues/3827 -* https://github.com/kubernetes/test-infra/issues/7801 -* https://github.com/kubernetes/test-infra/pull/5246 - -## Implementation History - -* 2018-06-25: creation of the KEP -* 2018-07-09: KEP content LGTM during sig-testing presentation -* 2018-07-24: KEP updated to keep `needs-ok-to-test` for better UX -* 2018-09-03: KEP rewritten with template -* 2018-10-04: KEP merged into master -* 2018-10-08: start of implementation -* 2018-10-10: `ok-to-test` label added +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file |
