diff options
| author | John Belamaric <jbelamaric@google.com> | 2020-05-18 11:31:37 -0700 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2020-05-18 11:31:37 -0700 |
| commit | beb2dcbf99bb9857bb4d29eacdeab09f855c37a4 (patch) | |
| tree | 93ceafc4ff8645a7f30e410c4e8dc011066f23a3 /sig-architecture | |
| parent | 7c46f9b449899e0589d4c9335964969d8b9bcb68 (diff) | |
Update PRR process status (#4734)
* Update PRR process status
* Update sig-architecture/production-readiness.md
Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com>
* Update sig-architecture/production-readiness.md
Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com>
Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com>
Diffstat (limited to 'sig-architecture')
| -rw-r--r-- | sig-architecture/production-readiness.md | 177 |
1 files changed, 14 insertions, 163 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md index f277d9ce..0d60a052 100644 --- a/sig-architecture/production-readiness.md +++ b/sig-architecture/production-readiness.md @@ -5,170 +5,21 @@ Kubernetes are observable, scalable and supportable, can be safely operated in production environments, and can be disabled or rolled back in the event they cause increased failures in production. -## Status - -The process and questoinnaire are currently under development as part of the -[PRR KEP][], with a target that reviews will be needed for features going into 1.18. - -During the 1.17 cycle, the PRR team will be piloting the questionnaire and other -aspects of the process. - -## Questionnaire - -#### Feature enablement and rollback - -* **How can this feature be enabled / disabled in a live cluster?** - - [ ] Feature gate - - Feature gate name: - - Components depending on the feature gate: - - [ ] Other - - Describe the mechanism: - - Will enabling / disabling the feature require downtime of the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). - -* **Can the feature be disabled once it has been enabled (i.e. can we rollback - the enablement)?** - Describe the consequences on existing workloads (e.g. if this is runtime - feature, can it break the existing applications?). - -* **What happens if we reenable the feature if it was previously rolled back?** - -* **Are there any tests for feature enablement/ disablement?** - The e2e framework does not currently support enabling and disabling feature - gates. However, unit tests in each component dealing with managing data created - with and without the feature are necessary. At the very least, think about - conversion tests if API types are being modified. - - -#### Rollout, Upgrade and Rollback Planning - -* **How can a rollout fail? Can it impact already running workloads?** - Try to be as paranoid as possible - e.g. what if some components will restart - in the middle of rollout? - -* **What specific metrics should inform a rollback?** - -* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** - Describe manual testing that was done and the outcomes. - Longer term, we may want to require automated upgrade/rollback tests, but we - are missing a bunch of machinery and tooling and do that now. - - -#### Monitoring requirements - -* **How can an operator determine if the feature is in use by workloads?** - Ideally, this should be a metrics. Operations against Kubernetes API (e.g. - checking if there are objects with field X set) may be last resort. Avoid - logs or events for this purpose. +More details may be found in the [PRR KEP]. -* **What are the SLIs (Service Level Indicators) an operator can use to - determine the health of the service?** - - [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: - - [ ] Other (treat as last resort) - - Details: - -* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** - At the high-level this usually will be in the form of "high percentile of SLI - per day <= X". It's impossible to provide a comprehensive guidance, but at the very - high level (they needs more precise definitions) those may be things like: - - per-day percentage of API calls finishing with 5XX errors <= 1% - - 99% percentile over day of absolute value from (job creation time minus expected - job creation time) for cron job <= 10% - - 99,9% of /health requests per day finish with 200 code - -* **Are there any missing metrics that would be useful to have to improve - observability if this feature?** - Describe the metrics themselves and the reason they weren't added (e.g. cost, - implementation difficulties, etc.). - -#### Dependencies - -* **Does this feature depend on any specific services running in the cluster?** - Think about both cluster-level services (e.g. metrics-server) as well - as node-level agents (e.g. specific version of CRI). Focus on external or - optional services that are needed. For example, if this feature depends on - a cloud provider API, or upon an external software-defined storage or network - control plane. - For each of the fill in the following, thinking both about running user workloads - and creating new ones, as well as about cluster-level services (e.g. DNS): - - [Dependency name] - - Usage description: - - Impact of its outage on the feature: - - Impact of its degraded performance or high error rates on the feature: - - -#### Scalability - -* **Will enabling / using this feature result in any new API calls?** - Describe them, providing: - - API call type (e.g. PATCH pods) - - estimated throughput - - originating component(s) (e.g. Kubelet, Feature-X-controller) - focusing mostly on: - - components listing and/or watching resources they didn't before - - API calls that may be triggered by changes of some Kubernetes resources - (e.g. update of object X triggers new updates of object Y) - - periodic API calls to reconcile state (e.g. periodic fetching state, - heartbeats, leader election, etc.) - -* **Will enabling / using this feature result in introducing new API types?** - Describe them providing: - - API type - - Supported number of objects per cluster - - Supported number of objects per namespace (for namespace-scoped objects) - -* **Will enabling / using this feature result in any new calls to cloud - provider?** - -* **Will enabling / using this feature result in increasing size or count - of the existing API objects?** - Describe them providing: - - API type(s): - - Estimated increase in size: (e.g. new annotation of size 32B) - - Estimated amount of new objects: (e.g. new Object X for every existing Pod) - -* **Will enabling / using this feature result in increasing time taken by any - operations covered by [existing SLIs/SLOs][]?** - Think about adding additional work or introducing new steps in between - (e.g. need to do X to start a container), etc. Please describe the details. - -* **Will enabling / using this feature result in non-negligible increase of - resource usage (CPU, RAM, disk, IO, ...) in any components?** - Things to keep in mind include: additional in-memory state, additional - non-trivial computations, excessive access to disks (including increased log - volume), significant amount of data send and/or received over network, etc. - This through this both in small and large cases, again with respect to the - [supported limits][]. - - -#### Troubleshooting -Troubleshooting section serves the `Playbook` role as of now. We may consider -splitting it into a dedicated `Playbook` document (potentially with some monitoring -details). For now we leave it here though, with some questions not required until -further stages (e.g. Beta/Ga) of feature lifecycle. - -* **How does this feature react if the API server and/or etcd is unavailable?** - -* **What are other known failure modes?** - For each of them fill in the following information by copying the below template: - - [Failure mode brief description] - - Detection: How can it be detected via metrics? Stated another way: - how can an operator troubleshoot without loogging into a master or worker node? - - Mitigations: What can be done to stop the bleeding, especially for already - running user workloads? - - Diagnostics: What are the useful log messages and their required logging - levels that could help debugging the issue? - Not required until feature graduated to Beta. - - Testing: Are there any tests for failure mode? If not describe why. +## Status -* **What steps should be taken if SLOs are not being met to determine the problem?** +As of 1.19, production readiness reviews are required, and are part of the KEP +process. The PRR questionnaire previously found here has been incorporated into +the [KEP template]. The template details the specific questions that must be +answered, depending on the stage of the feature. As of 1.19, PRRs are +non-blocking; that is, _approval_ is not required for the enhancement to be part of +the release. This is to provide some time for the community to adapt to the +process. +Note that some of the questions should be answered in both the KEP's README.md +and the `kep.yaml`, in order to support automated checks on the PRR. The +template points out these as needed. -[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md -[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md -[existing SLIs/SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#kubernetes-slisslos +[PRR KEP]: https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md +[KEP template]: https://git.k8s.io/enhancements/keps/NNNN-kep-template/README.md |
