summaryrefslogtreecommitdiff
path: root/sig-architecture
diff options
context:
space:
mode:
authorJohn Belamaric <jbelamaric@google.com>2020-05-18 11:31:37 -0700
committerGitHub <noreply@github.com>2020-05-18 11:31:37 -0700
commitbeb2dcbf99bb9857bb4d29eacdeab09f855c37a4 (patch)
tree93ceafc4ff8645a7f30e410c4e8dc011066f23a3 /sig-architecture
parent7c46f9b449899e0589d4c9335964969d8b9bcb68 (diff)
Update PRR process status (#4734)
* Update PRR process status * Update sig-architecture/production-readiness.md Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com> * Update sig-architecture/production-readiness.md Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com> Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com>
Diffstat (limited to 'sig-architecture')
-rw-r--r--sig-architecture/production-readiness.md177
1 files changed, 14 insertions, 163 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
index f277d9ce..0d60a052 100644
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@@ -5,170 +5,21 @@ Kubernetes are observable, scalable and supportable, can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production.
-## Status
-
-The process and questoinnaire are currently under development as part of the
-[PRR KEP][], with a target that reviews will be needed for features going into 1.18.
-
-During the 1.17 cycle, the PRR team will be piloting the questionnaire and other
-aspects of the process.
-
-## Questionnaire
-
-#### Feature enablement and rollback
-
-* **How can this feature be enabled / disabled in a live cluster?**
- - [ ] Feature gate
- - Feature gate name:
- - Components depending on the feature gate:
- - [ ] Other
- - Describe the mechanism:
- - Will enabling / disabling the feature require downtime of the control
- plane?
- - Will enabling / disabling the feature require downtime or reprovisioning
- of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
-
-* **Can the feature be disabled once it has been enabled (i.e. can we rollback
- the enablement)?**
- Describe the consequences on existing workloads (e.g. if this is runtime
- feature, can it break the existing applications?).
-
-* **What happens if we reenable the feature if it was previously rolled back?**
-
-* **Are there any tests for feature enablement/ disablement?**
- The e2e framework does not currently support enabling and disabling feature
- gates. However, unit tests in each component dealing with managing data created
- with and without the feature are necessary. At the very least, think about
- conversion tests if API types are being modified.
-
-
-#### Rollout, Upgrade and Rollback Planning
-
-* **How can a rollout fail? Can it impact already running workloads?**
- Try to be as paranoid as possible - e.g. what if some components will restart
- in the middle of rollout?
-
-* **What specific metrics should inform a rollback?**
-
-* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
- Describe manual testing that was done and the outcomes.
- Longer term, we may want to require automated upgrade/rollback tests, but we
- are missing a bunch of machinery and tooling and do that now.
-
-
-#### Monitoring requirements
-
-* **How can an operator determine if the feature is in use by workloads?**
- Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
- checking if there are objects with field X set) may be last resort. Avoid
- logs or events for this purpose.
+More details may be found in the [PRR KEP].
-* **What are the SLIs (Service Level Indicators) an operator can use to
- determine the health of the service?**
- - [ ] Metrics
- - Metric name:
- - [Optional] Aggregation method:
- - Components exposing the metric:
- - [ ] Other (treat as last resort)
- - Details:
-
-* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
- At the high-level this usually will be in the form of "high percentile of SLI
- per day <= X". It's impossible to provide a comprehensive guidance, but at the very
- high level (they needs more precise definitions) those may be things like:
- - per-day percentage of API calls finishing with 5XX errors <= 1%
- - 99% percentile over day of absolute value from (job creation time minus expected
- job creation time) for cron job <= 10%
- - 99,9% of /health requests per day finish with 200 code
-
-* **Are there any missing metrics that would be useful to have to improve
- observability if this feature?**
- Describe the metrics themselves and the reason they weren't added (e.g. cost,
- implementation difficulties, etc.).
-
-#### Dependencies
-
-* **Does this feature depend on any specific services running in the cluster?**
- Think about both cluster-level services (e.g. metrics-server) as well
- as node-level agents (e.g. specific version of CRI). Focus on external or
- optional services that are needed. For example, if this feature depends on
- a cloud provider API, or upon an external software-defined storage or network
- control plane.
- For each of the fill in the following, thinking both about running user workloads
- and creating new ones, as well as about cluster-level services (e.g. DNS):
- - [Dependency name]
- - Usage description:
- - Impact of its outage on the feature:
- - Impact of its degraded performance or high error rates on the feature:
-
-
-#### Scalability
-
-* **Will enabling / using this feature result in any new API calls?**
- Describe them, providing:
- - API call type (e.g. PATCH pods)
- - estimated throughput
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
- focusing mostly on:
- - components listing and/or watching resources they didn't before
- - API calls that may be triggered by changes of some Kubernetes resources
- (e.g. update of object X triggers new updates of object Y)
- - periodic API calls to reconcile state (e.g. periodic fetching state,
- heartbeats, leader election, etc.)
-
-* **Will enabling / using this feature result in introducing new API types?**
- Describe them providing:
- - API type
- - Supported number of objects per cluster
- - Supported number of objects per namespace (for namespace-scoped objects)
-
-* **Will enabling / using this feature result in any new calls to cloud
- provider?**
-
-* **Will enabling / using this feature result in increasing size or count
- of the existing API objects?**
- Describe them providing:
- - API type(s):
- - Estimated increase in size: (e.g. new annotation of size 32B)
- - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
-
-* **Will enabling / using this feature result in increasing time taken by any
- operations covered by [existing SLIs/SLOs][]?**
- Think about adding additional work or introducing new steps in between
- (e.g. need to do X to start a container), etc. Please describe the details.
-
-* **Will enabling / using this feature result in non-negligible increase of
- resource usage (CPU, RAM, disk, IO, ...) in any components?**
- Things to keep in mind include: additional in-memory state, additional
- non-trivial computations, excessive access to disks (including increased log
- volume), significant amount of data send and/or received over network, etc.
- This through this both in small and large cases, again with respect to the
- [supported limits][].
-
-
-#### Troubleshooting
-Troubleshooting section serves the `Playbook` role as of now. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now we leave it here though, with some questions not required until
-further stages (e.g. Beta/Ga) of feature lifecycle.
-
-* **How does this feature react if the API server and/or etcd is unavailable?**
-
-* **What are other known failure modes?**
- For each of them fill in the following information by copying the below template:
- - [Failure mode brief description]
- - Detection: How can it be detected via metrics? Stated another way:
- how can an operator troubleshoot without loogging into a master or worker node?
- - Mitigations: What can be done to stop the bleeding, especially for already
- running user workloads?
- - Diagnostics: What are the useful log messages and their required logging
- levels that could help debugging the issue?
- Not required until feature graduated to Beta.
- - Testing: Are there any tests for failure mode? If not describe why.
+## Status
-* **What steps should be taken if SLOs are not being met to determine the problem?**
+As of 1.19, production readiness reviews are required, and are part of the KEP
+process. The PRR questionnaire previously found here has been incorporated into
+the [KEP template]. The template details the specific questions that must be
+answered, depending on the stage of the feature. As of 1.19, PRRs are
+non-blocking; that is, _approval_ is not required for the enhancement to be part of
+the release. This is to provide some time for the community to adapt to the
+process.
+Note that some of the questions should be answered in both the KEP's README.md
+and the `kep.yaml`, in order to support automated checks on the PRR. The
+template points out these as needed.
-[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
-[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
-[existing SLIs/SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#kubernetes-slisslos
+[PRR KEP]: https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md
+[KEP template]: https://git.k8s.io/enhancements/keps/NNNN-kep-template/README.md