Update PRR process status (#4734)

* Update PRR process status * Update sig-architecture/production-readiness.md Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com> * Update sig-architecture/production-readiness.md Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com> Co-authored-by: Stephen Augustus <justaugustus@users.noreply.github.com>
author: John Belamaric <jbelamaric@google.com> 2020-05-18 11:31:37 -0700
committer: GitHub <noreply@github.com> 2020-05-18 11:31:37 -0700
commit: beb2dcbf99bb9857bb4d29eacdeab09f855c37a4 (patch)
tree: 93ceafc4ff8645a7f30e410c4e8dc011066f23a3 /sig-architecture
parent: 7c46f9b449899e0589d4c9335964969d8b9bcb68 (diff)
1 files changed, 14 insertions, 163 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
index f277d9ce..0d60a052 100644
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@@ -5,170 +5,21 @@ Kubernetes are observable, scalable and supportable, can be safely operated in
 production environments, and can be disabled or rolled back in the event they
 cause increased failures in production.
 
-## Status
-
-The process and questoinnaire are currently under development as part of the
-[PRR KEP][], with a target that reviews will be needed for features going into 1.18.
-
-During the 1.17 cycle, the PRR team will be piloting the questionnaire and other
-aspects of the process.
-
-## Questionnaire
-
-#### Feature enablement and rollback
-
-* **How can this feature be enabled / disabled in a live cluster?**
-  - [ ] Feature gate
-    - Feature gate name:
-    - Components depending on the feature gate:
-  - [ ] Other
-    - Describe the mechanism:
-    - Will enabling / disabling the feature require downtime of the control
-      plane?
-    - Will enabling / disabling the feature require downtime or reprovisioning
-      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
-
-* **Can the feature be disabled once it has been enabled (i.e. can we rollback
-  the enablement)?**
-  Describe the consequences on existing workloads (e.g. if this is runtime
-  feature, can it break the existing applications?).
-
-* **What happens if we reenable the feature if it was previously rolled back?**
-
-* **Are there any tests for feature enablement/ disablement?**
-  The e2e framework does not currently support enabling and disabling feature
-  gates. However, unit tests in each component dealing with managing data created
-  with and without the feature are necessary. At the very least, think about
-  conversion tests if API types are being modified.
-
-
-#### Rollout, Upgrade and Rollback Planning
-
-* **How can a rollout fail? Can it impact already running workloads?**
-  Try to be as paranoid as possible - e.g. what if some components will restart
-  in the middle of rollout?
-
-* **What specific metrics should inform a rollback?**
-
-* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
-  Describe manual testing that was done and the outcomes.
-  Longer term, we may want to require automated upgrade/rollback tests, but we
-  are missing a bunch of machinery and tooling and do that now.
-
-
-#### Monitoring requirements
-
-* **How can an operator determine if the feature is in use by workloads?**
-  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
-  checking if there are objects with field X set) may be last resort. Avoid
-  logs or events for this purpose.
+More details may be found in the [PRR KEP].
 
-* **What are the SLIs (Service Level Indicators) an operator can use to
-  determine the health of the service?**
-  - [ ] Metrics
-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
-  - [ ] Other (treat as last resort)
-    - Details:
-
-* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  At the high-level this usually will be in the form of "high percentile of SLI
-  per day <= X". It's impossible to provide a comprehensive guidance, but at the very
-  high level (they needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99,9% of /health requests per day finish with 200 code
-
-* **Are there any missing metrics that would be useful to have to improve
-  observability if this feature?**
-  Describe the metrics themselves and the reason they weren't added (e.g. cost,
-  implementation difficulties, etc.).
-
-#### Dependencies
-
-* **Does this feature depend on any specific services running in the cluster?**
-  Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI). Focus on external or
-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
-	For each of the fill in the following, thinking both about running user workloads
-  and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-		- Impact of its outage on the feature:
-		- Impact of its degraded performance or high error rates on the feature:
-
-
-#### Scalability
-
-* **Will enabling / using this feature result in any new API calls?**
-  Describe them, providing:
-  - API call type (e.g. PATCH pods)
-  - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)
-  focusing mostly on:
-  - components listing and/or watching resources they didn't before
-  - API calls that may be triggered by changes of some Kubernetes resources
-    (e.g. update of object X triggers new updates of object Y)
-  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    heartbeats, leader election, etc.)
-
-* **Will enabling / using this feature result in introducing new API types?**
-  Describe them providing:
-  - API type
-  - Supported number of objects per cluster
-  - Supported number of objects per namespace (for namespace-scoped objects)
-
-* **Will enabling / using this feature result in any new calls to cloud
-  provider?**
-
-* **Will enabling / using this feature result in increasing size or count
-  of the existing API objects?**
-  Describe them providing:
-  - API type(s):
-  - Estimated increase in size: (e.g. new annotation of size 32B)
-  - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
-
-* **Will enabling / using this feature result in increasing time taken by any
-  operations covered by [existing SLIs/SLOs][]?**
-  Think about adding additional work or introducing new steps in between
-  (e.g. need to do X to start a container), etc. Please describe the details.
-
-* **Will enabling / using this feature result in non-negligible increase of
-  resource usage (CPU, RAM, disk, IO, ...) in any components?**
-  Things to keep in mind include: additional in-memory state, additional
-  non-trivial computations, excessive access to disks (including increased log
-  volume), significant amount of data send and/or received over network, etc.
-  This through this both in small and large cases, again with respect to the
-  [supported limits][].
-
-
-#### Troubleshooting
-Troubleshooting section serves the `Playbook` role as of now. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now we leave it here though, with some questions not required until
-further stages (e.g. Beta/Ga) of feature lifecycle.
-
-* **How does this feature react if the API server and/or etcd is unavailable?**
-
-* **What are other known failure modes?**
-  For each of them fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without loogging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-	  - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debugging the issue?
-      Not required until feature graduated to Beta.
-    - Testing: Are there any tests for failure mode? If not describe why.
+## Status
 
-* **What steps should be taken if SLOs are not being met to determine the problem?**
+As of 1.19, production readiness reviews are required, and are part of the KEP
+process. The PRR questionnaire previously found here has been incorporated into
+the [KEP template]. The template details the specific questions that must be
+answered, depending on the stage of the feature. As of 1.19, PRRs are
+non-blocking; that is, _approval_ is not required for the enhancement to be part of
+the release. This is to provide some time for the community to adapt to the
+process.
 
+Note that some of the questions should be answered in both the KEP's README.md
+and the `kep.yaml`, in order to support automated checks on the PRR. The
+template points out these as needed.
 
-[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
-[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
-[existing SLIs/SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#kubernetes-slisslos
+[PRR KEP]: https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md
+[KEP template]: https://git.k8s.io/enhancements/keps/NNNN-kep-template/README.md
author	John Belamaric <jbelamaric@google.com>	2020-05-18 11:31:37 -0700
committer	GitHub <noreply@github.com>	2020-05-18 11:31:37 -0700
commit	beb2dcbf99bb9857bb4d29eacdeab09f855c37a4 (patch)
tree	93ceafc4ff8645a7f30e410c4e8dc011066f23a3 /sig-architecture
parent	7c46f9b449899e0589d4c9335964969d8b9bcb68 (diff)