summaryrefslogtreecommitdiff
path: root/sig-architecture
diff options
context:
space:
mode:
authorJohn Belamaric <jbelamaric@google.com>2019-10-17 10:28:54 -0700
committerJohn Belamaric <jbelamaric@google.com>2019-10-18 09:11:53 -0700
commitf79dd3ec8002c1088515413afa99bd1b7f1a1e5b (patch)
tree91a514c01220ab45d69398cebb737e601b3fd9ee /sig-architecture
parent520bce7c764f70ad8c7404700b17d3ce9f0cf96f (diff)
Initial PRR pilot policy doc
Diffstat (limited to 'sig-architecture')
-rw-r--r--sig-architecture/production-readiness.md51
1 files changed, 51 insertions, 0 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
new file mode 100644
index 00000000..17ed49dc
--- /dev/null
+++ b/sig-architecture/production-readiness.md
@@ -0,0 +1,51 @@
+# Production Readiness Review Process
+
+Production readiness reviews are intended to ensure that features merging into
+Kubernetes are observable, scalable and supportable, can be safely operated in
+production environments, and can be disabled or rolled back in the event they
+cause increased failures in production.
+
+## Status
+
+The process and questoinnaire are currently under development as part of the
+[PRR KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md), with a target that reviews will be needed for features
+going into 1.18.
+
+During the 1.17 cycle, the PRR team will be piloting the questionnaire and other
+aspects of the process.
+
+## Questionnaire
+
+* Feature enablement and rollback
+ - How can this feature be enabled / disabled in a live cluster?
+ - Can the feature be disabled once it has been enabled (i.e., can we roll
+ back the enablement)?
+ - Will enabling / disabling the feature require downtime for the control
+ plane?
+ - Will enabling / disabling the feature require downtime or reprovisioning
+ of a node?
+ - What happens if a cluster with this feature enabled is rolled back? What
+ happens if it is subsequently upgraded again?
+ - Are there tests for this?
+* Scalability
+* Rollout, Upgrade, and Rollback Planning
+* Dependencies
+ - Does this feature depend on any specific services running in the cluster
+ (e.g., a metrics service)?
+ - How does this feature respond to complete failures of the services on
+ which it depends?
+ - How does this feature respond to degraded performance or high error rates
+ from services on which it depends?
+* Monitoring requirements
+ - How can an operator determine if the feature is in use by workloads?
+ - How can an operator determine if the feature is functioning properly?
+ - What are the service level indicators an operator can use to determine the
+ health of the service?
+ - What are reasonable service level objectives for the feature?
+* Troubleshooting
+ - What are the known failure modes?
+ - How can those be detected via metrics or logs?
+ - What are the mitigations for each of those failure modes?
+ - What are the most useful log messages and what logging levels do they require?
+ - What steps should be taken if SLOs are not being met to determine the
+ problem?