diff options
| author | John Belamaric <jbelamaric@google.com> | 2019-10-17 10:28:54 -0700 |
|---|---|---|
| committer | John Belamaric <jbelamaric@google.com> | 2019-10-18 09:11:53 -0700 |
| commit | f79dd3ec8002c1088515413afa99bd1b7f1a1e5b (patch) | |
| tree | 91a514c01220ab45d69398cebb737e601b3fd9ee /sig-architecture | |
| parent | 520bce7c764f70ad8c7404700b17d3ce9f0cf96f (diff) | |
Initial PRR pilot policy doc
Diffstat (limited to 'sig-architecture')
| -rw-r--r-- | sig-architecture/production-readiness.md | 51 |
1 files changed, 51 insertions, 0 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md new file mode 100644 index 00000000..17ed49dc --- /dev/null +++ b/sig-architecture/production-readiness.md @@ -0,0 +1,51 @@ +# Production Readiness Review Process + +Production readiness reviews are intended to ensure that features merging into +Kubernetes are observable, scalable and supportable, can be safely operated in +production environments, and can be disabled or rolled back in the event they +cause increased failures in production. + +## Status + +The process and questoinnaire are currently under development as part of the +[PRR KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md), with a target that reviews will be needed for features +going into 1.18. + +During the 1.17 cycle, the PRR team will be piloting the questionnaire and other +aspects of the process. + +## Questionnaire + +* Feature enablement and rollback + - How can this feature be enabled / disabled in a live cluster? + - Can the feature be disabled once it has been enabled (i.e., can we roll + back the enablement)? + - Will enabling / disabling the feature require downtime for the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + - What happens if a cluster with this feature enabled is rolled back? What + happens if it is subsequently upgraded again? + - Are there tests for this? +* Scalability +* Rollout, Upgrade, and Rollback Planning +* Dependencies + - Does this feature depend on any specific services running in the cluster + (e.g., a metrics service)? + - How does this feature respond to complete failures of the services on + which it depends? + - How does this feature respond to degraded performance or high error rates + from services on which it depends? +* Monitoring requirements + - How can an operator determine if the feature is in use by workloads? + - How can an operator determine if the feature is functioning properly? + - What are the service level indicators an operator can use to determine the + health of the service? + - What are reasonable service level objectives for the feature? +* Troubleshooting + - What are the known failure modes? + - How can those be detected via metrics or logs? + - What are the mitigations for each of those failure modes? + - What are the most useful log messages and what logging levels do they require? + - What steps should be taken if SLOs are not being met to determine the + problem? |
