diff options
| author | wojtekt <wojtekt@google.com> | 2019-12-11 10:48:11 +0100 |
|---|---|---|
| committer | wojtekt <wojtekt@google.com> | 2019-12-13 09:11:53 +0100 |
| commit | 322dac026b24ff0c57407b93ad96a7a7e4d0ca9e (patch) | |
| tree | eb33f064773d05bb8eda8e85e5b230693df9e4a8 /sig-architecture | |
| parent | 31025aabc867830a03c918f4a7e18c9d79ee3a5f (diff) | |
Rework PRR questionaire
Diffstat (limited to 'sig-architecture')
| -rw-r--r-- | sig-architecture/production-readiness.md | 175 |
1 files changed, 118 insertions, 57 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md index 9f953678..88c55a39 100644 --- a/sig-architecture/production-readiness.md +++ b/sig-architecture/production-readiness.md @@ -15,63 +15,124 @@ aspects of the process. ## Questionnaire -* Feature enablement and rollback - - How can this feature be enabled / disabled in a live cluster? - - Can the feature be disabled once it has been enabled (i.e., can we roll - back the enablement)? - - Will enabling / disabling the feature require downtime for the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? - - What happens if a cluster with this feature enabled is rolled back? What - happens if it is subsequently upgraded again? - - Are there tests for this? -* Scalability - - Will enabling / using the feature result in any new API calls? - Describe them with their impact keeping in mind the [supported limits][] - (e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on: - - components listing and/or watching resources they didn't before - - API calls that may be triggered by changes of some Kubernetes - resources (e.g. update object X based on changes of object Y) - - periodic API calls to reconcile state (e.g. periodic fetching state, - heartbeats, leader election, etc.) - - Will enabling / using the feature result in supporting new API types? - How many objects of that type will be supported (and how that translates - to limitations for users)? - - Will enabling / using the feature result in increasing size or count - of the existing API objects? - - Will enabling / using the feature result in increasing time taken - by any operations covered by [existing SLIs/SLOs][] (e.g. by adding - additional work, introducing new steps in between, etc.)? - Please describe the details if so. - - Will enabling / using the feature result in non-negligible increase - of resource usage (CPU, RAM, disk IO, ...) in any components? - Things to keep in mind include: additional in-memory state, additional - non-trivial computations, excessive access to disks (including increased - log volume), significant amount of data sent and/or received over - network, etc. Think through this in both small and large cases, again - with respect to the [supported limits][]. -* Rollout, Upgrade, and Rollback Planning -* Dependencies - - Does this feature depend on any specific services running in the cluster - (e.g., a metrics service)? - - How does this feature respond to complete failures of the services on - which it depends? - - How does this feature respond to degraded performance or high error rates - from services on which it depends? -* Monitoring requirements - - How can an operator determine if the feature is in use by workloads? - - How can an operator determine if the feature is functioning properly? - - What are the service level indicators an operator can use to determine the - health of the service? - - What are reasonable service level objectives for the feature? -* Troubleshooting - - What are the known failure modes? - - How can those be detected via metrics or logs? - - What are the mitigations for each of those failure modes? - - What are the most useful log messages and what logging levels do they require? - - What steps should be taken if SLOs are not being met to determine the - problem? +#### Feature enablement and rollback + +* **How can this feature be enabled / disabled in a live cluster?** + - [ ] Feature gate + - Feature gate name: + - Components depending on the feature gate: + - [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +* **Can the feature be disabled once it has been enabled (i.e. can we rollback + the enablement)?** + Describe the consequences on existing workloads (e.g. if this is runtime + feature, can it break the existing applications?). + +* **What happens if we reenable the feature if it was previously rolled back?** + +* **Are there any tests for feature enablement/ disablement?** + At the very least, think about conversion tests if API types are being modified. + +#### Scalability + +* **Will enabling / using this feature result in any new API calls?** + Describe them, providing: + - API call type (e.g. PATCH pods) + - estimated throughput + - originating component(s) (e.g. Kubelet, Feature-X-controller) + focusing mostly on: + - components listing and/or watching resources they didn't before + - API calls that may be triggered by changes of some Kubernetes resources + (e.g. update of object X triggers new updates of object Y) + - periodic API calls to reconcile state (e.g. periodic fetching state, + heartbeats, leader election, etc.) + +* **Will enabling / using this feature result in introducing new API types?** + Describe them providing: + - API type + - Supported number of objects per cluster + - Supported number of objects per namespace (for namespace-scoped objects) + +* **Will enabling / using this feature result in any new calls to cloud + provider?** + +* **Will enabling / using this feature result in increasing size or count + of the existing API objects?** + Describe them providing: + - API type(s): + - Estimated increase in size: (e.g. new annotation of size 32B) + - Estimated amount of new objects: (e.g. new Object X for every existing Pod) + +* **Will enabling / using this feature result in increasing time taken by any + operations covered by [existing SLIs/SLOs][]?** + Think about adding additional work or introducing new steps in between + (e.g. need to do X to start a container), etc. Please describe the details. + +* **Will enabling / using this feature result in non-negligible increase of + resource usage (CPU, RAM, disk, IO, ...) in any components?** + Things to keep in mind include: additional in-memory state, additional + non-trivial computations, excessive access to disks (including increased log + volume), significant amount of data send and/or received over network, etc. + This through this both in small and large cases, again with respect to the + [supported limits][]. + +#### Rollout, Upgrade and Rollback Planning + +#### Dependencies + +* **Does this feature depend on any specific services running in the cluster?** + Think about both cluster-level services (e.g. metrics-server) as well + as node-level agents (e.g. specific version of CRI). + +* **How does this feature respond to complete failures of the services on which + it depends?** + Think about both running and newly created user workloads as well as + cluster-level services (e.g. DNS). + +* **How does this feature respond to degraded performance or high error rates + from services on which it depends?** + +#### Monitoring requirements + +* **How can an operator determine if the feature is in use by workloads?** + +* **How can an operator determine if the feature is functioning properly?** + Focus on metrics that cluster operators may gather from different + components and treat other signals as last resort. + +* **What are the SLIs (Service Level Indicators) an operator can use to + determine the health of the service?** + - [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: + - [ ] Other (treat as last resort) + - Details: + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + +#### Troubleshooting +Troubleshooting section serves the `Playbook` role as of now. We may consider +splitting it into a dedicated `Playbook` document (potentially with some monitoring +details). For now we leave it here though, with some questions not required until +further stages (e.g. Beta/Ga) of feature lifecycle. + +* **What are the known failure modes?** + +* **How can those be detected via metrics or logs?** + +* **What are the mitigations for each of those failure modes?** + +* **What are the most useful log messages and what logging levels to they require?** + Not required until feature graduates to Beta. + +* **What steps should be taken if SLOs are not being met to determine the problem?** + [PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md [supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md |
