diff options
| author | John Belamaric <jbelamaric@google.com> | 2020-02-12 10:27:20 -0800 |
|---|---|---|
| committer | John Belamaric <jbelamaric@google.com> | 2020-02-12 10:27:20 -0800 |
| commit | b1f06dfb9f84b7717bc200568ad9e7a426cf2363 (patch) | |
| tree | 27a089259db6292024550b3c291213ad10cd8186 | |
| parent | 4aa32d4e92a5008753860c594a809af105713c8d (diff) | |
Update PRR questions based on feedback
| -rw-r--r-- | sig-architecture/production-readiness.md | 44 |
1 files changed, 26 insertions, 18 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md index 88c55a39..28332601 100644 --- a/sig-architecture/production-readiness.md +++ b/sig-architecture/production-readiness.md @@ -19,7 +19,7 @@ aspects of the process. * **How can this feature be enabled / disabled in a live cluster?** - [ ] Feature gate - - Feature gate name: + - Feature gate name: - Components depending on the feature gate: - [ ] Other - Describe the mechanism: @@ -29,22 +29,25 @@ aspects of the process. of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). * **Can the feature be disabled once it has been enabled (i.e. can we rollback - the enablement)?** + the enablement)?** Describe the consequences on existing workloads (e.g. if this is runtime feature, can it break the existing applications?). * **What happens if we reenable the feature if it was previously rolled back?** -* **Are there any tests for feature enablement/ disablement?** - At the very least, think about conversion tests if API types are being modified. +* **Are there any tests for feature enablement/ disablement?** + The e2e framework does not currently support enabling and disabling feature + gates. However, unit tests in each component dealing with managing data created + with and without the feature are necessary. At the very least, think about + conversion tests if API types are being modified. #### Scalability - -* **Will enabling / using this feature result in any new API calls?** + +* **Will enabling / using this feature result in any new API calls?** Describe them, providing: - API call type (e.g. PATCH pods) - estimated throughput - - originating component(s) (e.g. Kubelet, Feature-X-controller) + - originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on: - components listing and/or watching resources they didn't before - API calls that may be triggered by changes of some Kubernetes resources @@ -52,7 +55,7 @@ aspects of the process. - periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.) -* **Will enabling / using this feature result in introducing new API types?** +* **Will enabling / using this feature result in introducing new API types?** Describe them providing: - API type - Supported number of objects per cluster @@ -62,19 +65,19 @@ aspects of the process. provider?** * **Will enabling / using this feature result in increasing size or count - of the existing API objects?** + of the existing API objects?* Describe them providing: - API type(s): - Estimated increase in size: (e.g. new annotation of size 32B) - Estimated amount of new objects: (e.g. new Object X for every existing Pod) * **Will enabling / using this feature result in increasing time taken by any - operations covered by [existing SLIs/SLOs][]?** + operations covered by [existing SLIs/SLOs][]?** Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details. * **Will enabling / using this feature result in non-negligible increase of - resource usage (CPU, RAM, disk, IO, ...) in any components?** + resource usage (CPU, RAM, disk, IO, ...) in any components?** Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data send and/or received over network, etc. @@ -85,12 +88,15 @@ aspects of the process. #### Dependencies -* **Does this feature depend on any specific services running in the cluster?** +* **Does this feature depend on any specific services running in the cluster?** Think about both cluster-level services (e.g. metrics-server) as well - as node-level agents (e.g. specific version of CRI). + as node-level agents (e.g. specific version of CRI). Focus on external or + optional services that are needed. For example, if this feature depends on + a cloud provider API, or upon an external software-defined storage or network + control plane. * **How does this feature respond to complete failures of the services on which - it depends?** + it depends?** Think about both running and newly created user workloads as well as cluster-level services (e.g. DNS). @@ -101,14 +107,14 @@ aspects of the process. * **How can an operator determine if the feature is in use by workloads?** -* **How can an operator determine if the feature is functioning properly?** +* **How can an operator determine if the feature is functioning properly?** Focus on metrics that cluster operators may gather from different components and treat other signals as last resort. * **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** - [ ] Metrics - - Metric name: + - Metric name: - [Optional] Aggregation method: - Components exposing the metric: - [ ] Other (treat as last resort) @@ -122,13 +128,15 @@ splitting it into a dedicated `Playbook` document (potentially with some monitor details). For now we leave it here though, with some questions not required until further stages (e.g. Beta/Ga) of feature lifecycle. -* **What are the known failure modes?** +* **How does this feature react if the API server is unavailable?** + +* **What are other known failure modes?** * **How can those be detected via metrics or logs?** * **What are the mitigations for each of those failure modes?** -* **What are the most useful log messages and what logging levels to they require?** +* **What are the most useful log messages and what logging levels to they require?** Not required until feature graduates to Beta. * **What steps should be taken if SLOs are not being met to determine the problem?** |
