Update PRR questions based on feedback

author: John Belamaric <jbelamaric@google.com> 2020-02-12 10:27:20 -0800
committer: John Belamaric <jbelamaric@google.com> 2020-02-12 10:27:20 -0800
commit: b1f06dfb9f84b7717bc200568ad9e7a426cf2363 (patch)
tree: 27a089259db6292024550b3c291213ad10cd8186
parent: 4aa32d4e92a5008753860c594a809af105713c8d (diff)
1 files changed, 26 insertions, 18 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
index 88c55a39..28332601 100644
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@@ -19,7 +19,7 @@ aspects of the process.
 
 * **How can this feature be enabled / disabled in a live cluster?**
   - [ ] Feature gate
-	  - Feature gate name:
+    - Feature gate name:
     - Components depending on the feature gate:
   - [ ] Other
     - Describe the mechanism:
@@ -29,22 +29,25 @@ aspects of the process.
       of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
 
 * **Can the feature be disabled once it has been enabled (i.e. can we rollback
-  the enablement)?**  
+  the enablement)?**
   Describe the consequences on existing workloads (e.g. if this is runtime
   feature, can it break the existing applications?).
 
 * **What happens if we reenable the feature if it was previously rolled back?**
 
-* **Are there any tests for feature enablement/ disablement?**  
-  At the very least, think about conversion tests if API types are being modified.
+* **Are there any tests for feature enablement/ disablement?**
+  The e2e framework does not currently support enabling and disabling feature
+  gates. However, unit tests in each component dealing with managing data created
+  with and without the feature are necessary. At the very least, think about
+  conversion tests if API types are being modified.
 
 #### Scalability
-      
-* **Will enabling / using this feature result in any new API calls?**  
+
+* **Will enabling / using this feature result in any new API calls?**
   Describe them, providing:
   - API call type (e.g. PATCH pods)
   - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)  
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
   focusing mostly on:
   - components listing and/or watching resources they didn't before
   - API calls that may be triggered by changes of some Kubernetes resources
@@ -52,7 +55,7 @@ aspects of the process.
   - periodic API calls to reconcile state (e.g. periodic fetching state,
     heartbeats, leader election, etc.)
 
-* **Will enabling / using this feature result in introducing new API types?**  
+* **Will enabling / using this feature result in introducing new API types?**
   Describe them providing:
   - API type
   - Supported number of objects per cluster
@@ -62,19 +65,19 @@ aspects of the process.
   provider?**
 
 * **Will enabling / using this feature result in increasing size or count
-  of the existing API objects?**  
+  of the existing API objects?*
   Describe them providing:
   - API type(s):
   - Estimated increase in size: (e.g. new annotation of size 32B)
   - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
 
 * **Will enabling / using this feature result in increasing time taken by any
-  operations covered by [existing SLIs/SLOs][]?**  
+  operations covered by [existing SLIs/SLOs][]?**
   Think about adding additional work or introducing new steps in between
   (e.g. need to do X to start a container), etc. Please describe the details.
 
 * **Will enabling / using this feature result in non-negligible increase of
-  resource usage (CPU, RAM, disk, IO, ...) in any components?**  
+  resource usage (CPU, RAM, disk, IO, ...) in any components?**
   Things to keep in mind include: additional in-memory state, additional
   non-trivial computations, excessive access to disks (including increased log
   volume), significant amount of data send and/or received over network, etc.
@@ -85,12 +88,15 @@ aspects of the process.
 
 #### Dependencies
 
-* **Does this feature depend on any specific services running in the cluster?**  
+* **Does this feature depend on any specific services running in the cluster?**
   Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI).
+  as node-level agents (e.g. specific version of CRI). Focus on external or
+  optional services that are needed. For example, if this feature depends on
+  a cloud provider API, or upon an external software-defined storage or network
+  control plane.
 
 * **How does this feature respond to complete failures of the services on which
-  it depends?**  
+  it depends?**
   Think about both running and newly created user workloads as well as
   cluster-level services (e.g. DNS).
 
@@ -101,14 +107,14 @@ aspects of the process.
 
 * **How can an operator determine if the feature is in use by workloads?**
 
-* **How can an operator determine if the feature is functioning properly?**  
+* **How can an operator determine if the feature is functioning properly?**
   Focus on metrics that cluster operators may gather from different
   components and treat other signals as last resort.
 
 * **What are the SLIs (Service Level Indicators) an operator can use to
   determine the health of the service?**
   - [ ] Metrics
-	  - Metric name:
+    - Metric name:
     - [Optional] Aggregation method:
     - Components exposing the metric:
   - [ ] Other (treat as last resort)
@@ -122,13 +128,15 @@ splitting it into a dedicated `Playbook` document (potentially with some monitor
 details). For now we leave it here though, with some questions not required until
 further stages (e.g. Beta/Ga) of feature lifecycle.
 
-* **What are the known failure modes?**
+* **How does this feature react if the API server is unavailable?**
+
+* **What are other known failure modes?**
 
 * **How can those be detected via metrics or logs?**
 
 * **What are the mitigations for each of those failure modes?**
 
-* **What are the most useful log messages and what logging levels to they require?**  
+* **What are the most useful log messages and what logging levels to they require?**
   Not required until feature graduates to Beta.
 
 * **What steps should be taken if SLOs are not being met to determine the problem?**
author	John Belamaric <jbelamaric@google.com>	2020-02-12 10:27:20 -0800
committer	John Belamaric <jbelamaric@google.com>	2020-02-12 10:27:20 -0800
commit	b1f06dfb9f84b7717bc200568ad9e7a426cf2363 (patch)
tree	27a089259db6292024550b3c291213ad10cd8186
parent	4aa32d4e92a5008753860c594a809af105713c8d (diff)