Deduplicate monitoring section and provide examples

author: wojtekt <wojtekt@google.com> 2020-04-06 17:58:26 +0200
committer: wojtekt <wojtekt@google.com> 2020-04-06 17:58:26 +0200
commit: d4f4801d1d525746f6adf17167f0b47b0a19c19f (patch)
tree: 9b20b39dd7952b430f6ce27450704f0c6f21f0e5
parent: 56d9ea22046d752bac8608562f9eedc4ce0e2bcf (diff)
1 files changed, 7 insertions, 7 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
index 8760b11e..f277d9ce 100644
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@@ -63,11 +63,6 @@ aspects of the process.
   checking if there are objects with field X set) may be last resort. Avoid
   logs or events for this purpose.
 
-* **How can an operator determine if the feature is functioning properly?**
-  Focus on metrics that cluster operators may gather from different
-  components and treat other signals as last resort.
-  TODO: Provide examples to make answering this question easier.
-
 * **What are the SLIs (Service Level Indicators) an operator can use to
   determine the health of the service?**
   - [ ] Metrics
@@ -78,8 +73,13 @@ aspects of the process.
     - Details:
 
 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  TODO: Provide examples for different features (e.g. server-side apply, user-space
-  proxy, cronjob controller) to make answering this question easier
+  At the high-level this usually will be in the form of "high percentile of SLI
+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very
+  high level (they needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99,9% of /health requests per day finish with 200 code
 
 * **Are there any missing metrics that would be useful to have to improve
   observability if this feature?**
author	wojtekt <wojtekt@google.com>	2020-04-06 17:58:26 +0200
committer	wojtekt <wojtekt@google.com>	2020-04-06 17:58:26 +0200
commit	d4f4801d1d525746f6adf17167f0b47b0a19c19f (patch)
tree	9b20b39dd7952b430f6ce27450704f0c6f21f0e5
parent	56d9ea22046d752bac8608562f9eedc4ce0e2bcf (diff)