summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorwojtekt <wojtekt@google.com>2020-04-06 17:58:26 +0200
committerwojtekt <wojtekt@google.com>2020-04-06 17:58:26 +0200
commitd4f4801d1d525746f6adf17167f0b47b0a19c19f (patch)
tree9b20b39dd7952b430f6ce27450704f0c6f21f0e5
parent56d9ea22046d752bac8608562f9eedc4ce0e2bcf (diff)
Deduplicate monitoring section and provide examples
-rw-r--r--sig-architecture/production-readiness.md14
1 files changed, 7 insertions, 7 deletions
diff --git a/sig-architecture/production-readiness.md b/sig-architecture/production-readiness.md
index 8760b11e..f277d9ce 100644
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@@ -63,11 +63,6 @@ aspects of the process.
checking if there are objects with field X set) may be last resort. Avoid
logs or events for this purpose.
-* **How can an operator determine if the feature is functioning properly?**
- Focus on metrics that cluster operators may gather from different
- components and treat other signals as last resort.
- TODO: Provide examples to make answering this question easier.
-
* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**
- [ ] Metrics
@@ -78,8 +73,13 @@ aspects of the process.
- Details:
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
- TODO: Provide examples for different features (e.g. server-side apply, user-space
- proxy, cronjob controller) to make answering this question easier
+ At the high-level this usually will be in the form of "high percentile of SLI
+ per day <= X". It's impossible to provide a comprehensive guidance, but at the very
+ high level (they needs more precise definitions) those may be things like:
+ - per-day percentage of API calls finishing with 5XX errors <= 1%
+ - 99% percentile over day of absolute value from (job creation time minus expected
+ job creation time) for cron job <= 10%
+ - 99,9% of /health requests per day finish with 200 code
* **Are there any missing metrics that would be useful to have to improve
observability if this feature?**