Merge pull request #3307 from wojtek-t/revise_networking_slos

Update network SLIs
author: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> 2019-02-27 11:52:42 -0800
committer: GitHub <noreply@github.com> 2019-02-27 11:52:42 -0800
commit: 684c2e28e539718e9724c64090d2ce961074fad2 (patch)
tree: 5c023d8b283e369a875485e7b7d34967e97dda2f
parent: ec027981fb8b3c1736537ea28619af118226f81a (diff)
parent: 81e438052a54cc892886c62de3014f6a9b57b804 (diff)
3 files changed, 28 insertions, 28 deletions
diff --git a/sig-scalability/slos/dns_programming_latency.md b/sig-scalability/slos/dns_programming_latency.md
index bec37dfb..d2844af3 100644
--- a/sig-scalability/slos/dns_programming_latency.md
+++ b/sig-scalability/slos/dns_programming_latency.md
@@ -1,10 +1,14 @@
-## Network programming latency SLIs/SLOs details
+## DNS programming latency SLIs/SLOs details
 
 ### Definition
 
 | Status | SLI | SLO |
 | --- | --- | --- |
-| __WIP__ | Latency of programming a single in-cluster dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all dns instances) per cluster-day <= X |
+| __WIP__ | Latency of programming dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes aggregated across all dns instances<sup>[1](#footnote1)</sup> | In default Kubernetes installation, 99th percentile per cluster-day <= X |
+
+<a name="footnote1">[1\]</a>Aggregation across all programmers means that all
+samples from all programmers go into one large pool, and SLI is percentile
+from all of them.
 
 ### User stories
 - As a user of vanilla Kubernetes, I want some guarantee how quickly in-cluster
@@ -20,22 +24,18 @@ as external DNS resolution clearly depends on cloud provider or environment
 in which the cluster is running (it hard to set the SLO for it).
 
 ### Caveats
-- The SLI is formulated for a single DNS instance, even though that value
-itself is not very interesting for the user.
-If there are multiple DNS instances in the cluster, the aggregation across
-them is done only at the SLO level (and only that gives a value that is
-interesting for the user). The reason for doing it this is feasibility for
-efficiently computing that:
+- The SLI is aggregated across all DNS instances, which is what is interesting
+for the end-user. It may happen that small percentage of DNS instances are
+completely unresponsive (if all others are fast), but that is desired - we need
+to allow slower/unresponsive ones because at some scale it will be happening.
+The reason for doing it this way is feasibility for efficiently computing that:
   - if we would be doing aggregation at the SLI level (i.e. the SLI would be
-    formulated like "... reflected in in-cluster DNS and visible from 99%
-    of DNS instances"), computing that SLI would be extremely
+    formulated like "... reflected in in-cluster load-balancing mechanism and
+    visible from 99% of programmers"), computing that SLI would be extremely
     difficult. It's because in order to decide e.g. whether pod transition to
     Ready state is reflected, we would have to know when exactly it was reflected
-    in 99% of DNS instances. That requires tracking metrics on
+    in 99% of programmers (e.g. iptables). That requires tracking metrics on
     per-change base (which we can't do efficiently).
-  - we admit that the SLO is a bit weaker in that form (i.e. it doesn't necessary
-    force that a given change is reflected in 99% of programmers with a given
-		99th percentile latency), but it's close enough approximation.
 
 ### How to measure the SLI.
 There [network programming latency](./network_programming_latency.md) is
diff --git a/sig-scalability/slos/network_programming_latency.md b/sig-scalability/slos/network_programming_latency.md
index dc1dace2..7cf5882c 100644
--- a/sig-scalability/slos/network_programming_latency.md
+++ b/sig-scalability/slos/network_programming_latency.md
@@ -4,7 +4,11 @@
 
 | Status | SLI | SLO |
 | --- | --- | --- |
-| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day <= X |
+| __WIP__ | Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers<sup>[1](#footnote1)</sup> | In default Kubernetes installation, 99th percentile per cluster-day <= X |
+
+<a name="footnote1">[1\]</a>Aggregation across all programmers means that all
+samples from all programmers go into one large pool, and SLI is percentile
+from all of them.
 
 ### User stories
 - As a user of vanilla Kubernetes, I want some guarantee how quickly new backends
@@ -27,12 +31,11 @@ but rejected due to being application specific, and thus introducing SLO would
 be impossible.
 
 ### Caveats
-- The SLI is formulated for a single "programmer" (e.g. iptables on a single
-node), even though that value itself is not very interesting for the user.
-In case there are multiple programmers in the cluster, the aggregation across
-them is done only at the SLO level (and only that gives a value that is somehow
-interesting for the user). The reason for doing it this is feasibility for
-efficiently computing that:
+- The SLI is aggregated across all "programmers", which is what is interesting
+for the end-user. It may happen that small percentage of programmers are
+completely unresponsive (if all others are fast), but that is desired - we need
+to allow slower/unresponsive nodes because at some scale it will be happening.
+The reason for doing it this way is feasibility for efficiently computing that:
   - if we would be doing aggregation at the SLI level (i.e. the SLI would be
     formulated like "... reflected in in-cluster load-balancing mechanism and
     visible from 99% of programmers"), computing that SLI would be extremely
@@ -40,9 +43,6 @@ efficiently computing that:
     Ready state is reflected, we would have to know when exactly it was reflected
     in 99% of programmers (e.g. iptables). That requires tracking metrics on
     per-change base (which we can't do efficiently).
-  - we admit that the SLO is a bit weaker in that form (i.e. it doesn't necessary
-    force that a given change is reflected in 99% of programmers with a given
-		99th percentile latency), but it's close enough approximation.
 
 ### How to measure the SLI.
 The method of measuring this SLI is not obvious, so for completeness we describe
diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md
index 2b13c48b..3ca252ba 100644
--- a/sig-scalability/slos/slos.md
+++ b/sig-scalability/slos/slos.md
@@ -106,14 +106,14 @@ Prerequisite: Kubernetes cluster is available and serving.
 | __Official__ | Latency of mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 1s | [Details](./api_call_latency.md) |
 | __Official__ | Latency of non-streaming read-only API calls for every (resource, scope pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) |
 | __Official__ | Startup latency of stateless and schedulable pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
-| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_programming_latency.md) |
-| __WIP__ | Latency of programming a single in-cluster dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all dns instances) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_programming_latency.md) |
+| __WIP__ | Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_programming_latency.md) |
+| __WIP__ | Latency of programming dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes aggregated across all dns instances | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_programming_latency.md) |
 | __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_latency.md) |
 | __WIP__ | In-cluster dns latency from a single prober pod, measured as latency of per second DNS lookup for "null service" from that pod, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_latency.md) |
 
 <a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
-sliding window. However, for the purpose of reporting the SLO, it means one
-point per day (whether SLO was satisfied on a given day or not).
+sliding window. However, for the purpose of SLO itself, it basically means
+"fraction of good minutes per day" being within threshold.
 
 
 ### Burst SLIs/SLOs
author	Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>	2019-02-27 11:52:42 -0800
committer	GitHub <noreply@github.com>	2019-02-27 11:52:42 -0800
commit	684c2e28e539718e9724c64090d2ce961074fad2 (patch)
tree	5c023d8b283e369a875485e7b7d34967e97dda2f
parent	ec027981fb8b3c1736537ea28619af118226f81a (diff)
parent	81e438052a54cc892886c62de3014f6a9b57b804 (diff)