Merge pull request #2198 from wojtek-t/organize_slos_2

Better organize performance SLIs/SLOs - part 2
author: k8s-ci-robot <k8s-ci-robot@users.noreply.github.com> 2018-05-30 03:33:53 -0700
committer: GitHub <noreply@github.com> 2018-05-30 03:33:53 -0700
commit: 45dfc3d03868f5bdc71fa0121e487c30521ce915 (patch)
tree: 318260429ed08008ad200ed61c6141825cbfa204
parent: c68bc2dfaed28d1349f16fae97542a35471bdeed (diff)
parent: 39ddb46fbfc2535777abf57d8b7b7640d7a24666 (diff)
6 files changed, 109 insertions, 197 deletions
diff --git a/sig-scalability/slis/apimachinery_slis.md b/sig-scalability/slis/apimachinery_slis.md
deleted file mode 100644
index b86d3f57..00000000
--- a/sig-scalability/slis/apimachinery_slis.md
+++ /dev/null
@@ -1,196 +0,0 @@
-# API-machinery SLIs and SLOs
-
-The document was converted from [Google Doc]. Please refer to the original for
-extended commentary and discussion.
-
-## Background
-
-Scalability is an important aspect of the Kubernetes. However, Kubernetes is
-such a large system that we need to manage users expectations in this area.
-To achieve it, we are in process of redefining what does it mean that
-Kubernetes supports X-node clusters - this doc describes the high-level
-proposal. In this doc we are describing API-machinery related SLIs we would
-like to introduce and suggest which of those should eventually have a
-corresponding SLO replacing current "99% of API calls return in under 1s" one.
-
-The SLOs we are proposing in this doc are our goal - they may not be currently
-satisfied. As a result, while in the future we would like to block the release
-when we are violating SLOs, we first need to understand where exactly we are
-now, define and implement proper tests and potentially improve the system.
-Only once this is done, we may try to introduce a policy of blocking the
-release on SLO violation. But this is out of scope of this doc.
-
-
-### SLIs and SLOs proposal
-
-Below we introduce all SLIs and SLOs we would like to have in the api-machinery
-area. A bunch of those are not easy to understand for users, as they are
-designed for developers or performance tracking of higher level
-user-understandable SLOs. The user-oriented one (which we want to publicly
-announce) are additionally highlighted with bold.
-
-### Prerequisite
-
-Kubernetes cluster is available and serving.
-
-### Latency<sup>[1](#footnote1)</sup> of API calls for single objects
-
-__***SLI1: Non-streaming API calls for single objects (POST, PUT, PATCH, DELETE,
-GET) latency for every (resource, verb) pair, measured as 99th percentile over
-last 5 minutes***__
-
-__***SLI2: 99th percentile for (resource, verb) pairs \[excluding virtual and
-aggregated resources and Custom Resource Definitions\] combined***__
-
-__***SLO: In default Kubernetes installation, 99th percentile of SLI2
-per cluster-day<sup>[2](#footnote2)</sup> <= 1s***__
-
-User stories:
-- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the
-response from an API call.
-- As an administrator of Kubernetes cluster, if I know characteristics of my
-external dependencies of apiserver (e.g custom admission plugins, webhooks and
-initializers) I want to be able to provide guarantees for API calls latency to
-users of my cluster
-
-Background:
-- We obviously can’t give any guarantee in general, because cluster
-administrators are allowed to register custom admission plugins, webhooks
-and/or initializers, which we don’t have any control about and they obviously
-impact API call latencies.
-- As a result, we define the SLIs to be very generic (no matter how your
-cluster is set up), but we provide SLO only for default installations (where we
-have control over what apiserver is doing). This doesn’t provide a false
-impression, that we provide guarantee no matter how the cluster is setup and
-what is installed on top of it.
-- At the same time, API calls are part of pretty much every non-trivial workflow
-in Kubernetes, so this metric is a building block for less trivial SLIs and
-SLOs.
-
-Other notes:
-- The SLO has to be satisfied independently from the used encoding. This
-makes the mix of client important while testing. However, we assume that all
-`core` components communicate with apiserver with protocol buffers (otherwise
-the SLO doesn’t have to be satisfied).
-- In case of GET requests, user has an option to opt-in for accepting
-potentially stale data (the request is then served from cache and not hitting
-underlying storage). However, the SLO has to be satisfied even if all requests
-ask for up-to-date data, which again makes careful choice of requests in tests
-important while testing.
-
-
-### Latency of API calls for multiple objects
-
-__***SLI1: Non-streaming API calls for multiple objects (LIST) latency for
-every (resource, verb) pair, measure as 99th percentile over last 5 minutes***__
-
-__***SLI2: 99th percentile for (resource, verb) pairs [excluding virtual and
-aggregated resources and Custom Resource Definitions] combined***__
-
-__***SLO1: In default Kubernetes installation, 99th percentile of SLI2 per
-cluster-day***__
-- __***is <= 1s if total number of objects of the same type as resource in the
-system <= X***__
-- __***is <= 5s if total number of objects of the same type as resource in the
-system <= Y***__
-- __***is <= 30s if total number of objects of the same types as resource in the
-system <= Z***__ 
-
-User stories:
-- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the
-response from an API call.
-- As an administrator of Kubernetes cluster, if I know characteristics of my
-external dependencies of apiserver (e.g custom admission plugins, webhooks and
-initializers) I want to be able to provide guarantees for API calls latency to
-users of my cluster.
-
-Background:
-- On top of arguments from latency of API calls for single objects, LIST
-operations are crucial part of watch-related frameworks, which in turn are
-responsible for overall system performance and responsiveness.
-- The above SLO is user-oriented and may have significant buffer in threshold.
-In fact, the latency of the request should be proportional to the amount of
-work to do (which in our case is number of objects of a given type (potentially
-in a requested namespace if specified)) plus some constant overhead. For better
-tracking of performance, we define the other SLIs which are supposed to be
-purely internal (developer-oriented)
-
-
-_SLI3: Non-streaming API calls for multiple objects (LIST) latency minus 1s
-(maxed with 0) divided by number of objects in the collection
-<sup>[3](#footnote3)</sup> (which may be many more than the number of returned
-objects) for every (resource, verb) pair, measured as 99th percentile over
-last 5 minutes._
-
-_SLI4: 99th percentile for (resource, verb) pairs [excluding virtual and
-aggregated resources and Custom Resource Definitions] combined_
-
-_SLO2: In default Kubernetes installation, 99th percentile of SLI4 per
-cluster-day <= Xms_
-
-
-### Watch latency
-
-_SLI1: API-machinery watch latency (measured from the moment when object is
-stored in database to when it’s ready to be sent to all watchers), measured
-as 99th percentile over last 5 minutes_
-
-_SLO1 (developer-oriented): 99th percentile of SLI1 per cluster-day <= Xms_
-
-User stories:
-- As an administrator, if system is slow, I would like to know if the root
-cause is slow api-machinery or something farther the path (lack of network
-bandwidth, slow or cpu-starved controllers, ...).
-
-Background:
-- Pretty much all control loops in Kubernetes are watch-based, so slow watch
-means slow system in general. As a result, we want to give some guarantees on
-how fast it is.
-- Note that how we measure it, silently assumes no clock-skew in case of HA
-clusters.
-
-
-### Admission plugin latency
-
-_SLI1: Admission latency for each admission plugin type, measured as 99th
-percentile over last 5 minutes_
-
-User stories:
-- As an administrator, if API calls are slow, I would like to know if this is
-because slow admission plugins and if so which ones are responsible.
-
-
-### Webhook latency
-
-_SLI1: Webhook call latency for each webhook type, measured as 99th percentile
-over last 5 minutes_
-
-User stories:
-- As an administrator, if API calls are slow, I would like to know if this is
-because slow webhooks and if so which ones are responsible.
-
-
-### Initializer latency
-
-_SLI1: Initializer latency for each initializer, measured as 99th percentile
-over last 5 minutes_
-
-User stories:
-- As an administrator, if API calls are slow, I would like to know if this is
-because of slow initializers and if so which ones are responsible.
-
----
-<a name="footnote1">\[1\]</a>By latency of API call in this doc we mean time
-from the moment when apiserver gets the request to last byte of response sent
-to the user.
-
-<a name="footnote2">\[2\]</a> For the purpose of visualization it will be a
-sliding window. However, for the purpose of reporting the SLO, it means one
-point per day (whether SLO was satisfied on a given day or not).
-
-<a name="footnote3">\[3\]</a>A collection contains: (a) all objects of that
-type for cluster-scoped resources, (b) all object of that type in a given
-namespace for namespace-scoped resources.
-
-
-[Google Doc]: https://docs.google.com/document/d/1Q5qxdeBPgTTIXZxdsFILg7kgqWhvOwY8uROEf0j5YBw/edit#
diff --git a/sig-scalability/slos/api_call_latency.md b/sig-scalability/slos/api_call_latency.md
new file mode 100644
index 00000000..65d7dc26
--- /dev/null
+++ b/sig-scalability/slos/api_call_latency.md
@@ -0,0 +1,47 @@
+## API call latency SLIs/SLOs details
+
+### User stories
+- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the
+response from an API call.
+- As an administrator of Kubernetes cluster, if I know characteristics of my
+external dependencies of apiserver (e.g custom admission plugins, webhooks and
+initializers) I want to be able to provide guarantees for API calls latency to
+users of my cluster
+
+### Other notes
+- We obviously can’t give any guarantee in general, because cluster
+administrators are allowed to register custom admission plugins, webhooks
+and/or initializers, which we don’t have any control about and they obviously
+impact API call latencies.
+- As a result, we define the SLIs to be very generic (no matter how your
+cluster is set up), but we provide SLO only for default installations (where we
+have control over what apiserver is doing). This doesn’t provide a false
+impression, that we provide guarantee no matter how the cluster is setup and
+what is installed on top of it.
+- At the same time, API calls are part of pretty much every non-trivial workflow
+in Kubernetes, so this metric is a building block for less trivial SLIs and
+SLOs.
+- The SLO for latency for read-only API calls of a given type may have significant
+buffer in threshold. In fact, the latency of the request should be proportional to
+the amount of work to do (which is number of objects of a given type in a given
+scope) plus some constant overhead. For better tracking of performance, we
+may want to define purely internal SLI of "latency per object". But that
+isn't in near term plans.
+
+### Caveats
+- The SLO has to be satisfied independently from used encoding in user-originated
+requests. This makes mix of client important while testing. However, we assume
+that all `core` components communicate with apiserver using protocol buffers.
+- In case of GET requests, user has an option opt-in for accepting potentially
+stale data (being served from cache) and the SLO again has to be satisfied
+independently of that. This makes the careful choice of requests in tests
+important.
+
+### TODOs
+- We may consider treating `non-namespaced` resources as a separate bucket in
+the future. However, it may not make sense if the number of those may be
+comparable with `namespaced` ones.
+
+### Test scenario
+
+__TODO: Descibe test scenario.__
diff --git a/sig-scalability/slos/api_extensions_latency.md b/sig-scalability/slos/api_extensions_latency.md
new file mode 100644
index 00000000..2681422c
--- /dev/null
+++ b/sig-scalability/slos/api_extensions_latency.md
@@ -0,0 +1,6 @@
+## API call extension points latency SLIs details
+
+### User stories
+- As an administrator, if API calls are slow, I would like to know if this is
+because slow extension points (admission plugins, webhooks, initializers) and
+if so which ones are responsible for it.
diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md
index 49d26c6a..34aba2f8 100644
--- a/sig-scalability/slos/slos.md
+++ b/sig-scalability/slos/slos.md
@@ -92,13 +92,39 @@ of the cases.
 We are looking into extending the set of SLIs/SLOs to cover more parts of
 Kubernetes.
 
+```
+Prerequisite: Kubernetes cluster is available and serving.
+```
+
 ### Steady state SLIs/SLOs
 
 | Status | SLI | SLO | User stories, test scenarios, ... |
 | --- | --- | --- | --- |
+| __Official__ | Latency<sup>[1](#footnote1)</sup> of mutating<sup>[2](#footnote2)</sup> API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[3](#footnote3)</sup> <= 1s | [Details](./api_call_latency.md) |
+| __Official__ | Latency<sup>[1](#footnote1)</sup> of non-streaming read-only<sup>[4](#footnote3)</sup> API calls for every (resource, scope<sup>[5](#footnote4)</sup>) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) |
+
+<a name="footnote1">\[1\]</a>By latency of API call in this doc we mean time
+from the moment when apiserver gets the request to last byte of response sent
+to the user.
+
+<a name="footnote2">\[2\]</a>By mutating API calls we mean POST, PUT, DELETE
+and PATCH.
+
+<a name="footnote3">\[3\]</a> For the purpose of visualization it will be a
+sliding window. However, for the purpose of reporting the SLO, it means one
+point per day (whether SLO was satisfied on a given day or not).
+
+<a name="footnote4">\[4\]</a>By non-streaming read-only API calls we mean GET
+requests without `watch=true` option set. (Note that in Kubernetes internally
+it translates to both GET and LIST calls).
+
+<a name="footnote5">\[5\]</a>A scope of a request can be either (a) `resource`
+if the request is about a single object, (b) `namespace` if it is about objects
+from a single namespace or (c) `cluster` if it spawns objects from multiple
+namespaces.
+
 
 __TODO: Migrate existing SLIs/SLOs here:__
-- __API-machinery ones__
 - __Pod startup time__
 
 ### Burst SLIs/SLOs
@@ -106,3 +132,13 @@ __TODO: Migrate existing SLIs/SLOs here:__
 | Status | SLI | SLO | User stories, test scenarios, ... |
 | --- | --- | --- | --- |
 | WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) |
+
+### Other SLIs
+
+| Status | SLI | User stories, ... |
+| --- | --- | --- |
+| WIP | Watch latency for every resource, (from the moment when object is stored in database to when it's ready to be sent to all watchers), measured as 99th percentile over last 5 minutes | TODO |
+| WIP | Admission latency for each admission plugin type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) |
+| WIP | Webhook call latency for each webhook type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) |
+| WIP | Initializer latency for each initializer, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) |
+
diff --git a/sig-scalability/slos/system_throughput.md b/sig-scalability/slos/system_throughput.md
index 369a6cba..5691b46d 100644
--- a/sig-scalability/slos/system_throughput.md
+++ b/sig-scalability/slos/system_throughput.md
@@ -1,3 +1,5 @@
+## System throughput SLI/SLO details
+
 ### User stories
 - As a user, I want a guarantee that my workload of X pods can be started
   within a given time
diff --git a/sig-scalability/slos/watch_latency.md b/sig-scalability/slos/watch_latency.md
new file mode 100644
index 00000000..2e698b4b
--- /dev/null
+++ b/sig-scalability/slos/watch_latency.md
@@ -0,0 +1,17 @@
+## Watch latency SLI details
+
+### User stories
+- As an administrator, if Kubernetes is slow, I would like to know if the root
+cause of it is slow api-machinery (slow watch) or something farther the path
+(lack of network bandwidth, slow or cpu-starved controllers, ...)
+
+### Other notes
+- Pretty much all control loops in Kubernetes are watch-based. As a result
+slow watch means slow system in general.
+- Note that how we measure it silently assumes no clock-skew in case of
+cluster with multiple masters.
+
+### TODOs
+- Longer term, we would like to provide some guarantees on watch latency
+(e.g. 99th percentile of SLI per cluster-day <= Xms). However, we are not
+there yet.
author	k8s-ci-robot <k8s-ci-robot@users.noreply.github.com>	2018-05-30 03:33:53 -0700
committer	GitHub <noreply@github.com>	2018-05-30 03:33:53 -0700
commit	45dfc3d03868f5bdc71fa0121e487c30521ce915 (patch)
tree	318260429ed08008ad200ed61c6141825cbfa204
parent	c68bc2dfaed28d1349f16fae97542a35471bdeed (diff)
parent	39ddb46fbfc2535777abf57d8b7b7640d7a24666 (diff)