diff options
| author | k8s-ci-robot <k8s-ci-robot@users.noreply.github.com> | 2018-05-30 03:33:53 -0700 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2018-05-30 03:33:53 -0700 |
| commit | 45dfc3d03868f5bdc71fa0121e487c30521ce915 (patch) | |
| tree | 318260429ed08008ad200ed61c6141825cbfa204 | |
| parent | c68bc2dfaed28d1349f16fae97542a35471bdeed (diff) | |
| parent | 39ddb46fbfc2535777abf57d8b7b7640d7a24666 (diff) | |
Merge pull request #2198 from wojtek-t/organize_slos_2
Better organize performance SLIs/SLOs - part 2
| -rw-r--r-- | sig-scalability/slis/apimachinery_slis.md | 196 | ||||
| -rw-r--r-- | sig-scalability/slos/api_call_latency.md | 47 | ||||
| -rw-r--r-- | sig-scalability/slos/api_extensions_latency.md | 6 | ||||
| -rw-r--r-- | sig-scalability/slos/slos.md | 38 | ||||
| -rw-r--r-- | sig-scalability/slos/system_throughput.md | 2 | ||||
| -rw-r--r-- | sig-scalability/slos/watch_latency.md | 17 |
6 files changed, 109 insertions, 197 deletions
diff --git a/sig-scalability/slis/apimachinery_slis.md b/sig-scalability/slis/apimachinery_slis.md deleted file mode 100644 index b86d3f57..00000000 --- a/sig-scalability/slis/apimachinery_slis.md +++ /dev/null @@ -1,196 +0,0 @@ -# API-machinery SLIs and SLOs - -The document was converted from [Google Doc]. Please refer to the original for -extended commentary and discussion. - -## Background - -Scalability is an important aspect of the Kubernetes. However, Kubernetes is -such a large system that we need to manage users expectations in this area. -To achieve it, we are in process of redefining what does it mean that -Kubernetes supports X-node clusters - this doc describes the high-level -proposal. In this doc we are describing API-machinery related SLIs we would -like to introduce and suggest which of those should eventually have a -corresponding SLO replacing current "99% of API calls return in under 1s" one. - -The SLOs we are proposing in this doc are our goal - they may not be currently -satisfied. As a result, while in the future we would like to block the release -when we are violating SLOs, we first need to understand where exactly we are -now, define and implement proper tests and potentially improve the system. -Only once this is done, we may try to introduce a policy of blocking the -release on SLO violation. But this is out of scope of this doc. - - -### SLIs and SLOs proposal - -Below we introduce all SLIs and SLOs we would like to have in the api-machinery -area. A bunch of those are not easy to understand for users, as they are -designed for developers or performance tracking of higher level -user-understandable SLOs. The user-oriented one (which we want to publicly -announce) are additionally highlighted with bold. - -### Prerequisite - -Kubernetes cluster is available and serving. - -### Latency<sup>[1](#footnote1)</sup> of API calls for single objects - -__***SLI1: Non-streaming API calls for single objects (POST, PUT, PATCH, DELETE, -GET) latency for every (resource, verb) pair, measured as 99th percentile over -last 5 minutes***__ - -__***SLI2: 99th percentile for (resource, verb) pairs \[excluding virtual and -aggregated resources and Custom Resource Definitions\] combined***__ - -__***SLO: In default Kubernetes installation, 99th percentile of SLI2 -per cluster-day<sup>[2](#footnote2)</sup> <= 1s***__ - -User stories: -- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the -response from an API call. -- As an administrator of Kubernetes cluster, if I know characteristics of my -external dependencies of apiserver (e.g custom admission plugins, webhooks and -initializers) I want to be able to provide guarantees for API calls latency to -users of my cluster - -Background: -- We obviously can’t give any guarantee in general, because cluster -administrators are allowed to register custom admission plugins, webhooks -and/or initializers, which we don’t have any control about and they obviously -impact API call latencies. -- As a result, we define the SLIs to be very generic (no matter how your -cluster is set up), but we provide SLO only for default installations (where we -have control over what apiserver is doing). This doesn’t provide a false -impression, that we provide guarantee no matter how the cluster is setup and -what is installed on top of it. -- At the same time, API calls are part of pretty much every non-trivial workflow -in Kubernetes, so this metric is a building block for less trivial SLIs and -SLOs. - -Other notes: -- The SLO has to be satisfied independently from the used encoding. This -makes the mix of client important while testing. However, we assume that all -`core` components communicate with apiserver with protocol buffers (otherwise -the SLO doesn’t have to be satisfied). -- In case of GET requests, user has an option to opt-in for accepting -potentially stale data (the request is then served from cache and not hitting -underlying storage). However, the SLO has to be satisfied even if all requests -ask for up-to-date data, which again makes careful choice of requests in tests -important while testing. - - -### Latency of API calls for multiple objects - -__***SLI1: Non-streaming API calls for multiple objects (LIST) latency for -every (resource, verb) pair, measure as 99th percentile over last 5 minutes***__ - -__***SLI2: 99th percentile for (resource, verb) pairs [excluding virtual and -aggregated resources and Custom Resource Definitions] combined***__ - -__***SLO1: In default Kubernetes installation, 99th percentile of SLI2 per -cluster-day***__ -- __***is <= 1s if total number of objects of the same type as resource in the -system <= X***__ -- __***is <= 5s if total number of objects of the same type as resource in the -system <= Y***__ -- __***is <= 30s if total number of objects of the same types as resource in the -system <= Z***__ - -User stories: -- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the -response from an API call. -- As an administrator of Kubernetes cluster, if I know characteristics of my -external dependencies of apiserver (e.g custom admission plugins, webhooks and -initializers) I want to be able to provide guarantees for API calls latency to -users of my cluster. - -Background: -- On top of arguments from latency of API calls for single objects, LIST -operations are crucial part of watch-related frameworks, which in turn are -responsible for overall system performance and responsiveness. -- The above SLO is user-oriented and may have significant buffer in threshold. -In fact, the latency of the request should be proportional to the amount of -work to do (which in our case is number of objects of a given type (potentially -in a requested namespace if specified)) plus some constant overhead. For better -tracking of performance, we define the other SLIs which are supposed to be -purely internal (developer-oriented) - - -_SLI3: Non-streaming API calls for multiple objects (LIST) latency minus 1s -(maxed with 0) divided by number of objects in the collection -<sup>[3](#footnote3)</sup> (which may be many more than the number of returned -objects) for every (resource, verb) pair, measured as 99th percentile over -last 5 minutes._ - -_SLI4: 99th percentile for (resource, verb) pairs [excluding virtual and -aggregated resources and Custom Resource Definitions] combined_ - -_SLO2: In default Kubernetes installation, 99th percentile of SLI4 per -cluster-day <= Xms_ - - -### Watch latency - -_SLI1: API-machinery watch latency (measured from the moment when object is -stored in database to when it’s ready to be sent to all watchers), measured -as 99th percentile over last 5 minutes_ - -_SLO1 (developer-oriented): 99th percentile of SLI1 per cluster-day <= Xms_ - -User stories: -- As an administrator, if system is slow, I would like to know if the root -cause is slow api-machinery or something farther the path (lack of network -bandwidth, slow or cpu-starved controllers, ...). - -Background: -- Pretty much all control loops in Kubernetes are watch-based, so slow watch -means slow system in general. As a result, we want to give some guarantees on -how fast it is. -- Note that how we measure it, silently assumes no clock-skew in case of HA -clusters. - - -### Admission plugin latency - -_SLI1: Admission latency for each admission plugin type, measured as 99th -percentile over last 5 minutes_ - -User stories: -- As an administrator, if API calls are slow, I would like to know if this is -because slow admission plugins and if so which ones are responsible. - - -### Webhook latency - -_SLI1: Webhook call latency for each webhook type, measured as 99th percentile -over last 5 minutes_ - -User stories: -- As an administrator, if API calls are slow, I would like to know if this is -because slow webhooks and if so which ones are responsible. - - -### Initializer latency - -_SLI1: Initializer latency for each initializer, measured as 99th percentile -over last 5 minutes_ - -User stories: -- As an administrator, if API calls are slow, I would like to know if this is -because of slow initializers and if so which ones are responsible. - ---- -<a name="footnote1">\[1\]</a>By latency of API call in this doc we mean time -from the moment when apiserver gets the request to last byte of response sent -to the user. - -<a name="footnote2">\[2\]</a> For the purpose of visualization it will be a -sliding window. However, for the purpose of reporting the SLO, it means one -point per day (whether SLO was satisfied on a given day or not). - -<a name="footnote3">\[3\]</a>A collection contains: (a) all objects of that -type for cluster-scoped resources, (b) all object of that type in a given -namespace for namespace-scoped resources. - - -[Google Doc]: https://docs.google.com/document/d/1Q5qxdeBPgTTIXZxdsFILg7kgqWhvOwY8uROEf0j5YBw/edit# diff --git a/sig-scalability/slos/api_call_latency.md b/sig-scalability/slos/api_call_latency.md new file mode 100644 index 00000000..65d7dc26 --- /dev/null +++ b/sig-scalability/slos/api_call_latency.md @@ -0,0 +1,47 @@ +## API call latency SLIs/SLOs details + +### User stories +- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the +response from an API call. +- As an administrator of Kubernetes cluster, if I know characteristics of my +external dependencies of apiserver (e.g custom admission plugins, webhooks and +initializers) I want to be able to provide guarantees for API calls latency to +users of my cluster + +### Other notes +- We obviously can’t give any guarantee in general, because cluster +administrators are allowed to register custom admission plugins, webhooks +and/or initializers, which we don’t have any control about and they obviously +impact API call latencies. +- As a result, we define the SLIs to be very generic (no matter how your +cluster is set up), but we provide SLO only for default installations (where we +have control over what apiserver is doing). This doesn’t provide a false +impression, that we provide guarantee no matter how the cluster is setup and +what is installed on top of it. +- At the same time, API calls are part of pretty much every non-trivial workflow +in Kubernetes, so this metric is a building block for less trivial SLIs and +SLOs. +- The SLO for latency for read-only API calls of a given type may have significant +buffer in threshold. In fact, the latency of the request should be proportional to +the amount of work to do (which is number of objects of a given type in a given +scope) plus some constant overhead. For better tracking of performance, we +may want to define purely internal SLI of "latency per object". But that +isn't in near term plans. + +### Caveats +- The SLO has to be satisfied independently from used encoding in user-originated +requests. This makes mix of client important while testing. However, we assume +that all `core` components communicate with apiserver using protocol buffers. +- In case of GET requests, user has an option opt-in for accepting potentially +stale data (being served from cache) and the SLO again has to be satisfied +independently of that. This makes the careful choice of requests in tests +important. + +### TODOs +- We may consider treating `non-namespaced` resources as a separate bucket in +the future. However, it may not make sense if the number of those may be +comparable with `namespaced` ones. + +### Test scenario + +__TODO: Descibe test scenario.__ diff --git a/sig-scalability/slos/api_extensions_latency.md b/sig-scalability/slos/api_extensions_latency.md new file mode 100644 index 00000000..2681422c --- /dev/null +++ b/sig-scalability/slos/api_extensions_latency.md @@ -0,0 +1,6 @@ +## API call extension points latency SLIs details + +### User stories +- As an administrator, if API calls are slow, I would like to know if this is +because slow extension points (admission plugins, webhooks, initializers) and +if so which ones are responsible for it. diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md index 49d26c6a..34aba2f8 100644 --- a/sig-scalability/slos/slos.md +++ b/sig-scalability/slos/slos.md @@ -92,13 +92,39 @@ of the cases. We are looking into extending the set of SLIs/SLOs to cover more parts of Kubernetes. +``` +Prerequisite: Kubernetes cluster is available and serving. +``` + ### Steady state SLIs/SLOs | Status | SLI | SLO | User stories, test scenarios, ... | | --- | --- | --- | --- | +| __Official__ | Latency<sup>[1](#footnote1)</sup> of mutating<sup>[2](#footnote2)</sup> API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[3](#footnote3)</sup> <= 1s | [Details](./api_call_latency.md) | +| __Official__ | Latency<sup>[1](#footnote1)</sup> of non-streaming read-only<sup>[4](#footnote3)</sup> API calls for every (resource, scope<sup>[5](#footnote4)</sup>) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) | + +<a name="footnote1">\[1\]</a>By latency of API call in this doc we mean time +from the moment when apiserver gets the request to last byte of response sent +to the user. + +<a name="footnote2">\[2\]</a>By mutating API calls we mean POST, PUT, DELETE +and PATCH. + +<a name="footnote3">\[3\]</a> For the purpose of visualization it will be a +sliding window. However, for the purpose of reporting the SLO, it means one +point per day (whether SLO was satisfied on a given day or not). + +<a name="footnote4">\[4\]</a>By non-streaming read-only API calls we mean GET +requests without `watch=true` option set. (Note that in Kubernetes internally +it translates to both GET and LIST calls). + +<a name="footnote5">\[5\]</a>A scope of a request can be either (a) `resource` +if the request is about a single object, (b) `namespace` if it is about objects +from a single namespace or (c) `cluster` if it spawns objects from multiple +namespaces. + __TODO: Migrate existing SLIs/SLOs here:__ -- __API-machinery ones__ - __Pod startup time__ ### Burst SLIs/SLOs @@ -106,3 +132,13 @@ __TODO: Migrate existing SLIs/SLOs here:__ | Status | SLI | SLO | User stories, test scenarios, ... | | --- | --- | --- | --- | | WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) | + +### Other SLIs + +| Status | SLI | User stories, ... | +| --- | --- | --- | +| WIP | Watch latency for every resource, (from the moment when object is stored in database to when it's ready to be sent to all watchers), measured as 99th percentile over last 5 minutes | TODO | +| WIP | Admission latency for each admission plugin type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | +| WIP | Webhook call latency for each webhook type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | +| WIP | Initializer latency for each initializer, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | + diff --git a/sig-scalability/slos/system_throughput.md b/sig-scalability/slos/system_throughput.md index 369a6cba..5691b46d 100644 --- a/sig-scalability/slos/system_throughput.md +++ b/sig-scalability/slos/system_throughput.md @@ -1,3 +1,5 @@ +## System throughput SLI/SLO details + ### User stories - As a user, I want a guarantee that my workload of X pods can be started within a given time diff --git a/sig-scalability/slos/watch_latency.md b/sig-scalability/slos/watch_latency.md new file mode 100644 index 00000000..2e698b4b --- /dev/null +++ b/sig-scalability/slos/watch_latency.md @@ -0,0 +1,17 @@ +## Watch latency SLI details + +### User stories +- As an administrator, if Kubernetes is slow, I would like to know if the root +cause of it is slow api-machinery (slow watch) or something farther the path +(lack of network bandwidth, slow or cpu-starved controllers, ...) + +### Other notes +- Pretty much all control loops in Kubernetes are watch-based. As a result +slow watch means slow system in general. +- Note that how we measure it silently assumes no clock-skew in case of +cluster with multiple masters. + +### TODOs +- Longer term, we would like to provide some guarantees on watch latency +(e.g. 99th percentile of SLI per cluster-day <= Xms). However, we are not +there yet. |
