diff options
| -rw-r--r-- | sig-scalability/slos/api_call_latency.md | 42 | ||||
| -rw-r--r-- | sig-scalability/slos/api_extensions_latency.md | 6 | ||||
| -rw-r--r-- | sig-scalability/slos/slos.md | 38 | ||||
| -rw-r--r-- | sig-scalability/slos/system_throughput.md | 2 | ||||
| -rw-r--r-- | sig-scalability/slos/watch_latency.md | 17 |
5 files changed, 104 insertions, 1 deletions
diff --git a/sig-scalability/slos/api_call_latency.md b/sig-scalability/slos/api_call_latency.md new file mode 100644 index 00000000..f4373703 --- /dev/null +++ b/sig-scalability/slos/api_call_latency.md @@ -0,0 +1,42 @@ +## API call latency SLIs/SLOs details + +### User stories +- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the +response from an API call. +- As an administrator of Kubernetes cluster, if I know characteristics of my +external dependencies of apiserver (e.g custom admission plugins, webhooks and +initializers) I want to be able to provide guarantees for API calls latency to +users of my cluster + +### Other notes +- We obviously can’t give any guarantee in general, because cluster +administrators are allowed to register custom admission plugins, webhooks +and/or initializers, which we don’t have any control about and they obviously +impact API call latencies. +- As a result, we define the SLIs to be very generic (no matter how your +cluster is set up), but we provide SLO only for default installations (where we +have control over what apiserver is doing). This doesn’t provide a false +impression, that we provide guarantee no matter how the cluster is setup and +what is installed on top of it. +- At the same time, API calls are part of pretty much every non-trivial workflow +in Kubernetes, so this metric is a building block for less trivial SLIs and +SLOs. +- The SLO for latency for read-only API calls of a given type may have significant +buffer in threshold. In fact, the latency of the request should be proportional to +the amount of work to do (which is number of objects og a given type in a given +scope) plus some constant overhead. For better tracking of performance, we +may want to define purely internal SLI of "latency per object". But that +isn't in near term plans. + +### Caveats +- The SLO has to be satisfied independently from used encoding in user-originated +requests. This makes mix of client important while testing. However, we assume +that all `core` components communicate with apiserver using protocol buffers. +- In case of GET requests, user has an option opt-in for accepting potentially +stale data (being served from cache) and the SLO again has to be satisfied +independently of that. This makes the careful choice of requests in tests +important. + +### Test scenario + +__TODO: Descibe test scenario.__ diff --git a/sig-scalability/slos/api_extensions_latency.md b/sig-scalability/slos/api_extensions_latency.md new file mode 100644 index 00000000..2681422c --- /dev/null +++ b/sig-scalability/slos/api_extensions_latency.md @@ -0,0 +1,6 @@ +## API call extension points latency SLIs details + +### User stories +- As an administrator, if API calls are slow, I would like to know if this is +because slow extension points (admission plugins, webhooks, initializers) and +if so which ones are responsible for it. diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md index 49d26c6a..34aba2f8 100644 --- a/sig-scalability/slos/slos.md +++ b/sig-scalability/slos/slos.md @@ -92,13 +92,39 @@ of the cases. We are looking into extending the set of SLIs/SLOs to cover more parts of Kubernetes. +``` +Prerequisite: Kubernetes cluster is available and serving. +``` + ### Steady state SLIs/SLOs | Status | SLI | SLO | User stories, test scenarios, ... | | --- | --- | --- | --- | +| __Official__ | Latency<sup>[1](#footnote1)</sup> of mutating<sup>[2](#footnote2)</sup> API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[3](#footnote3)</sup> <= 1s | [Details](./api_call_latency.md) | +| __Official__ | Latency<sup>[1](#footnote1)</sup> of non-streaming read-only<sup>[4](#footnote3)</sup> API calls for every (resource, scope<sup>[5](#footnote4)</sup>) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) | + +<a name="footnote1">\[1\]</a>By latency of API call in this doc we mean time +from the moment when apiserver gets the request to last byte of response sent +to the user. + +<a name="footnote2">\[2\]</a>By mutating API calls we mean POST, PUT, DELETE +and PATCH. + +<a name="footnote3">\[3\]</a> For the purpose of visualization it will be a +sliding window. However, for the purpose of reporting the SLO, it means one +point per day (whether SLO was satisfied on a given day or not). + +<a name="footnote4">\[4\]</a>By non-streaming read-only API calls we mean GET +requests without `watch=true` option set. (Note that in Kubernetes internally +it translates to both GET and LIST calls). + +<a name="footnote5">\[5\]</a>A scope of a request can be either (a) `resource` +if the request is about a single object, (b) `namespace` if it is about objects +from a single namespace or (c) `cluster` if it spawns objects from multiple +namespaces. + __TODO: Migrate existing SLIs/SLOs here:__ -- __API-machinery ones__ - __Pod startup time__ ### Burst SLIs/SLOs @@ -106,3 +132,13 @@ __TODO: Migrate existing SLIs/SLOs here:__ | Status | SLI | SLO | User stories, test scenarios, ... | | --- | --- | --- | --- | | WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) | + +### Other SLIs + +| Status | SLI | User stories, ... | +| --- | --- | --- | +| WIP | Watch latency for every resource, (from the moment when object is stored in database to when it's ready to be sent to all watchers), measured as 99th percentile over last 5 minutes | TODO | +| WIP | Admission latency for each admission plugin type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | +| WIP | Webhook call latency for each webhook type, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | +| WIP | Initializer latency for each initializer, measured as 99th percentile over last 5 minutes | [Details](./api_extensions_latency.md) | + diff --git a/sig-scalability/slos/system_throughput.md b/sig-scalability/slos/system_throughput.md index 369a6cba..5691b46d 100644 --- a/sig-scalability/slos/system_throughput.md +++ b/sig-scalability/slos/system_throughput.md @@ -1,3 +1,5 @@ +## System throughput SLI/SLO details + ### User stories - As a user, I want a guarantee that my workload of X pods can be started within a given time diff --git a/sig-scalability/slos/watch_latency.md b/sig-scalability/slos/watch_latency.md new file mode 100644 index 00000000..1aa3d488 --- /dev/null +++ b/sig-scalability/slos/watch_latency.md @@ -0,0 +1,17 @@ +## Watch latency SLI details + +### User stories +- As an aministrator, if Kubernetes is slow, I would like to know if the root +cause of it is slow api-machinery (slow watch) or something farther the path +(lack of network bandwidth, slow or cpu-starved controllers, ...) + +### Other notes +- Pretty much all control loops in Kubernetes are watch-based. As a result +slow watch means slow system in general. +- Note that how we measure it silently assumes no clock-skew in case of +cluster with multiple masters. + +### TODOs +- Longer term, we would like to provide some guarantees on watch latency +(e.g. 99th percentile of SLI per cluster-day <= Xms). However, we are not +there yet. |
