Organize SLOs

author: wojtekt <wojtekt@google.com> 2018-05-18 15:48:36 +0200
committer: wojtekt <wojtekt@google.com> 2018-05-24 13:11:02 +0200
commit: 078bb85d376ea8890caa73744fd767afff1d793b (patch)
tree: 81bfdf4c63867f5cf8ee6d99bd68646b8ab5b39f
parent: e1f3ea4543d40c93aaf6581a84a9ab83533b3ef3 (diff)
4 files changed, 134 insertions, 98 deletions
diff --git a/sig-scalability/slos/extending_slo.md b/sig-scalability/slos/extending_slo.md
deleted file mode 100644
index 5cbbb87f..00000000
--- a/sig-scalability/slos/extending_slo.md
+++ /dev/null
@@ -1,72 +0,0 @@
-# Extended Kubernetes scalability SLOs
-
-## Goal
-The goal of this effort is to extend SLOs which Kubernetes cluster has to meet to support given number of Nodes. As of April 2017 we have only two SLOs:
-- API-responsiveness: 99% of all API calls return in less than 1s
-- Pod startup time: 99% of Pods (with pre-pulled images) start within 5s
-which are enough to guarantee that cluster doesn't feel completely dead, but not enough to guarantee that it satisfies user's needs.
-
-We're going to define more SLOs based on most important indicators, and standardize the format in which we speak about our objectives. Our SLOs need to have two properties:
-- They need to be testable, i.e. we need to have a benchmark to measure if it's met,
-- They need to be expressed in a way that's possible to understand by a user not intimately familiar with the system internals, i.e. formulation can't depend on some arcane knowledge.
-
-On the other hand we do not require that:
-- SLOs are possible to monitor in a running cluster, i.e. not all SLOs need to be easily translatable to SLAs. Being able to benchmark is enough for us.
-
-## Split metrics from environment
-Currently what me measure and how we measure it is tightly coupled. This means that we don't have good environmental constraint suggestions for users (e.g. how many Pods per Namespace we support, how many Endpoints per Service, how to setup the cluster etc.). We need to decide on what's reasonable and make the environment explicit.
-
-## Split SLOs by kind
-Current SLOs implicitly assume that the cluster is in a "steady state". By this we mean that we assume that there's only some, limited, number of things going during benchmarking. We need to make this assumption explicit and split SLOs into two categories: steady-state SLOs and burst SLOs.
-
-## Steady state SLOs
-With steady state SLO we want to give users the data about system's behavior during normal operation. We define steady state by limiting the churn on the cluster. 
-
-This includes current SLOs:
-- API call latency
-- E2e Pod startup latency
-
-By churn we understand a measure of amount changes happening in the cluster. Its formal(-ish) definition will follow, but informally it can be thought about as number of user-issued requests per second plus number of pods affected by those requests.
-
-More formally churn per second is defined as:
-```
-#Pod creations + #PodSpec updates + #user originated requests in a given second
-```
-The last part is necessary only to get rid of situations when user is spamming API server with various requests. In ordinary circumstances we expect it to be in the order of 1-2. 
-
-## Burst SLOs
-With burst SLOs we want to give user idea on how system behaves under the heavy load, i.e. when one want the system to do something as quickly as possible, not caring too much about response time for a single request. Note that this voids all steady-state SLOs.
-
-This includes the new SLO:
-- Pod startup throughput
-
-## Environment
-A Kubernetes cluster in which we benchmark SLOs needs to meet the following criteria:
-- Run a single appropriately sized master machine
-- Main etcd runs as a single instance on the master machine
-- Events are stored in a separate etcd instance running on the master machine
-- Kubernetes version is at least 1.X.Y
-- Components configuration = _?_
-
-_TODO: NEED AN HA CONFIGURATION AS WELL_
-
-## SLO template
-All our performance SLOs should be defined using the following template:
-
----
-
-# SLO: *TL;DR description of the SLO*
-## (Burst|Steady state) foo bar SLO
-
-### Summary
-_One-two sentences describing the SLO, that's possible to understand by the majority of the community_
-
-### User Stories
-_A Few user stories showing in what situations users might be interested in this SLO, and why other ones are not enough_
-
-## Full definition
-### Test description
-_Precise description of test scenario, including maximum number of Pods per Controller, objects per namespace, and anything else that even remotely seems important_
-
-### Formal definition (can be skipped if the same as title/summary)
-_Precise and as formal as possible definition of SLO. This does not necessarily need to be easily understandable by layman_
diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md
new file mode 100644
index 00000000..49d26c6a
--- /dev/null
+++ b/sig-scalability/slos/slos.md
@@ -0,0 +1,108 @@
+# Kubernetes scalability and performance SLIs/SLOs
+
+## What Kubernetes guarantees?
+
+One of the important aspects of Kubernetes is its scalability and performance
+characteristic. As Kubernetes user or operator/administrator of a cluster
+you would expect to have some guarantees in those areas.
+
+The goal of this doc is to organize the guarantees that Kubernetes provides
+in these areas.
+
+## What do we require from SLIs/SLOs?
+
+We are going to define more SLIs and SLOs based on the most important indicators
+in the system.
+
+Our SLOs need to have the following properties:
+- <b> They need to be testable </b> <br/>
+  That means that we need to have a benchmark to measure if it's met.
+- <b> They need to be understandable for users </b> <br/>
+  In particular, they need to be understandable for people not familiar
+  with the system internals, i.e. their formulation can't depend on some
+  arcane knowledge.
+
+However, we may introduce some internal (for developers only) SLIs, that
+may be useful for understanding performance characterstic of the system,
+but for which we don't provide any guarantees for users and thus may not
+be fully understandable for users.
+
+On the other hand, we do NOT require that our SLOs:
+- are measurable in a running cluster (though that's desired if possible) <br/>
+  In other words, not SLOs need to be easily translatable to SLAs.
+  Being able to benchmark is enough for us.
+
+## Types of SLOs
+
+While SLIs are very generic and don't really depend on anything (they just
+define what and how we measure), it's not the case for SLOs.
+SLOs provide guarantees, and satisfying them may depend on meeting some
+specific requirements.
+
+As a result, we build our SLOs in "you promise, we promise" format.
+That means, that we provide you a guarantee only if you satisfy the requirement
+that we put on you.
+
+As a consequence we introduce the two types of SLOs.
+
+### Steady state SLOs
+
+With steady state SLOs, we provide guarantees about system's behavior during
+normal operations. We are able to provide much more guarantees in that situation.
+
+```Definition
+We define system to be in steady state when the cluster churn per second is <= 20, where
+
+churn = #(Pod spec creations/updates/deletions) + #(user originated requests) in a given second
+```
+
+### Burst SLO
+
+With burst SLOs, we provide guarantees on how system behaves under the heavy load
+(when user wants the system to do something as quickly as possible not caring too
+much about response time).
+
+## Environment
+
+In order to meet the SLOs, system must run in the environment satisfying
+the following criteria:
+- Runs a single or more appropriate sized master machines
+- Main etcd running on master machine(s)
+- Events are stored in a separate etcd running on the master machine(s)
+- Kubernetes version is at least X.Y.Z
+- ...
+
+__TODO: Document other necessary configuration.__
+
+## Thresholds
+
+To make the cluster eligible for SLO, users also can't have too many objects in
+their clusters. More concretely, the number of different objects in the cluster
+MUST satisfy thresholds defined in [thresholds file][].
+
+[thresholds file]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
+
+
+## Kubernetes SLIs/SLOs
+
+The currently existing SLIs/SLOs are enough to guarantee that cluster isn't
+completely dead. However, the are not enough to satisfy user's needs in most
+of the cases.
+
+We are looking into extending the set of SLIs/SLOs to cover more parts of
+Kubernetes.
+
+### Steady state SLIs/SLOs
+
+| Status | SLI | SLO | User stories, test scenarios, ... |
+| --- | --- | --- | --- |
+
+__TODO: Migrate existing SLIs/SLOs here:__
+- __API-machinery ones__
+- __Pod startup time__
+
+### Burst SLIs/SLOs
+
+| Status | SLI | SLO | User stories, test scenarios, ... |
+| --- | --- | --- | --- |
+| WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) |
diff --git a/sig-scalability/slos/system_throughput.md b/sig-scalability/slos/system_throughput.md
new file mode 100644
index 00000000..369a6cba
--- /dev/null
+++ b/sig-scalability/slos/system_throughput.md
@@ -0,0 +1,26 @@
+### User stories
+- As a user, I want a guarantee that my workload of X pods can be started
+  within a given time
+- As a user, I want to understand how quickly I can react to a dramatic
+  change in workload profile when my workload exhibits very bursty behavior
+  (e.g. shop during Back Friday Sale)
+- As a user, I want a guarantee how quickly I can recreate the whole setup
+  in case of a serious disaster which brings the whole cluster down.
+
+### Test scenario
+- Start with a healthy (all nodes ready, all cluster addons already running)
+  cluster with N (>0) running pause pods per node.
+- Create a number of `Namespaces` and a number of `Deployments` in each of them.
+- All `Namespaces` should be isomorphic, possibly excluding last one which should
+  run all pods that didn't fit in the previous ones.
+- Single namespace should run 5000 `Pods` in the following configuration:
+  - one big `Deployment` running ~1/3 of all `Pods` from this `namespace`
+  - medium `Deployments`, each with 120 `Pods`, in total running ~1/3 of all
+    `Pods` from this `namespace`
+  - small `Deployment`, each with 10 `Pods`, in total running ~1/3 of all `Pods`
+    from this `Namespace`
+- Each `Deployment` should be covered by a single `Service`.
+- Each `Pod` in any `Deployment` contains two pause containers, one `Secret`
+  other than default `ServiceAccount` and one `ConfigMap`. Additionally it has
+  resource requests set and doesn't use any advanced scheduling features or
+  init containers.
diff --git a/sig-scalability/slos/throughput_burst_slo.md b/sig-scalability/slos/throughput_burst_slo.md
deleted file mode 100644
index e579acb1..00000000
--- a/sig-scalability/slos/throughput_burst_slo.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# SLO: Kubernetes cluster of size at least X is able to start Y Pods in Z minutes
-**This is a WIP SLO doc - something that we want to meet, but we may not be there yet**
-
-## Burst Pod Startup Throughput SLO
-### User Stories
-- User is running a workload of X total pods and wants to ensure that it can be started in Y time. 
-- User is running a system that exhibits very bursty behavior (e.g. shop during Black Friday Sale) and wants to understand how quickly they can react to a dramatic change in workload profile.
-- User is running a huge serving app on a huge cluster. He wants to know how quickly he can recreate his whole setup in case of a serious disaster which will bring the whole cluster down.
-
-Current steady state SLOs are do not provide enough data to make these assessments about burst behavior. 
-## SLO definition (full)
-### Test setup
-Standard performance test kubernetes setup, as describe in [the doc](../extending_slo.md#environment).
-### Test scenario is following:
-- Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause Pods/Node.
-- Create a number of Deployments that run X Pods and Namespaces necessary to create them.
-- All namespaces should be isomorphic, possibly excluding last one which should run all Pods that didn't fit in the previous ones.
-- Single Namespace should run at most 5000 Pods in the following configuration:
-  - one big Deployment running 1/3 of all Pods from this Namespace (1667 for 5000 Pod Namespace)
-  - medium Deployments, each of which is not running more than 120 Pods, running in total 1/3 of all Pods from this Namespace (14 Deployments with 119 Pods each for 5000 Pod Namespace)
-  - small Deployments, each of which is not running more than 10 Pods, running in total 1/3 of all Pods from this Namespace (238 Deployments with 7 Pods each for 5000 Pod Namespace)
-- Each Deployment is covered by a single Service.
-- Each Pod in any Deployment contains two pause containers, one secret other than ServiceAccount and one ConfigMap, has resource request set and doesn't use any advanced scheduling features (Affinities, etc.) or init containers.
-- Measure the time between starting the test and moment when last Pod is started according to it's Kubelet. Note that pause container is ready just after it's started, which may not be true for more complex containers that use nontrivial readiness probes.
-### Definition
-Kubernetes cluster of size at least X adhering to the environment definition, when running the specified test, 99th percentile of time necessary to start Y pods from the time when user created all controllers to the time when Kubelet starts the last Pod from the set is no greater than Z minutes, assuming that all images are already present on all Nodes.
-\ No newline at end of file
author	wojtekt <wojtekt@google.com>	2018-05-18 15:48:36 +0200
committer	wojtekt <wojtekt@google.com>	2018-05-24 13:11:02 +0200
commit	078bb85d376ea8890caa73744fd767afff1d793b (patch)
tree	81bfdf4c63867f5cf8ee6d99bd68646b8ab5b39f
parent	e1f3ea4543d40c93aaf6581a84a9ab83533b3ef3 (diff)