Merge pull request #2149 from wojtek-t/sig_scalability_charter

SIG Scalability Charter
author: k8s-ci-robot <k8s-ci-robot@users.noreply.github.com> 2018-08-31 10:56:25 -0700
committer: GitHub <noreply@github.com> 2018-08-31 10:56:25 -0700
commit: 55d3d7c780ec6172f6db689be041355170a6e357 (patch)
tree: 5934ec51858cbec7ea24037067dccb92eb1e9d4c
parent: 689654f4aac66a094bce03e8d2f2961249a9ff88 (diff)
parent: b6ea6539848054d838b4d9058565e0b34337133a (diff)
2 files changed, 149 insertions, 0 deletions
diff --git a/sig-scalability/block_merges.md b/sig-scalability/block_merges.md
new file mode 100644
index 00000000..d57c68a1
--- /dev/null
+++ b/sig-scalability/block_merges.md
@@ -0,0 +1,51 @@
+# Blocking PR merges in the event of regression.
+
+As mentioned in the charter, SIG scalability has a right to block all PRs
+from merging into the relevant repos. This document describes the underlying
+"Rules of engagement" of this process and the rationale why this is needed.
+
+### Rules of engagement.
+The rules of engagement for blocking merges are as following:
+
+- Observe as scalability regression on one of release-blocking test suites.
+- Block merges of all PRs.
+- Identify the PR which caused the regression:
+  - this can be done by reading code changes, bisecting, debugging based on
+    metrics and/or logs, etc.
+  - we say a PR is identified as the cause when we are reasonably confident
+    that it indeed caused a regression, even if the mechanism is not 100%
+    understood to minimize the time when merges are blocked
+- Mitigate the regression. This may mean e.g.:
+  - reverting the PR
+  - switching a feature off (preferably by default, as last resort only in tests)
+  - fixing the problem (if it's easy and quick to fix)
+- Unblock PR merged.
+
+The exact technical mechanisms for it are out of scope for this document.
+
+### Rationale
+The process described above is quite drastic, but we believe it is justified
+if we want kubernetes to maintain scalability SLOs. The reasoning is:
+- reliably testing for regressions takes a lot of time:
+  - key scalability e2e tests take too long to execute to be a prerequisite
+    for merging all PRs, this is an inherent characteristic of testing at scale,
+  - end-to-end tests are flaky (even when not at scale) requiring retries,
+- we need to prevent regression pile-ups:
+  - once a regression is merged, and no other action is taken, it is only
+    a matter of time until another regression is merged on top of it,
+  - debugging the cause of two simultaneous (piled-up) regressions is 
+    exponentially harder, see issue 53255 which links to past experience
+- we need to keep flakiness of merge-blocking jobs very low:
+- regarding benchmarks, there were several scalability issues in the past
+  caught by (costly) large-scale e2e tests, which could have been caught and
+  fixed earlier and with far less human effort if we had benchmark-like
+  tests. Examples include:
+  - scheduler anti-affinity affecting kube-dns,
+  - kubelet network plugin increasing pod-startup latency,
+  - large responses from apiserver violating gRPC MTU.
+
+As explained in detail in an issue, not being able to maintain passing scalability
+tests adversely affect:
+- release quality
+- release schedule
+- engineer productivity
diff --git a/sig-scalability/charter.md b/sig-scalability/charter.md
new file mode 100644
index 00000000..acb051d3
--- /dev/null
+++ b/sig-scalability/charter.md
@@ -0,0 +1,98 @@
+# SIG Scalability Charter
+
+This charter adheres to the conventions described in the [Kubernetes Charter README]
+and uses the Roles and Organization Management outlined in [sig-governance].
+
+[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
+[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md
+
+## Scope
+
+SIG Scalability's primary responsibilities are to define and drive scalability
+goals for Kubernetes. This involves defining, testing and measuring performance and
+scalability related Service Level Indicators (SLIs) and ensuring that every
+Kubernetes release meets Service Level Objectives (SLOs) built on top of those
+SLIs.
+
+We also coordinate and contribute to general system-wide scalability and
+performance improvements (that don't fall into the charter of another individual
+SIG) by driving large architectural changes and finding bottlenecks, as well as
+provide consultations about any scalability and performance related aspects of
+Kubernetes.
+
+### In Scope
+
+#### Code, Binaries and Services:
+
+- Scalability and performance testing frameworks. Examples include:
+  - [Cluster loader](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2)
+  - [Kubemark](https://github.com/kubernetes/kubernetes/tree/master/cmd/kubemark)
+- Scalability and performance tests:
+  - [Tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/)
+  - [Jobs running those](https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability)
+
+#### Cross-cutting and Externally Facing Processes
+
+- Defining what does “Kubernetes scales” mean by defining (or approving)
+individual performance SLIs/SLOs, ensuring they are all oriented on user
+experience and consistent with each other:
+  - [SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md)
+- Ensuring that each official Kubernetes release satisfies all scalability and
+performance related requirements, as stated in "Kubernetes scalability" definition.
+- Establishing and documenting best practises on how to design and/or implement
+Kubernetes features in scalable and performant way. Educating contributors and
+consulting individual designs/implementations to ensure that those are widely used.
+Example artifacts:
+  - [Scalability governance](https://github.com/kubernetes/community/blob/master/sig-scalability/governance)
+- Finding system bottlenecks and coordinating improvement on cross-cutting
+architectural changes.
+
+### Out of scope
+
+- Improving performance/scalability of features falling into charters of
+individual SIGs.
+
+
+## What can we do/require from other SIGs
+Scalability and performance are horizontal aspects of the system - changes in a
+single place of Kubernetes may affect the whole system. As a result, to
+effectively ensure Kubernetes scales, we need a special cross-SIG privileges.
+
+- We can rollback any merged PR if it has been identified as a cause of any
+  [performance/scalability SLOs] regression (identified by the set of release
+  blocking scalability/performance tests). The offending PR should only be
+  merged again after proving to pass  tests at scale.
+- In the even of a performance regression, we can block all PRs from being
+  merged into the relevant repos until the cause of the regression is
+	identified and mitigated.
+  The “Rules of engagement” of pausing merge-queue and rationale for
+  necessity of its introduce are explained in [a separate doc](./block_merges.md).
+- We require significant changes (in terms of impact, such as: update of etcd,
+  update of Go version, major architectural changes, etc.) may only be merged:
+  - with an explicit approval from a SIG-scalability tech lead and
+  - after having passed performance testing on biggest supported clusters (unless
+    found unnecessary by approver)
+- We can block a feature from transitioning:
+  - to Beta status, if (when turned on) it causes violation of already existing
+    performance/scalability SLOs;
+  - to GA status, when it can be used scale. That means:
+    - in rare cases, introducing a new SLI and SLO and ensuring it is met at scale
+    - in most of cases, extending scalability tests to use it and ensuring that
+      existing SLOs are still met
+- We can require a SIG to introduce a regression-catching benchmark test for a
+  scalability-critical functionality.
+
+[performance/scalability SLOs]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md
+
+## Roles and Organization Management
+
+This sig follows adheres to the Roles and Organization Management outlined in
+[sig-governance] and opts-in to updates and modifications to [sig-governance].
+
+[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
+
+### Subproject Creation
+
+SIG Scalability delegates subproject approval to Technical Leads. See [Subproject creation - Option 1].
+
+[Subproject creation - Option 1]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md#subproject-creation
author	k8s-ci-robot <k8s-ci-robot@users.noreply.github.com>	2018-08-31 10:56:25 -0700
committer	GitHub <noreply@github.com>	2018-08-31 10:56:25 -0700
commit	55d3d7c780ec6172f6db689be041355170a6e357 (patch)
tree	5934ec51858cbec7ea24037067dccb92eb1e9d4c
parent	689654f4aac66a094bce03e8d2f2961249a9ff88 (diff)
parent	b6ea6539848054d838b4d9058565e0b34337133a (diff)