summaryrefslogtreecommitdiff
path: root/sig-scalability/block_merges.md
blob: fdc73c1d4a496f28c6a9d792c2a64e703da31dc8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Blocking PR merges in the event of regression.

As mentioned in the charter, SIG scalability has a right to block all PRs
from merging into the relevant repos. This document describes the underlying
"Rules of engagement" of this process and the rationale why this is needed.

### Rules of engagement.
The rules of engagement for blocking merges are as following:

- Observe as scalability regression on one of release-blocking test suites
  (defined as green to red transition - if tests were already failing, we
  don't have a right to declare a regression).
- Block merges of all PRs to the relevant repos in the affected branch,
  declaring which repos those are and why.
- Identify the PR which caused the regression:
  - this can be done by reading code changes, bisecting, debugging based on
    metrics and/or logs, etc.
  - we say a PR is identified as the cause when we are reasonably confident
    that it indeed caused a regression, even if the mechanism is not 100%
    understood to minimize the time when merges are blocked
- Mitigate the regression. This may mean e.g.:
  - reverting the PR
  - switching a feature off (preferably by default, as last resort only in tests)
  - fixing the problem (if it's easy and quick to fix)
- Unblock merges of all PRs to the relevant repos in the affected branch.

The exact technical mechanisms for it are out of scope for this document.

### Rationale
The process described above is quite drastic, but we believe it is justified
if we want kubernetes to maintain scalability SLOs. The reasoning is:
- reliably testing for regressions takes a lot of time:
  - key scalability e2e tests take too long to execute to be a prerequisite
    for merging all PRs, this is an inherent characteristic of testing at scale,
  - end-to-end tests are flaky (even when not at scale) requiring retries,
- we need to prevent regression pile-ups:
  - once a regression is merged, and no other action is taken, it is only
    a matter of time until another regression is merged on top of it,
  - debugging the cause of two simultaneous (piled-up) regressions is 
    exponentially harder, see issue [53255](http://pr.k8s.io/53255) which
    links to past experience
- we need to keep flakiness of merge-blocking jobs very low:
- regarding benchmarks, there were several scalability issues in the past
  caught by (costly) large-scale e2e tests, which could have been caught and
  fixed earlier and with far less human effort if we had benchmark-like
  tests. Examples include:
  - scheduler anti-affinity affecting kube-dns,
  - kubelet network plugin increasing pod-startup latency,
  - large responses from apiserver violating gRPC MTU.

As explained in detail in an issue, not being able to maintain passing scalability
tests adversely affect:
- release quality
- release schedule
- engineer productivity