summaryrefslogtreecommitdiff
path: root/contributors
diff options
context:
space:
mode:
authorKubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>2020-01-14 18:55:32 -0800
committerGitHub <noreply@github.com>2020-01-14 18:55:32 -0800
commitefd16a304ee26b959bb225e7ef73f0097ecd96d3 (patch)
tree5dcc3692e13cae82241da96fd31387cb1a1852b8 /contributors
parentb2cb2573693f266cf1245e4ffb8def6409599fe5 (diff)
parentbe3986f7d043048bb639c3152da2373a4001c6b2 (diff)
Merge pull request #4403 from spiffxp/deflake-suggests
Add some suggestions on how to handle flakes
Diffstat (limited to 'contributors')
-rw-r--r--contributors/devel/sig-testing/flaky-tests.md68
1 files changed, 59 insertions, 9 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
index f23272b1..0ca4e9a0 100644
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -5,20 +5,70 @@ all tests are green, and we have a number of different CI systems running the
tests in various combinations, even a small percentage of flakes results in a
lot of pain for people waiting for their PRs to merge.
-Therefore, it's very important that we write tests defensively. Situations that
-"almost never happen" happen with some regularity when run thousands of times in
-resource-constrained environments. Since flakes can often be quite hard to
-reproduce while still being common enough to block merges occasionally, it's
-additionally important that the test logs be useful for narrowing down exactly
-what caused the failure.
-
-Note that flakes can occur in unit tests, integration tests, or end-to-end
-tests, but probably occur most commonly in end-to-end tests.
+Therefore, it's important we take flakes seriously. We should avoid flakes by
+writing our tests defensively. When flakes are identified, we should prioritize
+addressing them, either by fixing them or quarantining them off the critical
+path.
+
+# Avoiding Flakes
+
+Write tests defensively. Remember that "almost never" happens all the time when
+tests are run thousands of times in a CI environment. Tests need to be tolerant
+of other tests running concurrently, resource contention, and things taking
+longer than expected.
+
+There is a balance to be had here. Don't log too much, but don't log too little.
+Don't assume things will succeed after a fixed delay, but don't wait forever.
+
+- Ensure the test functions in parallel with other tests
+ - Be specific enough to ensure a test isn't thrown off by other tests' assets
+ - https://github.com/kubernetes/kubernetes/pull/85849 - eg: ensure resource name and namespace match
+ - https://github.com/kubernetes/kubernetes/pull/85967 - eg: tolerate errors for non k8s.io APIs
+ - https://github.com/kubernetes/kubernetes/pull/85619 - eg: tolerate multiple storage plugins
+- Ensure the test functions in a resource constrained environment
+ - Only ask for the resources you need
+ - https://github.com/kubernetes/kubernetes/pull/84975 - eg: drop memory constraints for test cases that only need cpu
+ - Don't use overly tight deadlines (but not overly broad either, non-[Slow] tests timeout after 5min)
+ - https://github.com/kubernetes/kubernetes/pull/85847 - eg: poll for `wait.ForeverTestTimeout` instead of 10s
+ - https://github.com/kubernetes/kubernetes/pull/84238 - eg: poll for 2min instead of 1min
+ - mark tests as [Slow] if they are unable to pass within 5min
+ - Do not expect actions to happen instantaneously or after a fixed delay
+ - Prefer informers and wait loops
+- Ensure the test provides sufficient context in logs for forensic debugging
+ - Explain what the test is doing, eg:
+ - "creating a foo with invalid configuration"
+ - "patching the foo to have a bar"
+ - Explain what specific check failed, and how, eg:
+ - "failed to create resource foo in namespace bar because of err"
+ - "expected all items to be deleted, but items foo, bar, and baz remain"
+ - Explain why a polling loop is failing, eg:
+ - "expected 3 widgets, found 2, will retry"
+ - "expected pod to be in state foo, currently in state bar, will retry"
+
+# Quarantining Flakes
+
+- When quarantining a presubmit test, ensure an issue exists in the current
+ release milestone assigned to the owning SIG. The issue should be labeled
+ `priority/critical-urgent`, `lifecycle/frozen`, and `kind/flake`. The
+ expectation is for the owning SIG to resolve the flakes and reintroduce the
+ test, or determine the tested functionality is covered via another method
+ and delete the test in question.
+- Quarantine a single test case by adding `[Flaky]` to the test name in question,
+ most CI jobs exclude these tests. This makes the most sense for flakes that
+ are merge-blocking and taking too long to troubleshoot, or occurring across
+ multiple jobs. - eg: https://github.com/kubernetes/kubernetes/pull/83792
+ - eg: https://github.com/kubernetes/kubernetes/pull/86327
+- Quarantine an entire set of tests by adding `[Feature:Foo]` to the test(s) in
+ question. This will require creating jobs that focus specifically on this
+ feature. The majority of release-blocking and merge-blocking suites avoid
+ these jobs unless they're proven to be non-flaky.
# Hunting Flakes
We offer the following tools to aid in finding or troubleshooting flakes
+- [flakes-latest.json](http://storage.googleapis.com/k8s-metrics/flakes-latest.json)
+ - shows the top 10 flakes over the past week for all PR jobs
- [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks
- https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job
- https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out`