summaryrefslogtreecommitdiff
path: root/contributors
diff options
context:
space:
mode:
authorErik L. Arneson <erik@lionswaycontent.com>2020-10-08 15:23:56 -0700
committerErik L. Arneson <erik@lionswaycontent.com>2020-10-21 09:03:13 -0700
commit15de14640bc773bca6e793a85ff2f110cfba3e67 (patch)
treef370f0203a5e552e7c1aaef3a368b3e0b6995060 /contributors
parentc2fa487203f41b59fcd57293cb743411e936bf6c (diff)
Add Deflaking tests information to Developer Guide
This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md. Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
Diffstat (limited to 'contributors')
-rw-r--r--contributors/devel/sig-testing/flaky-tests.md320
1 files changed, 232 insertions, 88 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
index 1745e23f..e90a6785 100644
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -1,4 +1,4 @@
-# Flaky tests
+# Flaky Tests
Any test that fails occasionally is "flaky". Since our merges only proceed when
all tests are green, and we have a number of different CI systems running the
@@ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize
addressing them, either by fixing them or quarantining them off the critical
path.
-# Avoiding Flakes
+For more information about deflaking Kubernetes tests, watch @liggitt's
+[presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg).
+
+**Table of Contents**
+
+- [Flaky Tests](#flaky-tests)
+ - [Avoiding Flakes](#avoiding-flakes)
+ - [Quarantining Flakes](#quarantining-flakes)
+ - [Hunting Flakes](#hunting-flakes)
+ - [GitHub Issues for Known Flakes](#github-issues-for-known-flakes)
+ - [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you)
+ - [Writing a good flake report](#writing-a-good-flake-report)
+ - [Deflaking unit tests](#deflaking-unit-tests)
+ - [Deflaking integration tests](#deflaking-integration-tests)
+ - [Deflaking e2e tests](#deflaking-e2e-tests)
+ - [Gathering information](#gathering-information)
+ - [Filtering and correlating information](#filtering-and-correlating-information)
+ - [What to look for](#what-to-look-for)
+ - [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes)
+
+## Avoiding Flakes
Write tests defensively. Remember that "almost never" happens all the time when
tests are run thousands of times in a CI environment. Tests need to be tolerant
@@ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
- "expected 3 widgets, found 2, will retry"
- "expected pod to be in state foo, currently in state bar, will retry"
-# Quarantining Flakes
+## Quarantining Flakes
- When quarantining a presubmit test, ensure an issue exists in the current
release milestone assigned to the owning SIG. The issue should be labeled
@@ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
feature. The majority of release-blocking and merge-blocking suites avoid
these jobs unless they're proven to be non-flaky.
-# Hunting Flakes
+## Hunting Flakes
We offer the following tools to aid in finding or troubleshooting flakes
@@ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes
[go.k8s.io/triage]: https://go.k8s.io/triage
[testgrid.k8s.io]: https://testgrid.k8s.io
-# GitHub Issues for Known Flakes
+## GitHub Issues for Known Flakes
Because flakes may be rare, it's very important that all relevant logs be
discoverable from the issue.
@@ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill.
[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
-## Expectations when a flaky test is assigned to you
+### Expectations when a flaky test is assigned to you
Note that we won't randomly assign these issues to you unless you've opted in or
you're part of a group that has opted in. We are more than happy to accept help
@@ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines:
name, which will result in the test being quarantined to only those jobs that
explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)
-# Reproducing unit test flakes
+### Writing a good flake report
-Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+If you are reporting a flake, it is important to include enough information for
+others to reproduce the issue. When filing the issue, use the
+[flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In
+your issue, answer these following questions:
-Just
+- Is this flaking in multiple jobs? You can search for the flaking test or error
+ messages using the
+ [Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool.
+- Are there multiple tests in the same package or suite failing with the same apparent error?
-```
-$ go install golang.org/x/tools/cmd/stress
-```
+In addition, be sure to include the following information:
-Then build your test binary
+- A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's
+ jobs, filtered to the relevant tests
+- The failed test output &mdash; this is essential because it makes the issue searchable
+- A link to the triage query
+- A link to specific failures
+- Be sure to tag the relevant SIG, if you know what it is.
-```
-$ go test -c -race
-```
+For a good example of a flaking test issue,
+[check here](https://github.com/kubernetes/kubernetes/issues/93358).
-Then run it under stress
+([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.)
+## Deflaking unit tests
+
+To get started with deflaking unit tests, you will need to first
+reproduce the flaky behavior. Start with a simple attempt to just run
+the flaky unit test. For example:
+
+```sh
+go test ./pkg/kubelet/config -run TestInvalidPodFiltered
```
-$ stress ./package.test -test.run=FlakyTest
+
+Also make sure that you bypass the `go test` cache by using an uncachable
+command line option:
+
+```sh
+go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered
```
-It runs the command and writes output to `/tmp/gostress-*` files when it fails.
-It periodically reports with run counts. Be careful with tests that use the
-`net/http/httptest` package; they could exhaust the available ports on your
-system!
-
-# Hunting flaky unit tests in Kubernetes
-
-Sometimes unit tests are flaky. This means that due to (usually) race
-conditions, they will occasionally fail, even though most of the time they pass.
-
-We have a goal of 99.9% flake free tests. This means that there is only one
-flake in one thousand runs of a test.
-
-Running a test 1000 times on your own machine can be tedious and time consuming.
-Fortunately, there is a better way to achieve this using Kubernetes.
-
-_Note: these instructions are mildly hacky for now, as we get run once semantics
-and logging they will get better_
-
-There is a testing image `brendanburns/flake` up on the docker hub. We will use
-this image to test our fix.
-
-Create a replication controller with the following config:
-
-```yaml
-apiVersion: v1
-kind: ReplicationController
-metadata:
- name: flakecontroller
-spec:
- replicas: 24
- template:
- metadata:
- labels:
- name: flake
- spec:
- containers:
- - name: flake
- image: brendanburns/flake
- env:
- - name: TEST_PACKAGE
- value: pkg/tools
- - name: REPO_SPEC
- value: https://github.com/kubernetes/kubernetes
+If even this is not revealing issues with the flaky test, try running with
+[race detection](https://golang.org/doc/articles/race_detector.html) enabled:
+
+```sh
+go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered
```
-Note that we omit the labels and the selector fields of the replication
-controller, because they will be populated from the labels field of the pod
-template by default.
+Finally, you can stress test the unit test using the
+[stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it
+with this command:
```sh
-kubectl create -f ./controller.yaml
+go get golang.org/x/tools/cmd/stress
```
-This will spin up 24 instances of the test. They will run to completion, then
-exit, and the kubelet will restart them, accumulating more and more runs of the
-test.
+Then build your test binary:
-You can examine the recent runs of the test by calling `docker ps -a` and
-looking for tasks that exited with non-zero exit codes. Unfortunately, docker
-ps -a only keeps around the exit status of the last 15-20 containers with the
-same image, so you have to check them frequently.
+```sh
+go test ./pkg/kubelet/config -race -c
+```
-You can use this script to automate checking for failures, assuming your cluster
-is running on GCE and has four nodes:
+Then run it under stress:
```sh
-echo "" > output.txt
-for i in {1..4}; do
- echo "Checking kubernetes-node-${i}"
- echo "kubernetes-node-${i}:" >> output.txt
- gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
-done
-grep "Exited ([^0])" output.txt
+stress ./config.test -test.run TestInvalidPodFiltered
```
-Eventually you will have sufficient runs for your purposes. At that point you
-can delete the replication controller by running:
+The stress command runs the test binary repeatedly, reporting when it fails. It
+will periodically report how many times it has run and how many failures have
+occurred.
+
+You should see output like this:
+
+```
+411 runs so far, 0 failures
+/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
+--- FAIL: TestInvalidPodFiltered (0.00s)
+ config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
+FAIL
+ERROR: exit status 1
+815 runs so far, 1 failures
+```
+
+Be careful with tests that use the `net/http/httptest` package; they could
+exhaust the available ports on your system!
+
+## Deflaking integration tests
+
+Integration tests run similarly to unit tests, but they almost always expect a
+running `etcd` instance. You should already have `etcd` installed if you have
+followed the instructions in the [Development Guide](../development.md). Run
+`etcd` in another shell window or tab.
+
+Compile your integration test using a command like this:
```sh
-kubectl delete replicationcontroller flakecontroller
+go test -c -race ./test/integration/endpointslice
```
-If you do a final check for flakes with `docker ps -a`, ignore tasks that
-exited -1, since that's what happens when you stop the replication controller.
+And then stress test the flaky test using the `stress` command:
-Happy flake hunting!
+```sh
+stress ./endpointslice.test -test.run TestEndpointSliceMirroring
+```
+For an example of a failing or flaky integration test,
+[read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312).
+
+Sometimes, but not often, a test will fail due to timeouts caused by
+deadlocks. This can be tracked down by stress testing an entire package. The way
+to track this down is to stress test individual tests in a package. This process
+can take extra effort. Try following these steps:
+
+1. Run each test in the package individually to figure out the average runtime.
+2. Stress each test individually, bounding the timeout to 100 times the average run time.
+3. Isolate the particular test that is deadlocking.
+4. Add debug output to figure out what is causing the deadlock.
+
+Hopefully this can help narrow down exactly where the deadlock is occurring,
+revealing a simple fix!
+
+## Deflaking e2e tests
+
+A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of
+challenges. In particular, these tests are difficult because they test the
+entire Kubernetes system. This can be both good and bad. It can be good because
+we want the entire system to work when testing, but an e2e test can also fail
+because of something completely unrelated, such as failing infrastructure or
+misconfigured volumes. Be aware that you can't simply look at the title of an
+e2e test to understand exactly what is being tested. If possible, look for unit
+and integration tests related to the problem you are trying to solve.
+
+### Gathering information
+
+The first step in deflaking an e2e test is to gather information. We capture a
+lot of information from e2e test runs, and you can use these artifacts to gather
+information as to why a test is failing.
+
+Use the [Prow Status](https://prow.k8s.io/) tool to collect information on
+specific test jobs. Drill down into a job and use the **Artifacts** tab to
+collect information. For example, with
+[this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312),
+we can collect the following:
+
+* `build-log.txt`
+* In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/`
+ * `kube-apiserver-audit.log` (and rotated files)
+ * `kube-apiserver.log`
+ * `kube-controller-manager.log`
+ * `kube-scheduler.log`
+ * And more!
+
+The `artifacts/` directory will contain much more information. From inside the
+directories for each node:
+- `e2e-171671cb3f-674b9-minion-group-drkr`
+- `e2e-171671cb3f-674b9-minion-group-lr2z`
+- `e2e-171671cb3f-674b9-minion-group-qkkz`
+
+Look for these files:
+* `kubelet.log`
+* `docker.log`
+* `kube-proxy.log`
+* And so forth.
+
+### Filtering and correlating information
+
+Once you have gathered your information, the next step is to filter and
+correlate the information. This can require some familiarity with the issue you are tracking
+down, but look first at the relevant components, such as the test log, logs for the API
+server, controller manager, and `kubelet`.
+
+Filter the logs to find events that happened around the time of the failure and
+events that occurred in related namespaces and objects.
+
+The goal is to collate log entries from all of these different files so you can
+get a picture of what was happening in the distributed system. This will help
+you figure out exactly where the e2e test is failing. One tool that may help you
+with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner)
+
+Kubernetes has a lot of nested systems, so sometimes log entries can refer to
+events happening three levels deep. This means that line numbers in logs might
+not refer to where problems and messages originate. Do not make any assumptions
+about where messages are initiated!
+
+If you have trouble finding relevant logging information or events, don't be
+afraid to add debugging output to the test. For an example of this approach,
+[see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417).
+
+### What to look for
+
+One of the first things to look for is if the test is assuming that something is
+running synchronously when it actually runs asynchronously. For example, if the
+test is kicking off a goroutine, you might need to add delays to simulate slow
+operations and reproduce issues.
+
+Examples of the types of changes you could make to try to force a failure:
+ - `time.Sleep(time.Second)` at the top of a goroutine
+ - `time.Sleep(time.Second)` at the beginning of a watch event handler
+ - `time.Sleep(time.Second)` at the end of a watch event handler
+ - `time.Sleep(time.Second)` at the beginning of a sync loop worker
+ - `time.Sleep(time.Second)` at the end of a sync loop worker
+
+Sometimes,
+[such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856),
+a test might be causing a race condition with the system it is trying to
+test. Investigate if the test is conflicting with an asynchronous background
+process. To verify the issue, simulate the test losing the race by putting a
+`time.Sleep(time.Second)` between test steps.
+
+If a test is assuming that an operation will happen quickly, it might not be
+taking into account the configuration of a CI environment. A CI environment will
+generally be more resource-constrained and will run multiple tests in
+parallel. If it runs in less than a second locally, it could take a few seconds
+in a CI environment.
+
+Unless your test is specifically testing performance/timing, don't set tight
+timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in
+for operations that should not take very long. This is a better approach than
+polling for 1 to 10 seconds.
+
+Is the test incorrectly assuming deterministic output? Remember that map iteration in go is
+non-deterministic. If there is a list being compiled or a set of steps are being
+performed by iterating over a map, they will not be completed in a predictable
+order. Make sure the test is able to tolerate any order in a map.
+
+Be aware that if a test is mixing random allocation with static allocation, that
+there will be intermittent conflicts.
+
+Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point.
+It is better to look for specific actions in the fake client rather than
+asserting exact content of the full set.