diff options
| author | Erik L. Arneson <erik@lionswaycontent.com> | 2020-10-08 15:23:56 -0700 |
|---|---|---|
| committer | Erik L. Arneson <erik@lionswaycontent.com> | 2020-10-21 09:03:13 -0700 |
| commit | 15de14640bc773bca6e793a85ff2f110cfba3e67 (patch) | |
| tree | f370f0203a5e552e7c1aaef3a368b3e0b6995060 /contributors | |
| parent | c2fa487203f41b59fcd57293cb743411e936bf6c (diff) | |
Add Deflaking tests information to Developer Guide
This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg
Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md.
Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
Diffstat (limited to 'contributors')
| -rw-r--r-- | contributors/devel/sig-testing/flaky-tests.md | 320 |
1 files changed, 232 insertions, 88 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md index 1745e23f..e90a6785 100644 --- a/contributors/devel/sig-testing/flaky-tests.md +++ b/contributors/devel/sig-testing/flaky-tests.md @@ -1,4 +1,4 @@ -# Flaky tests +# Flaky Tests Any test that fails occasionally is "flaky". Since our merges only proceed when all tests are green, and we have a number of different CI systems running the @@ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize addressing them, either by fixing them or quarantining them off the critical path. -# Avoiding Flakes +For more information about deflaking Kubernetes tests, watch @liggitt's +[presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg). + +**Table of Contents** + +- [Flaky Tests](#flaky-tests) + - [Avoiding Flakes](#avoiding-flakes) + - [Quarantining Flakes](#quarantining-flakes) + - [Hunting Flakes](#hunting-flakes) + - [GitHub Issues for Known Flakes](#github-issues-for-known-flakes) + - [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you) + - [Writing a good flake report](#writing-a-good-flake-report) + - [Deflaking unit tests](#deflaking-unit-tests) + - [Deflaking integration tests](#deflaking-integration-tests) + - [Deflaking e2e tests](#deflaking-e2e-tests) + - [Gathering information](#gathering-information) + - [Filtering and correlating information](#filtering-and-correlating-information) + - [What to look for](#what-to-look-for) + - [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes) + +## Avoiding Flakes Write tests defensively. Remember that "almost never" happens all the time when tests are run thousands of times in a CI environment. Tests need to be tolerant @@ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever. - "expected 3 widgets, found 2, will retry" - "expected pod to be in state foo, currently in state bar, will retry" -# Quarantining Flakes +## Quarantining Flakes - When quarantining a presubmit test, ensure an issue exists in the current release milestone assigned to the owning SIG. The issue should be labeled @@ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever. feature. The majority of release-blocking and merge-blocking suites avoid these jobs unless they're proven to be non-flaky. -# Hunting Flakes +## Hunting Flakes We offer the following tools to aid in finding or troubleshooting flakes @@ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes [go.k8s.io/triage]: https://go.k8s.io/triage [testgrid.k8s.io]: https://testgrid.k8s.io -# GitHub Issues for Known Flakes +## GitHub Issues for Known Flakes Because flakes may be rare, it's very important that all relevant logs be discoverable from the issue. @@ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill. [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake -## Expectations when a flaky test is assigned to you +### Expectations when a flaky test is assigned to you Note that we won't randomly assign these issues to you unless you've opted in or you're part of a group that has opted in. We are more than happy to accept help @@ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines: name, which will result in the test being quarantined to only those jobs that explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky) -# Reproducing unit test flakes +### Writing a good flake report -Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress). +If you are reporting a flake, it is important to include enough information for +others to reproduce the issue. When filing the issue, use the +[flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In +your issue, answer these following questions: -Just +- Is this flaking in multiple jobs? You can search for the flaking test or error + messages using the + [Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool. +- Are there multiple tests in the same package or suite failing with the same apparent error? -``` -$ go install golang.org/x/tools/cmd/stress -``` +In addition, be sure to include the following information: -Then build your test binary +- A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's + jobs, filtered to the relevant tests +- The failed test output — this is essential because it makes the issue searchable +- A link to the triage query +- A link to specific failures +- Be sure to tag the relevant SIG, if you know what it is. -``` -$ go test -c -race -``` +For a good example of a flaking test issue, +[check here](https://github.com/kubernetes/kubernetes/issues/93358). -Then run it under stress +([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.) +## Deflaking unit tests + +To get started with deflaking unit tests, you will need to first +reproduce the flaky behavior. Start with a simple attempt to just run +the flaky unit test. For example: + +```sh +go test ./pkg/kubelet/config -run TestInvalidPodFiltered ``` -$ stress ./package.test -test.run=FlakyTest + +Also make sure that you bypass the `go test` cache by using an uncachable +command line option: + +```sh +go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered ``` -It runs the command and writes output to `/tmp/gostress-*` files when it fails. -It periodically reports with run counts. Be careful with tests that use the -`net/http/httptest` package; they could exhaust the available ports on your -system! - -# Hunting flaky unit tests in Kubernetes - -Sometimes unit tests are flaky. This means that due to (usually) race -conditions, they will occasionally fail, even though most of the time they pass. - -We have a goal of 99.9% flake free tests. This means that there is only one -flake in one thousand runs of a test. - -Running a test 1000 times on your own machine can be tedious and time consuming. -Fortunately, there is a better way to achieve this using Kubernetes. - -_Note: these instructions are mildly hacky for now, as we get run once semantics -and logging they will get better_ - -There is a testing image `brendanburns/flake` up on the docker hub. We will use -this image to test our fix. - -Create a replication controller with the following config: - -```yaml -apiVersion: v1 -kind: ReplicationController -metadata: - name: flakecontroller -spec: - replicas: 24 - template: - metadata: - labels: - name: flake - spec: - containers: - - name: flake - image: brendanburns/flake - env: - - name: TEST_PACKAGE - value: pkg/tools - - name: REPO_SPEC - value: https://github.com/kubernetes/kubernetes +If even this is not revealing issues with the flaky test, try running with +[race detection](https://golang.org/doc/articles/race_detector.html) enabled: + +```sh +go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered ``` -Note that we omit the labels and the selector fields of the replication -controller, because they will be populated from the labels field of the pod -template by default. +Finally, you can stress test the unit test using the +[stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it +with this command: ```sh -kubectl create -f ./controller.yaml +go get golang.org/x/tools/cmd/stress ``` -This will spin up 24 instances of the test. They will run to completion, then -exit, and the kubelet will restart them, accumulating more and more runs of the -test. +Then build your test binary: -You can examine the recent runs of the test by calling `docker ps -a` and -looking for tasks that exited with non-zero exit codes. Unfortunately, docker -ps -a only keeps around the exit status of the last 15-20 containers with the -same image, so you have to check them frequently. +```sh +go test ./pkg/kubelet/config -race -c +``` -You can use this script to automate checking for failures, assuming your cluster -is running on GCE and has four nodes: +Then run it under stress: ```sh -echo "" > output.txt -for i in {1..4}; do - echo "Checking kubernetes-node-${i}" - echo "kubernetes-node-${i}:" >> output.txt - gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt -done -grep "Exited ([^0])" output.txt +stress ./config.test -test.run TestInvalidPodFiltered ``` -Eventually you will have sufficient runs for your purposes. At that point you -can delete the replication controller by running: +The stress command runs the test binary repeatedly, reporting when it fails. It +will periodically report how many times it has run and how many failures have +occurred. + +You should see output like this: + +``` +411 runs so far, 0 failures +/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266 +--- FAIL: TestInvalidPodFiltered (0.00s) + config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"} +FAIL +ERROR: exit status 1 +815 runs so far, 1 failures +``` + +Be careful with tests that use the `net/http/httptest` package; they could +exhaust the available ports on your system! + +## Deflaking integration tests + +Integration tests run similarly to unit tests, but they almost always expect a +running `etcd` instance. You should already have `etcd` installed if you have +followed the instructions in the [Development Guide](../development.md). Run +`etcd` in another shell window or tab. + +Compile your integration test using a command like this: ```sh -kubectl delete replicationcontroller flakecontroller +go test -c -race ./test/integration/endpointslice ``` -If you do a final check for flakes with `docker ps -a`, ignore tasks that -exited -1, since that's what happens when you stop the replication controller. +And then stress test the flaky test using the `stress` command: -Happy flake hunting! +```sh +stress ./endpointslice.test -test.run TestEndpointSliceMirroring +``` +For an example of a failing or flaky integration test, +[read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312). + +Sometimes, but not often, a test will fail due to timeouts caused by +deadlocks. This can be tracked down by stress testing an entire package. The way +to track this down is to stress test individual tests in a package. This process +can take extra effort. Try following these steps: + +1. Run each test in the package individually to figure out the average runtime. +2. Stress each test individually, bounding the timeout to 100 times the average run time. +3. Isolate the particular test that is deadlocking. +4. Add debug output to figure out what is causing the deadlock. + +Hopefully this can help narrow down exactly where the deadlock is occurring, +revealing a simple fix! + +## Deflaking e2e tests + +A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of +challenges. In particular, these tests are difficult because they test the +entire Kubernetes system. This can be both good and bad. It can be good because +we want the entire system to work when testing, but an e2e test can also fail +because of something completely unrelated, such as failing infrastructure or +misconfigured volumes. Be aware that you can't simply look at the title of an +e2e test to understand exactly what is being tested. If possible, look for unit +and integration tests related to the problem you are trying to solve. + +### Gathering information + +The first step in deflaking an e2e test is to gather information. We capture a +lot of information from e2e test runs, and you can use these artifacts to gather +information as to why a test is failing. + +Use the [Prow Status](https://prow.k8s.io/) tool to collect information on +specific test jobs. Drill down into a job and use the **Artifacts** tab to +collect information. For example, with +[this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312), +we can collect the following: + +* `build-log.txt` +* In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/` + * `kube-apiserver-audit.log` (and rotated files) + * `kube-apiserver.log` + * `kube-controller-manager.log` + * `kube-scheduler.log` + * And more! + +The `artifacts/` directory will contain much more information. From inside the +directories for each node: +- `e2e-171671cb3f-674b9-minion-group-drkr` +- `e2e-171671cb3f-674b9-minion-group-lr2z` +- `e2e-171671cb3f-674b9-minion-group-qkkz` + +Look for these files: +* `kubelet.log` +* `docker.log` +* `kube-proxy.log` +* And so forth. + +### Filtering and correlating information + +Once you have gathered your information, the next step is to filter and +correlate the information. This can require some familiarity with the issue you are tracking +down, but look first at the relevant components, such as the test log, logs for the API +server, controller manager, and `kubelet`. + +Filter the logs to find events that happened around the time of the failure and +events that occurred in related namespaces and objects. + +The goal is to collate log entries from all of these different files so you can +get a picture of what was happening in the distributed system. This will help +you figure out exactly where the e2e test is failing. One tool that may help you +with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner) + +Kubernetes has a lot of nested systems, so sometimes log entries can refer to +events happening three levels deep. This means that line numbers in logs might +not refer to where problems and messages originate. Do not make any assumptions +about where messages are initiated! + +If you have trouble finding relevant logging information or events, don't be +afraid to add debugging output to the test. For an example of this approach, +[see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417). + +### What to look for + +One of the first things to look for is if the test is assuming that something is +running synchronously when it actually runs asynchronously. For example, if the +test is kicking off a goroutine, you might need to add delays to simulate slow +operations and reproduce issues. + +Examples of the types of changes you could make to try to force a failure: + - `time.Sleep(time.Second)` at the top of a goroutine + - `time.Sleep(time.Second)` at the beginning of a watch event handler + - `time.Sleep(time.Second)` at the end of a watch event handler + - `time.Sleep(time.Second)` at the beginning of a sync loop worker + - `time.Sleep(time.Second)` at the end of a sync loop worker + +Sometimes, +[such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856), +a test might be causing a race condition with the system it is trying to +test. Investigate if the test is conflicting with an asynchronous background +process. To verify the issue, simulate the test losing the race by putting a +`time.Sleep(time.Second)` between test steps. + +If a test is assuming that an operation will happen quickly, it might not be +taking into account the configuration of a CI environment. A CI environment will +generally be more resource-constrained and will run multiple tests in +parallel. If it runs in less than a second locally, it could take a few seconds +in a CI environment. + +Unless your test is specifically testing performance/timing, don't set tight +timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in +for operations that should not take very long. This is a better approach than +polling for 1 to 10 seconds. + +Is the test incorrectly assuming deterministic output? Remember that map iteration in go is +non-deterministic. If there is a list being compiled or a set of steps are being +performed by iterating over a map, they will not be completed in a predictable +order. Make sure the test is able to tolerate any order in a map. + +Be aware that if a test is mixing random allocation with static allocation, that +there will be intermittent conflicts. + +Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point. +It is better to look for specific actions in the fake client rather than +asserting exact content of the full set. |
