diff options
| author | eduartua <eduartua@gmail.com> | 2019-01-30 13:05:42 -0600 |
|---|---|---|
| committer | eduartua <eduartua@gmail.com> | 2019-01-30 13:05:42 -0600 |
| commit | bbc4d0b877a7749a9344e4b9ab44424228d7631f (patch) | |
| tree | 3f37594505519f6de10fdd6e265bf58c2fac8134 | |
| parent | af988769d2eeb2dedb3373670aa3a9643c611064 (diff) | |
file flaky-tests.md moves to the new folder /devel/sig-testing - URLs updated - tombstone file created
| -rw-r--r-- | contributors/devel/README.md | 2 | ||||
| -rw-r--r-- | contributors/devel/flaky-tests.md | 202 | ||||
| -rw-r--r-- | contributors/devel/sig-testing/flaky-tests.md | 201 |
3 files changed, 204 insertions, 201 deletions
diff --git a/contributors/devel/README.md b/contributors/devel/README.md index a0685b5e..b083f03d 100644 --- a/contributors/devel/README.md +++ b/contributors/devel/README.md @@ -29,7 +29,7 @@ Guide](http://kubernetes.io/docs/admin/). * **Conformance Testing** ([conformance-tests.md](conformance-tests.md)) What is conformance testing and how to create/manage them. -* **Hunting flaky tests** ([flaky-tests.md](flaky-tests.md)): We have a goal of 99.9% flake free tests. +* **Hunting flaky tests** ([flaky-tests.md](sig-testing/flaky-tests.md)): We have a goal of 99.9% flake free tests. Here's how to run your tests many times. * **Logging Conventions** ([logging.md](sig-instrumentation/logging.md)): Glog levels. diff --git a/contributors/devel/flaky-tests.md b/contributors/devel/flaky-tests.md index 14302592..7f238095 100644 --- a/contributors/devel/flaky-tests.md +++ b/contributors/devel/flaky-tests.md @@ -1,201 +1,3 @@ -# Flaky tests - -Any test that fails occasionally is "flaky". Since our merges only proceed when -all tests are green, and we have a number of different CI systems running the -tests in various combinations, even a small percentage of flakes results in a -lot of pain for people waiting for their PRs to merge. - -Therefore, it's very important that we write tests defensively. Situations that -"almost never happen" happen with some regularity when run thousands of times in -resource-constrained environments. Since flakes can often be quite hard to -reproduce while still being common enough to block merges occasionally, it's -additionally important that the test logs be useful for narrowing down exactly -what caused the failure. - -Note that flakes can occur in unit tests, integration tests, or end-to-end -tests, but probably occur most commonly in end-to-end tests. - -## Hunting Flakes - -You may notice lots of your PRs or ones you watch are having a common -pre-submit failure, but less frequent issues that are still of concern take -more analysis over time. There are metrics recorded and viewable in: -- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary) -- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1) - -It is worth noting tests are going to fail in presubmit a lot due -to unbuildable code, but that wont happen as much on the same commit unless -there's a true issue in the code or a broader problem like a dep failed to -pull in. - -## Filing issues for flaky tests - -Because flakes may be rare, it's very important that all relevant logs be -discoverable from the issue. - -1. Search for the test name. If you find an open issue and you're 90% sure the - flake is exactly the same, add a comment instead of making a new issue. -2. If you make a new issue, you should title it with the test name, prefixed by - "e2e/unit/integration flake:" (whichever is appropriate) -3. Reference any old issues you found in step one. Also, make a comment in the - old issue referencing your new issue, because people monitoring only their - email do not see the backlinks github adds. Alternatively, tag the person or - people who most recently worked on it. -4. Paste, in block quotes, the entire log of the individual failing test, not - just the failure line. -5. Link to durable storage with the rest of the logs. This means (for all the - tests that Google runs) the GCS link is mandatory! The Jenkins test result - link is nice but strictly optional: not only does it expire more quickly, - it's not accessible to non-Googlers. - -## Finding failed flaky test cases - -Find flaky tests issues on GitHub under the [kind/flake issue label][flake]. -There are significant numbers of flaky tests reported on a regular basis and P2 -flakes are under-investigated. Fixing flakes is a quick way to gain expertise -and community goodwill. - -[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake - -## Expectations when a flaky test is assigned to you - -Note that we won't randomly assign these issues to you unless you've opted in or -you're part of a group that has opted in. We are more than happy to accept help -from anyone in fixing these, but due to the severity of the problem when merges -are blocked, we need reasonably quick turn-around time on test flakes. Therefore -we have the following guidelines: - -1. If a flaky test is assigned to you, it's more important than anything else - you're doing unless you can get a special dispensation (in which case it will - be reassigned). If you have too many flaky tests assigned to you, or you - have such a dispensation, then it's *still* your responsibility to find new - owners (this may just mean giving stuff back to the relevant Team or SIG Lead). -2. You should make a reasonable effort to reproduce it. Somewhere between an - hour and half a day of concentrated effort is "reasonable". It is perfectly - reasonable to ask for help! -3. If you can reproduce it (or it's obvious from the logs what happened), you - should then be able to fix it, or in the case where someone is clearly more - qualified to fix it, reassign it with very clear instructions. -4. Once you have made a change that you believe fixes a flake, it is conservative - to keep the issue for the flake open and see if it manifests again after the - change is merged. -5. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes - back, at least 2 hours of merge time is wasted. So we need to make monotonic - progress towards narrowing it down every time a flake occurs. If you can't - figure it out from the logs, add log messages that would have help you figure - it out. If you make changes to make a flake more reproducible, please link - your pull request to the flake you're working on. -6. If a flake has been open, could not be reproduced, and has not manifested in - 3 months, it is reasonable to close the flake issue with a note saying - why. - -# Reproducing unit test flakes - -Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress). - -Just - -``` -$ go install golang.org/x/tools/cmd/stress -``` - -Then build your test binary - -``` -$ go test -c -race -``` - -Then run it under stress - -``` -$ stress ./package.test -test.run=FlakyTest -``` - -It runs the command and writes output to `/tmp/gostress-*` files when it fails. -It periodically reports with run counts. Be careful with tests that use the -`net/http/httptest` package; they could exhaust the available ports on your -system! - -# Hunting flaky unit tests in Kubernetes - -Sometimes unit tests are flaky. This means that due to (usually) race -conditions, they will occasionally fail, even though most of the time they pass. - -We have a goal of 99.9% flake free tests. This means that there is only one -flake in one thousand runs of a test. - -Running a test 1000 times on your own machine can be tedious and time consuming. -Fortunately, there is a better way to achieve this using Kubernetes. - -_Note: these instructions are mildly hacky for now, as we get run once semantics -and logging they will get better_ - -There is a testing image `brendanburns/flake` up on the docker hub. We will use -this image to test our fix. - -Create a replication controller with the following config: - -```yaml -apiVersion: v1 -kind: ReplicationController -metadata: - name: flakecontroller -spec: - replicas: 24 - template: - metadata: - labels: - name: flake - spec: - containers: - - name: flake - image: brendanburns/flake - env: - - name: TEST_PACKAGE - value: pkg/tools - - name: REPO_SPEC - value: https://github.com/kubernetes/kubernetes -``` - -Note that we omit the labels and the selector fields of the replication -controller, because they will be populated from the labels field of the pod -template by default. - -```sh -kubectl create -f ./controller.yaml -``` - -This will spin up 24 instances of the test. They will run to completion, then -exit, and the kubelet will restart them, accumulating more and more runs of the -test. - -You can examine the recent runs of the test by calling `docker ps -a` and -looking for tasks that exited with non-zero exit codes. Unfortunately, docker -ps -a only keeps around the exit status of the last 15-20 containers with the -same image, so you have to check them frequently. - -You can use this script to automate checking for failures, assuming your cluster -is running on GCE and has four nodes: - -```sh -echo "" > output.txt -for i in {1..4}; do - echo "Checking kubernetes-node-${i}" - echo "kubernetes-node-${i}:" >> output.txt - gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt -done -grep "Exited ([^0])" output.txt -``` - -Eventually you will have sufficient runs for your purposes. At that point you -can delete the replication controller by running: - -```sh -kubectl delete replicationcontroller flakecontroller -``` - -If you do a final check for flakes with `docker ps -a`, ignore tasks that -exited -1, since that's what happens when you stop the replication controller. - -Happy flake hunting! +This file has moved to https://git.k8s.io/community/contributors/devel/sig-testing/flaky-tests.md. +This file is a placeholder to preserve links. Please remove by April 30, 2019 or the release of kubernetes 1.13, whichever comes first.
\ No newline at end of file diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md new file mode 100644 index 00000000..14302592 --- /dev/null +++ b/contributors/devel/sig-testing/flaky-tests.md @@ -0,0 +1,201 @@ +# Flaky tests + +Any test that fails occasionally is "flaky". Since our merges only proceed when +all tests are green, and we have a number of different CI systems running the +tests in various combinations, even a small percentage of flakes results in a +lot of pain for people waiting for their PRs to merge. + +Therefore, it's very important that we write tests defensively. Situations that +"almost never happen" happen with some regularity when run thousands of times in +resource-constrained environments. Since flakes can often be quite hard to +reproduce while still being common enough to block merges occasionally, it's +additionally important that the test logs be useful for narrowing down exactly +what caused the failure. + +Note that flakes can occur in unit tests, integration tests, or end-to-end +tests, but probably occur most commonly in end-to-end tests. + +## Hunting Flakes + +You may notice lots of your PRs or ones you watch are having a common +pre-submit failure, but less frequent issues that are still of concern take +more analysis over time. There are metrics recorded and viewable in: +- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary) +- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1) + +It is worth noting tests are going to fail in presubmit a lot due +to unbuildable code, but that wont happen as much on the same commit unless +there's a true issue in the code or a broader problem like a dep failed to +pull in. + +## Filing issues for flaky tests + +Because flakes may be rare, it's very important that all relevant logs be +discoverable from the issue. + +1. Search for the test name. If you find an open issue and you're 90% sure the + flake is exactly the same, add a comment instead of making a new issue. +2. If you make a new issue, you should title it with the test name, prefixed by + "e2e/unit/integration flake:" (whichever is appropriate) +3. Reference any old issues you found in step one. Also, make a comment in the + old issue referencing your new issue, because people monitoring only their + email do not see the backlinks github adds. Alternatively, tag the person or + people who most recently worked on it. +4. Paste, in block quotes, the entire log of the individual failing test, not + just the failure line. +5. Link to durable storage with the rest of the logs. This means (for all the + tests that Google runs) the GCS link is mandatory! The Jenkins test result + link is nice but strictly optional: not only does it expire more quickly, + it's not accessible to non-Googlers. + +## Finding failed flaky test cases + +Find flaky tests issues on GitHub under the [kind/flake issue label][flake]. +There are significant numbers of flaky tests reported on a regular basis and P2 +flakes are under-investigated. Fixing flakes is a quick way to gain expertise +and community goodwill. + +[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake + +## Expectations when a flaky test is assigned to you + +Note that we won't randomly assign these issues to you unless you've opted in or +you're part of a group that has opted in. We are more than happy to accept help +from anyone in fixing these, but due to the severity of the problem when merges +are blocked, we need reasonably quick turn-around time on test flakes. Therefore +we have the following guidelines: + +1. If a flaky test is assigned to you, it's more important than anything else + you're doing unless you can get a special dispensation (in which case it will + be reassigned). If you have too many flaky tests assigned to you, or you + have such a dispensation, then it's *still* your responsibility to find new + owners (this may just mean giving stuff back to the relevant Team or SIG Lead). +2. You should make a reasonable effort to reproduce it. Somewhere between an + hour and half a day of concentrated effort is "reasonable". It is perfectly + reasonable to ask for help! +3. If you can reproduce it (or it's obvious from the logs what happened), you + should then be able to fix it, or in the case where someone is clearly more + qualified to fix it, reassign it with very clear instructions. +4. Once you have made a change that you believe fixes a flake, it is conservative + to keep the issue for the flake open and see if it manifests again after the + change is merged. +5. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes + back, at least 2 hours of merge time is wasted. So we need to make monotonic + progress towards narrowing it down every time a flake occurs. If you can't + figure it out from the logs, add log messages that would have help you figure + it out. If you make changes to make a flake more reproducible, please link + your pull request to the flake you're working on. +6. If a flake has been open, could not be reproduced, and has not manifested in + 3 months, it is reasonable to close the flake issue with a note saying + why. + +# Reproducing unit test flakes + +Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress). + +Just + +``` +$ go install golang.org/x/tools/cmd/stress +``` + +Then build your test binary + +``` +$ go test -c -race +``` + +Then run it under stress + +``` +$ stress ./package.test -test.run=FlakyTest +``` + +It runs the command and writes output to `/tmp/gostress-*` files when it fails. +It periodically reports with run counts. Be careful with tests that use the +`net/http/httptest` package; they could exhaust the available ports on your +system! + +# Hunting flaky unit tests in Kubernetes + +Sometimes unit tests are flaky. This means that due to (usually) race +conditions, they will occasionally fail, even though most of the time they pass. + +We have a goal of 99.9% flake free tests. This means that there is only one +flake in one thousand runs of a test. + +Running a test 1000 times on your own machine can be tedious and time consuming. +Fortunately, there is a better way to achieve this using Kubernetes. + +_Note: these instructions are mildly hacky for now, as we get run once semantics +and logging they will get better_ + +There is a testing image `brendanburns/flake` up on the docker hub. We will use +this image to test our fix. + +Create a replication controller with the following config: + +```yaml +apiVersion: v1 +kind: ReplicationController +metadata: + name: flakecontroller +spec: + replicas: 24 + template: + metadata: + labels: + name: flake + spec: + containers: + - name: flake + image: brendanburns/flake + env: + - name: TEST_PACKAGE + value: pkg/tools + - name: REPO_SPEC + value: https://github.com/kubernetes/kubernetes +``` + +Note that we omit the labels and the selector fields of the replication +controller, because they will be populated from the labels field of the pod +template by default. + +```sh +kubectl create -f ./controller.yaml +``` + +This will spin up 24 instances of the test. They will run to completion, then +exit, and the kubelet will restart them, accumulating more and more runs of the +test. + +You can examine the recent runs of the test by calling `docker ps -a` and +looking for tasks that exited with non-zero exit codes. Unfortunately, docker +ps -a only keeps around the exit status of the last 15-20 containers with the +same image, so you have to check them frequently. + +You can use this script to automate checking for failures, assuming your cluster +is running on GCE and has four nodes: + +```sh +echo "" > output.txt +for i in {1..4}; do + echo "Checking kubernetes-node-${i}" + echo "kubernetes-node-${i}:" >> output.txt + gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt +done +grep "Exited ([^0])" output.txt +``` + +Eventually you will have sufficient runs for your purposes. At that point you +can delete the replication controller by running: + +```sh +kubectl delete replicationcontroller flakecontroller +``` + +If you do a final check for flakes with `docker ps -a`, ignore tasks that +exited -1, since that's what happens when you stop the replication controller. + +Happy flake hunting! + |
