summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authoreduartua <eduartua@gmail.com>2019-01-30 13:05:42 -0600
committereduartua <eduartua@gmail.com>2019-01-30 13:05:42 -0600
commitbbc4d0b877a7749a9344e4b9ab44424228d7631f (patch)
tree3f37594505519f6de10fdd6e265bf58c2fac8134
parentaf988769d2eeb2dedb3373670aa3a9643c611064 (diff)
file flaky-tests.md moves to the new folder /devel/sig-testing - URLs updated - tombstone file created
-rw-r--r--contributors/devel/README.md2
-rw-r--r--contributors/devel/flaky-tests.md202
-rw-r--r--contributors/devel/sig-testing/flaky-tests.md201
3 files changed, 204 insertions, 201 deletions
diff --git a/contributors/devel/README.md b/contributors/devel/README.md
index a0685b5e..b083f03d 100644
--- a/contributors/devel/README.md
+++ b/contributors/devel/README.md
@@ -29,7 +29,7 @@ Guide](http://kubernetes.io/docs/admin/).
* **Conformance Testing** ([conformance-tests.md](conformance-tests.md))
What is conformance testing and how to create/manage them.
-* **Hunting flaky tests** ([flaky-tests.md](flaky-tests.md)): We have a goal of 99.9% flake free tests.
+* **Hunting flaky tests** ([flaky-tests.md](sig-testing/flaky-tests.md)): We have a goal of 99.9% flake free tests.
Here's how to run your tests many times.
* **Logging Conventions** ([logging.md](sig-instrumentation/logging.md)): Glog levels.
diff --git a/contributors/devel/flaky-tests.md b/contributors/devel/flaky-tests.md
index 14302592..7f238095 100644
--- a/contributors/devel/flaky-tests.md
+++ b/contributors/devel/flaky-tests.md
@@ -1,201 +1,3 @@
-# Flaky tests
-
-Any test that fails occasionally is "flaky". Since our merges only proceed when
-all tests are green, and we have a number of different CI systems running the
-tests in various combinations, even a small percentage of flakes results in a
-lot of pain for people waiting for their PRs to merge.
-
-Therefore, it's very important that we write tests defensively. Situations that
-"almost never happen" happen with some regularity when run thousands of times in
-resource-constrained environments. Since flakes can often be quite hard to
-reproduce while still being common enough to block merges occasionally, it's
-additionally important that the test logs be useful for narrowing down exactly
-what caused the failure.
-
-Note that flakes can occur in unit tests, integration tests, or end-to-end
-tests, but probably occur most commonly in end-to-end tests.
-
-## Hunting Flakes
-
-You may notice lots of your PRs or ones you watch are having a common
-pre-submit failure, but less frequent issues that are still of concern take
-more analysis over time. There are metrics recorded and viewable in:
-- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary)
-- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1)
-
-It is worth noting tests are going to fail in presubmit a lot due
-to unbuildable code, but that wont happen as much on the same commit unless
-there's a true issue in the code or a broader problem like a dep failed to
-pull in.
-
-## Filing issues for flaky tests
-
-Because flakes may be rare, it's very important that all relevant logs be
-discoverable from the issue.
-
-1. Search for the test name. If you find an open issue and you're 90% sure the
- flake is exactly the same, add a comment instead of making a new issue.
-2. If you make a new issue, you should title it with the test name, prefixed by
- "e2e/unit/integration flake:" (whichever is appropriate)
-3. Reference any old issues you found in step one. Also, make a comment in the
- old issue referencing your new issue, because people monitoring only their
- email do not see the backlinks github adds. Alternatively, tag the person or
- people who most recently worked on it.
-4. Paste, in block quotes, the entire log of the individual failing test, not
- just the failure line.
-5. Link to durable storage with the rest of the logs. This means (for all the
- tests that Google runs) the GCS link is mandatory! The Jenkins test result
- link is nice but strictly optional: not only does it expire more quickly,
- it's not accessible to non-Googlers.
-
-## Finding failed flaky test cases
-
-Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
-There are significant numbers of flaky tests reported on a regular basis and P2
-flakes are under-investigated. Fixing flakes is a quick way to gain expertise
-and community goodwill.
-
-[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
-
-## Expectations when a flaky test is assigned to you
-
-Note that we won't randomly assign these issues to you unless you've opted in or
-you're part of a group that has opted in. We are more than happy to accept help
-from anyone in fixing these, but due to the severity of the problem when merges
-are blocked, we need reasonably quick turn-around time on test flakes. Therefore
-we have the following guidelines:
-
-1. If a flaky test is assigned to you, it's more important than anything else
- you're doing unless you can get a special dispensation (in which case it will
- be reassigned). If you have too many flaky tests assigned to you, or you
- have such a dispensation, then it's *still* your responsibility to find new
- owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
-2. You should make a reasonable effort to reproduce it. Somewhere between an
- hour and half a day of concentrated effort is "reasonable". It is perfectly
- reasonable to ask for help!
-3. If you can reproduce it (or it's obvious from the logs what happened), you
- should then be able to fix it, or in the case where someone is clearly more
- qualified to fix it, reassign it with very clear instructions.
-4. Once you have made a change that you believe fixes a flake, it is conservative
- to keep the issue for the flake open and see if it manifests again after the
- change is merged.
-5. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes
- back, at least 2 hours of merge time is wasted. So we need to make monotonic
- progress towards narrowing it down every time a flake occurs. If you can't
- figure it out from the logs, add log messages that would have help you figure
- it out. If you make changes to make a flake more reproducible, please link
- your pull request to the flake you're working on.
-6. If a flake has been open, could not be reproduced, and has not manifested in
- 3 months, it is reasonable to close the flake issue with a note saying
- why.
-
-# Reproducing unit test flakes
-
-Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
-
-Just
-
-```
-$ go install golang.org/x/tools/cmd/stress
-```
-
-Then build your test binary
-
-```
-$ go test -c -race
-```
-
-Then run it under stress
-
-```
-$ stress ./package.test -test.run=FlakyTest
-```
-
-It runs the command and writes output to `/tmp/gostress-*` files when it fails.
-It periodically reports with run counts. Be careful with tests that use the
-`net/http/httptest` package; they could exhaust the available ports on your
-system!
-
-# Hunting flaky unit tests in Kubernetes
-
-Sometimes unit tests are flaky. This means that due to (usually) race
-conditions, they will occasionally fail, even though most of the time they pass.
-
-We have a goal of 99.9% flake free tests. This means that there is only one
-flake in one thousand runs of a test.
-
-Running a test 1000 times on your own machine can be tedious and time consuming.
-Fortunately, there is a better way to achieve this using Kubernetes.
-
-_Note: these instructions are mildly hacky for now, as we get run once semantics
-and logging they will get better_
-
-There is a testing image `brendanburns/flake` up on the docker hub. We will use
-this image to test our fix.
-
-Create a replication controller with the following config:
-
-```yaml
-apiVersion: v1
-kind: ReplicationController
-metadata:
- name: flakecontroller
-spec:
- replicas: 24
- template:
- metadata:
- labels:
- name: flake
- spec:
- containers:
- - name: flake
- image: brendanburns/flake
- env:
- - name: TEST_PACKAGE
- value: pkg/tools
- - name: REPO_SPEC
- value: https://github.com/kubernetes/kubernetes
-```
-
-Note that we omit the labels and the selector fields of the replication
-controller, because they will be populated from the labels field of the pod
-template by default.
-
-```sh
-kubectl create -f ./controller.yaml
-```
-
-This will spin up 24 instances of the test. They will run to completion, then
-exit, and the kubelet will restart them, accumulating more and more runs of the
-test.
-
-You can examine the recent runs of the test by calling `docker ps -a` and
-looking for tasks that exited with non-zero exit codes. Unfortunately, docker
-ps -a only keeps around the exit status of the last 15-20 containers with the
-same image, so you have to check them frequently.
-
-You can use this script to automate checking for failures, assuming your cluster
-is running on GCE and has four nodes:
-
-```sh
-echo "" > output.txt
-for i in {1..4}; do
- echo "Checking kubernetes-node-${i}"
- echo "kubernetes-node-${i}:" >> output.txt
- gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
-done
-grep "Exited ([^0])" output.txt
-```
-
-Eventually you will have sufficient runs for your purposes. At that point you
-can delete the replication controller by running:
-
-```sh
-kubectl delete replicationcontroller flakecontroller
-```
-
-If you do a final check for flakes with `docker ps -a`, ignore tasks that
-exited -1, since that's what happens when you stop the replication controller.
-
-Happy flake hunting!
+This file has moved to https://git.k8s.io/community/contributors/devel/sig-testing/flaky-tests.md.
+This file is a placeholder to preserve links. Please remove by April 30, 2019 or the release of kubernetes 1.13, whichever comes first. \ No newline at end of file
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
new file mode 100644
index 00000000..14302592
--- /dev/null
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -0,0 +1,201 @@
+# Flaky tests
+
+Any test that fails occasionally is "flaky". Since our merges only proceed when
+all tests are green, and we have a number of different CI systems running the
+tests in various combinations, even a small percentage of flakes results in a
+lot of pain for people waiting for their PRs to merge.
+
+Therefore, it's very important that we write tests defensively. Situations that
+"almost never happen" happen with some regularity when run thousands of times in
+resource-constrained environments. Since flakes can often be quite hard to
+reproduce while still being common enough to block merges occasionally, it's
+additionally important that the test logs be useful for narrowing down exactly
+what caused the failure.
+
+Note that flakes can occur in unit tests, integration tests, or end-to-end
+tests, but probably occur most commonly in end-to-end tests.
+
+## Hunting Flakes
+
+You may notice lots of your PRs or ones you watch are having a common
+pre-submit failure, but less frequent issues that are still of concern take
+more analysis over time. There are metrics recorded and viewable in:
+- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary)
+- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1)
+
+It is worth noting tests are going to fail in presubmit a lot due
+to unbuildable code, but that wont happen as much on the same commit unless
+there's a true issue in the code or a broader problem like a dep failed to
+pull in.
+
+## Filing issues for flaky tests
+
+Because flakes may be rare, it's very important that all relevant logs be
+discoverable from the issue.
+
+1. Search for the test name. If you find an open issue and you're 90% sure the
+ flake is exactly the same, add a comment instead of making a new issue.
+2. If you make a new issue, you should title it with the test name, prefixed by
+ "e2e/unit/integration flake:" (whichever is appropriate)
+3. Reference any old issues you found in step one. Also, make a comment in the
+ old issue referencing your new issue, because people monitoring only their
+ email do not see the backlinks github adds. Alternatively, tag the person or
+ people who most recently worked on it.
+4. Paste, in block quotes, the entire log of the individual failing test, not
+ just the failure line.
+5. Link to durable storage with the rest of the logs. This means (for all the
+ tests that Google runs) the GCS link is mandatory! The Jenkins test result
+ link is nice but strictly optional: not only does it expire more quickly,
+ it's not accessible to non-Googlers.
+
+## Finding failed flaky test cases
+
+Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
+There are significant numbers of flaky tests reported on a regular basis and P2
+flakes are under-investigated. Fixing flakes is a quick way to gain expertise
+and community goodwill.
+
+[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
+
+## Expectations when a flaky test is assigned to you
+
+Note that we won't randomly assign these issues to you unless you've opted in or
+you're part of a group that has opted in. We are more than happy to accept help
+from anyone in fixing these, but due to the severity of the problem when merges
+are blocked, we need reasonably quick turn-around time on test flakes. Therefore
+we have the following guidelines:
+
+1. If a flaky test is assigned to you, it's more important than anything else
+ you're doing unless you can get a special dispensation (in which case it will
+ be reassigned). If you have too many flaky tests assigned to you, or you
+ have such a dispensation, then it's *still* your responsibility to find new
+ owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
+2. You should make a reasonable effort to reproduce it. Somewhere between an
+ hour and half a day of concentrated effort is "reasonable". It is perfectly
+ reasonable to ask for help!
+3. If you can reproduce it (or it's obvious from the logs what happened), you
+ should then be able to fix it, or in the case where someone is clearly more
+ qualified to fix it, reassign it with very clear instructions.
+4. Once you have made a change that you believe fixes a flake, it is conservative
+ to keep the issue for the flake open and see if it manifests again after the
+ change is merged.
+5. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes
+ back, at least 2 hours of merge time is wasted. So we need to make monotonic
+ progress towards narrowing it down every time a flake occurs. If you can't
+ figure it out from the logs, add log messages that would have help you figure
+ it out. If you make changes to make a flake more reproducible, please link
+ your pull request to the flake you're working on.
+6. If a flake has been open, could not be reproduced, and has not manifested in
+ 3 months, it is reasonable to close the flake issue with a note saying
+ why.
+
+# Reproducing unit test flakes
+
+Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+
+Just
+
+```
+$ go install golang.org/x/tools/cmd/stress
+```
+
+Then build your test binary
+
+```
+$ go test -c -race
+```
+
+Then run it under stress
+
+```
+$ stress ./package.test -test.run=FlakyTest
+```
+
+It runs the command and writes output to `/tmp/gostress-*` files when it fails.
+It periodically reports with run counts. Be careful with tests that use the
+`net/http/httptest` package; they could exhaust the available ports on your
+system!
+
+# Hunting flaky unit tests in Kubernetes
+
+Sometimes unit tests are flaky. This means that due to (usually) race
+conditions, they will occasionally fail, even though most of the time they pass.
+
+We have a goal of 99.9% flake free tests. This means that there is only one
+flake in one thousand runs of a test.
+
+Running a test 1000 times on your own machine can be tedious and time consuming.
+Fortunately, there is a better way to achieve this using Kubernetes.
+
+_Note: these instructions are mildly hacky for now, as we get run once semantics
+and logging they will get better_
+
+There is a testing image `brendanburns/flake` up on the docker hub. We will use
+this image to test our fix.
+
+Create a replication controller with the following config:
+
+```yaml
+apiVersion: v1
+kind: ReplicationController
+metadata:
+ name: flakecontroller
+spec:
+ replicas: 24
+ template:
+ metadata:
+ labels:
+ name: flake
+ spec:
+ containers:
+ - name: flake
+ image: brendanburns/flake
+ env:
+ - name: TEST_PACKAGE
+ value: pkg/tools
+ - name: REPO_SPEC
+ value: https://github.com/kubernetes/kubernetes
+```
+
+Note that we omit the labels and the selector fields of the replication
+controller, because they will be populated from the labels field of the pod
+template by default.
+
+```sh
+kubectl create -f ./controller.yaml
+```
+
+This will spin up 24 instances of the test. They will run to completion, then
+exit, and the kubelet will restart them, accumulating more and more runs of the
+test.
+
+You can examine the recent runs of the test by calling `docker ps -a` and
+looking for tasks that exited with non-zero exit codes. Unfortunately, docker
+ps -a only keeps around the exit status of the last 15-20 containers with the
+same image, so you have to check them frequently.
+
+You can use this script to automate checking for failures, assuming your cluster
+is running on GCE and has four nodes:
+
+```sh
+echo "" > output.txt
+for i in {1..4}; do
+ echo "Checking kubernetes-node-${i}"
+ echo "kubernetes-node-${i}:" >> output.txt
+ gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
+done
+grep "Exited ([^0])" output.txt
+```
+
+Eventually you will have sufficient runs for your purposes. At that point you
+can delete the replication controller by running:
+
+```sh
+kubectl delete replicationcontroller flakecontroller
+```
+
+If you do a final check for flakes with `docker ps -a`, ignore tasks that
+exited -1, since that's what happens when you stop the replication controller.
+
+Happy flake hunting!
+