summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>2019-12-09 18:07:29 -0800
committerGitHub <noreply@github.com>2019-12-09 18:07:29 -0800
commitfd7043c73bb981e4ef078ed65be73888c5e35bd2 (patch)
tree9e875239550ccd9108c4e3b1f64b008d7ae1ad1d
parentbfdd288a81fbfdc3403170f1554cf493823176af (diff)
parent460f52fda305d5db80a63a82ee46ce8876652639 (diff)
Merge pull request #4299 from spiffxp/flake-content
Try adding some more useful links for hunting flakes
-rw-r--r--contributors/devel/sig-testing/flaky-tests.md51
1 files changed, 28 insertions, 23 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
index 14302592..2184949a 100644
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -15,20 +15,28 @@ what caused the failure.
Note that flakes can occur in unit tests, integration tests, or end-to-end
tests, but probably occur most commonly in end-to-end tests.
-## Hunting Flakes
+# Hunting Flakes
-You may notice lots of your PRs or ones you watch are having a common
-pre-submit failure, but less frequent issues that are still of concern take
-more analysis over time. There are metrics recorded and viewable in:
-- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary)
-- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1)
+We offer the following tools to aid in finding or troubleshooting flakes
-It is worth noting tests are going to fail in presubmit a lot due
-to unbuildable code, but that wont happen as much on the same commit unless
-there's a true issue in the code or a broader problem like a dep failed to
-pull in.
+- [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks
+ - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job
+ - https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out`
+ - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5C%5Bsig-apps%5C%5D - all failures that happened in tests with `[sig-apps]` in their name
+- [testgrid.k8s.io] - display test results in a grid for visual identififcation of flakes
+ - https://testgrid.k8s.io/presubmits-kubernetes-blocking - all merge-blocking jobs
+ - https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&exclude-filter-by-regex=BeforeSuite&sort-by-flakiness= - results for the pull-kubernetes-e2e-gce job sorted by flakiness
+ - https://testgrid.k8s.io/sig-release-master-informing#gce-cos-master-default&sort-by-flakiness=&width=10 - results for the equivalent CI job
+- [velodrome.k8s.io] - dashboards driven by the results of queries run against test results using bigquery
+ - http://velodrome.k8s.io/dashboard/db/job-health-merge-blocking?orgId=1 - includes flake rate and top flakes for merge-blocking jobs for kubernetes/kubernetes
+ - http://velodrome.k8s.io/dashboard/db/job-health-release-blocking?orgId=1 - includes flake rate and top flakes for release-blocking jobs for kubernetes/kubernetes
+- [`kind/flake` github query][flake] - open issues or PRs related to flaky jobs or tests for kubernetes/kubernetes
-## Filing issues for flaky tests
+[go.k8s.io/triage]: https//go.k8s.io/triage
+[testgrid.k8s.io]: https://testgrid.k8s.io
+[velodrome.k8s.io]: https://velodrome.k8s.io
+
+# GitHub Issues for Known Flakes
Because flakes may be rare, it's very important that all relevant logs be
discoverable from the issue.
@@ -36,24 +44,18 @@ discoverable from the issue.
1. Search for the test name. If you find an open issue and you're 90% sure the
flake is exactly the same, add a comment instead of making a new issue.
2. If you make a new issue, you should title it with the test name, prefixed by
- "e2e/unit/integration flake:" (whichever is appropriate)
+ "[Flaky test]"
3. Reference any old issues you found in step one. Also, make a comment in the
old issue referencing your new issue, because people monitoring only their
email do not see the backlinks github adds. Alternatively, tag the person or
people who most recently worked on it.
4. Paste, in block quotes, the entire log of the individual failing test, not
just the failure line.
-5. Link to durable storage with the rest of the logs. This means (for all the
- tests that Google runs) the GCS link is mandatory! The Jenkins test result
- link is nice but strictly optional: not only does it expire more quickly,
- it's not accessible to non-Googlers.
-
-## Finding failed flaky test cases
+5. Link to spyglass to provide access to all durable artifacts and logs (eg: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-flaky/1204178407886163970)
Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
-There are significant numbers of flaky tests reported on a regular basis and P2
-flakes are under-investigated. Fixing flakes is a quick way to gain expertise
-and community goodwill.
+There are significant numbers of flaky tests reported on a regular basis. Fixing
+flakes is a quick way to gain expertise and community goodwill.
[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
@@ -62,8 +64,8 @@ and community goodwill.
Note that we won't randomly assign these issues to you unless you've opted in or
you're part of a group that has opted in. We are more than happy to accept help
from anyone in fixing these, but due to the severity of the problem when merges
-are blocked, we need reasonably quick turn-around time on test flakes. Therefore
-we have the following guidelines:
+are blocked, we need reasonably quick turn-around time on merge-blocking or
+release-blocking flakes. Therefore we have the following guidelines:
1. If a flaky test is assigned to you, it's more important than anything else
you're doing unless you can get a special dispensation (in which case it will
@@ -88,6 +90,9 @@ we have the following guidelines:
6. If a flake has been open, could not be reproduced, and has not manifested in
3 months, it is reasonable to close the flake issue with a note saying
why.
+7. If you are unable to deflake the test, consider adding `[Flaky]` to the test
+ name, which will result in the test being quarantined to only those jobs that
+ explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)
# Reproducing unit test flakes