Merge pull request #4299 from spiffxp/flake-content

Try adding some more useful links for hunting flakes
author: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> 2019-12-09 18:07:29 -0800
committer: GitHub <noreply@github.com> 2019-12-09 18:07:29 -0800
commit: fd7043c73bb981e4ef078ed65be73888c5e35bd2 (patch)
tree: 9e875239550ccd9108c4e3b1f64b008d7ae1ad1d
parent: bfdd288a81fbfdc3403170f1554cf493823176af (diff)
parent: 460f52fda305d5db80a63a82ee46ce8876652639 (diff)
1 files changed, 28 insertions, 23 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
index 14302592..2184949a 100644
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -15,20 +15,28 @@ what caused the failure.
 Note that flakes can occur in unit tests, integration tests, or end-to-end
 tests, but probably occur most commonly in end-to-end tests.
 
-## Hunting Flakes
+# Hunting Flakes
 
-You may notice lots of your PRs or ones you watch are having a common
-pre-submit failure, but less frequent issues that are still of concern take
-more analysis over time.  There are metrics recorded and viewable in:
-- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary)
-- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1)
+We offer the following tools to aid in finding or troubleshooting flakes
 
-It is worth noting tests are going to fail in presubmit a lot due
-to unbuildable code, but that wont happen as much on the same commit unless
-there's a true issue in the code or a broader problem like a dep failed to
-pull in.
+- [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks
+  - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job
+  - https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out`
+  - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5C%5Bsig-apps%5C%5D - all failures that happened in tests with `[sig-apps]` in their name
+- [testgrid.k8s.io] - display test results in a grid for visual identififcation of flakes
+  - https://testgrid.k8s.io/presubmits-kubernetes-blocking - all merge-blocking jobs
+  - https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&exclude-filter-by-regex=BeforeSuite&sort-by-flakiness= - results for the pull-kubernetes-e2e-gce job sorted by flakiness
+  - https://testgrid.k8s.io/sig-release-master-informing#gce-cos-master-default&sort-by-flakiness=&width=10 - results for the equivalent CI job
+- [velodrome.k8s.io] - dashboards driven by the results of queries run against test results using bigquery
+  - http://velodrome.k8s.io/dashboard/db/job-health-merge-blocking?orgId=1 - includes flake rate and top flakes for merge-blocking jobs for kubernetes/kubernetes
+  - http://velodrome.k8s.io/dashboard/db/job-health-release-blocking?orgId=1 - includes flake rate and top flakes for release-blocking jobs for kubernetes/kubernetes
+- [`kind/flake` github query][flake] - open issues or PRs related to flaky jobs or tests for kubernetes/kubernetes
 
-## Filing issues for flaky tests
+[go.k8s.io/triage]: https//go.k8s.io/triage
+[testgrid.k8s.io]: https://testgrid.k8s.io
+[velodrome.k8s.io]: https://velodrome.k8s.io
+
+# GitHub Issues for Known Flakes
 
 Because flakes may be rare, it's very important that all relevant logs be
 discoverable from the issue.
@@ -36,24 +44,18 @@ discoverable from the issue.
 1. Search for the test name. If you find an open issue and you're 90% sure the
    flake is exactly the same, add a comment instead of making a new issue.
 2. If you make a new issue, you should title it with the test name, prefixed by
-   "e2e/unit/integration flake:" (whichever is appropriate)
+   "[Flaky test]"
 3. Reference any old issues you found in step one. Also, make a comment in the
    old issue referencing your new issue, because people monitoring only their
    email do not see the backlinks github adds. Alternatively, tag the person or
    people who most recently worked on it.
 4. Paste, in block quotes, the entire log of the individual failing test, not
    just the failure line.
-5. Link to durable storage with the rest of the logs. This means (for all the
-   tests that Google runs) the GCS link is mandatory! The Jenkins test result
-   link is nice but strictly optional: not only does it expire more quickly,
-   it's not accessible to non-Googlers.
-
-## Finding failed flaky test cases
+5. Link to spyglass to provide access to all durable artifacts and logs (eg: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-flaky/1204178407886163970)
 
 Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
-There are significant numbers of flaky tests reported on a regular basis and P2
-flakes are under-investigated. Fixing flakes is a quick way to gain expertise
-and community goodwill.
+There are significant numbers of flaky tests reported on a regular basis. Fixing
+flakes is a quick way to gain expertise and community goodwill.
 
 [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
 
@@ -62,8 +64,8 @@ and community goodwill.
 Note that we won't randomly assign these issues to you unless you've opted in or
 you're part of a group that has opted in. We are more than happy to accept help
 from anyone in fixing these, but due to the severity of the problem when merges
-are blocked, we need reasonably quick turn-around time on test flakes. Therefore
-we have the following guidelines:
+are blocked, we need reasonably quick turn-around time on merge-blocking or
+release-blocking flakes. Therefore we have the following guidelines:
 
 1. If a flaky test is assigned to you, it's more important than anything else
    you're doing unless you can get a special dispensation (in which case it will
@@ -88,6 +90,9 @@ we have the following guidelines:
 6. If a flake has been open, could not be reproduced, and has not manifested in
    3 months, it is reasonable to close the flake issue with a note saying
    why.
+7. If you are unable to deflake the test, consider adding `[Flaky]` to the test
+   name, which will result in the test being quarantined to only those jobs that
+   explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)
 
 # Reproducing unit test flakes
author	Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>	2019-12-09 18:07:29 -0800
committer	GitHub <noreply@github.com>	2019-12-09 18:07:29 -0800
commit	fd7043c73bb981e4ef078ed65be73888c5e35bd2 (patch)
tree	9e875239550ccd9108c4e3b1f64b008d7ae1ad1d
parent	bfdd288a81fbfdc3403170f1554cf493823176af (diff)
parent	460f52fda305d5db80a63a82ee46ce8876652639 (diff)