diff options
| author | Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> | 2019-12-09 18:07:29 -0800 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2019-12-09 18:07:29 -0800 |
| commit | fd7043c73bb981e4ef078ed65be73888c5e35bd2 (patch) | |
| tree | 9e875239550ccd9108c4e3b1f64b008d7ae1ad1d | |
| parent | bfdd288a81fbfdc3403170f1554cf493823176af (diff) | |
| parent | 460f52fda305d5db80a63a82ee46ce8876652639 (diff) | |
Merge pull request #4299 from spiffxp/flake-content
Try adding some more useful links for hunting flakes
| -rw-r--r-- | contributors/devel/sig-testing/flaky-tests.md | 51 |
1 files changed, 28 insertions, 23 deletions
diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md index 14302592..2184949a 100644 --- a/contributors/devel/sig-testing/flaky-tests.md +++ b/contributors/devel/sig-testing/flaky-tests.md @@ -15,20 +15,28 @@ what caused the failure. Note that flakes can occur in unit tests, integration tests, or end-to-end tests, but probably occur most commonly in end-to-end tests. -## Hunting Flakes +# Hunting Flakes -You may notice lots of your PRs or ones you watch are having a common -pre-submit failure, but less frequent issues that are still of concern take -more analysis over time. There are metrics recorded and viewable in: -- [TestGrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#Summary) -- [Velodrome](http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1) +We offer the following tools to aid in finding or troubleshooting flakes -It is worth noting tests are going to fail in presubmit a lot due -to unbuildable code, but that wont happen as much on the same commit unless -there's a true issue in the code or a broader problem like a dep failed to -pull in. +- [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks + - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job + - https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out` + - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5C%5Bsig-apps%5C%5D - all failures that happened in tests with `[sig-apps]` in their name +- [testgrid.k8s.io] - display test results in a grid for visual identififcation of flakes + - https://testgrid.k8s.io/presubmits-kubernetes-blocking - all merge-blocking jobs + - https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&exclude-filter-by-regex=BeforeSuite&sort-by-flakiness= - results for the pull-kubernetes-e2e-gce job sorted by flakiness + - https://testgrid.k8s.io/sig-release-master-informing#gce-cos-master-default&sort-by-flakiness=&width=10 - results for the equivalent CI job +- [velodrome.k8s.io] - dashboards driven by the results of queries run against test results using bigquery + - http://velodrome.k8s.io/dashboard/db/job-health-merge-blocking?orgId=1 - includes flake rate and top flakes for merge-blocking jobs for kubernetes/kubernetes + - http://velodrome.k8s.io/dashboard/db/job-health-release-blocking?orgId=1 - includes flake rate and top flakes for release-blocking jobs for kubernetes/kubernetes +- [`kind/flake` github query][flake] - open issues or PRs related to flaky jobs or tests for kubernetes/kubernetes -## Filing issues for flaky tests +[go.k8s.io/triage]: https//go.k8s.io/triage +[testgrid.k8s.io]: https://testgrid.k8s.io +[velodrome.k8s.io]: https://velodrome.k8s.io + +# GitHub Issues for Known Flakes Because flakes may be rare, it's very important that all relevant logs be discoverable from the issue. @@ -36,24 +44,18 @@ discoverable from the issue. 1. Search for the test name. If you find an open issue and you're 90% sure the flake is exactly the same, add a comment instead of making a new issue. 2. If you make a new issue, you should title it with the test name, prefixed by - "e2e/unit/integration flake:" (whichever is appropriate) + "[Flaky test]" 3. Reference any old issues you found in step one. Also, make a comment in the old issue referencing your new issue, because people monitoring only their email do not see the backlinks github adds. Alternatively, tag the person or people who most recently worked on it. 4. Paste, in block quotes, the entire log of the individual failing test, not just the failure line. -5. Link to durable storage with the rest of the logs. This means (for all the - tests that Google runs) the GCS link is mandatory! The Jenkins test result - link is nice but strictly optional: not only does it expire more quickly, - it's not accessible to non-Googlers. - -## Finding failed flaky test cases +5. Link to spyglass to provide access to all durable artifacts and logs (eg: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-flaky/1204178407886163970) Find flaky tests issues on GitHub under the [kind/flake issue label][flake]. -There are significant numbers of flaky tests reported on a regular basis and P2 -flakes are under-investigated. Fixing flakes is a quick way to gain expertise -and community goodwill. +There are significant numbers of flaky tests reported on a regular basis. Fixing +flakes is a quick way to gain expertise and community goodwill. [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake @@ -62,8 +64,8 @@ and community goodwill. Note that we won't randomly assign these issues to you unless you've opted in or you're part of a group that has opted in. We are more than happy to accept help from anyone in fixing these, but due to the severity of the problem when merges -are blocked, we need reasonably quick turn-around time on test flakes. Therefore -we have the following guidelines: +are blocked, we need reasonably quick turn-around time on merge-blocking or +release-blocking flakes. Therefore we have the following guidelines: 1. If a flaky test is assigned to you, it's more important than anything else you're doing unless you can get a special dispensation (in which case it will @@ -88,6 +90,9 @@ we have the following guidelines: 6. If a flake has been open, could not be reproduced, and has not manifested in 3 months, it is reasonable to close the flake issue with a note saying why. +7. If you are unable to deflake the test, consider adding `[Flaky]` to the test + name, which will result in the test being quarantined to only those jobs that + explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky) # Reproducing unit test flakes |
