diff options
| author | Kubernetes Submit Queue <k8s-merge-robot@users.noreply.github.com> | 2017-11-19 23:33:47 -0800 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2017-11-19 23:33:47 -0800 |
| commit | 3a991bd962d86628034834bf998489df53eedaac (patch) | |
| tree | c08324e0af10f7ed9511f4365b2b58c9f5fb9d97 | |
| parent | 164e6b51cc8311acd11ff242ac412fbf7a5c31ae (diff) | |
| parent | 3b1d9698799cdbdcf91877569a7ed34463e92c62 (diff) | |
Merge pull request #1411 from shyamjvs/update-scale-jobs-automation-schedule
Automatic merge from submit-queue.
Update scalability-validation doc to reflect current job schedule
Following from https://github.com/kubernetes/test-infra/pull/5532
/cc @porridge
| -rw-r--r-- | contributors/devel/release/scalability-validation.md | 59 |
1 files changed, 35 insertions, 24 deletions
diff --git a/contributors/devel/release/scalability-validation.md b/contributors/devel/release/scalability-validation.md index 5aae1cbd..08ecf09b 100644 --- a/contributors/devel/release/scalability-validation.md +++ b/contributors/devel/release/scalability-validation.md @@ -41,37 +41,47 @@ We need to run them on 5k-node clusters, but they’re: - Blocking other large tests (quota limitations + only one large test project available viz. 'kubernetes-scale') So we don’t want to run them too frequently. On the other hand, running them too infrequently means late identification and piling up of regressions. So we choose the following middleground: -(B = release-blocking, NB = not release-blocking) - -- Performance tests on 2k-node cluster in GCE/GKE alternatingly each week (NB) - - would catch most scale-related regressions - - would catch bugs in both GCE and GKE -- Performance tests on 5k-node cluster in GCE each week (B) - - would catch scale-related bugs missed by 2k-node test (should be rare) -- Correctness tests on 2k-node cluster in GCE/GKE alternatingly each week (NB) - - correctness tests failing on just large clusters is rare, so weekly is enough - - would catch bugs in both GCE and GKE -- Correctness tests on 5k-node cluster in GCE/GKE alternatingly each week (B for GCE) - - would catch bugs left out by 2k-node (should be rare) - - would verify 5k-node clusters on GKE but without blocking the release + +- Performance tests on 2k-node/5k-node GCE clusters alternatingly from Mon-Sat + - would give us one performance run from each day to help catch regressions fast + - running 2k-node on alternating days gives time for 5k-node correctness tests to run on those days + - many of the performance regressions on 5k-node should also be seen on 2k-node (albeit a smaller version probably) +- Correctness tests on 2k-node/5k-node GCE clusters alternatingly from Mon-Sat + - would give us one correctness run from each day to help catch regressions fast + - running 2k-node on alternating days gives time for 5k-node performance tests to run on those days + - many of the correctness regressions on 5k-node should also be seen on 2k-node +- Performance tests on 2k-node GKE cluster on Sun + - would give us a performance run for sunday too + - would also additionally help verify performance of GKE +- Correctness tests on 2k-node GKE cluster on Sun + - would give us a correctness run for sunday too + - would also additionally help verify correctness of GKE Here's the proposed schedule (may be fine-tuned later based on test health / release schedule): +(B = release-blocking job) + +| Day | | | +| ------------- |:-------------:| -----:| +| Mon | 5k-node performance @ 00:01 PT (B) | 2k-node correctness @ 22:01 PT | +| Tue | 2k-node performance @ 05:01 PT | 5k-node correctness @ 14:01 PT (B) | +| Wed | 5k-node performance @ 00:01 PT (B) | 2k-node correctness @ 22:01 PT | +| Thu | 2k-node performance @ 05:01 PT | 5k-node correctness @ 14:01 PT (B) | +| Fri | 5k-node performance @ 00:01 PT (B) | 2k-node correctness @ 22:01 PT | +| Sat | 2k-node performance @ 05:01 PT | 5k-node correctness @ 14:01 PT (B) | +| Sun | 'GKE' 2k-node performance @ 05:01 PT | 'GKE' 2k-node correctness @ 15:01 PT | -| | Mon | Tue | Wed | Thu | Fri | Sat | Sun -| ---- | :----: | :----: | :----:| :----: | :----: | :----: | :----: -| Week 2k | GCE 5k perf (B) | | | GCE 5k corr (B) | | GKE 2k perf+corr (NB) | -| Week 2k+1 | GCE 5k perf (B) | | | GKE 5k corr (NB) | | GCE 2k perf+corr (NB) | +Note: The above schedule is subject to change based on job health, release requirements, etc. You should find it up-to-date in this [calendar]. Why this schedule? -- 5k tests might need special attention in case of failures so they should run on weekdays -- Running 5k perf test on Mon gives the whole week for debugging in case of failures -- Running 5k corr test on Thu gives Tue-Wed to debug any setup issues found on Mon -- Running 5k corr tests alternatingly on GKE and GCE validates both setups -- 2k perf and corr tests can be run in parallel to save time -- Running 2k tests on weekend will free up a weekday for the project to run other tests +- 5k tests might need special attention in case of failures so they should mostly run on weekdays (EDIT: Given that they're quite stable now, we're trying running them on weekend too) +- Running a large-scale performance job and a large-scale correctness job each day would: + - help catch regressions on a daily basis + - help verify fixes with low latency + - ensure a good release signal +- Running large scale tests on GKE once a week would help verify GKE setup also, at no real loss of signal ideally -Why GKE tests? +Why run GKE tests at all? Google is currently using a single project for scalability testing, on both GCE and GKE. As a result we need to schedule them together. There's a plan for CNCF becoming responsible for funding k8s testing, and GCE/GKE tests would be separated to different projects when that happens, with only GCE being funded by them. This ensures fairness across all cloud providers. @@ -123,3 +133,4 @@ Responsibilities lying with other SIGs/teams as applicable (could be sig-scalabi [here]: https://docs.google.com/document/d/15rD6XBtKyvXXifkRAsAVFBqEGApQxDRWM3H1bZSBsKQ [#47865]: https://github.com/kubernetes/kubernetes/issues/47865 [test_owners.csv]: https://github.com/kubernetes/kubernetes/blob/master/test/test_owners.csv +[calendar]: https://calendar.google.com/calendar?cid=Z29vZ2xlLmNvbV9tNHA3bG1jODVubGlmazFxYzRnNTRqZjg4a0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t |
