Update buildcop instructions

Based on https://docs.google.com/document/d/11nf_tg3_0OTHfKpJdaaPBJrCMIHHWyr5cBZ1EXDnI0o/edit?ts=59477035#
author: Anirudh Ramanathan <ramanathana@google.com> 2017-06-19 11:01:33 -0700
committer: GitHub <noreply@github.com> 2017-06-19 11:01:33 -0700
commit: 10ff89c5fad124a7f31edf9abb12e6bf59c82762 (patch)
tree: 60c5d1a30a167e7a1cd5166fb9d695e8fd2cc186
parent: 45988676be1714cc275b8e736b42e0820aa57362 (diff)
1 files changed, 44 insertions, 117 deletions
diff --git a/contributors/devel/on-call-build-cop.md b/contributors/devel/on-call-build-cop.md
index ab80faea..af680374 100644
--- a/contributors/devel/on-call-build-cop.md
+++ b/contributors/devel/on-call-build-cop.md
@@ -1,117 +1,44 @@
-## Kubernetes "Github and Build-cop" Rotation
-
-### Preqrequisites
-
-* Ensure you have [write access to http://github.com/kubernetes/kubernetes](https://github.com/orgs/kubernetes/teams/kubernetes-maintainers)
-  * Test your admin access by e.g. adding a label to an issue.
-
-### Traffic sources and responsibilities
-
-* GitHub Kubernetes [issues](https://github.com/kubernetes/kubernetes/issues):
-Your job is to be
-the first responder to all new issues. If you are not equipped to do
-this (which is fine!), it is your job to seek guidance!
-
-  * Support issues should be closed and redirected to Stack Overflow (see example
-response [here](on-call-user-support.md#user-support-response-example)).
-
-  * All incoming issues should be tagged with a team label
-(team/{api,ux,control-plane,node,cluster,csi,redhat,mesosphere,gke,release-infra,test-infra,none});
-for issues that overlap teams, you can use multiple team labels
-
-    * There is a related concept of "Github teams" which allow you to @ mention
-a set of people; feel free to @ mention a Github team if you wish, but this is
-not a substitute for adding a team/* label, which is required
-
-      * [Google teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=goog-)
-      * [Redhat teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=rh-)
-      * [SIGs](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=sig-)
-
-    * If the issue is reporting broken builds, broken e2e tests, or other
-obvious P0 issues, label the issue with priority/P0 and assign it to someone.
-This is the only situation in which you should add a priority/* label
-      * non-P0 issues do not need a reviewer assigned initially
-
-    * Assign any issues related to Vagrant to @derekwaynecarr (and @mention him
-in the issue)
-
-  * Keep in mind that you can @ mention people in an issue to bring it to
-their attention without assigning it to them. You can also @ mention github
-teams, such as @kubernetes/goog-ux or @kubernetes/kubectl
-
-  * If you need help triaging an issue, consult with (or assign it to)
-@brendandburns, @thockin, @bgrant0607, @davidopp, @dchen1107,
-@lavalamp (all U.S. Pacific Time) or @fgrzadkowski (Central European Time).
-
-  * At the beginning of your shift, please add team/* labels to any issues that
-have fallen through the cracks and don't have one. Likewise, be fair to the next
-person in rotation: try to ensure that every issue that gets filed while you are
-on duty is handled. The Github query to find issues with no team/* label is:
-[here](https://github.com/kubernetes/kubernetes/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+-label%3Ateam%2Fcontrol-plane+-label%3Ateam%2Fmesosphere+-label%3Ateam%2Fredhat+-label%3Ateam%2Frelease-infra+-label%3Ateam%2Fnone+-label%3Ateam%2Fnode+-label%3Ateam%2Fcluster+-label%3Ateam%2Fux+-label%3Ateam%2Fapi+-label%3Ateam%2Ftest-infra+-label%3Ateam%2Fgke+-label%3A"team%2FCSI-API+Machinery+SIG"+-label%3Ateam%2Fhuawei+-label%3Ateam%2Fsig-aws).
-
-### Build-copping
-
-* The [merge-bot submit queue](http://submit-queue.k8s.io/)
-([source](https://github.com/kubernetes/contrib/tree/master/mungegithub/mungers/submit-queue.go))
-should auto-merge all eligible PRs for you once they've passed all the relevant
-checks mentioned below and all [critical e2e tests]
-(https://goto.google.com/k8s-test/view/Critical%20Builds/) are passing. If the
-merge-bot been disabled for some reason, or tests are failing, you might need to
-do some manual merging to get things back on track.
-
-* Once a day or so, look at the [flaky test builds]
-(https://goto.google.com/k8s-test/view/Flaky/); if they are timing out, clusters
-are failing to start, or tests are consistently failing (instead of just
-flaking), file an issue to get things back on track.
-
-* Jobs that are not in [critical e2e tests](https://goto.google.com/k8s-test/view/Critical%20Builds/)
-or [flaky test builds](https://goto.google.com/k8s-test/view/Flaky/) are not
-your responsibility to monitor. The `Test owner:` in the job description will be
-automatically emailed if the job is failing.
-
-* If you are oncall, ensure that PRs confirming to the following
-pre-requisites are being merged at a reasonable rate:
-
-  * [Have been LGTMd](https://github.com/kubernetes/kubernetes/labels/lgtm)
-  * Pass Travis and Jenkins per-PR tests.
-  * Author has signed CLA if applicable.
-
-
-* Although the shift schedule shows you as being scheduled Monday to Monday,
-  working on the weekend is neither expected nor encouraged.  Enjoy your time
-  off.
-
-* When the build is broken, roll back the PRs responsible ASAP
-
-* If the build job itself fails, Jenkins will not try again automatically and everything will halt.  You can trigger one at http://kubekins.mtv.corp.google.com/job/ci-kubernetes-build/#.  Click `log in`, then click `Build Now` in the left margin.
-
-* When E2E tests are unstable, a "merge freeze" may be instituted. During a
-merge freeze:
-
-  * Oncall should slowly merge LGTMd changes throughout the day while monitoring
-E2E to ensure stability.
-
-  * Ideally the E2E run should be green, but some tests are flaky and can fail
-randomly (not as a result of a particular change).
-      * If a large number of tests fail, or tests that normally pass fail, that
-is an indication that one or more of the PR(s) in that build might be
-problematic (and should be reverted).
-      * Use the Test Results Analyzer to see individual test history over time.
-
-
-* Flake mitigation
-
-  * Tests that flake (fail a small percentage of the time) need an issue filed
-against them. Please read [this](flaky-tests.md#filing-issues-for-flaky-tests);
-the build cop is expected to file issues for any flaky tests they encounter.
-
-  * It's reasonable to manually merge PRs that fix a flake or otherwise mitigate it.
-
-### Contact information
-
-[@k8s-oncall](https://github.com/k8s-oncall) will reach the current person on
-call.
-
-<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
-[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-build-cop.md?pixel)]()
-<!-- END MUNGE: GENERATED_ANALYTICS -->
+# Kubernetes BuildCop Workflow
+
+June 2017
+
+## Objective
+
+This document describes the responsibilities and the workflow of a person assuming the buildcop role. 
+The current buildcop can be found [here](https://storage.googleapis.com/kubernetes-jenkins/oncall.html).
+
+## Prerequisites for build-copping
+
+- Ensure you have write access to [http://github.com/kubernetes/kubernetes](http://github.com/kubernetes/kubernetes)
+  - Test your admin access by e.g. adding a label to an issue.
+- You must communicate any concerns/actions via the **#sig-release** slack channel to ensure that 
+the release team has context on the current state of the submit queue.
+- You must attend the release burndown meeting to provide an update on the current state of the submit-queue
+
+## Responsibilities
+
+The build-cop's primary responsibility is to ensure that automatic merges are happening at a 
+**reasonable** rate. This may include performing merging of test flake PRs when the pre-submits 
+are failing repeatedly. The buildcop must be familiar with the 
+[queue labels](https://submit-queue.k8s.io/#/info) and apply them as necessary to critical fixes. 
+The priority labels are defunct and no longer respected by the submit-queue. As of June 2017, 
+the merge rate is ~30 PRs per day if there are that many PRs in the queue. The previous 
+responsibilities of this role included classification of incoming issues, but that is no 
+longer a part of the mandate.
+
+## Workflow
+
+1. Check the Prow batch dashboard: [https://prow.k8s.io/?type=batch](https://prow.k8s.io/?type=batch) 
+to ensure that merges are occurring regularly.
+2. If there are post-submit blocking jobs (see [link](https://submit-queue.k8s.io/#/e2e)), ensure 
+that those builds are green and allowing merges to occur.
+3. If several batch merges are failing, file an issue for that job and describe the possible 
+causes for the failure. Debug if possible, else triage and assign to a particular SIG, and 
+@-mention the maintainers. For example, see: 
+[https://github.com/kubernetes/kubernetes/issues/47135](https://github.com/kubernetes/kubernetes/issues/47135)
+4. Communicate the actions to # **sig-release** via slack and ensure that the issue is being worked on.
+  1. If the issue is not worked on for several hours, please escalate to the release team.
+5. When the SIG member sends a fix, manually merge if necessary, after verifying that pre-submits pass, 
+or use the 'retest-not-required' label with the appropriate 'queue/*' label to ensure merge of the 
+flake fix.
+6. Issue an update to the # **sig-release** channel on the merge rate and the PR that was used to fix the queue.
author	Anirudh Ramanathan <ramanathana@google.com>	2017-06-19 11:01:33 -0700
committer	GitHub <noreply@github.com>	2017-06-19 11:01:33 -0700
commit	10ff89c5fad124a7f31edf9abb12e6bf59c82762 (patch)
tree	60c5d1a30a167e7a1cd5166fb9d695e8fd2cc186
parent	45988676be1714cc275b8e736b42e0820aa57362 (diff)