summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorhasheddan <georgedanielmangum@gmail.com>2021-02-06 16:58:57 -0600
committerhasheddan <georgedanielmangum@gmail.com>2021-02-06 16:58:57 -0600
commit9321dd58ff71bb50ecf628eecc6f826849a15c1c (patch)
tree350a91c93cc96875ded30003e67a7bf33faa2a9a
parentcc0ddcf3ebd530f6eeb72c35711a4cde6d1372e1 (diff)
Add flake finders episode 0 hosts
Updates flake finders episode 0 README.md with host links. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
-rw-r--r--contributors/devel/sig-release/flake-finders/episodes/000/README.md77
1 files changed, 39 insertions, 38 deletions
diff --git a/contributors/devel/sig-release/flake-finders/episodes/000/README.md b/contributors/devel/sig-release/flake-finders/episodes/000/README.md
index 869c1afb..64c408dc 100644
--- a/contributors/devel/sig-release/flake-finders/episodes/000/README.md
+++ b/contributors/devel/sig-release/flake-finders/episodes/000/README.md
@@ -2,6 +2,9 @@
February 5th 2021 ([Recording](https://youtu.be/Hqlm2h2AEvA))
+Hosts: [Dan Mangum](https://github.com/hasheddan), [Rob
+Kielty](https://github.com/RobertKielty)
+
## Introduction
This is the first episode of Flake Finder Fridays with Dan Mangum and Rob
@@ -16,13 +19,12 @@ test related issue logged in the past four weeks.
We intend to demo how CI works on the Kubernetes project and also how we
collaborate across teams to resolve test maintenance issues.
-## Issue
-This is the issue that we are going to look at today ...
+## Issue This is the issue that we are going to look at today ...
[[Failing Test] ci-kubernetes-build-canary does not understand
"--platform"](https://github.com/kubernetes/kubernetes/issues/98646)
-### Testgrid Dashboard
+### Testgrid Dashboard
[build-master-canary](https://testgrid.k8s.io/sig-release-master-informing#build-master-canary)
### Breaking PRs
@@ -34,66 +36,65 @@ This is the issue that we are going to look at today ...
## Investigation
1. Desire to move from Google-owned infrastructure to Kubernetes community
- infrastructure. Thus the introduction of a **canary** build job to test
- pushing building and pushing artifacts with new infrastructure.
+infrastructure. Thus the introduction of a **canary** build job to test pushing
+building and pushing artifacts with new infrastructure.
1. Desire to move off of `bootstrap.py` job (currently being used for canary
- job) to `krel` tooling.
+job) to `krel` tooling.
1. Separate job existed (`ci-kubernetes-build-no-bootstrap`) that was doing the
- same thing as the canary job, but with `krel` tooling.
+same thing as the canary job, but with `krel` tooling.
1. The `no-bootstrap` job was running smoothly, so [updated to use it for the
- canary job](https://github.com/kubernetes/test-infra/pull/20663).
+canary job](https://github.com/kubernetes/test-infra/pull/20663).
1. Right before the update, we [switched to using buildx for multi-arch
- images](https://github.com/kubernetes/kubernetes/pull/98529).
+images](https://github.com/kubernetes/kubernetes/pull/98529).
1. Job started failing, which showed up in [some interesting
- ways](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700).
+ways](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700).
1. Triage begins! Issue
- [opened](https://github.com/kubernetes/kubernetes/issues/98646) and release
- management team is pinged in Slack.
+[opened](https://github.com/kubernetes/kubernetes/issues/98646) and release
+management team is pinged in Slack.
1. The `build-master`
- [job](https://testgrid.k8s.io/sig-release-master-blocking#build-master) was
- still passing though... interesting.
+[job](https://testgrid.k8s.io/sig-release-master-blocking#build-master) was
+still passing though... interesting.
1. Both are eventually calling `make release`, so environment must be different.
1. Let's look inside!
- ```
- docker run -it --entrypoint /bin/bash gcr.io/k8s-testimages/bootstrap:v20210130-12516b2
- ```
+ ``` docker run -it --entrypoint /bin/bash
+gcr.io/k8s-testimages/bootstrap:v20210130-12516b2 ```
- ```
- docker run -it gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default /bin/bash
- ```
+ ``` docker run -it
+gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default
+/bin/bash ```
1. A few directions we could go here:
1. Update the `k8s-ci-builder` image to you use newer version of Docker
1. Update the `k8s-ci-builder` image to ensure that
- `DOCKER_CLI_EXPERIMENTAL=enabled` is set
+`DOCKER_CLI_EXPERIMENTAL=enabled` is set
1. Update the `release.sh` script to set `DOCKER_CLI_EXPERIMENTAL=enabled`
1. Making the `release.sh` script more flexible serves the community better
- because it allows for building with more environments. Would also be good to
- update the `k8s-ci-builder` image for this specific case as well.
+because it allows for building with more environments. Would also be good to
+update the `k8s-ci-builder` image for this specific case as well.
1. And we get a new
- [failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
+[failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
1. Let's see what is going on in those images again...
1. Why would this cause an error in one but not the other if we have
- `DOCKER_CLI_EXPERIMENTAL=enabled`?
- ([this](https://github.com/docker/buildx/pull/403) is why)
+`DOCKER_CLI_EXPERIMENTAL=enabled`?
+([this](https://github.com/docker/buildx/pull/403) is why)
1. In the mean time we went ahead and [re-enabled the bootstrap
- job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
- images need them!)
+job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
+images need them!)
1. Decided to [increase logging
- verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures
- to see if that would give us a clue into what was going wrong (and to remove
- those annoying `quiet currently not implemented` warnings).
+verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures to
+see if that would give us a clue into what was going wrong (and to remove those
+annoying `quiet currently not implemented` warnings).
1. Job turns green! But how?
1. [Buildx](https://github.com/docker/buildx) is versioned separately than
- Docker itself. Turns out that the `--quiet` flag warning was [actually an
- error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
+Docker itself. Turns out that the `--quiet` flag warning was [actually an
+error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
1. The `build-master` job was running with buildx `v0.5.1` while the `krel` job
- was running with `v0.4.2`. This meant the quiet flag was causing an error in
- the `krel` job, and removing it alleviated the error.
+was running with `v0.4.2`. This meant the quiet flag was causing an error in the
+`krel` job, and removing it alleviated the error.
1. Finished up by once again [removing the `bootstrap`
- job](https://github.com/kubernetes/test-infra/pull/20731).
+job](https://github.com/kubernetes/test-infra/pull/20731).
### Fixes
@@ -132,8 +133,8 @@ Brand new to the project?
Setup already and interested in maintaining tests?
- Check out [this video](https://www.youtube.com/watch?v=Ewp8LNY_qTg) from
Jordan Liggit who describes strategies and tactics to deflake flaking tests
- ([Jordan's show notes for that
- talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
+([Jordan's show notes for that
+talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
Here's how the CI Signal Team actively monitors CI during a release cycle:
- [A Tour of CI on the Kubernetes