diff options
| author | Mike Brown <brownwm@us.ibm.com> | 2016-05-03 14:31:42 -0500 |
|---|---|---|
| committer | Mike Brown <brownwm@us.ibm.com> | 2016-07-14 15:07:05 -0500 |
| commit | c7c8656f2fee1cc86baa73d4a65ae4ea6611e3f2 (patch) | |
| tree | f6dbd099690d4b645dfd326a2ddbc7be0bc66f47 /flaky-tests.md | |
| parent | 74c9bab39c48cee79c364187322e4b944e6b4667 (diff) | |
devel/ tree 80col updates; and other minor edits
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
Diffstat (limited to 'flaky-tests.md')
| -rw-r--r-- | flaky-tests.md | 41 |
1 files changed, 29 insertions, 12 deletions
diff --git a/flaky-tests.md b/flaky-tests.md index 68fe8a23..c2db9ae2 100644 --- a/flaky-tests.md +++ b/flaky-tests.md @@ -67,7 +67,7 @@ discoverable from the issue. 5. Link to durable storage with the rest of the logs. This means (for all the tests that Google runs) the GCS link is mandatory! The Jenkins test result link is nice but strictly optional: not only does it expire more quickly, - it's not accesible to non-Googlers. + it's not accessible to non-Googlers. ## Expectations when a flaky test is assigned to you @@ -132,15 +132,20 @@ system! # Hunting flaky unit tests in Kubernetes -Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass. +Sometimes unit tests are flaky. This means that due to (usually) race +conditions, they will occasionally fail, even though most of the time they pass. -We have a goal of 99.9% flake free tests. This means that there is only one flake in one thousand runs of a test. +We have a goal of 99.9% flake free tests. This means that there is only one +flake in one thousand runs of a test. -Running a test 1000 times on your own machine can be tedious and time consuming. Fortunately, there is a better way to achieve this using Kubernetes. +Running a test 1000 times on your own machine can be tedious and time consuming. +Fortunately, there is a better way to achieve this using Kubernetes. -_Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better_ +_Note: these instructions are mildly hacky for now, as we get run once semantics +and logging they will get better_ -There is a testing image `brendanburns/flake` up on the docker hub. We will use this image to test our fix. +There is a testing image `brendanburns/flake` up on the docker hub. We will use +this image to test our fix. Create a replication controller with the following config: @@ -166,15 +171,25 @@ spec: value: https://github.com/kubernetes/kubernetes ``` -Note that we omit the labels and the selector fields of the replication controller, because they will be populated from the labels field of the pod template by default. +Note that we omit the labels and the selector fields of the replication +controller, because they will be populated from the labels field of the pod +template by default. ```sh kubectl create -f ./controller.yaml ``` -This will spin up 24 instances of the test. They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test. -You can examine the recent runs of the test by calling `docker ps -a` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently. -You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes: +This will spin up 24 instances of the test. They will run to completion, then +exit, and the kubelet will restart them, accumulating more and more runs of the +test. + +You can examine the recent runs of the test by calling `docker ps -a` and +looking for tasks that exited with non-zero exit codes. Unfortunately, docker +ps -a only keeps around the exit status of the last 15-20 containers with the +same image, so you have to check them frequently. + +You can use this script to automate checking for failures, assuming your cluster +is running on GCE and has four nodes: ```sh echo "" > output.txt @@ -186,13 +201,15 @@ done grep "Exited ([^0])" output.txt ``` -Eventually you will have sufficient runs for your purposes. At that point you can delete the replication controller by running: +Eventually you will have sufficient runs for your purposes. At that point you +can delete the replication controller by running: ```sh kubectl delete replicationcontroller flakecontroller ``` -If you do a final check for flakes with `docker ps -a`, ignore tasks that exited -1, since that's what happens when you stop the replication controller. +If you do a final check for flakes with `docker ps -a`, ignore tasks that +exited -1, since that's what happens when you stop the replication controller. Happy flake hunting! |
