From 3f99d83ca6a809139b36e0da71ee2339a2336ae4 Mon Sep 17 00:00:00 2001
From: Wojciech Tyczynski <wojtekt@google.com>
Date: Fri, 31 Jul 2015 09:54:05 +0200
Subject: Fixes to watch in apiserver proposal

---
 apiserver-watch.md | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++
 apiserver_watch.md | 184 -----------------------------------------------------
 2 files changed, 178 insertions(+), 184 deletions(-)
 create mode 100644 apiserver-watch.md
 delete mode 100644 apiserver_watch.md
diff --git a/apiserver-watch.md b/apiserver-watch.md
new file mode 100644
index 00000000..02a6e6c8
--- /dev/null
+++ b/apiserver-watch.md
@@ -0,0 +1,178 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+<strong>
+The latest 1.0.x release of this document can be found
+[here](http://releases.k8s.io/release-1.0/docs/proposals/apiserver-watch.md).
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+## Abstract
+
+In the current system, all watch requests send to apiserver are in general
+redirected to etcd. This means that for every watch request to apiserver,
+apiserver opens a watch on etcd.
+
+The purpose of the proposal is to improve the overall performance of the system
+by solving the following problems:
+
+- having too many open watches on etcd
+- avoiding deserializing/converting the same objects multiple times in different
+watch results
+
+In the future, we would also like to add an indexing mechanism to the watch.
+Although Indexer is not part of this proposal, it is supposed to be compatible
+with it - in the future Indexer should be incorporated into the proposed new
+watch solution in apiserver without requiring any redesign.
+
+
+## High level design
+
+We are going to solve those problems by allowing many clients to watch the same
+storage in the apiserver, without being redirected to etcd.
+
+At the high level, apiserver will have a single watch open to etcd, watching all
+the objects (of a given type) without any filtering. The changes delivered from
+etcd will then be stored in a cache in apiserver. This cache is in fact a
+"rolling history window" that will support clients having some amount of latency
+between their list and watch calls. Thus it will have a limited capacity and
+whenever a new change comes from etcd when a cache is full, othe oldest change
+will be remove to make place for the new one.
+
+When a client sends a watch request to apiserver, instead of redirecting it to
+etcd, it will cause:
+
+  - registering a handler to receive all new changes coming from etcd
+  - iterating though a watch window, starting at the requested resourceVersion
+    to the head and sending filtered changes directory to the client, blocking
+    the above until this iteration has caught up
+
+This will be done be creating a go-routine per watcher that will be responsible
+for performing the above.
+
+The following section describes the proposal in more details, analyzes some
+corner cases and divides the whole design in more fine-grained steps.
+
+
+## Proposal details
+
+We would like the cache to be __per-resource-type__ and __optional__. Thanks to
+it we will be able to:
+  - have different cache sizes for different resources (e.g. bigger cache
+    [= longer history] for pods, which can significantly affect performance)
+  - avoid any overhead for objects that are watched very rarely (e.g. events
+    are almost not watched at all, but there are a lot of them)
+  - filter the cache for each watcher more effectively
+
+If we decide to support watches spanning different resources in the future and
+we have an efficient indexing mechanisms, it should be relatively simple to unify
+the cache to be common for all the resources.
+
+The rest of this section describes the concrete steps that need to be done
+to implement the proposal.
+
+1. Since we want the watch in apiserver to be optional for different resource
+types, this needs to be self-contained and hidden behind a well defined API.
+This should be a layer very close to etcd - in particular all registries:
+"pkg/registry/generic/etcd" should be build on top of it.
+We will solve it by turning tools.EtcdHelper by extracting its interface
+and treating this interface as this API - the whole watch mechanisms in
+apiserver will be hidden behind that interface.
+Thanks to it we will get an initial implementation for free and we will just
+need to reimplement few relevant functions (probably just Watch and List).
+Mover, this will not require any changes in other parts of the code.
+This step is about extracting the interface of tools.EtcdHelper.
+
+2. Create a FIFO cache with a given capacity. In its "rolling history window"
+we will store two things:
+
+  - the resourceVersion of the object (being an etcdIndex)
+  - the object watched from etcd itself (in a deserialized form)
+
+  This should be as simple as having an array an treating it as a cyclic buffer.
+  Obviously resourceVersion of objects watched from etcd will be increasing, but
+  they are necessary for registering a new watcher that is interested in all the
+  changes since a given etcdIndec.
+
+  Additionally, we should support LIST operation, otherwise clients can never
+  start watching at now. We may consider passing lists through etcd, however
+  this will not work once we have Indexer, so we will need that information
+  in memory anyway.
+  Thus, we should support LIST operation from the "end of the history" - i.e.
+  from the moment just after the newest cached watched event. It should be
+  pretty simple to do, because we can incrementally update this list whenever
+  the new watch event is watched from etcd.
+  We may consider reusing existing structures cache.Store or cache.Indexer
+  ("pkg/client/cache") but this is not a hard requirement.
+
+3. Create the new implementation of the API, that will internally have a
+single watch open to etcd and will store the data received from etcd in
+the FIFO cache - this includes implementing registration of a new watcher
+which will start a new go-routine responsible for iterating over the cache
+and sending all the objects watcher is interested in (by applying filtering
+function) to the watcher.
+
+4. Add a support for processing "error too old" from etcd, which will require:
+  - disconnect all the watchers
+  - clear the internal cache and relist all objects from etcd
+  - start accepting watchers again
+
+5. Enable watch in apiserver for some of the existing resource types - this
+should require only changes at the initialization level.
+
+6. The next step will be to incorporate some indexing mechanism, but details
+of it are TBD.
+
+
+
+### Future optimizations:
+
+1. The implementation of watch in apiserver internally will open a single
+watch to etcd, responsible for watching all the changes of objects of a given
+resource type. However, this watch can potentially expire at any time and
+reconnecting can return "too old resource version". In that case relisting is
+necessary. In such case, to avoid LIST requests coming from all watchers at
+the same time, we can introduce an additional etcd event type:
+[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36)
+
+  Whenever reslisting will be done to refresh the internal watch to etcd,
+  EtcdResync event will be send to all the watchers. It will contain the
+  full list of all the objects the watcher is interested in (appropriately
+  filtered) as the parameter of this watch event.
+  Thus, we need to create the EtcdResync event, extend watch.Interface and
+  its implementations to support it and handle those events appropriately
+  in places like
+  [Reflector](../../pkg/client/cache/reflector.go)
+
+	However, this might turn out to be unnecessary optimization if apiserver
+	will always keep up (which is possible in the new design). We will work
+  out all necessary details at that point.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver-watch.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/apiserver_watch.md b/apiserver_watch.md
deleted file mode 100644
index a731c7f4..00000000
--- a/apiserver_watch.md
+++ /dev/null
@@ -1,184 +0,0 @@
-<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
-
-<!-- BEGIN STRIP_FOR_RELEASE -->
-
-<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
-     width="25" height="25">
-<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
-     width="25" height="25">
-<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
-     width="25" height="25">
-<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
-     width="25" height="25">
-<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
-     width="25" height="25">
-
-<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
-
-If you are using a released version of Kubernetes, you should
-refer to the docs that go with that version.
-
-<strong>
-The latest 1.0.x release of this document can be found
-[here](http://releases.k8s.io/release-1.0/docs/proposals/apiserver_watch.md).
-
-Documentation for other releases can be found at
-[releases.k8s.io](http://releases.k8s.io).
-</strong>
---
-
-<!-- END STRIP_FOR_RELEASE -->
-
-<!-- END MUNGE: UNVERSIONED_WARNING -->
-
-## Abstract
-
-In the current system, all watch requests send to apiserver are in general
-redirected to etcd. This means that for every watch request to apiserver,
-apiserver opens a watch on etcd.
-
-The purpose of the proposal is to improve the overall performance of the system
-by solving the following problems:
-
-- having too many open watches on etcd
-- avoiding deserializing/converting the same objects multiple times in different
-watch results
-
-In the future, we would also like to add an indexing mechanism to the watch.
-Although Indexer is not part of this proposal, it is supposed to be compatible
-with it - in the future Indexer should be incorporated into the proposed new
-watch solution in apiserver without requiring any redesign.
-
-
-## High level design
-
-We are going to solve those problems by allowing many clients to watch the same
-storage in the apiserver, without being redirected to etcd.
-
-At the high level, apiserver will have a single watch open to etcd, watching all
-the objects (of a given type) without any filtering. The changes delivered from
-etcd will then be stored in a cache in apiserver. This cache is in fact a
-"rolling history window" that will support clients having some amount of latency
-between their list and watch calls. Thus it will have a limited capacity and
-whenever a new change comes from etcd when a cache is full, othe oldest change
-will be remove to make place for the new one.
-
-When a client sends a watch request to apiserver, instead of redirecting it to
-etcd, it will cause:
-
-  - registering a handler to receive all new changes coming from etcd
-  - iterating though a watch window, starting at the requested resourceVersion
-    to the head and sending filtered changes directory to the client, blocking
-    the above until this iteration has caught up
-
-This will be done be creating a go-routine per watcher that will be responsible
-for performing the above.
-
-The following section describes the proposal in more details, analyzes some
-corner cases and divides the whole design in more fine-grained steps.
-
-
-## Proposal details
-
-We would like the cache to be __per-resource-type__ and __optional__. Thanks to
-it we will be able to:
-  - have different cache sizes for different resources (e.g. bigger cache
-    [= longer history] for pods, which can significantly affect performance)
-  - avoid any overhead for objects that are watched very rarely (e.g. events
-    are almost not watched at all, but there are a lot of them)
-  - filter the cache for each watcher more effectively
-
-If we decide to support watches spanning different resources in the future and
-we have an efficient indexing mechanisms, it should be relatively simple to unify
-the cache to be common for all the resources.
-
-The rest of this section describes the concrete steps that need to be done
-to implement the proposal.
-
-1. Since we want the watch in apiserver to be optional for different resource
-types, this needs to be self-contained and hidden behind a well defined API.
-This should be a layer very close to etcd - in particular all registries:
-"pkg/registry/generic/etcd" should be build on top of it.
-We will solve it by turning tools.EtcdHelper by extracting its interface
-and treating this interface as this API - the whole watch mechanisms in
-apiserver will be hidden behind that interface.
-Thanks to it we will get an initial implementation for free and we will just
-need to reimplement few relevant functions (probably just Watch and List).
-Mover, this will not require any changes in other parts of the code.
-This step is about extracting the interface of tools.EtcdHelper.
-
-2. Create a FIFO cache with a given capacity. In its "rolling history windown"
-we will store two things:
-
-  - the resourceVersion of the object (being an etcdIndex)
-  - the object watched from etcd itself (in a deserialized form)
-
-  This should be as simple as having an array an treating it as a cyclic buffer.
-  Obviously resourceVersion of objects watched from etcd will be increasing, but
-  they are necessary for registering a new watcher that is interested in all the
-  changes since a given etcdIndec.
-
-  Additionally, we should support LIST operation, otherwise clients can never
-  start watching at now. We may consider passing lists through etcd, however
-  this will not work once we have Indexer, so we will need that information
-  in memory anyway.
-  Thus, we should support LIST operation from the "end of the history" - i.e.
-  from the moment just after the newest cached watched event. It should be
-  pretty simple to do, because we can incrementally update this list whenever
-  the new watch event is watched from etcd.
-  We may consider reusing existing structures cache.Store or cache.Indexer
-  ("pkg/client/cache") but this is not a hard requirement.
-
-3. Create a new implementation of the EtcdHelper interface, that will internally
-have a single watch open to etcd and will store data received from etcd in the
-FIFO cache. This includes implementing registration of a new watcher that will
-start a new go-routine responsible for iterating over the cache and sending
-appropriately filtered objects to the watcher.
-
-4. Create the new implementation of the API, that will internally have a
-single watch open to etcd and will store the data received from etcd in
-the FIFO cache - this includes implementing registration of a new watcher
-which will start a new go-routine responsible for iterating over the cache
-and sending all the objects watcher is interested in (by applying filtering
-function) to the watcher.
-
-5. Add a support for processing "error too old" from etcd, which will require:
-  - disconnect all the watchers
-  - clear the internal cache and relist all objects from etcd
-  - start accepting watchers again
-
-6. Enable watch in apiserver for some of the existing resource types - this
-should require only changes at the initialization level.
-
-7. The next step will be to incorporate some indexing mechanism, but details
-of it are TBD.
-
-
-
-### Future optimizations:
-
-1. The implementation of watch in apiserver internally will open a single
-watch to etcd, responsible for watching all the changes of objects of a given
-resource type. However, this watch can potentially expire at any time and
-reconnecting can return "too old resource version". In that case relisting is
-necessary. In such case, to avoid LIST requests coming from all watchers at
-the same time, we can introduce an additional etcd event type:
-[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36)
-
-  Whenever reslisting will be done to refresh the internal watch to etcd,
-  EtcdResync event will be send to all the watchers. It will contain the
-  full list of all the objects the watcher is interested in (appropriately
-  filtered) as the parameter of this watch event.
-  Thus, we need to create the EtcdResync event, extend watch.Interface and
-  its implementations to support it and handle those events appropriately
-  in places like
-  [Reflector](../../pkg/client/cache/reflector.go)
-
-	However, this might turn out to be unnecessary optimization if apiserver
-	will always keep up (which is possible in the new design). We will work
-  out all necessary details at that point.
-
-
-<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
-[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver_watch.md?pixel)]()
-<!-- END MUNGE: GENERATED_ANALYTICS -->
-- 
cgit v1.2.3


From 171fb6ecc2d2ba72d78b8c1440ec68ebb1aa5bcb Mon Sep 17 00:00:00 2001
From: Wojciech Tyczynski <wojtekt@google.com>
Date: Fri, 31 Jul 2015 09:52:36 +0200
Subject: Kubmark proposal

---
 scalability-testing.md | 105 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)
 create mode 100644 scalability-testing.md

diff --git a/scalability-testing.md b/scalability-testing.md
new file mode 100644
index 00000000..cf87d84d
--- /dev/null
+++ b/scalability-testing.md
@@ -0,0 +1,105 @@
+
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+<strong>
+The latest 1.0.x release of this document can be found
+[here](http://releases.k8s.io/release-1.0/docs/proposals/scalability-testing.md).
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+## Background
+
+We have a goal to be able to scale to 1000-node clusters by end of 2015.
+As a result, we need to be able to run some kind of regression tests and deliver
+a mechanism so that developers can test their changes with respect to performance.
+
+Ideally, we would like to run performance tests also on PRs - although it might
+be impossible to run them on every single PR, we may introduce a possibility for
+a reviewer to trigger them if the change has non obvious impact on the performance
+(something like "k8s-bot run scalability tests please" should be feasible).
+
+However, running performance tests on 1000-node clusters (or even bigger in the
+future is) is a non-starter. Thus, we need some more sophisticated infrastructure
+to simulate big clusters on relatively small number of machines and/or cores.
+
+This document describes two approaches to tackling this problem.
+Once we have a better understanding of their consequences, we may want to
+decide to drop one of them, but we are not yet in that position.
+
+
+## Proposal 1 - Kubmark
+
+In this proposal we are focusing on scalability testing of master components.
+We do NOT focus on node-scalability - this issue should be handled separately.
+
+Since we do not focus on the node performance, we don't need real Kubelet nor
+KubeProxy - in fact we don't even need to start real containers.
+All we actually need is to have some Kubelet-like and KubeProxy-like components
+that will be simulating the load on apiserver that their real equivalents are
+generating (e.g. sending NodeStatus updated, watching for pods, watching for
+endpoints (KubeProxy), etc.).
+
+What needs to be done:
+
+1. Determine what requests both KubeProxy and Kubelet are sending to apiserver.
+2. Create a KubeletSim that is generating the same load on apiserver that the
+   real Kubelet, but is not starting any containers. In the initial version we
+   can assume that pods never die, so it is enough to just react on the state
+   changes read from apiserver.
+	 TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake"
+   interfaces to it?
+3. Similarly create a KubeProxySim that is generating the same load on apiserver
+   as a real KubeProxy. Again, since we are not planning to talk to those
+   containers, it basically doesn't need to do anything apart from that.
+	 TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake"
+   interfaces to it?
+4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting
+   a cluster with KubeletSim and KubeProxySim instead of real ones and put
+   a bunch of them on a single machine.
+5. Create a load generator for it (probably initially it would be enough to
+   reuse tests that we use in gce-scalability suite).
+
+
+## Proposal 2 - Oversubscribing
+
+The other method we are proposing is to oversubscribe the resource,
+or in essence enable a single node to look like many separate nodes even though
+they reside on a single host. This is a well established pattern in many different
+cluster managers (for more details see
+http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ).
+There are a couple of different ways to accomplish this, but the most viable method
+is to run privileged kubelet pods under a hosts kubelet process. These pods then
+register back with the master via the introspective service using modified names
+as not to collide.
+
+Complications may currently exist around container tracking and ownership in docker.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scalability-testing.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
-- 
cgit v1.2.3