From 3f99d83ca6a809139b36e0da71ee2339a2336ae4 Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Fri, 31 Jul 2015 09:54:05 +0200 Subject: Fixes to watch in apiserver proposal --- apiserver-watch.md | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++ apiserver_watch.md | 184 ----------------------------------------------------- 2 files changed, 178 insertions(+), 184 deletions(-) create mode 100644 apiserver-watch.md delete mode 100644 apiserver_watch.md diff --git a/apiserver-watch.md b/apiserver-watch.md new file mode 100644 index 00000000..02a6e6c8 --- /dev/null +++ b/apiserver-watch.md @@ -0,0 +1,178 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/proposals/apiserver-watch.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +## Abstract + +In the current system, all watch requests send to apiserver are in general +redirected to etcd. This means that for every watch request to apiserver, +apiserver opens a watch on etcd. + +The purpose of the proposal is to improve the overall performance of the system +by solving the following problems: + +- having too many open watches on etcd +- avoiding deserializing/converting the same objects multiple times in different +watch results + +In the future, we would also like to add an indexing mechanism to the watch. +Although Indexer is not part of this proposal, it is supposed to be compatible +with it - in the future Indexer should be incorporated into the proposed new +watch solution in apiserver without requiring any redesign. + + +## High level design + +We are going to solve those problems by allowing many clients to watch the same +storage in the apiserver, without being redirected to etcd. + +At the high level, apiserver will have a single watch open to etcd, watching all +the objects (of a given type) without any filtering. The changes delivered from +etcd will then be stored in a cache in apiserver. This cache is in fact a +"rolling history window" that will support clients having some amount of latency +between their list and watch calls. Thus it will have a limited capacity and +whenever a new change comes from etcd when a cache is full, othe oldest change +will be remove to make place for the new one. + +When a client sends a watch request to apiserver, instead of redirecting it to +etcd, it will cause: + + - registering a handler to receive all new changes coming from etcd + - iterating though a watch window, starting at the requested resourceVersion + to the head and sending filtered changes directory to the client, blocking + the above until this iteration has caught up + +This will be done be creating a go-routine per watcher that will be responsible +for performing the above. + +The following section describes the proposal in more details, analyzes some +corner cases and divides the whole design in more fine-grained steps. + + +## Proposal details + +We would like the cache to be __per-resource-type__ and __optional__. Thanks to +it we will be able to: + - have different cache sizes for different resources (e.g. bigger cache + [= longer history] for pods, which can significantly affect performance) + - avoid any overhead for objects that are watched very rarely (e.g. events + are almost not watched at all, but there are a lot of them) + - filter the cache for each watcher more effectively + +If we decide to support watches spanning different resources in the future and +we have an efficient indexing mechanisms, it should be relatively simple to unify +the cache to be common for all the resources. + +The rest of this section describes the concrete steps that need to be done +to implement the proposal. + +1. Since we want the watch in apiserver to be optional for different resource +types, this needs to be self-contained and hidden behind a well defined API. +This should be a layer very close to etcd - in particular all registries: +"pkg/registry/generic/etcd" should be build on top of it. +We will solve it by turning tools.EtcdHelper by extracting its interface +and treating this interface as this API - the whole watch mechanisms in +apiserver will be hidden behind that interface. +Thanks to it we will get an initial implementation for free and we will just +need to reimplement few relevant functions (probably just Watch and List). +Mover, this will not require any changes in other parts of the code. +This step is about extracting the interface of tools.EtcdHelper. + +2. Create a FIFO cache with a given capacity. In its "rolling history window" +we will store two things: + + - the resourceVersion of the object (being an etcdIndex) + - the object watched from etcd itself (in a deserialized form) + + This should be as simple as having an array an treating it as a cyclic buffer. + Obviously resourceVersion of objects watched from etcd will be increasing, but + they are necessary for registering a new watcher that is interested in all the + changes since a given etcdIndec. + + Additionally, we should support LIST operation, otherwise clients can never + start watching at now. We may consider passing lists through etcd, however + this will not work once we have Indexer, so we will need that information + in memory anyway. + Thus, we should support LIST operation from the "end of the history" - i.e. + from the moment just after the newest cached watched event. It should be + pretty simple to do, because we can incrementally update this list whenever + the new watch event is watched from etcd. + We may consider reusing existing structures cache.Store or cache.Indexer + ("pkg/client/cache") but this is not a hard requirement. + +3. Create the new implementation of the API, that will internally have a +single watch open to etcd and will store the data received from etcd in +the FIFO cache - this includes implementing registration of a new watcher +which will start a new go-routine responsible for iterating over the cache +and sending all the objects watcher is interested in (by applying filtering +function) to the watcher. + +4. Add a support for processing "error too old" from etcd, which will require: + - disconnect all the watchers + - clear the internal cache and relist all objects from etcd + - start accepting watchers again + +5. Enable watch in apiserver for some of the existing resource types - this +should require only changes at the initialization level. + +6. The next step will be to incorporate some indexing mechanism, but details +of it are TBD. + + + +### Future optimizations: + +1. The implementation of watch in apiserver internally will open a single +watch to etcd, responsible for watching all the changes of objects of a given +resource type. However, this watch can potentially expire at any time and +reconnecting can return "too old resource version". In that case relisting is +necessary. In such case, to avoid LIST requests coming from all watchers at +the same time, we can introduce an additional etcd event type: +[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36) + + Whenever reslisting will be done to refresh the internal watch to etcd, + EtcdResync event will be send to all the watchers. It will contain the + full list of all the objects the watcher is interested in (appropriately + filtered) as the parameter of this watch event. + Thus, we need to create the EtcdResync event, extend watch.Interface and + its implementations to support it and handle those events appropriately + in places like + [Reflector](../../pkg/client/cache/reflector.go) + + However, this might turn out to be unnecessary optimization if apiserver + will always keep up (which is possible in the new design). We will work + out all necessary details at that point. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver-watch.md?pixel)]() + diff --git a/apiserver_watch.md b/apiserver_watch.md deleted file mode 100644 index a731c7f4..00000000 --- a/apiserver_watch.md +++ /dev/null @@ -1,184 +0,0 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/proposals/apiserver_watch.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - -## Abstract - -In the current system, all watch requests send to apiserver are in general -redirected to etcd. This means that for every watch request to apiserver, -apiserver opens a watch on etcd. - -The purpose of the proposal is to improve the overall performance of the system -by solving the following problems: - -- having too many open watches on etcd -- avoiding deserializing/converting the same objects multiple times in different -watch results - -In the future, we would also like to add an indexing mechanism to the watch. -Although Indexer is not part of this proposal, it is supposed to be compatible -with it - in the future Indexer should be incorporated into the proposed new -watch solution in apiserver without requiring any redesign. - - -## High level design - -We are going to solve those problems by allowing many clients to watch the same -storage in the apiserver, without being redirected to etcd. - -At the high level, apiserver will have a single watch open to etcd, watching all -the objects (of a given type) without any filtering. The changes delivered from -etcd will then be stored in a cache in apiserver. This cache is in fact a -"rolling history window" that will support clients having some amount of latency -between their list and watch calls. Thus it will have a limited capacity and -whenever a new change comes from etcd when a cache is full, othe oldest change -will be remove to make place for the new one. - -When a client sends a watch request to apiserver, instead of redirecting it to -etcd, it will cause: - - - registering a handler to receive all new changes coming from etcd - - iterating though a watch window, starting at the requested resourceVersion - to the head and sending filtered changes directory to the client, blocking - the above until this iteration has caught up - -This will be done be creating a go-routine per watcher that will be responsible -for performing the above. - -The following section describes the proposal in more details, analyzes some -corner cases and divides the whole design in more fine-grained steps. - - -## Proposal details - -We would like the cache to be __per-resource-type__ and __optional__. Thanks to -it we will be able to: - - have different cache sizes for different resources (e.g. bigger cache - [= longer history] for pods, which can significantly affect performance) - - avoid any overhead for objects that are watched very rarely (e.g. events - are almost not watched at all, but there are a lot of them) - - filter the cache for each watcher more effectively - -If we decide to support watches spanning different resources in the future and -we have an efficient indexing mechanisms, it should be relatively simple to unify -the cache to be common for all the resources. - -The rest of this section describes the concrete steps that need to be done -to implement the proposal. - -1. Since we want the watch in apiserver to be optional for different resource -types, this needs to be self-contained and hidden behind a well defined API. -This should be a layer very close to etcd - in particular all registries: -"pkg/registry/generic/etcd" should be build on top of it. -We will solve it by turning tools.EtcdHelper by extracting its interface -and treating this interface as this API - the whole watch mechanisms in -apiserver will be hidden behind that interface. -Thanks to it we will get an initial implementation for free and we will just -need to reimplement few relevant functions (probably just Watch and List). -Mover, this will not require any changes in other parts of the code. -This step is about extracting the interface of tools.EtcdHelper. - -2. Create a FIFO cache with a given capacity. In its "rolling history windown" -we will store two things: - - - the resourceVersion of the object (being an etcdIndex) - - the object watched from etcd itself (in a deserialized form) - - This should be as simple as having an array an treating it as a cyclic buffer. - Obviously resourceVersion of objects watched from etcd will be increasing, but - they are necessary for registering a new watcher that is interested in all the - changes since a given etcdIndec. - - Additionally, we should support LIST operation, otherwise clients can never - start watching at now. We may consider passing lists through etcd, however - this will not work once we have Indexer, so we will need that information - in memory anyway. - Thus, we should support LIST operation from the "end of the history" - i.e. - from the moment just after the newest cached watched event. It should be - pretty simple to do, because we can incrementally update this list whenever - the new watch event is watched from etcd. - We may consider reusing existing structures cache.Store or cache.Indexer - ("pkg/client/cache") but this is not a hard requirement. - -3. Create a new implementation of the EtcdHelper interface, that will internally -have a single watch open to etcd and will store data received from etcd in the -FIFO cache. This includes implementing registration of a new watcher that will -start a new go-routine responsible for iterating over the cache and sending -appropriately filtered objects to the watcher. - -4. Create the new implementation of the API, that will internally have a -single watch open to etcd and will store the data received from etcd in -the FIFO cache - this includes implementing registration of a new watcher -which will start a new go-routine responsible for iterating over the cache -and sending all the objects watcher is interested in (by applying filtering -function) to the watcher. - -5. Add a support for processing "error too old" from etcd, which will require: - - disconnect all the watchers - - clear the internal cache and relist all objects from etcd - - start accepting watchers again - -6. Enable watch in apiserver for some of the existing resource types - this -should require only changes at the initialization level. - -7. The next step will be to incorporate some indexing mechanism, but details -of it are TBD. - - - -### Future optimizations: - -1. The implementation of watch in apiserver internally will open a single -watch to etcd, responsible for watching all the changes of objects of a given -resource type. However, this watch can potentially expire at any time and -reconnecting can return "too old resource version". In that case relisting is -necessary. In such case, to avoid LIST requests coming from all watchers at -the same time, we can introduce an additional etcd event type: -[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36) - - Whenever reslisting will be done to refresh the internal watch to etcd, - EtcdResync event will be send to all the watchers. It will contain the - full list of all the objects the watcher is interested in (appropriately - filtered) as the parameter of this watch event. - Thus, we need to create the EtcdResync event, extend watch.Interface and - its implementations to support it and handle those events appropriately - in places like - [Reflector](../../pkg/client/cache/reflector.go) - - However, this might turn out to be unnecessary optimization if apiserver - will always keep up (which is possible in the new design). We will work - out all necessary details at that point. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver_watch.md?pixel)]() - -- cgit v1.2.3 From 171fb6ecc2d2ba72d78b8c1440ec68ebb1aa5bcb Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Fri, 31 Jul 2015 09:52:36 +0200 Subject: Kubmark proposal --- scalability-testing.md | 105 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 scalability-testing.md diff --git a/scalability-testing.md b/scalability-testing.md new file mode 100644 index 00000000..cf87d84d --- /dev/null +++ b/scalability-testing.md @@ -0,0 +1,105 @@ + + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/proposals/scalability-testing.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +## Background + +We have a goal to be able to scale to 1000-node clusters by end of 2015. +As a result, we need to be able to run some kind of regression tests and deliver +a mechanism so that developers can test their changes with respect to performance. + +Ideally, we would like to run performance tests also on PRs - although it might +be impossible to run them on every single PR, we may introduce a possibility for +a reviewer to trigger them if the change has non obvious impact on the performance +(something like "k8s-bot run scalability tests please" should be feasible). + +However, running performance tests on 1000-node clusters (or even bigger in the +future is) is a non-starter. Thus, we need some more sophisticated infrastructure +to simulate big clusters on relatively small number of machines and/or cores. + +This document describes two approaches to tackling this problem. +Once we have a better understanding of their consequences, we may want to +decide to drop one of them, but we are not yet in that position. + + +## Proposal 1 - Kubmark + +In this proposal we are focusing on scalability testing of master components. +We do NOT focus on node-scalability - this issue should be handled separately. + +Since we do not focus on the node performance, we don't need real Kubelet nor +KubeProxy - in fact we don't even need to start real containers. +All we actually need is to have some Kubelet-like and KubeProxy-like components +that will be simulating the load on apiserver that their real equivalents are +generating (e.g. sending NodeStatus updated, watching for pods, watching for +endpoints (KubeProxy), etc.). + +What needs to be done: + +1. Determine what requests both KubeProxy and Kubelet are sending to apiserver. +2. Create a KubeletSim that is generating the same load on apiserver that the + real Kubelet, but is not starting any containers. In the initial version we + can assume that pods never die, so it is enough to just react on the state + changes read from apiserver. + TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake" + interfaces to it? +3. Similarly create a KubeProxySim that is generating the same load on apiserver + as a real KubeProxy. Again, since we are not planning to talk to those + containers, it basically doesn't need to do anything apart from that. + TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake" + interfaces to it? +4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting + a cluster with KubeletSim and KubeProxySim instead of real ones and put + a bunch of them on a single machine. +5. Create a load generator for it (probably initially it would be enough to + reuse tests that we use in gce-scalability suite). + + +## Proposal 2 - Oversubscribing + +The other method we are proposing is to oversubscribe the resource, +or in essence enable a single node to look like many separate nodes even though +they reside on a single host. This is a well established pattern in many different +cluster managers (for more details see +http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ). +There are a couple of different ways to accomplish this, but the most viable method +is to run privileged kubelet pods under a hosts kubelet process. These pods then +register back with the master via the introspective service using modified names +as not to collide. + +Complications may currently exist around container tracking and ownership in docker. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scalability-testing.md?pixel)]() + -- cgit v1.2.3