diff options
| author | k8s-ci-robot <k8s-ci-robot@users.noreply.github.com> | 2018-03-13 09:54:09 -0700 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2018-03-13 09:54:09 -0700 |
| commit | 93e9ec68bc1fcbe642541556f9266147e6f48091 (patch) | |
| tree | 60aea78823d31f194998c9532fbe64c8ea2256e6 | |
| parent | f9bc8548924f18ebffedfd07998d914c7779fe94 (diff) | |
| parent | 16503b7bbd22f1cc19ce21bbf4713fb1e01a364a (diff) | |
Merge pull request #593 from irfanurrehman/federated-hpa-design
[Federation] Federated hpa design
| -rw-r--r-- | contributors/design-proposals/multicluster/federated-hpa.md | 271 |
1 files changed, 271 insertions, 0 deletions
diff --git a/contributors/design-proposals/multicluster/federated-hpa.md b/contributors/design-proposals/multicluster/federated-hpa.md new file mode 100644 index 00000000..7639425b --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-hpa.md @@ -0,0 +1,271 @@ +# Federated Pod Autoscaler + +# Requirements & Design Document + +irfan.rehman@huawei.com, quinton.hoole@huawei.com + +# Use cases + +1 – Users can schedule replicas of same application, across the +federated clusters, using replicaset (or deployment). +Users however further might need to let the replicas be scaled +independently in each cluster, depending on the current usage metrics +of the replicas; including the CPU, memory and application defined +custom metrics. + +2 - As stated in the previous use case, a federation user schedules +replicas of same application, into federated clusters and subsequently +creates a horizontal pod autoscaler targeting the object responsible for +the replicas. User would want the auto-scaling to continue based on +the in-cluster metrics, even if for some reason, there is an outage at +federation level. User (or other users) should still be able to access +the deployed application into all federated clusters. Further, if the +load on the deployed app varies, the autoscaler should continue taking +care of scaling the replicas for a smooth user experience. + +3 - A federation that consists of an on-premise cluster and a cluster +running in a public cloud has a user workload (eg. deployment or rs) +preferentially running in the on-premise cluster. However if there are +spikes in the app usage, such that the capacity in the on-premise cluster +is not sufficient, the workload should be able to get scaled beyond the +on-premise cluster boundary and into the other clusters which are part +of this federation. + +Please refer to some additional use cases, which partly led to the derivation +of the above use case, and are listed in the **glossary** section of this document. + +# User workflow + +User wants to schedule a set of common workload across federated clusters. +He creates a replicaset or a deployment to schedule the workload (with or +without preferences). The federation then distributes the replicas of the +given workload into the federated clusters. As the user at this point is +unaware of the exact usage metrics of the individual pods created in the +federated clusters, he creates an HPA into the federation, providing metric +parameters to be used in the scale request for a resource. It is now the +responsibility of this HPA to monitor the relevant resource metrics and the +scaling of the pods per cluster then is controlled by the associated HPA. + +# Alternative approaches + +## Design Alternative 1 + +Make the autoscaling resource available and implement support for +horizontalpodautoscalers objects at federation. The HPA API resource +will need to be exposed at the federation level, which can follow the +version similar to one implemented in the latest k8s cluster release. + +Once the HPA object is created at federation, the federation controller +creates and monitors a similar HPA object (partitioning the min and max values) +in each of the federated clusters. Based on the metadata in spec of the HPA +describing the scaleTargetRef, the HPA will be applied on the already existing +target objects. If the target object is not present in the cluster (either +because, its not created until now, or deleted for some reason), the HPA will +still exist but no action will be taken. The HPA's action will become +applicable when the target object is created in the given cluster anytime in +future. Also as stated already the federation controller will need to partition +the min and max values appropriately into the federated clusters among the HPA +objects such that the total of min and that of max replicas satisfies the +constraints specified by the user at federation. The point of control over the +scaling of replicas will lie locally with the federated hpa controller. The +federated controller will however watch the cluster local HPAs wrt current +replicas of the target objects and will do intelligent dynamic adjustments of +min and max values of the HPA replicas across the clusters based on the run time +conditions. + +The federation controller by default will distribute the min and max replicas of the +HPA equally among all clusters. The min values will first be distributed such that +any cluster into which the replicas are distributed does not get a min replicas +lesser than 1. This means that HPA can actually be created in lesser number of +ready clusters then available in federation. Once this distribution happens, the +max replicas of the hpa will be distributed across all those clusters into which +the HPA needs to be created. The default distribution can be overridden using the +annotations on the HPA object, very similar to the annotations on federated +replicaset object as described +[here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md#federatereplicaset-preferences). + +One of the points to note here is that, doing this brings a two point control on +number of replicas of the target object, one by the federated target object (rs or +deployment) and other by the hpa local to the federated cluster. Solution to which +is discussed in the following section. Another additional note here is that, the +preferences would consider use of only minreplicas and maxreplicas in this phase +of implementation and weights will be discarded for this alternative design. + +### Rebalancing of workload replicas and control over the same. + +The current implementation of federated replicasets (and deployments) first +distributes the replicas into underlying clusters and then monitors the status +of the pods in each cluster. In case there are clusters which have active pods +lesser than what federation reconciler desires, federation control plane will +trigger creation of the missing pods (which federation considers missing), or +in other case would trigger removal of pods, if the control plane considers that +the given cluster has more pods than needed. This is something which counters +the role of HPA in individual cluster. To handle this, the knowledge that HPA +is active separately targeting this object has to be percolated to the federation +control plane monitoring the individual replicas such that, the federation control +plane stops reconciling the replicas in the individual clusters. In other words +the link between the HPA wrt to the corresponding objects will need to be +maintained and if an HPA is active, other federation controllers (aka replicaset +and deployment controllers) reconcile process, would stop updating and/or +rebalancing the replicas in and across the underlying clusters. The reconcile +of the objects (rs or deployment) would still continue, to handle the scenario +of the object missing from any given federated cluster. +The mechanism to achieve this behaviour shall be as below: + - User creates a workload object (for example rs) in federation. + - User then creates an HPA object in federation (this step and the previous + step can follow either order of creation). + - The rs as an object will exist in federation control plane with or without + the user preferences and/or cluster selection annotations. + - The HPA controller will first evaluate which cluster(s) get the replicas + and which don't (if any). This list of clusters will be a subset of the + cluster selector already applied on the hpa object. + - The HPA controller will apply this list on the federated rs object as the + cluster selection annotation overriding the user provided preferences (if any). + The control over the placement of workload replicas and the add on preferences + will thus lie completely with the HPA objects. This is an important assumption + that the user of these federated objects interacting with each other should be + aware of; and if the user needs to place replicas in specific clusters, together + with workload autoscaling he/she should apply these preferences on the HPA + object. Any preferences applied on the workload object (rs or deployment) will + be overridden. + - The target workload object (for example rs) replicas will be kept unchanged + in the cluster which already has the replicas, will be created with one replica + if the particular cluster does not have the same and HPA calculation resulted + in some replicas for that cluster and deleted from the clusters which has the + replicas and the federated HPA calculations result in no replicas for that + particular cluster. + - The desired replicas per cluster as per the federated HPA dynamic rebalance + mechanism, elaborated in the next section, will be set on individual clusters + local HPA, which in turn will set the same on the target local object. + +### Dynamic HPA min/max rebalance + +The proposal in this section can be used to improve the distribution of replicas +across the clusters such that there are more replicas in those clusters, where +they are needed more. The federation hpa controller will monitor the status of +the local HPAs in the federated clusters and update the min and/or max values +set on the local HPAs as below (assuming that all previous steps are done and +local HPAs in federated clusters are active): + +1. At some point, one or more of the cluster HPA's hit the upper limit of their +allowed scaling such that _DesiredReplicas == MaxReplicas_; Or more appropriately +_CurrentReplicas == DesiredReplicas == MaxReplicas_. + +2. If the above is observed the Federation HPA tries to transfer allocation +of _MaxReplicas_ from clusters where it is not needed (_DesiredReplicas < MaxReplicas_) +or where it cannot be used, e.g. due to capacity constraints +(_CurrentReplicas < DesiredReplicas <= MaxReplicas_) to the clusters which have +reached their upper limit (1 above). + +3. It will be taken care that the _MaxReplica_ does not become lesser than _MinReplica_ +in any of the clusters in this redistribution. Additionally if the usage of the same +could be established, _MinReplicas_ can also be distributed as in 4 below. + +4. An exactly similar approach can also be applied to _MinReplicas_ of the local HPAs, +so as to reduce the min from those clusters, where +_CurrentReplicas == DesiredReplicas == MinReplicas_ and the observed average resource +metric usage (on the HPA) is lesser then a given threshold, to those clusters, +where the _DesiredReplicas > MinReplicas_. + +However, as stated in 3 above, the approach of distribution will first be implemented +only for _MaxReplicas_ to establish it utility, before implementing the same for _MinReplicas_. + +## Design Alternative 2 + +Same as the previous one, the API will need to be exposed at federation. + +However, when the request to create HPA is sent to federation, federation controller +will not create the HPA into the federated clusters. The HPA object will reside in the +federation API server only. The federation controller will need to get a metrics +client to each of the federated clusters and collect all the relevant metrics +periodically from all those clusters. The federation controller will further calculate +the current average metrics utilisation across all clusters (using the collected metrics) +of the given target object and calculate the replicas globally to attain the target +utilisation as specified in the federation HPA. After arriving at the target replicas, +the target replica number is set directly on the target object (replicaset, deployment, ..) +using its scales sub-resource at federation. It will be left to the actual target object +controller (for example RS controller) to distribute the replicas accordingly into the +federated clusters. The point of control over the scaling of replicas will lie completely +with the federation controllers. + +### Algorithm (for alternative 2) + +Federated HPA (FHPA), from every cluster gets: + +- ```avg_i``` average metric value (like CPU utilization) for all pods matching the +deployment/rs selector. +- ```count_i``` number of replicas that were used to calculate the average. + +To calculate the target number of replicas HPA calculates the sum of all metrics from +all clusters: + +```sum(avg_i * count_i)``` and divides it by target metric value. The target replica +count (validated against HPA min/max and thresholds) is set on Federated +Deployment/replica set. So the deployment has the correct number of replicas +(that should match the desired metric value) and provides all of the rebalancing/failover +mechanisms. + +Further, this can be expanded such that FHPA places replicas where they are needed the +most (in cluster that have the most traffic). For that FHPA would play with weights in +Federated Deployment. Each cluster will get the weight of ```100 * avg_i/sum(avg_i)```. +Weights hint Federated Deployment where to put replicas. But they are only hints so +if placing a replica in the desired cluster is not possible then it will be placed elsewhere, +what is probably better than not having the replica at all. + +# Other Scenario + +Other scenario, for example rolling updates (when user updates the deployment or RS), +recreation of the object (when user specifies the strategy as recreate while updating +the object), will continue to be handled the way they are handled in an individual k8s +cluster. Additionally there is a shortcoming in the current implementation of the +federated deployments rolling update. There is an existing proposal as part of the +[federated deployment design doc](https://github.com/kubernetes/community/pull/325). +Given it is implemented, the rolling updates for a federated deployment while a +federated HPA is active on the same object will also work fine. + +# Conclusion + +The design alternative 2 has the following major drawbacks, which are sufficient to +discard it as a probable implementation option: +- This option needs the federation control plane controller to collect metrics +data from each cluster, which is an overhead with increasing gravity of the problem +with increasing number of federated clusters, in a given federation. +- The monitoring and update of objects which are targeted by the federated HPA object +(when needed) for a particular federated cluster would stop if for whatever reasons +the network link between the federated cluster and federation control plane is severed. +A bigger problem can happen in case of an outage of the federation control plane +altogether. + +In Design Alternative 1 the autoscaling of replicas will continue, even if a given +cluster gets disconnected from federation or in case of the federation control plane +outage. This would happen because the local HPAs with the last know maxreplica and +minreplicas would exist in the local clusters. Additionally in this alternative there +is no need of collection and processing of the pod metrics for the target object from +each individual cluster. +This document proposes to use ***design alternative 1*** as the preferred implementation. + +# Glossary + +These use cases are specified using the terminology partly specific to telecom products/platforms: + +1 - A telecom service provider has a large number of base stations, for a particular region, +each with some set of virtualized resources each running some specific network functions. +In a specific scenario the resources need to be treated logically separate (thus making large +number of smaller clusters), but still a very similar workload needs to be deployed on each +cluster (network function stacks, for example). + +2 - In one of the architectures, the IOT matrix has IOT gateways, which aggregate a large +number of IOT sensors in a small area (for example a shopping mall). The IOT gateway is +envisioned as a virtualized resource, and in some cases multiple such resources need +aggregation, each forming a small cluster. Each of these clusters might run very similar +functions, but will independently scale based on the demand of that area. + +3 - A telecom service provider has a large number of base stations, each with some set of +virtualized resources, and each running specific network functions and each specifically +catering to different network abilities (2g, 3g, 4g, etc). Each of these virtualized base +stations, make small clusters and can cater to specific network abilities, such that one +can cater to one or more network abilities. At a given point of time there would be some +number of end user agents (cell phones) associated with each, and these UEs can come and +go within the range of each. While the UEs move, a more centralized entity (read federation) +needs to make a decision as to which exact base station cluster is suitable and with needed +resources to handle the incoming UEs. |
