[Federation] Federated hpa design

author: irfanurrehman <irfan.rehman@huawei.com> 2017-04-26 17:01:41 +0530
committer: Irfan Ur Rehman <irfan.rehman@huawei.com> 2018-03-13 21:41:03 +0530
commit: ba53fa0805f4116f98742e2115820cc7c6435a37 (patch)
tree: 4b3b6f38301b7fd6a75224d31416a3bba8f4e219
parent: f9bc8548924f18ebffedfd07998d914c7779fe94 (diff)
1 files changed, 230 insertions, 0 deletions
diff --git a/contributors/design-proposals/federated-hpa.md b/contributors/design-proposals/federated-hpa.md
new file mode 100644
index 00000000..189b6d7a
--- /dev/null
+++ b/contributors/design-proposals/federated-hpa.md
@@ -0,0 +1,230 @@
+# Federated Pod Autoscaler
+
+# Requirements &amp; Design Document
+
+irfan.rehman@huawei.com, quinton.hoole@huawei.com
+
+# Use cases
+
+1 – Users can schedule replicas of same application, across the 
+federated clusters, using replicaset (or deployment). 
+Users however further might need to let the replicas be scaled 
+independently in each cluster, depending on the current usage metrics 
+of the replicas; including the CPU, memory and application defined 
+custom metrics.
+
+2 - As stated in the previous use case, a federation user schedules 
+replicas of same application, into federated clusters and subsequently
+creates a horizontal pod autoscaler targeting the object responsible for 
+the replicas. User would want the auto-scaling to continue based on 
+the in-cluster metrics, even if for some reason, there is an outage at 
+federation level. User (or other users) should still be able to access 
+the deployed application into all federated clusters. Further, if the 
+load on the deployed app varies, the autoscaler should continue taking 
+care of scaling the replicas for a smooth user experience.
+
+3 - A federation that consists of an on-premise cluster and  a cluster 
+running in a public cloud has a user workload (eg. deployment or rs) 
+preferentially running in the on-premise cluster. However if there are 
+spikes in the app usage, such that the capacity in the on-premise cluster 
+is not sufficient, the workload should be able to get scaled beyond the 
+on-premise cluster boundary and into the other clusters which are part 
+of this federation.
+
+Please refer to some additional use cases, which partly led to the derivation 
+of the above use case, and are listed in the **glossary** section of this document.
+
+# User workflow
+
+User wants to schedule a set of common workload across federated clusters. 
+He creates a replicaset or a deployment to schedule the workload (with or 
+without preferences). The federation then distributes the replicas of the 
+given workload into the federated clusters. As the user at this point is 
+unaware of the exact usage metrics of the individual pods created in the 
+federated clusters, he creates an HPA into the federation, providing metric 
+parameters to be used in the scale request for a resource. It is now the 
+responsibility of this HPA to monitor the relevant resource metrics and the 
+scaling of the pods per cluster then is controlled by the associated HPA.
+
+# Alternative approaches
+
+## Alternative 1
+
+Make the autoscaling resource available and implement support for 
+horizontalpodautoscalers objects at federation. The HPA API resource 
+will need to be exposed at the federation level, which can follow the 
+version similar to one implemented in the latest k8s cluster release.
+
+Once the object is created at federation, the federation controller 
+creates and monitors  a similar object (partitioning the min and max values) 
+in each of the federated clusters. Based on the metadata in spec of the HPA
+describing the scaleTargetRef, the HPA will be applied on the already existing
+target objects. If the target object is not present in the cluster (either 
+because, its not created until now, or deleted for some reason), the HPA will 
+still exist but no action will be taken. The HPA&#39;s action will become 
+applicable when the target object is created in the given cluster anytime in 
+future. Also as stated already the  federation controller will need to partition 
+the min and max values appropriately into the federated clusters among the HPA 
+objects such that the total of min and that of max replicas satisfies the 
+constraints specified by the user at federation. The point of control over the 
+scaling of replicas will lie locally with the federated hpa controller. The 
+federated controller will however watch the cluster local HPAs wrt current
+replicas of the target objects and will do intelligent dynamic adjustments of 
+min and max values of the HPA replicas across the clusters based on the run time 
+conditions.
+
+The federation controller by default will distribute the min and max replicas of the 
+HPA equally among all clusters. The min values will first be distributed such that 
+any cluster into which the replicas are distributed does not get a min replicas 
+lesser than 1. This means that HPA can actually be created in lesser number of 
+ready clusters then available in federation. Once this distribution happens, the 
+max replicas of the hpa will be distributed across all those clusters into which 
+the HPA needs to be created. The default distribution can be overridden using the 
+annotations on the HPA object, very similar to the annotations on federated 
+replicaset object as described 
+[here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md#federatereplicaset-preferences).
+
+One of the points to note here is that, doing this brings a two point control on 
+number of replicas of the target object, one by the federated target object (rs or 
+deployment) and other by the hpa local to the federated cluster. Solution to which 
+is discussed in the following section.
+
+### Rebalancing (function of target federated objects)
+
+The current implementation of federated replicasets (and deployments) first 
+distributes the replicas into underlying clusters and then monitors the status 
+of the pods in each cluster. In case there are clusters which have active pods 
+lesser than what federation reconciler desires, federation control plane will 
+trigger creation of the missing pods (which federation considers missing), or 
+in other case would trigger removal of pods, if the control plane considers that 
+the given cluster has more pods than needed. This is something which counters 
+the role of HPA in individual cluster. To handle this, the knowledge that HPA 
+is active separately targeting this object has to be percolated to the federation 
+control plane monitoring the individual replicas such that, the federation control
+plane stops reconciling the replicas in the individual clusters. In other words 
+the link between the HPA wrt to the corresponding objects will need to be 
+maintained and if an HPA is active, other federation controllers (aka replicaset 
+and deployment controllers) reconcile process, would stop updating and/or 
+rebalancing the replicas in and across the underlying clusters. The reconcile 
+of the objects (rs or deployment) would still continue, to handle the scenario 
+of the object missing from any given federated cluster.
+
+### Dynamic HPA min/max rebalance
+
+The proposal in this section can be used to improve the distribution of replicas 
+across the clusters such that there are more replicas in those clusters, where 
+they are needed more. The federation hpa controller will monitor the status of 
+the local HPAs in the federated clusters and update the min and/or max values 
+set on the local HPAs as below (assuming that all previous steps are done and 
+local HPAs in federated clusters are active):
+
+1. At some point, one or more of the cluster HPA&#39;s hit the upper limit of their 
+allowed scaling such that _DesiredReplicas == MaxReplicas_; Or more appropriately 
+_CurrentReplicas == DesiredReplicas == MaxReplicas_.
+
+2. If the above is observed the Federation HPA tries to transfer allocation 
+of _MaxReplicas_ from clusters where it is not needed (_DesiredReplicas < MaxReplicas_) 
+or where it cannot be used, e.g. due to capacity constraints 
+(_CurrentReplicas < DesiredReplicas <= MaxReplicas_) to the clusters which have 
+reached their upper limit (1 above).
+
+3. It will be taken care that the _MaxReplica_ does not become lesser than _MinReplica_ 
+in any of the clusters in this redistribution. Additionally if the usage of the same 
+could be established, _MinReplicas_ can also be distributed as in 4 below.
+
+4. An exactly similar approach can also be applied to _MinReplicas_ of the local HPAs, 
+so as to reduce the min from those clusters, where  
+_CurrentReplicas == DesiredReplicas == MinReplicas_ and the observed average resource
+metric usage (on the HPA) is lesser then a given threshold, to those clusters, 
+where the _DesiredReplicas > MinReplicas_.
+
+However, as stated above in 3 above, the approach of distribution will first be implemented
+only for _MaxReplicas_ to establish it utility, before implementing the same for _MinReplicas_.
+
+## Alternative 2
+
+Same as the previous one, the API will need to be exposed at federation.
+
+However, when the request to create HPA is sent to federation, federation controller 
+will not create the HPA into the federated clusters. The HPA object will reside in the 
+federation API server only. The federation controller will need to get a metrics 
+client to each of the federated clusters and collect all the relevant metrics 
+periodically from all those clusters. The federation controller will further calculate 
+the current average metrics utilisation across all clusters (using the collected metrics) 
+of the given target object and calculate the replicas globally to attain the target 
+utilisation as specified in the federation HPA. After arriving at the target replicas, 
+the target replica number is set directly on the target object (replicaset, deployment, ..) 
+using its scales sub-resource at federation. It will be left to the actual target object 
+controller (for example RS controller) to distribute the replicas accordingly into the 
+federated clusters. The point of control over the scaling of replicas will lie completely 
+with the federation controllers.
+
+### Algorithm (for alternative 2)
+
+Federated HPA (FHPA), from every cluster gets:
+
+- ```avg_i``` average metric value (like CPU utilization) for all pods matching the 
+deployment/rs selector.
+- ```count_i``` number of replicas that were used to calculate the average.
+
+To calculate the target number of replicas HPA calculates the sum of all metrics from
+all clusters:
+
+```sum(avg_i * count_i)``` and divides it by target metric value. The target replica 
+count (validated against HPA min/max and thresholds) is set on Federated 
+Deployment/replica set. So the deployment has the correct number of replicas 
+(that should match the desired metric value) and provides all of the rebalancing/failover 
+mechanisms.
+
+Further, this can be expanded such that FHPA places replicas where they are needed the
+most (in cluster that have the most traffic). For that FHPA would play with weights in
+Federated Deployment. Each cluster will get the weight of ```100 * avg_i/sum(avg_i)```.
+Weights hint Federated Deployment where to put replicas. But they are only hints so 
+if placing a replica in the desired cluster is not possible then it will be placed elsewhere, 
+what is probably better than not having the replica at all.
+
+# Other Scenario
+
+All other scenario, for example rolling updates (when user updates the deployment or RS), 
+recreation of the object (when user specifies the strategy as recreate while updating the object), 
+will continue to be handled the way they are handled in an individual k8s cluster.
+
+At federation the respective controllers will monitor/watch the individual objects and update 
+reconcile as per the implementation, except the case of rebalance specified in the section 
+*rebalancing*.
+
+# Conclusion
+
+The design alternative 2 has a drawback, that the monitoring and update of hpa objects (when needed) 
+for a particular federated cluster would stop if for whatever reasons network link between the 
+federated cluster and federation control plane is severed. A bigger problem can happen in case of 
+an outage of the federation control plane altogether.
+
+Alternative 1 would be a better choice.
+
+
+# Glossary
+
+These use cases are specified using the terminology partly specific to telecom products/platforms:
+
+1 - A telecom service provider has a large number of base stations, for a particular region, 
+each with some set of virtualized resources each running some specific network functions. 
+In a specific scenario the resources need to be treated logically separate (thus making large 
+number of smaller clusters), but still a very similar workload needs to be deployed on each 
+cluster (network function stacks, for example).
+
+2 - In one of the architectures, the IOT matrix has IOT gateways, which aggregate a large 
+number of IOT sensors in a small area (for example a shopping mall). The IOT gateway is 
+envisioned as a virtualized resource, and in some cases multiple such resources need 
+aggregation, each forming a small cluster. Each of these clusters might run very similar 
+functions, but will independently scale based on the demand of that area.
+
+3 - A telecom service provider has a large number of base stations, each with some set of 
+virtualized resources, and each running specific network functions and each specifically 
+catering to different network abilities (2g, 3g, 4g, etc). Each of these virtualized base 
+stations, make small clusters and can cater to specific network abilities, such that one 
+can cater to one or more network abilities. At a given point of time there would be some 
+number of end user agents (cell phones) associated with each, and these UEs can come and 
+go within the range of each. While the UEs move, a more centralized entity (read federation) 
+needs to make a decision as to which exact base station cluster is suitable and with needed 
+resources to handle the incoming UEs.
author	irfanurrehman <irfan.rehman@huawei.com>	2017-04-26 17:01:41 +0530
committer	Irfan Ur Rehman <irfan.rehman@huawei.com>	2018-03-13 21:41:03 +0530
commit	ba53fa0805f4116f98742e2115820cc7c6435a37 (patch)
tree	4b3b6f38301b7fd6a75224d31416a3bba8f4e219
parent	f9bc8548924f18ebffedfd07998d914c7779fe94 (diff)