diff options
Diffstat (limited to 'federation-phase-1.md')
| -rw-r--r-- | federation-phase-1.md | 434 |
1 files changed, 434 insertions, 0 deletions
diff --git a/federation-phase-1.md b/federation-phase-1.md new file mode 100644 index 00000000..baf1e472 --- /dev/null +++ b/federation-phase-1.md @@ -0,0 +1,434 @@ +<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> + +<!-- BEGIN STRIP_FOR_RELEASE --> + +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" + width="25" height="25"> +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" + width="25" height="25"> +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" + width="25" height="25"> +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" + width="25" height="25"> +<img src="http://kubernetes.io/img/warning.png" alt="WARNING" + width="25" height="25"> + +<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> + +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +</strong> +-- + +<!-- END STRIP_FOR_RELEASE --> + +<!-- END MUNGE: UNVERSIONED_WARNING --> + +# Ubernetes Design Spec (phase one) + +**Huawei PaaS Team** + +## INTRODUCTION + +In this document we propose a design for the “Control Plane” of +Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of +this work please refer to +[this proposal](../../docs/proposals/federation.md). +The document is arranged as following. First we briefly list scenarios +and use cases that motivate K8S federation work. These use cases drive +the design and they also verify the design. We summarize the +functionality requirements from these use cases, and define the “in +scope” functionalities that will be covered by this design (phase +one). After that we give an overview of the proposed architecture, API +and building blocks. And also we go through several activity flows to +see how these building blocks work together to support use cases. + +## REQUIREMENTS + +There are many reasons why customers may want to build a K8S +federation: + ++ **High Availability:** Customers want to be immune to the outage of + a single availability zone, region or even a cloud provider. ++ **Sensitive workloads:** Some workloads can only run on a particular + cluster. They cannot be scheduled to or migrated to other clusters. ++ **Capacity overflow:** Customers prefer to run workloads on a + primary cluster. But if the capacity of the cluster is not + sufficient, workloads should be automatically distributed to other + clusters. ++ **Vendor lock-in avoidance:** Customers want to spread their + workloads on different cloud providers, and can easily increase or + decrease the workload proportion of a specific provider. ++ **Cluster Size Enhancement:** Currently K8S cluster can only support +a limited size. While the community is actively improving it, it can +be expected that cluster size will be a problem if K8S is used for +large workloads or public PaaS infrastructure. While we can separate +different tenants to different clusters, it would be good to have a +unified view. + +Here are the functionality requirements derived from above use cases: + ++ Clients of the federation control plane API server can register and deregister clusters. ++ Workloads should be spread to different clusters according to the + workload distribution policy. ++ Pods are able to discover and connect to services hosted in other + clusters (in cases where inter-cluster networking is necessary, + desirable and implemented). ++ Traffic to these pods should be spread across clusters (in a manner + similar to load balancing, although it might not be strictly + speaking balanced). ++ The control plane needs to know when a cluster is down, and migrate + the workloads to other clusters. ++ Clients have a unified view and a central control point for above + activities. + +## SCOPE + +It’s difficult to have a perfect design with one click that implements +all the above requirements. Therefore we will go with an iterative +approach to design and build the system. This document describes the +phase one of the whole work. In phase one we will cover only the +following objectives: + ++ Define the basic building blocks and API objects of control plane ++ Implement a basic end-to-end workflow + + Clients register federated clusters + + Clients submit a workload + + The workload is distributed to different clusters + + Service discovery + + Load balancing + +The following parts are NOT covered in phase one: + ++ Authentication and authorization (other than basic client + authentication against the ubernetes API, and from ubernetes control + plane to the underlying kubernetes clusters). ++ Deployment units other than replication controller and service ++ Complex distribution policy of workloads ++ Service affinity and migration + +## ARCHITECTURE + +The overall architecture of a control plane is shown as following: + + + +Some design principles we are following in this architecture: + +1. Keep the underlying K8S clusters independent. They should have no + knowledge of control plane or of each other. +1. Keep the Ubernetes API interface compatible with K8S API as much as + possible. +1. Re-use concepts from K8S as much as possible. This reduces +customers’ learning curve and is good for adoption. Below is a brief +description of each module contained in above diagram. + +## Ubernetes API Server + +The API Server in the Ubernetes control plane works just like the API +Server in K8S. It talks to a distributed key-value store to persist, +retrieve and watch API objects. This store is completely distinct +from the kubernetes key-value stores (etcd) in the underlying +kubernetes clusters. We still use `etcd` as the distributed +storage so customers don’t need to learn and manage a different +storage system, although it is envisaged that other storage systems +(consol, zookeeper) will probably be developedand supported over +time. + +## Ubernetes Scheduler + +The Ubernetes Scheduler schedules resources onto the underlying +Kubernetes clusters. For example it watches for unscheduled Ubernetes +replication controllers (those that have not yet been scheduled onto +underlying Kubernetes clusters) and performs the global scheduling +work. For each unscheduled replication controller, it calls policy +engine to decide how to spit workloads among clusters. It creates a +Kubernetes Replication Controller on one ore more underlying cluster, +and post them back to `etcd` storage. + +One sublety worth noting here is that the scheduling decision is +arrived at by combining the application-specific request from the user (which might +include, for example, placement constraints), and the global policy specified +by the federation administrator (for example, "prefer on-premise +clusters over AWS clusters" or "spread load equally across clusters"). + +## Ubernetes Cluster Controller + +The cluster controller +performs the following two kinds of work: + +1. It watches all the sub-resources that are created by Ubernetes + components, like a sub-RC or a sub-service. And then it creates the + corresponding API objects on the underlying K8S clusters. +1. It periodically retrieves the available resources metrics from the + underlying K8S cluster, and updates them as object status of the + `cluster` API object. An alternative design might be to run a pod + in each underlying cluster that reports metrics for that cluster to + the Ubernetes control plane. Which approach is better remains an + open topic of discussion. + +## Ubernetes Service Controller + +The Ubernetes service controller is a federation-level implementation +of K8S service controller. It watches service resources created on +control plane, creates corresponding K8S services on each involved K8S +clusters. Besides interacting with services resources on each +individual K8S clusters, the Ubernetes service controller also +performs some global DNS registration work. + +## API OBJECTS + +## Cluster + +Cluster is a new first-class API object introduced in this design. For +each registered K8S cluster there will be such an API resource in +control plane. The way clients register or deregister a cluster is to +send corresponding REST requests to following URL: +`/api/{$version}/clusters`. Because control plane is behaving like a +regular K8S client to the underlying clusters, the spec of a cluster +object contains necessary properties like K8S cluster address and +credentials. The status of a cluster API object will contain +following information: + +1. Which phase of its lifecycle +1. Cluster resource metrics for scheduling decisions. +1. Other metadata like the version of cluster + +$version.clusterSpec + +<table style="border:1px solid #000000;border-collapse:collapse;"> +<tbody> +<tr> +<td style="padding:5px;"><b>Name</b><br> +</td> +<td style="padding:5px;"><b>Description</b><br> +</td> +<td style="padding:5px;"><b>Required</b><br> +</td> +<td style="padding:5px;"><b>Schema</b><br> +</td> +<td style="padding:5px;"><b>Default</b><br> +</td> +</tr> +<tr> +<td style="padding:5px;">Address<br> +</td> +<td style="padding:5px;">address of the cluster<br> +</td> +<td style="padding:5px;">yes<br> +</td> +<td style="padding:5px;">address<br> +</td> +<td style="padding:5px;"><p></p></td> +</tr> +<tr> +<td style="padding:5px;">Credential<br> +</td> +<td style="padding:5px;">the type (e.g. bearer token, client +certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br> +</td> +<td style="padding:5px;">yes<br> +</td> +<td style="padding:5px;">string <br> +</td> +<td style="padding:5px;"><p></p></td> +</tr> +</tbody> +</table> + +$version.clusterStatus + +<table style="border:1px solid #000000;border-collapse:collapse;"> +<tbody> +<tr> +<td style="padding:5px;"><b>Name</b><br> +</td> +<td style="padding:5px;"><b>Description</b><br> +</td> +<td style="padding:5px;"><b>Required</b><br> +</td> +<td style="padding:5px;"><b>Schema</b><br> +</td> +<td style="padding:5px;"><b>Default</b><br> +</td> +</tr> +<tr> +<td style="padding:5px;">Phase<br> +</td> +<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br> +</td> +<td style="padding:5px;">yes<br> +</td> +<td style="padding:5px;">enum<br> +</td> +<td style="padding:5px;"><p></p></td> +</tr> +<tr> +<td style="padding:5px;">Capacity<br> +</td> +<td style="padding:5px;">represents the available resources of a cluster<br> +</td> +<td style="padding:5px;">yes<br> +</td> +<td style="padding:5px;">any<br> +</td> +<td style="padding:5px;"><p></p></td> +</tr> +<tr> +<td style="padding:5px;">ClusterMeta<br> +</td> +<td style="padding:5px;">Other cluster metadata like the version<br> +</td> +<td style="padding:5px;">yes<br> +</td> +<td style="padding:5px;">ClusterMeta<br> +</td> +<td style="padding:5px;"><p></p></td> +</tr> +</tbody> +</table> + +**For simplicity we didn’t introduce a separate “cluster metrics” API +object here**. The cluster resource metrics are stored in cluster +status section, just like what we did to nodes in K8S. In phase one it +only contains available CPU resources and memory resources. The +cluster controller will periodically poll the underlying cluster API +Server to get cluster capability. In phase one it gets the metrics by +simply aggregating metrics from all nodes. In future we will improve +this with more efficient ways like leveraging heapster, and also more +metrics will be supported. Similar to node phases in K8S, the “phase” +field includes following values: + ++ pending: newly registered clusters or clusters suspended by admin + for various reasons. They are not eligible for accepting workloads ++ running: clusters in normal status that can accept workloads ++ offline: clusters temporarily down or not reachable ++ terminated: clusters removed from federation + +Below is the state transition diagram. + + + +## Replication Controller + +A global workload submitted to control plane is represented as an +Ubernetes replication controller. When a replication controller +is submitted to control plane, clients need a way to express its +requirements or preferences on clusters. Depending on different use +cases it may be complex. For example: + ++ This workload can only be scheduled to cluster Foo. It cannot be + scheduled to any other clusters. (use case: sensitive workloads). ++ This workload prefers cluster Foo. But if there is no available + capacity on cluster Foo, it’s OK to be scheduled to cluster Bar + (use case: workload ) ++ Seventy percent of this workload should be scheduled to cluster Foo, + and thirty percent should be scheduled to cluster Bar (use case: + vendor lock-in avoidance). In phase one, we only introduce a + _clusterSelector_ field to filter acceptable clusters. In default + case there is no such selector and it means any cluster is + acceptable. + +Below is a sample of the YAML to create such a replication controller. + +``` +apiVersion: v1 +kind: ReplicationController +metadata: + name: nginx-controller +spec: + replicas: 5 + selector: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: nginx + ports: + - containerPort: 80 + clusterSelector: + name in (Foo, Bar) +``` + +Currently clusterSelector (implemented as a +[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) +only supports a simple list of acceptable clusters. Workloads will be +evenly distributed on these acceptable clusters in phase one. After +phase one we will define syntax to represent more advanced +constraints, like cluster preference ordering, desired number of +splitted workloads, desired ratio of workloads spread on different +clusters, etc. + +Besides this explicit “clusterSelector” filter, a workload may have +some implicit scheduling restrictions. For example it defines +“nodeSelector” which can only be satisfied on some particular +clusters. How to handle this will be addressed after phase one. + +## Ubernetes Services + +The Service API object exposed by Ubernetes is similar to service +objects on Kubernetes. It defines the access to a group of pods. The +Ubernetes service controller will create corresponding Kubernetes +service objects on underlying clusters. These are detailed in a +separate design document: [Federated Services](federated-services.md). + +## Pod + +In phase one we only support scheduling replication controllers. Pod +scheduling will be supported in later phase. This is primarily in +order to keep the Ubernetes API compatible with the Kubernetes API. + +## ACTIVITY FLOWS + +## Scheduling + +The below diagram shows how workloads are scheduled on the Ubernetes control plane: + +1. A replication controller is created by the client. +1. APIServer persists it into the storage. +1. Cluster controller periodically polls the latest available resource + metrics from the underlying clusters. +1. Scheduler is watching all pending RCs. It picks up the RC, make + policy-driven decisions and split it into different sub RCs. +1. Each cluster control is watching the sub RCs bound to its + corresponding cluster. It picks up the newly created sub RC. +1. The cluster controller issues requests to the underlying cluster +API Server to create the RC. In phase one we don’t support complex +distribution policies. The scheduling rule is basically: + 1. If a RC does not specify any nodeSelector, it will be scheduled + to the least loaded K8S cluster(s) that has enough available + resources. + 1. If a RC specifies _N_ acceptable clusters in the + clusterSelector, all replica will be evenly distributed among + these clusters. + +There is a potential race condition here. Say at time _T1_ the control +plane learns there are _m_ available resources in a K8S cluster. As +the cluster is working independently it still accepts workload +requests from other K8S clients or even another Ubernetes control +plane. The Ubernetes scheduling decision is based on this data of +available resources. However when the actual RC creation happens to +the cluster at time _T2_, the cluster may don’t have enough resources +at that time. We will address this problem in later phases with some +proposed solutions like resource reservation mechanisms. + + + +## Service Discovery + +This part has been included in the section “Federated Service” of +document +“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please +refer to that document for details. + + +<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> +[]() +<!-- END MUNGE: GENERATED_ANALYTICS --> |
