From f8e1cbd0920f4a181759664095d80775e4e672c0 Mon Sep 17 00:00:00 2001 From: Jonathan MacMillan Date: Wed, 11 Oct 2017 16:34:50 -0700 Subject: Update files in the community repo to point to multicluster rather than federation. --- .../design-proposals/architecture/architecture.md | 2 +- contributors/design-proposals/dir_struct.txt | 2 +- .../federation/control-plane-resilience.md | 241 -------- .../federation/federated-api-servers.md | 8 - .../federation/federated-ingress.md | 194 ------ .../federation/federated-placement-policy.md | 371 ------------ .../federation/federated-replicasets.md | 513 ---------------- .../federation/federated-services.md | 519 ----------------- .../federation/federation-clusterselector.md | 81 --- .../federation/federation-high-level-arch.png | Bin 31793 -> 0 bytes .../design-proposals/federation/federation-lite.md | 201 ------- .../federation/federation-phase-1.md | 407 ------------- .../design-proposals/federation/federation.md | 648 --------------------- .../federation/ubernetes-cluster-state.png | Bin 13824 -> 0 bytes .../federation/ubernetes-design.png | Bin 20358 -> 0 bytes .../federation/ubernetes-scheduling.png | Bin 39094 -> 0 bytes .../multicluster/control-plane-resilience.md | 241 ++++++++ .../multicluster/federated-api-servers.md | 8 + .../multicluster/federated-ingress.md | 194 ++++++ .../multicluster/federated-placement-policy.md | 371 ++++++++++++ .../multicluster/federated-replicasets.md | 513 ++++++++++++++++ .../multicluster/federated-services.md | 519 +++++++++++++++++ .../multicluster/federation-clusterselector.md | 81 +++ .../multicluster/federation-high-level-arch.png | Bin 0 -> 31793 bytes .../multicluster/federation-lite.md | 201 +++++++ .../multicluster/federation-phase-1.md | 407 +++++++++++++ .../design-proposals/multicluster/federation.md | 648 +++++++++++++++++++++ .../multicluster/ubernetes-cluster-state.png | Bin 0 -> 13824 bytes .../multicluster/ubernetes-design.png | Bin 0 -> 20358 bytes .../multicluster/ubernetes-scheduling.png | Bin 0 -> 39094 bytes .../design-proposals/scheduling/podaffinity.md | 2 +- contributors/devel/release/issues.md | 2 +- sig-federation/ONCALL.md | 63 -- sig-federation/OWNERS | 6 - sig-federation/README.md | 29 - sig-list.md | 2 +- sig-multicluster/ONCALL.md | 76 +++ sig-multicluster/OWNERS | 6 + sig-multicluster/README.md | 29 + sigs.yaml | 25 +- 40 files changed, 3312 insertions(+), 3298 deletions(-) delete mode 100644 contributors/design-proposals/federation/control-plane-resilience.md delete mode 100644 contributors/design-proposals/federation/federated-api-servers.md delete mode 100644 contributors/design-proposals/federation/federated-ingress.md delete mode 100644 contributors/design-proposals/federation/federated-placement-policy.md delete mode 100644 contributors/design-proposals/federation/federated-replicasets.md delete mode 100644 contributors/design-proposals/federation/federated-services.md delete mode 100644 contributors/design-proposals/federation/federation-clusterselector.md delete mode 100644 contributors/design-proposals/federation/federation-high-level-arch.png delete mode 100644 contributors/design-proposals/federation/federation-lite.md delete mode 100644 contributors/design-proposals/federation/federation-phase-1.md delete mode 100644 contributors/design-proposals/federation/federation.md delete mode 100644 contributors/design-proposals/federation/ubernetes-cluster-state.png delete mode 100644 contributors/design-proposals/federation/ubernetes-design.png delete mode 100644 contributors/design-proposals/federation/ubernetes-scheduling.png create mode 100644 contributors/design-proposals/multicluster/control-plane-resilience.md create mode 100644 contributors/design-proposals/multicluster/federated-api-servers.md create mode 100644 contributors/design-proposals/multicluster/federated-ingress.md create mode 100644 contributors/design-proposals/multicluster/federated-placement-policy.md create mode 100644 contributors/design-proposals/multicluster/federated-replicasets.md create mode 100644 contributors/design-proposals/multicluster/federated-services.md create mode 100644 contributors/design-proposals/multicluster/federation-clusterselector.md create mode 100644 contributors/design-proposals/multicluster/federation-high-level-arch.png create mode 100644 contributors/design-proposals/multicluster/federation-lite.md create mode 100644 contributors/design-proposals/multicluster/federation-phase-1.md create mode 100644 contributors/design-proposals/multicluster/federation.md create mode 100644 contributors/design-proposals/multicluster/ubernetes-cluster-state.png create mode 100644 contributors/design-proposals/multicluster/ubernetes-design.png create mode 100644 contributors/design-proposals/multicluster/ubernetes-scheduling.png delete mode 100644 sig-federation/ONCALL.md delete mode 100644 sig-federation/OWNERS delete mode 100644 sig-federation/README.md create mode 100644 sig-multicluster/ONCALL.md create mode 100644 sig-multicluster/OWNERS create mode 100644 sig-multicluster/README.md diff --git a/contributors/design-proposals/architecture/architecture.md b/contributors/design-proposals/architecture/architecture.md index 06644621..b0d1f99b 100644 --- a/contributors/design-proposals/architecture/architecture.md +++ b/contributors/design-proposals/architecture/architecture.md @@ -245,7 +245,7 @@ itself: A single Kubernetes cluster may span multiple availability zones. -However, for the highest availability, we recommend using [cluster federation](../federation/federation.md). +However, for the highest availability, we recommend using [cluster federation](../multicluster/federation.md). [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/contributors/design-proposals/dir_struct.txt b/contributors/design-proposals/dir_struct.txt index e5e8ea71..ef61ae75 100644 --- a/contributors/design-proposals/dir_struct.txt +++ b/contributors/design-proposals/dir_struct.txt @@ -134,7 +134,7 @@ Uncategorized security.md security_context.md service_accounts.md -./federation +./multicluster control-plane-resilience.md federated-api-servers.md federated-ingress.md diff --git a/contributors/design-proposals/federation/control-plane-resilience.md b/contributors/design-proposals/federation/control-plane-resilience.md deleted file mode 100644 index 1e0a3baf..00000000 --- a/contributors/design-proposals/federation/control-plane-resilience.md +++ /dev/null @@ -1,241 +0,0 @@ -# Kubernetes and Cluster Federation Control Plane Resilience - -## Long Term Design and Current Status - -### by Quinton Hoole, Mike Danese and Justin Santa-Barbara - -### December 14, 2015 - -## Summary - -Some amount of confusion exists around how we currently, and in future -want to ensure resilience of the Kubernetes (and by implication -Kubernetes Cluster Federation) control plane. This document is an attempt to capture that -definitively. It covers areas including self-healing, high -availability, bootstrapping and recovery. Most of the information in -this document already exists in the form of github comments, -PR's/proposals, scattered documents, and corridor conversations, so -document is primarily a consolidation and clarification of existing -ideas. - -## Terms - -* **Self-healing:** automatically restarting or replacing failed - processes and machines without human intervention -* **High availability:** continuing to be available and work correctly - even if some components are down or uncontactable. This typically - involves multiple replicas of critical services, and a reliable way - to find available replicas. Note that it's possible (but not - desirable) to have high - availability properties (e.g. multiple replicas) in the absence of - self-healing properties (e.g. if a replica fails, nothing replaces - it). Fairly obviously, given enough time, such systems typically - become unavailable (after enough replicas have failed). -* **Bootstrapping**: creating an empty cluster from nothing -* **Recovery**: recreating a non-empty cluster after perhaps - catastrophic failure/unavailability/data corruption - -## Overall Goals - -1. **Resilience to single failures:** Kubernetes clusters constrained - to single availability zones should be resilient to individual - machine and process failures by being both self-healing and highly - available (within the context of such individual failures). -1. **Ubiquitous resilience by default:** The default cluster creation - scripts for (at least) GCE, AWS and basic bare metal should adhere - to the above (self-healing and high availability) by default (with - options available to disable these features to reduce control plane - resource requirements if so required). It is hoped that other - cloud providers will also follow the above guidelines, but the - above 3 are the primary canonical use cases. -1. **Resilience to some correlated failures:** Kubernetes clusters - which span multiple availability zones in a region should by - default be resilient to complete failure of one entire availability - zone (by similarly providing self-healing and high availability in - the default cluster creation scripts as above). -1. **Default implementation shared across cloud providers:** The - differences between the default implementations of the above for - GCE, AWS and basic bare metal should be minimized. This implies - using shared libraries across these providers in the default - scripts in preference to highly customized implementations per - cloud provider. This is not to say that highly differentiated, - customized per-cloud cluster creation processes (e.g. for GKE on - GCE, or some hosted Kubernetes provider on AWS) are discouraged. - But those fall squarely outside the basic cross-platform OSS - Kubernetes distro. -1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms - for achieving system resilience (replication controllers, health - checking, service load balancing etc) should be used in preference - to building a separate set of mechanisms to achieve the same thing. - This implies that self hosting (the kubernetes control plane on - kubernetes) is strongly preferred, with the caveat below. -1. **Recovery from catastrophic failure:** The ability to quickly and - reliably recover a cluster from catastrophic failure is critical, - and should not be compromised by the above goal to self-host - (i.e. it goes without saying that the cluster should be quickly and - reliably recoverable, even if the cluster control plane is - broken). This implies that such catastrophic failure scenarios - should be carefully thought out, and the subject of regular - continuous integration testing, and disaster recovery exercises. - -## Relative Priorities - -1. **(Possibly manual) recovery from catastrophic failures:** having a -Kubernetes cluster, and all applications running inside it, disappear forever -perhaps is the worst possible failure mode. So it is critical that we be able to -recover the applications running inside a cluster from such failures in some -well-bounded time period. - 1. In theory a cluster can be recovered by replaying all API calls - that have ever been executed against it, in order, but most - often that state has been lost, and/or is scattered across - multiple client applications or groups. So in general it is - probably infeasible. - 1. In theory a cluster can also be recovered to some relatively - recent non-corrupt backup/snapshot of the disk(s) backing the - etcd cluster state. But we have no default consistent - backup/snapshot, verification or restoration process. And we - don't routinely test restoration, so even if we did routinely - perform and verify backups, we have no hard evidence that we - can in practise effectively recover from catastrophic cluster - failure or data corruption by restoring from these backups. So - there's more work to be done here. -1. **Self-healing:** Most major cloud providers provide the ability to - easily and automatically replace failed virtual machines within a - small number of minutes (e.g. GCE - [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) - and Managed Instance Groups, - AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) - and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This - can fairly trivially be used to reduce control-plane down-time due - to machine failure to a small number of minutes per failure - (i.e. typically around "3 nines" availability), provided that: - 1. cluster persistent state (i.e. etcd disks) is either: - 1. truly persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using etcd [dynamic member - addition](https://github.com/coreos/etcd/blob/master/Documentation/v2/runtime-configuration.md#add-a-new-member) - or [backup and - recovery](https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery)). - 1. and boot disks are either: - 1. truly persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using boot-from-snapshot, - boot-from-pre-configured-image or - boot-from-auto-initializing image). -1. **High Availability:** This has the potential to increase - availability above the approximately "3 nines" level provided by - automated self-healing, but it's somewhat more complex, and - requires additional resources (e.g. redundant API servers and etcd - quorum members). In environments where cloud-assisted automatic - self-healing might be infeasible (e.g. on-premise bare-metal - deployments), it also gives cluster administrators more time to - respond (e.g. replace/repair failed machines) without incurring - system downtime. - -## Design and Status (as of December 2015) - - - - - - - - - - - - - - - - - - - - - - -
Control Plane ComponentResilience PlanCurrent Status
API Server - -Multiple stateless, self-hosted, self-healing API servers behind a HA -load balancer, built out by the default "kube-up" automation on GCE, -AWS and basic bare metal (BBM). Note that the single-host approach of -having etcd listen only on localhost to ensure that only API server can -connect to it will no longer work, so alternative security will be -needed in the regard (either using firewall rules, SSL certs, or -something else). All necessary flags are currently supported to enable -SSL between API server and etcd (OpenShift runs like this out of the -box), but this needs to be woven into the "kube-up" and related -scripts. Detailed design of self-hosting and related bootstrapping -and catastrophic failure recovery will be detailed in a separate -design doc. - - - -No scripted self-healing or HA on GCE, AWS or basic bare metal -currently exists in the OSS distro. To be clear, "no self healing" -means that even if multiple e.g. API servers are provisioned for HA -purposes, if they fail, nothing replaces them, so eventually the -system will fail. Self-healing and HA can be set up -manually by following documented instructions, but this is not -currently an automated process, and it is not tested as part of -continuous integration. So it's probably safest to assume that it -doesn't actually work in practise. - -
Controller manager and scheduler - -Multiple self-hosted, self healing warm standby stateless controller -managers and schedulers with leader election and automatic failover of API -server clients, automatically installed by default "kube-up" automation. - -As above.
etcd - -Multiple (3-5) etcd quorum members behind a load balancer with session -affinity (to prevent clients from being bounced from one to another). - -Regarding self-healing, if a node running etcd goes down, it is always necessary -to do three things: -
    -
  1. allocate a new node (not necessary if running etcd as a pod, in -which case specific measures are required to prevent user pods from -interfering with system pods, for example using node selectors as -described in -dynamic member addition. - -In the case of remote persistent disk, the etcd state can be recovered by -attaching the remote persistent disk to the replacement node, thus the state is -recoverable even if all other replicas are down. - -There are also significant performance differences between local disks and remote -persistent disks. For example, the - -sustained throughput local disks in GCE is approximately 20x that of remote -disks. - -Hence we suggest that self-healing be provided by remotely mounted persistent -disks in non-performance critical, single-zone cloud deployments. For -performance critical installations, faster local SSD's should be used, in which -case remounting on node failure is not an option, so - -etcd runtime configuration should be used to replace the failed machine. -Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so -automatic -runtime configuration is required. Similarly, basic bare metal deployments -cannot generally rely on remote persistent disks, so the same approach applies -there. -
- -Somewhat vague instructions exist on how to set some of this up manually in -a self-hosted configuration. But automatic bootstrapping and self-healing is not -described (and is not implemented for the non-PD cases). This all still needs to -be automated and continuously tested. -
- - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federated-api-servers.md b/contributors/design-proposals/federation/federated-api-servers.md deleted file mode 100644 index ff214c23..00000000 --- a/contributors/design-proposals/federation/federated-api-servers.md +++ /dev/null @@ -1,8 +0,0 @@ -# Federated API Servers - -Moved to [aggregated-api-servers.md](../api-machinery/aggregated-api-servers.md) since cluster -federation stole the word "federation" from this effort and it was very confusing. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-api-servers.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federated-ingress.md b/contributors/design-proposals/federation/federated-ingress.md deleted file mode 100644 index 07e75b0c..00000000 --- a/contributors/design-proposals/federation/federated-ingress.md +++ /dev/null @@ -1,194 +0,0 @@ -# Kubernetes Federated Ingress - - Requirements and High Level Design - - Quinton Hoole - - July 17, 2016 - -## Overview/Summary - -[Kubernetes Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) -provides an abstraction for sophisticated L7 load balancing through a -single IP address (and DNS name) across multiple pods in a single -Kubernetes cluster. Multiple alternative underlying implementations -are provided, including one based on GCE L7 load balancing and another -using an in-cluster nginx/HAProxy deployment (for non-GCE -environments). An AWS implementation, based on Elastic Load Balancers -and Route53 is under way by the community. - -To extend the above to cover multiple clusters, Kubernetes Federated -Ingress aims to provide a similar/identical API abstraction and, -again, multiple implementations to cover various -cloud-provider-specific as well as multi-cloud scenarios. The general -model is to allow the user to instantiate a single Ingress object via -the Federation API, and have it automatically provision all of the -necessary underlying resources (L7 cloud load balancers, in-cluster -proxies etc) to provide L7 load balancing across a service spanning -multiple clusters. - -Four options are outlined: - -1. GCP only -1. AWS only -1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP). -1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS). - -Option 1 is the: - -1. easiest/quickest, -1. most featureful - -Recommendations: - -+ Suggest tackling option 1 (GCP only) first (target beta in v1.4) -+ Thereafter option 3 (cross-cloud via GCP) -+ We should encourage/facilitate the community to tackle option 2 (AWS-only) - -## Options - -## Google Cloud Platform only - backed by GCE L7 Load Balancers - -This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE) - -### Features - -In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features: - -1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS) -1. Geo-locality for both external and GCP-internal clients -1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service). -1. URL-based request direction (different backend services can fulfill each different URL). -1. HTTPS request termination (at the GCE load balancer, with server SSL certs) - -### Implementation - -1. Federation user creates (federated) Ingress object (the services - backing the ingress object must share the same nodePort, as they - share a single GCP health check). -1. Federated Ingress Controller creates Ingress object in each cluster - in the federation (after [configuring each cluster ingress - controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)). -1. Each cluster-level Ingress Controller ("GLBC") creates Google L7 - Load Balancer machinery (forwarding rules, target proxy, URL map, - backend service, health check) which ensures that traffic to the - Ingress (backed by a Service), is directed to the nodes in the cluster. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) - -An alternative implementation approach involves lifting the current -Federated Ingress Controller functionality up into the Federation -control plane. This alternative is not considered any any further -detail in this document. - -### Outstanding work Items - -1. This should in theory all work out of the box. Need to confirm -with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341)) -1. Implement Federated Ingress: - 1. API machinery (~1 day) - 1. Controller (~3 weeks) -1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation) - 1. API machinery (~1 day) - 1. KubeDNS support (~ 1 week?) - -### Pros - -1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below) -1. Leverages existing K8s Ingress machinery - not too much to add. -1. Leverages existing Federated Service machinery - controller looks - almost identical, DNS provider also re-used. - -### Cons - -1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions). - -## Amazon Web Services only - backed by Route53 - -This is an option for AWS-only federations. Parts of this are -apparently work in progress, see e.g. -[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -### Features - -In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/). - -1. Geo-aware DNS direction to closest regional elastic load balancer -1. DNS health checks to route traffic to only healthy elastic load -balancers -1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin -1. Elastic Load Balancing automatically routes traffic across multiple - instances and multiple Availability Zones within the same region. -1. Health checks ensure that only healthy Amazon EC2 instances receive traffic. - -### Implementation - -1. Federation user creates (federated) Ingress object -1. Federated Ingress Controller creates Ingress object in each cluster in the federation -1. Each cluster-level AWS Ingress Controller creates/updates - 1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region. - 1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster. - -### Outstanding Work Items - -Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -1. K8s AWS Ingress Controller -1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...". - -### Pros - -1. Geo-locality (via geo-DNS, not VIP) -1. Load-based overflow -1. Real load balancing (same caveats as for GCP above). -1. L7 SSL connection termination. -1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required. - -### Cons - -1. K8s Ingress Controller still needs to be developed. Lots of work. -1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless) -1. Only works on AWS (initial version, at least). - -## Cross-cloud via GCP - -### Summary - -Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods). - -### Features - -As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster). - -### Implementation - -TBD - see Summary above in the mean time. - -### Outstanding Work - -Assuming that GCP-only (see above) is complete: - -1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters -1. Probably some more - additional detailed research and design necessary. - -### Pros - -1. Works for cross-cloud. - -### Cons - -1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases. - -## Cross-cloud via AWS - -In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters. -Detail docs TBD. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-ingress.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federated-placement-policy.md b/contributors/design-proposals/federation/federated-placement-policy.md deleted file mode 100644 index e1292bd9..00000000 --- a/contributors/design-proposals/federation/federated-placement-policy.md +++ /dev/null @@ -1,371 +0,0 @@ -# Policy-based Federated Resource Placement - -This document proposes a design for policy-based control over placement of -Federated resources. - -Tickets: - -- https://github.com/kubernetes/kubernetes/issues/39982 - -Authors: - -- Torin Sandall (torin@styra.com, tsandall@github) and Tim Hinrichs - (tim@styra.com). -- Based on discussions with Quinton Hoole (quinton.hoole@huawei.com, - quinton-hoole@github), Nikhil Jindal (nikhiljindal@github). - -## Background - -Resource placement is a policy-rich problem affecting many deployments. -Placement may be based on company conventions, external regulation, pricing and -performance requirements, etc. Furthermore, placement policies evolve over time -and vary across organizations. As a result, it is difficult to anticipate the -policy requirements of all users. - -A simple example of a placement policy is - -> Certain apps must be deployed on clusters in EU zones with sufficient PCI -> compliance. - -The [Kubernetes Cluster -Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation/federation.md#policy-engine-and-migrationreplication-controllers) -design proposal includes a pluggable policy engine component that decides how -applications/resources are placed across federated clusters. - -Currently, the placement decision can be controlled for Federated ReplicaSets -using the `federation.kubernetes.io/replica-set-preferences` annotation. In the -future, the [Cluster -Selector](https://github.com/kubernetes/kubernetes/issues/29887) annotation will -provide control over placement of other resources. The proposed design supports -policy-based control over both of these annotations (as well as others). - -This proposal is based on a POC built using the Open Policy Agent project. [This -short video (7m)](https://www.youtube.com/watch?v=hRz13baBhfg) provides an -overview and demo of the POC. - -## Design - -The proposed design uses the [Open Policy Agent](http://www.openpolicyagent.org) -project (OPA) to realize the policy engine component from the Federation design -proposal. OPA is an open-source, general purpose policy engine that includes a -declarative policy language and APIs to answer policy queries. - -The proposed design allows administrators to author placement policies and have -them automatically enforced when resources are created or updated. The design -also covers support for automatic remediation of resource placement when policy -(or the relevant state of the world) changes. - -In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes -in the same cluster as the Federation Control Plane: - -![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407) - -The proposed design is divided into following sections: - -1. Control over the initial placement decision (admission controller) -1. Remediation of resource placement (opa-kube-sync/remediator) -1. Replication of Kubernetes resources (opa-kube-sync/replicator) -1. Management and storage of policies (ConfigMap) - -### 1. Initial Placement Decision - -To provide policy-based control over the initial placement decision, we propose -a new admission controller that integrates with OPA: - -When admitting requests, the admission controller executes an HTTP API call -against OPA. The API call passes the JSON representation of the resource in the -message body. - -The response from OPA contains the desired value for the resource’s annotations -(defined in policy by the administrator). The admission controller updates the -annotations on the resource and admits the request: - -![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288) - -The admission controller updates the resource by **merging** the annotations in -the response with existing annotations on the resource. If there are overlapping -annotation keys the admission controller replaces the existing value with the -value from the response. - -#### Example Policy Engine Query: - -```http -POST /v1/data/io/k8s/federation/admission HTTP/1.1 -Content-Type: application/json -``` - -```json -{ - "input": { - "apiVersion": "extensions/v1beta1", - "kind": "ReplicaSet", - "metadata": { - "annotations": { - "policy.federation.alpha.kubernetes.io/eu-jurisdiction-required": "true", - "policy.federation.alpha.kubernetes.io/pci-compliance-level": "2" - }, - "creationTimestamp": "2017-01-23T16:25:14Z", - "generation": 1, - "labels": { - "app": "nginx-eu" - }, - "name": "nginx-eu", - "namespace": "default", - "resourceVersion": "364993", - "selfLink": "/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-eu", - "uid": "84fab96d-e188-11e6-ac83-0a580a54020e" - }, - "spec": { - "replicas": 4, - "selector": {...}, - "template": {...}, - } - } -} -``` - -#### Example Policy Engine Response: - -```http -HTTP/1.1 200 OK -Content-Type: application/json -``` - -```json -{ - "result": { - "annotations": { - "federation.kubernetes.io/replica-set-preferences": { - "clusters": { - "gce-europe-west1": { - "weight": 1 - }, - "gce-europe-west2": { - "weight": 1 - } - }, - "rebalance": true - } - } - } -} -``` - -> This example shows the policy engine returning the replica-set-preferences. -> The policy engine could similarly return a desired value for other annotations -> such as the Cluster Selector annotation. - -#### Conflicts - -A conflict arises if the developer and the policy define different values for an -annotation. In this case, the developer's intent is provided as a policy query -input and the policy author's intent is encoded in the policy itself. Since the -policy is the only place where both the developer and policy author intents are -known, the policy (or policy engine) should be responsible for resolving the -conflict. - -There are a few options for handling conflicts. As a concrete example, this is -how a policy author could handle invalid clusters/conflicts: - -``` -package io.k8s.federation.admission - -errors["requested replica-set-preferences includes invalid clusters"] { - invalid_clusters = developer_clusters - policy_defined_clusters - invalid_clusters != set() -} - -annotations["replica-set-preferences"] = value { - value = developer_clusters & policy_defined_clusters -} - -# Not shown here: -# -# policy_defined_clusters[...] { ... } -# developer_clusters[...] { ... } -``` - -The admission controller will execute a query against -/io/k8s/federation/admission and if the policy detects an invalid cluster, the -"errors" key in the response will contain a non-empty array. In this case, the -admission controller will deny the request. - -```http -HTTP/1.1 200 OK -Content-Type: application/json -``` - -```json -{ - "result": { - "errors": [ - "requested replica-set-preferences includes invalid clusters" - ], - "annotations": { - "federation.kubernetes.io/replica-set-preferences": { - ... - } - } - } -} -``` - -This example shows how the policy could handle conflicts when the author's -intent is to define clusters that MAY be used. If the author's intent is to -define what clusters MUST be used, then the logic would not use intersection. - -#### Configuration - -The admission controller requires configuration for the OPA endpoint: - -``` -{ - "EnforceSchedulingPolicy": { - "url": “https://opa.federation.svc.cluster.local:8181/v1/data/io/k8s/federation/annotations”, - "token": "super-secret-token-value" - } -} -``` - -- `url` specifies the URL of the policy engine API to query. The query response - contains the annotations to apply to the resource. -- `token` specifies a static token to use for authentication when contacting the - policy engine. In the future, other authentication schemes may be supported. - -The configuration file is provided to the federation-apiserver with the -`--admission-control-config-file` command line argument. - -The admission controller is enabled in the federation-apiserver by providing the -`--admission-control` command line argument. E.g., -`--admission-control=AlwaysAdmit,EnforceSchedulingPolicy`. - -The admission controller will be enabled by default. - -#### Error Handling - -The admission controller is designed to **fail closed** if policies have been -created. - -Request handling may fail because of: - -- Serialization errors -- Request timeouts or other network errors -- Authentication or authorization errors from the policy engine -- Other unexpected errors from the policy engine - -In the event of request timeouts (or other network errors) or back-pressure -hints from the policy engine, the admission controller should retry after -applying a backoff. The admission controller should also create an event so that -developers can identify why their resources are not being scheduled. - -Policies are stored as ConfigMap resources in a well-known namespace. This -allows the admission controller to check if one or more policies exist. If one -or more policies exist, the admission controller will fail closed. Otherwise -the admission controller will **fail open**. - -### 2. Remediation of Resource Placement - -When policy changes or the environment in which resources are deployed changes -(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might -need to be moved for them to obey the placement policy. Sometimes administrators -may decide to remediate manually, other times they may want Kubernetes to -remediate automatically. - -To automatically reschedule resources onto desired clusters, we introduce a -remediator component (**opa-kube-sync**) that is deployed as a sidecar with OPA. - -![Remediation](https://docs.google.com/drawings/d/1ehuzwUXSpkOXzOUGyBW0_7jS8pKB4yRk_0YRb1X4zsY/pub?w=812&h=288) - -The notifications sent to the remediator by OPA specify the new value for -annotations such as replica-set-preferences. - -When the remediator component (in the sidecar) receives the notification it -sends a PATCH request to the federation-apiserver to update the affected -resource. This way, the actual rebalancing of ReplicaSets is still handled by -the [Rescheduling -Algorithm](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation/federated-replicasets.md) -in the Federated ReplicaSet controller. - -The remediator component must be deployed with a kubeconfig for the -federation-apiserver so that it can identify itself when sending the PATCH -requests. We can use the same mechanism that is used for the -federation-controller-manager (which also needs ot identify itself when sending -requests to the federation-apiserver.) - -### 3. Replication of Kubernetes Resources - -Administrators must be able to author policies that refer to properties of -Kubernetes resources. For example, assuming the following sample policy (in -English): - -> Certain apps must be deployed on Clusters in EU zones with sufficient PCI -> compliance. - -The policy definition must refer to the geographic region and PCI compliance -rating of federated clusters. Today, the geographic region is stored as an -attribute on the cluster resource and the PCI compliance rating is an example of -data that may be included in a label or annotation. - -When the policy engine is queried for a placement decision (e.g., by the -admission controller), it must have access to the data representing the -federated clusters. - -To provide OPA with the data representing federated clusters as well as other -Kubernetes resource types (such as federated ReplicaSets), we use a sidecar -container that is deployed alongside OPA. The sidecar (“opa-kube-sync”) is -responsible for replicating Kubernetes resources into OPA: - -![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288) - -The sidecar/replicator component will implement the (somewhat common) list/watch -pattern against the federation-apiserver: - -- Initially, it will GET all resources of a particular type. -- Subsequently, it will GET with the **watch** and **resourceVersion** - parameters set and process add, remove, update events accordingly. - -Each resource received by the sidecar/replicator component will be pushed into -OPA. The sidecar will likely rely on one of the existing Kubernetes Go client -libraries to handle the low-level list/watch behavior. - -As new resource types are introduced in the federation-apiserver, the -sidecar/replicator component will need to be updated to support them. As a -result, the sidecar/replicator component must be designed so that it is easy to -add support for new resource types. - -Eventually, the sidecar/replicator component may allow admins to configure which -resource types are replicated. As an optimization, the sidecar may eventually -analyze policies to determine which resource properties are requires for policy -evaluation. This would allow it to replicate the minimum amount of data into -OPA. - -### 4. Policy Management - -Policies are written in a text-based, declarative language supported by OPA. The -policies can be loaded into the policy engine either on startup or via HTTP -APIs. - -To avoid introducing additional persistent state, we propose storing policies -in ConfigMap resources in the Federation Control Plane inside a well-known -namespace (e.g., `kube-federationscheduling-policy`). The ConfigMap resources -will be replicated into the policy engine by the sidecar. - -The sidecar can establish a watch on the ConfigMap resources in the Federation -Control Plane. This will enable hot-reloading of policies whenever they change. - -## Applicability to Other Policy Engines - -This proposal was designed based on a POC with OPA, but it can be applied to -other policy engines as well. The admission and remediation components are -comprised of two main pieces of functionality: (i) applying annotation values to -federated resources and (ii) asking the policy engine for annotation values. The -code for applying annotation values is completely independent of the policy -engine. The code that asks the policy engine for annotation values happens both -within the admission and remediation components. In the POC, asking OPA for -annotation values amounts to a simple RESTful API call that any other policy -engine could implement. - -## Future Work - -- This proposal uses ConfigMaps to store and manage policies. In the future, we - want to introduce a first-class **Policy** API resource. \ No newline at end of file diff --git a/contributors/design-proposals/federation/federated-replicasets.md b/contributors/design-proposals/federation/federated-replicasets.md deleted file mode 100644 index 8b48731c..00000000 --- a/contributors/design-proposals/federation/federated-replicasets.md +++ /dev/null @@ -1,513 +0,0 @@ -# Federated ReplicaSets - -# Requirements & Design Document - -This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. - -Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) -Based on discussions with -Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) - -## Overview - -### Summary & Vision - -When running a global application on a federation of Kubernetes -clusters the owner currently has to start it in multiple clusters and -control whether he has both enough application replicas running -locally in each of the clusters (so that, for example, users are -handled by a nearby cluster, with low latency) and globally (so that -there is always enough capacity to handle all traffic). If one of the -clusters has issues or hasn't enough capacity to run the given set of -replicas the replicas should be automatically moved to some other -cluster to keep the application responsive. - -In single cluster Kubernetes there is a concept of ReplicaSet that -manages the replicas locally. We want to expand this concept to the -federation level. - -### Goals - -+ Win large enterprise customers who want to easily run applications - across multiple clusters -+ Create a reference controller implementation to facilitate bringing - other Kubernetes concepts to Federated Kubernetes. - -## Glossary - -Federation Cluster - a cluster that is a member of federation. - -Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster -that is a member of federation. - -Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. - -Federated ReplicaSet Controller (FRSC) - A controller running inside -of Federated K8S server that controls FRS. - -## User Experience - -### Critical User Journeys - -+ [CUJ1] User wants to create a ReplicaSet in each of the federation - cluster. They create a definition of federated ReplicaSet on the - federated master and (local) ReplicaSets are automatically created - in each of the federation clusters. The number of replicas is each - of the Local ReplicaSets is (perhaps indirectly) configurable by - the user. -+ [CUJ2] When the current number of replicas in a cluster drops below - the desired number and new replicas cannot be scheduled then they - should be started in some other cluster. - -### Features Enabling Critical User Journeys - -Feature #1 -> CUJ1: -A component which looks for newly created Federated ReplicaSets and -creates the appropriate Local ReplicaSet definitions in the federated -clusters. - -Feature #2 -> CUJ2: -A component that checks how many replicas are actually running in each -of the subclusters and if the number matches to the -FederatedReplicaSet preferences (by default spread replicas evenly -across the clusters but custom preferences are allowed - see -below). If it doesn't and the situation is unlikely to improve soon -then the replicas should be moved to other subclusters. - -### API and CLI - -All interaction with FederatedReplicaSet will be done by issuing -kubectl commands pointing on the Federated Master API Server. All the -commands would behave in a similar way as on the regular master, -however in the next versions (1.5+) some of the commands may give -slightly different output. For example kubectl describe on federated -replica set should also give some information about the subclusters. - -Moreover, for safety, some defaults will be different. For example for -kubectl delete federatedreplicaset cascade will be set to false. - -FederatedReplicaSet would have the same object as local ReplicaSet -(although it will be accessible in a different part of the -api). Scheduling preferences (how many replicas in which cluster) will -be passed as annotations. - -### FederateReplicaSet preferences - -The preferences are expressed by the following structure, passed as a -serialized json inside annotations. - -```go -type FederatedReplicaSetPreferences struct { - // If set to true then already scheduled and running replicas may be moved to other clusters to - // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, - // up and running replicas will not be moved. - Rebalance bool `json:"rebalance,omitempty"` - - // Map from cluster name to preferences for that cluster. It is assumed that if a cluster - // doesn't have a matching entry then it should not have local replica. The cluster matches - // to "*" if there is no entry with the real cluster name. - Clusters map[string]LocalReplicaSetPreferences -} - -// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. -type ClusterReplicaSetPreferences struct { - // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. - MinReplicas int64 `json:"minReplicas,omitempty"` - - // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). - MaxReplicas *int64 `json:"maxReplicas,omitempty"` - - // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. - Weight int64 -} -``` - -How this works in practice: - -**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity. - Replica layout: A=16 B=17 C=17. -+ Clusters A,B,C and C has capacity for 6 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C. B and C are offline: - Replica layout: A=50 - -**Scenario 2**. I want to have only 2 replicas in each of the clusters. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} - } -} -``` - -Or - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } - } - } - -``` - -Or - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} - } -} -``` - -There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. - -**Scenario 3**. I want to have 20 replicas in each of 3 clusters. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} - } -} -``` - -There is a global target for 50, however clusters require 60. So some clusters will have less replicas. - Replica layout: A=20 B=20 C=10. - -**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don't put more than 20 replicas to cluster C. - -```go -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=16 B=17 C=17. -+ B is offline/has no capacity - Replica layout: A=30 B=0 C=20 -+ A and B are offline: - Replica layout: C=20 - -**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. - -```go -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 1000000} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=50 B=0 C=0. -+ A has capacity for only 40 replicas - Replica layout: A=40 B=5 C=5 - -**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. - -```go -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 2} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there -are already some replicas, please do not move them. Config: - -```go -FederatedReplicaSetPreferences { - Rebalance : false - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity, but A already has 20 replicas - Replica layout: A=20 B=15 C=15. -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. - Replica layout: A=30 B=14 C=6 - -## The Idea - -A new federated controller - Federated Replica Set Controller (FRSC) -will be created inside federated controller manager. Below are -enumerated the key idea elements: - -+ [I0] It is considered OK to have slightly higher number of replicas - globally for some time. - -+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens - on FRS being created, updated or deleted. On each create/update the - scheduling code will be started to calculate where to put the - replicas. The default behavior is to start the same amount of - replicas in each of the cluster. While creating LocalReplicaSets - (LRS) the following errors/issues can occur: - - + [E1] Master rejects LRS creation (for known or unknown - reason). In this case another attempt to create a LRS should be - attempted in 1m or so. This action can be tied with - [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created - the situation is the same as [E5]. If this happens multiple - times all due replicas should be moved elsewhere and later moved - back once the LRS is created. - - + [E2] LRS with the same name but different configuration already - exists. The LRS is then overwritten and an appropriate event - created to explain what happened. Pods under the control of the - old LRS are left intact and the new LRS may adopt them if they - match the selector. - - + [E3] LRS is new but the pods that match the selector exist. The - pods are adopted by the RS (if not owned by some other - RS). However they may have a different image, configuration - etc. Just like with regular LRS. - -+ [I2] For each of the cluster FRSC starts a store and an informer on - LRS that will listen for status updates. These status changes are - only interesting in case of troubles. Otherwise it is assumed that - LRS runs trouble free and there is always the right number of pod - created but possibly not scheduled. - - - + [E4] LRS is manually deleted from the local cluster. In this case - a new LRS should be created. It is the same case as - [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind - won't be killed and will be adopted after the LRS is recreated. - - + [E5] LRS fails to create (not necessary schedule) the desired - number of pods due to master troubles, admission control - etc. This should be considered as the same situation as replicas - unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). - - + [E6] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216). - -+ [I3] For each of the cluster start an store and informer to monitor - whether the created pods are eventually scheduled and what is the - current number of correctly running ready pods. Errors: - - + [E7] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216) - -+ [I4] It is assumed that a not scheduled pod is a normal situation -and can last up to X min if there is a huge traffic on the -cluster. However if the replicas are not scheduled in that time then -FRSC should consider moving most of the unscheduled replicas -elsewhere. For that purpose FRSC will maintain a data structure -where for each FRS controlled LRS we store a list of pods belonging -to that LRS along with their current status and status change timestamp. - -+ [I5] If a new cluster is added to the federation then it doesn't - have a LRS and the situation is equal to - [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). - -+ [I6] If a cluster is removed from the federation then the situation - is equal to multiple [E4]. It is assumed that if a connection with - a cluster is lost completely then the cluster is removed from the - the cluster list (or marked accordingly) so - [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) - don't need to be handled. - -+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), - checked against the current list of clusters, and all missing LRS - are created. This will be executed in combination with [I8]. - -+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min - (configurable) to check whether some replica move between clusters - is needed or not. - -+ FRSC never moves replicas to LRS that have not scheduled/running -pods or that has pods that failed to be created. - - + When FRSC notices that a number of pods are not scheduler/running - or not_even_created in one LRS for more than Y minutes it takes - most of them from LRS, leaving couple still waiting so that once - they are scheduled FRSC will know that it is ok to put some more - replicas to that cluster. - -+ [I9] FRS becomes ToBeChecked if: - + It is newly created - + Some replica set inside changed its status - + Some pods inside cluster changed their status - + Some cluster is added or deleted. -> FRS stops ToBeChecked if is in desired configuration (or is stable enough). - -## (RE)Scheduling algorithm - -To calculate the (re)scheduling moves for a given FRS: - -1. For each cluster FRSC calculates the number of replicas that are placed -(not necessary up and running) in the cluster and the number of replicas that -failed to be scheduled. Cluster capacity is the difference between the -the placed and failed to be scheduled. - -2. Order all clusters by their weight and hash of the name so that every time -we process the same replica-set we process the clusters in the same order. -Include federated replica set name in the cluster name hash so that we get -slightly different ordering for different RS. So that not all RS of size 1 -end up on the same cluster. - -3. Assign minimum preferred number of replicas to each of the clusters, if -there is enough replicas and capacity. - -4. If rebalance = false, assign the previously present replicas to the clusters, -remember the number of extra replicas added (ER). Of course if there -is enough replicas and capacity. - -5. Distribute the remaining replicas with regard to weights and cluster capacity. -In multiple iterations calculate how many of the replicas should end up in the cluster. -For each of the cluster cap the number of assigned replicas by max number of replicas and -cluster capacity. If there were some extra replicas added to the cluster in step -4, don't really add the replicas but balance them gains ER from 4. - -## Goroutines layout - -+ [GR1] Involved in FRS informer (see - [[I1]]). Whenever a FRS is created and - updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with - delay 0. - -+ [GR2_1...GR2_N] Involved in informers/store on LRS (see - [[I2]]). On all changes the FRS is put on - FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR3_1...GR3_N] Involved in informers/store on Pods - (see [[I3]] and [[I4]]). They maintain the status store - so that for each of the LRS we know the number of pods that are - actually running and ready in O(1) time. They also put the - corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR4] Involved in cluster informer (see - [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE - with delay 0. - -+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on - FRS_CHANNEL after the given delay (and remove from - FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to - FRS_TO_CHECK_QUEUE the delays are compared and updated so that the - shorter delay is used. - -+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever - a FRS is received it is put to a work queue. Work queue has no delay - and makes sure that a single replica set is process is processed by - only one goroutine. - -+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. - Multiple replica set can be processed in parallel. Two Goroutines cannot - process the same FRS at the same time. - - -## Func DoFrsCheck - -The function does [[I7]] and[[I8]]. It is assumed that it is run on a -single thread/goroutine so we check and evaluate the same FRS on many -goroutines (however if needed the function can be parallelized for -different FRS). It takes data only from store maintained by GR2_* and -GR3_*. The external communication is only required to: - -+ Create LRS. If a LRS doesn't exist it is created after the - rescheduling, when we know how much replicas should it have. - -+ Update LRS replica targets. - -If FRS is not in the desired state then it is put to -FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). - -## Monitoring and status reporting - -FRCS should expose a number of metrics form the run, like - -+ FRSC -> LRS communication latency -+ Total times spent in various elements of DoFrsCheck - -FRSC should also expose the status of FRS as an annotation on FRS and -as events. - -## Workflow - -Here is the sequence of tasks that need to be done in order for a -typical FRS to be split into a number of LRS's and to be created in -the underlying federated clusters. - -Note a: the reason the workflow would be helpful at this phase is that -for every one or two steps we can create PRs accordingly to start with -the development. - -Note b: we assume that the federation is already in place and the -federated clusters are added to the federation. - -Step 1. the client sends an RS create request to the -federation-apiserver - -Step 2. federation-apiserver persists an FRS into the federation etcd - -Note c: federation-apiserver populates the clusterid field in the FRS -before persisting it into the federation etcd - -Step 3: the federation-level “informer” in FRSC watches federation -etcd for new/modified FRS's, with empty clusterid or clusterid equal -to federation ID, and if detected, it calls the scheduling code - -Step 4. - -Note d: scheduler populates the clusterid field in the LRS with the -IDs of target clusters - -Note e: at this point let us assume that it only does the even -distribution, i.e., equal weights for all of the underlying clusters - -Step 5. As soon as the scheduler function returns the control to FRSC, -the FRSC starts a number of cluster-level “informer”s, one per every -target cluster, to watch changes in every target cluster etcd -regarding the posted LRS's and if any violation from the scheduled -number of replicase is detected the scheduling code is re-called for -re-scheduling purposes. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federated-services.md b/contributors/design-proposals/federation/federated-services.md deleted file mode 100644 index 8ec9ca29..00000000 --- a/contributors/design-proposals/federation/federated-services.md +++ /dev/null @@ -1,519 +0,0 @@ -# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") - -## Cross-cluster Load Balancing and Service Discovery - -### Requirements and System Design - -### by Quinton Hoole, Dec 3 2015 - -## Requirements - -### Discovery, Load-balancing and Failover - -1. **Internal discovery and connection**: Pods/containers (running in - a Kubernetes cluster) must be able to easily discover and connect - to endpoints for Kubernetes services on which they depend in a - consistent way, irrespective of whether those services exist in a - different kubernetes cluster within the same cluster federation. - Hence-forth referred to as "cluster-internal clients", or simply - "internal clients". -1. **External discovery and connection**: External clients (running - outside a Kubernetes cluster) must be able to discover and connect - to endpoints for Kubernetes services on which they depend. - 1. **External clients predominantly speak HTTP(S)**: External - clients are most often, but not always, web browsers, or at - least speak HTTP(S) - notable exceptions include Enterprise - Message Busses (Java, TLS), DNS servers (UDP), - SIP servers and databases) -1. **Find the "best" endpoint:** Upon initial discovery and - connection, both internal and external clients should ideally find - "the best" endpoint if multiple eligible endpoints exist. "Best" - in this context implies the closest (by network topology) endpoint - that is both operational (as defined by some positive health check) - and not overloaded (by some published load metric). For example: - 1. An internal client should find an endpoint which is local to its - own cluster if one exists, in preference to one in a remote - cluster (if both are operational and non-overloaded). - Similarly, one in a nearby cluster (e.g. in the same zone or - region) is preferable to one further afield. - 1. An external client (e.g. in New York City) should find an - endpoint in a nearby cluster (e.g. U.S. East Coast) in - preference to one further away (e.g. Japan). -1. **Easy fail-over:** If the endpoint to which a client is connected - becomes unavailable (no network response/disconnected) or - overloaded, the client should reconnect to a better endpoint, - somehow. - 1. In the case where there exist one or more connection-terminating - load balancers between the client and the serving Pod, failover - might be completely automatic (i.e. the client's end of the - connection remains intact, and the client is completely - oblivious of the fail-over). This approach incurs network speed - and cost penalties (by traversing possibly multiple load - balancers), but requires zero smarts in clients, DNS libraries, - recursing DNS servers etc, as the IP address of the endpoint - remains constant over time. - 1. In a scenario where clients need to choose between multiple load - balancer endpoints (e.g. one per cluster), multiple DNS A - records associated with a single DNS name enable even relatively - dumb clients to try the next IP address in the list of returned - A records (without even necessarily re-issuing a DNS resolution - request). For example, all major web browsers will try all A - records in sequence until a working one is found (TBD: justify - this claim with details for Chrome, IE, Safari, Firefox). - 1. In a slightly more sophisticated scenario, upon disconnection, a - smarter client might re-issue a DNS resolution query, and - (modulo DNS record TTL's which can typically be set as low as 3 - minutes, and buggy DNS resolvers, caches and libraries which - have been known to completely ignore TTL's), receive updated A - records specifying a new set of IP addresses to which to - connect. - -### Portability - -A Kubernetes application configuration (e.g. for a Pod, Replication -Controller, Service etc) should be able to be successfully deployed -into any Kubernetes Cluster or Federation of Clusters, -without modification. More specifically, a typical configuration -should work correctly (although possibly not optimally) across any of -the following environments: - -1. A single Kubernetes Cluster on one cloud provider (e.g. Google - Compute Engine, GCE). -1. A single Kubernetes Cluster on a different cloud provider - (e.g. Amazon Web Services, AWS). -1. A single Kubernetes Cluster on a non-cloud, on-premise data center -1. A Federation of Kubernetes Clusters all on the same cloud provider - (e.g. GCE). -1. A Federation of Kubernetes Clusters across multiple different cloud - providers and/or on-premise data centers (e.g. one cluster on - GCE/GKE, one on AWS, and one on-premise). - -### Trading Portability for Optimization - -It should be possible to explicitly opt out of portability across some -subset of the above environments in order to take advantage of -non-portable load balancing and DNS features of one or more -environments. More specifically, for example: - -1. For HTTP(S) applications running on GCE-only Federations, - [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - should be usable. These provide single, static global IP addresses - which load balance and fail over globally (i.e. across both regions - and zones). These allow for really dumb clients, but they only - work on GCE, and only for HTTP(S) traffic. -1. For non-HTTP(S) applications running on GCE-only Federations within - a single region, - [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - should be usable. These provide TCP (i.e. both HTTP/S and - non-HTTP/S) load balancing and failover, but only on GCE, and only - within a single region. - [Google Cloud DNS](https://cloud.google.com/dns) can be used to - route traffic between regions (and between different cloud - providers and on-premise clusters, as it's plain DNS, IP only). -1. For applications running on AWS-only Federations, - [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - should be usable. These provide both L7 (HTTP(S)) and L4 load - balancing, but only within a single region, and only on AWS - ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be - used to load balance and fail over across multiple regions, and is - also capable of resolving to non-AWS endpoints). - -## Component Cloud Services - -Cross-cluster Federated load balancing is built on top of the following: - -1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - provide single, static global IP addresses which load balance and - fail over globally (i.e. across both regions and zones). These - allow for really dumb clients, but they only work on GCE, and only - for HTTP(S) traffic. -1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - provide both HTTP(S) and non-HTTP(S) load balancing and failover, - but only on GCE, and only within a single region. -1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - provide both L7 (HTTP(S)) and L4 load balancing, but only within a - single region, and only on AWS. -1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other - programmable DNS service, like - [CloudFlare](http://www.cloudflare.com) can be used to route - traffic between regions (and between different cloud providers and - on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS - doesn't provide any built-in geo-DNS, latency-based routing, health - checking, weighted round robin or other advanced capabilities. - It's plain old DNS. We would need to build all the aforementioned - on top of it. It can provide internal DNS services (i.e. serve RFC - 1918 addresses). - 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can - be used to load balance and fail over across regions, and is also - capable of routing to non-AWS endpoints). It provides built-in - geo-DNS, latency-based routing, health checking, weighted - round robin and optional tight integration with some other - AWS services (e.g. Elastic Load Balancers). -1. Kubernetes L4 Service Load Balancing: This provides both a - [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) - and a - [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) - service IP which is load-balanced (currently simple round-robin) - across the healthy pods comprising a service within a single - Kubernetes cluster. -1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): -A generic wrapper around cloud-provided L4 and L7 load balancing services, and -roll-your-own load balancers run in pods, e.g. HA Proxy. - -## Cluster Federation API - -The Cluster Federation API for load balancing should be compatible with the equivalent -Kubernetes API, to ease porting of clients between Kubernetes and -federations of Kubernetes clusters. -Further details below. - -## Common Client Behavior - -To be useful, our load balancing solution needs to work properly with real -client applications. There are a few different classes of those... - -### Browsers - -These are the most common external clients. These are all well-written. See below. - -### Well-written clients - -1. Do a DNS resolution every time they connect. -1. Don't cache beyond TTL (although a small percentage of the DNS - servers on which they rely might). -1. Do try multiple A records (in order) to connect. -1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. - -Examples: - -+ all common browsers (except for SRV records) -+ ... - -### Dumb clients - -1. Don't do a DNS resolution every time they connect (or do cache beyond the -TTL). -1. Do try multiple A records - -Examples: - -+ ... - -### Dumber clients - -1. Only do a DNS lookup once on startup. -1. Only try the first returned DNS A record. - -Examples: - -+ ... - -### Dumbest clients - -1. Never do a DNS lookup - are pre-configured with a single (or possibly -multiple) fixed server IP(s). Nothing else matters. - -## Architecture and Implementation - -### General Control Plane Architecture - -Each cluster hosts one or more Cluster Federation master components (Federation API -servers, controller managers with leader election, and etcd quorum members. This -is documented in more detail in a separate design doc: -[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). - -In the description below, assume that 'n' clusters, named 'cluster-1'... -'cluster-n' have been registered against a Cluster Federation "federation-1", -each with their own set of Kubernetes API endpoints,so, -"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), -[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) -... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . - -### Federated Services - -Federated Services are pretty straight-forward. They're comprised of multiple -equivalent underlying Kubernetes Services, each with their own external -endpoint, and a load balancing mechanism across them. Let's work through how -exactly that works in practice. - -Our user creates the following Federated Service (against a Federation -API endpoint): - - $ kubectl create -f my-service.yaml --context="federation-1" - -where service.yaml contains the following: - - kind: Service - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - ports: - - port: 2379 - protocol: TCP - targetPort: 2379 - name: client - - port: 2380 - protocol: TCP - targetPort: 2380 - name: peer - selector: - run: my-service - type: LoadBalancer - -The Cluster Federation control system in turn creates one equivalent service (identical config to the above) -in each of the underlying Kubernetes clusters, each of which results in -something like this: - - $ kubectl get -o yaml --context="cluster-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:25Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "147365" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00002 - spec: - clusterIP: 10.0.153.185 - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - ip: 104.197.117.10 - -Similar services are created in `cluster-2` and `cluster-3`, each of which are -allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. - -In the Cluster Federation `federation-1`, the resulting federated service looks as follows: - - $ kubectl get -o yaml --context="federation-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:23Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "157333" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00007 - spec: - clusterIP: - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - hostname: my-service.my-namespace.my-federation.my-domain.com - -Note that the federated service: - -1. Is API-compatible with a vanilla Kubernetes service. -1. has no clusterIP (as it is cluster-independent) -1. has a federation-wide load balancer hostname - -In addition to the set of underlying Kubernetes services (one per cluster) -described above, the Cluster Federation control system has also created a DNS name (e.g. on -[Google Cloud DNS](https://cloud.google.com/dns) or -[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) -which provides load balancing across all of those services. For example, in a -very basic configuration: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -Each of the above IP addresses (which are just the external load balancer -ingress IP's of each cluster service) is of course load balanced across the pods -comprising the service in each cluster. - -In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster -Federation control system -automatically creates a -[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) -which exposes a single, globally load-balanced IP: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 - -Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) -in each Kubernetes cluster to preferentially return the local -clusterIP for the service in that cluster, with other clusters' -external service IP's (or a global load-balanced IP) also configured -for failover purposes: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -If Cluster Federation Global Service Health Checking is enabled, multiple service health -checkers running across the federated clusters collaborate to monitor the health -of the service endpoints, and automatically remove unhealthy endpoints from the -DNS record (e.g. a majority quorum is required to vote a service endpoint -unhealthy, to avoid false positives due to individual health checker network -isolation). - -### Federated Replication Controllers - -So far we have a federated service defined, with a resolvable load balancer -hostname by which clients can reach it, but no pods serving traffic directed -there. So now we need a Federated Replication Controller. These are also fairly -straight-forward, being comprised of multiple underlying Kubernetes Replication -Controllers which do the hard work of keeping the desired number of Pod replicas -alive in each Kubernetes cluster. - - $ kubectl create -f my-service-rc.yaml --context="federation-1" - -where `my-service-rc.yaml` contains the following: - - kind: ReplicationController - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - replicas: 6 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - -The Cluster Federation control system in turn creates one equivalent replication controller -(identical config to the above, except for the replica count) in each -of the underlying Kubernetes clusters, each of which results in -something like this: - - $ ./kubectl get -o yaml rc my-service --context="cluster-1" - kind: ReplicationController - metadata: - creationTimestamp: 2015-12-02T23:00:47Z - labels: - run: my-service - name: my-service - namespace: my-namespace - selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service - uid: 86542109-9948-11e5-a38c-42010af00002 - spec: - replicas: 2 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - resources: {} - dnsPolicy: ClusterFirst - restartPolicy: Always - status: - replicas: 2 - -The exact number of replicas created in each underlying cluster will of course -depend on what scheduling policy is in force. In the above example, the -scheduler created an equal number of replicas (2) in each of the three -underlying clusters, to make up the total of 6 replicas required. To handle -entire cluster failures, various approaches are possible, including: -1. **simple overprovisioning**, such that sufficient replicas remain even if a - cluster fails. This wastes some resources, but is simple and reliable. - -2. **pod autoscaling**, where the replication controller in each - cluster automatically and autonomously increases the number of - replicas in its cluster in response to the additional traffic - diverted from the failed cluster. This saves resources and is relatively - simple, but there is some delay in the autoscaling. - -3. **federated replica migration**, where the Cluster Federation - control system detects the cluster failure and automatically - increases the replica count in the remaining clusters to make up - for the lost replicas in the failed cluster. This does not seem to - offer any benefits relative to pod autoscaling above, and is - arguably more complex to implement, but we note it here as a - possibility. - -### Implementation Details - -The implementation approach and architecture is very similar to Kubernetes, so -if you're familiar with how Kubernetes works, none of what follows will be -surprising. One additional design driver not present in Kubernetes is that -the Cluster Federation control system aims to be resilient to individual cluster and availability zone -failures. So the control plane spans multiple clusters. More specifically: - -+ Cluster Federation runs it's own distinct set of API servers (typically one - or more per underlying Kubernetes cluster). These are completely - distinct from the Kubernetes API servers for each of the underlying - clusters. -+ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, - by default). Approximately 1 quorum member runs in each underlying - cluster ("approximately" because we aim for an odd number of quorum - members, and typically don't want more than 5 quorum members, even - if we have a larger number of federated clusters, so 2 clusters->3 - quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). - -Cluster Controllers in the Federation control system watch against the -Federation API server/etcd -state, and apply changes to the underlying kubernetes clusters accordingly. They -also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" -state against kubernetes "actual desired" state. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federation-clusterselector.md b/contributors/design-proposals/federation/federation-clusterselector.md deleted file mode 100644 index 154412c7..00000000 --- a/contributors/design-proposals/federation/federation-clusterselector.md +++ /dev/null @@ -1,81 +0,0 @@ -# ClusterSelector Federated Resource Placement - -This document proposes a design for label based control over placement of -Federated resources. - -Tickets: - -- https://github.com/kubernetes/kubernetes/issues/29887 - -Authors: - -- Dan Wilson (emaildanwilson@github.com). -- Nikhil Jindal (nikhiljindal@github). - -## Background - -End users will often need a simple way to target a subset of clusters for deployment of resources. In some cases this will be for a specific cluster in other cases it will be groups of clusters. -A few examples... - -1. Deploy the foo service to all clusters in Europe -1. Deploy the bar service to cluster test15 -1. Deploy the baz service to all prod clusters globally - -Currently, it's possible to control placement decision of Federated ReplicaSets -using the `federation.kubernetes.io/replica-set-preferences` annotation. This provides functionality to change the number of ReplicaSets created on each Federated Cluster, by setting the quantity for each Cluster by Cluster Name. Since cluster names are required, in situations where clusters are add/removed from Federation it would require the object definitions to change in order to maintain the same configuration. From the example above, if a new cluster is created in Europe and added to federation, then the replica-set-preferences would need to be updated to include the new cluster name. - -This proposal is to provide placement decision support for all object types using Labels on the Federated Clusters as opposed to cluster names. The matching language currently used for nodeAffinity placement decisions onto nodes can be leveraged. - -Carrying forward the examples from above... - -1. "location=europe" -1. "someLabel exists" -1. "environment notin ["qa", "dev"] - -## Design - -The proposed design uses a ClusterSelector annotation that has a value that is parsed into a struct definition that follows the same design as the [NodeSelector type used w/ nodeAffinity](https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L1972) and will also use the [Matches function](https://github.com/kubernetes/apimachinery/blob/master/pkg/labels/selector.go#L172) of the apimachinery project to determine if an object should be sent on to federated clusters or not. - -In situations where objects are not to be forwarded to federated clusters, instead a delete api call will be made using the object definition. If the object does not exist it will be ignored. - -The federation-controller will be used to implement this with shared logic stored as utility functions to reduce duplicated code where appropriate. - -### End User Functionality -The annotation `federation.alpha.kubernetes.io/cluster-selector` is used on kubernetes objects to specify additional placement decisions that should be made. The value of the annotation will be a json object of type ClusterSelector which is an array of type ClusterSelectorRequirement. - -Each ClusterSelectorRequirement is defined in three possible parts consisting of -1. Key - Matches against label keys on the Federated Clusters. -1. Operator - Represents how the Key and/or Values will be matched against the label keys and values on the Federated Clusters one of ("In", in", "=", "==", "NotIn", notin", "Exists", "exists", "!=", "DoesNotExist", "!", "Gt", "gt", "Lt", "lt"). -1. Values - Matches against the label values on the Federated Clusters using the Key specified. When the operator is "Exists", "exists", "DoesNotExist" or "!" then Values should not be specified. - -Example ConfigMap that uses the ClusterSelector annotation. The yaml format is used here to show that the value of the annotation will still be json. -```yaml -apiVersion: v1 -data: - myconfigkey: myconfigdata -kind: ConfigMap -metadata: - annotations: - federation.alpha.kubernetes.io/cluster-selector: '[{"key": "location", "operator": - "in", "values": ["europe"]}, {"key": "environment", "operator": "==", "values": - ["prod"]}]' - creationTimestamp: 2017-02-07T19:43:40Z - name: myconfig -``` - -In order for the configmap in the example above to be forwarded to any Federated Clusters they MUST have two Labels: "location" with at least one value of "europe" and "environment" that has a value of "prod". - -### Matching Logic - -The logic to determine if an object is sent to a Federated Cluster will have two rules. - -1. An object with no `federation.alpha.kubernetes.io/cluster-selector` annotation will always be forwarded on to all Federated Clusters even if they have labels configured. (this ensures no regression from existing functionality) - -1. If an object contains the `federation.alpha.kubernetes.io/cluster-selector` annotation then ALL ClusterSelectorRequirements must match in order for the object to be forwarded to the Federated Cluster. - -1. If `federation.kubernetes.io/replica-set-preferences` are also defined they will be applied AFTER the ClusterSelectorRequirements. - -## Open Questions - -1. Should there be any special considerations for when dependant resources would not be forwarded together to a Federated Cluster. -1. How to improve usability of this feature long term. It will certainly help to give first class API support but easier ways to map labels or requirements to objects may be required. diff --git a/contributors/design-proposals/federation/federation-high-level-arch.png b/contributors/design-proposals/federation/federation-high-level-arch.png deleted file mode 100644 index 8a416cc1..00000000 Binary files a/contributors/design-proposals/federation/federation-high-level-arch.png and /dev/null differ diff --git a/contributors/design-proposals/federation/federation-lite.md b/contributors/design-proposals/federation/federation-lite.md deleted file mode 100644 index 549f98df..00000000 --- a/contributors/design-proposals/federation/federation-lite.md +++ /dev/null @@ -1,201 +0,0 @@ -# Kubernetes Multi-AZ Clusters - -## (previously nicknamed "Ubernetes-Lite") - -## Introduction - -Full Cluster Federation will offer sophisticated federation between multiple kubernetes -clusters, offering true high-availability, multiple provider support & -cloud-bursting, multiple region support etc. However, many users have -expressed a desire for a "reasonably" high-available cluster, that runs in -multiple zones on GCE or availability zones in AWS, and can tolerate the failure -of a single zone without the complexity of running multiple clusters. - -Multi-AZ Clusters aim to deliver exactly that functionality: to run a single -Kubernetes cluster in multiple zones. It will attempt to make reasonable -scheduling decisions, in particular so that a replication controller's pods are -spread across zones, and it will try to be aware of constraints - for example -that a volume cannot be mounted on a node in a different zone. - -Multi-AZ Clusters are deliberately limited in scope; for many advanced functions -the answer will be "use full Cluster Federation". For example, multiple-region -support is not in scope. Routing affinity (e.g. so that a webserver will -prefer to talk to a backend service in the same zone) is similarly not in -scope. - -## Design - -These are the main requirements: - -1. kube-up must allow bringing up a cluster that spans multiple zones. -1. pods in a replication controller should attempt to spread across zones. -1. pods which require volumes should not be scheduled onto nodes in a different zone. -1. load-balanced services should work reasonably - -### kube-up support - -kube-up support for multiple zones will initially be considered -advanced/experimental functionality, so the interface is not initially going to -be particularly user-friendly. As we design the evolution of kube-up, we will -make multiple zones better supported. - -For the initial implementation, kube-up must be run multiple times, once for -each zone. The first kube-up will take place as normal, but then for each -additional zone the user must run kube-up again, specifying -`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then -create additional nodes in a different zone, but will register them with the -existing master. - -### Zone spreading - -This will be implemented by modifying the existing scheduler priority function -`SelectorSpread`. Currently this priority function aims to put pods in an RC -on different hosts, but it will be extended first to spread across zones, and -then to spread across hosts. - -So that the scheduler does not need to call out to the cloud provider on every -scheduling decision, we must somehow record the zone information for each node. -The implementation of this will be described in the implementation section. - -Note that zone spreading is 'best effort'; zones are just be one of the factors -in making scheduling decisions, and thus it is not guaranteed that pods will -spread evenly across zones. However, this is likely desirable: if a zone is -overloaded or failing, we still want to schedule the requested number of pods. - -### Volume affinity - -Most cloud providers (at least GCE and AWS) cannot attach their persistent -volumes across zones. Thus when a pod is being scheduled, if there is a volume -attached, that will dictate the zone. This will be implemented using a new -scheduler predicate (a hard constraint): `VolumeZonePredicate`. - -When `VolumeZonePredicate` observes a pod scheduling request that includes a -volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any -nodes not in that zone. - -Again, to avoid the scheduler calling out to the cloud provider, this will rely -on information attached to the volumes. This means that this will only support -PersistentVolumeClaims, because direct mounts do not have a place to attach -zone information. PersistentVolumes will then include zone information where -volumes are zone-specific. - -### Load-balanced services should operate reasonably - -For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each -service of type LoadBalancer. The native cloud load-balancers on both AWS & -GCE are region-level, and support load-balancing across instances in multiple -zones (in the same region). For both clouds, the behaviour of the native cloud -load-balancer is reasonable in the face of failures (indeed, this is why clouds -provide load-balancing as a primitve). - -For multi-AZ clusters we will therefore simply rely on the native cloud provider -load balancer behaviour, and we do not anticipate substantial code changes. - -One notable shortcoming here is that load-balanced traffic still goes through -kube-proxy controlled routing, and kube-proxy does not (currently) favor -targeting a pod running on the same instance or even the same zone. This will -likely produce a lot of unnecessary cross-zone traffic (which is likely slower -and more expensive). This might be sufficiently low-hanging fruit that we -choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed -after the initial implementation. - - -## Implementation - -The main implementation points are: - -1. how to attach zone information to Nodes and PersistentVolumes -1. how nodes get zone information -1. how volumes get zone information - -### Attaching zone information - -We must attach zone information to Nodes and PersistentVolumes, and possibly to -other resources in future. There are two obvious alternatives: we can use -labels/annotations, or we can extend the schema to include the information. - -For the initial implementation, we propose to use labels. The reasoning is: - -1. It is considerably easier to implement. -1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and -`failure-domain.alpha.kubernetes.io/region` for the two pieces of information -we need. By putting this under the `kubernetes.io` namespace there is no risk -of collision, and by putting it under `alpha.kubernetes.io` we clearly mark -this as an experimental feature. -1. We do not yet know whether these labels will be sufficient for all -environments, nor which entities will require zone information. Labels give us -more flexibility here. -1. Because the labels are reserved, we can move to schema-defined fields in -future using our cross-version mapping techniques. - -### Node labeling - -We do not want to require an administrator to manually label nodes. We instead -modify the kubelet to include the appropriate labels when it registers itself. -The information is easily obtained by the kubelet from the cloud provider. - -### Volume labeling - -As with nodes, we do not want to require an administrator to manually label -volumes. We will create an admission controller `PersistentVolumeLabel`. -`PersistentVolumeLabel` will intercept requests to create PersistentVolumes, -and will label them appropriately by calling in to the cloud provider. - -## AWS Specific Considerations - -The AWS implementation here is fairly straightforward. The AWS API is -region-wide, meaning that a single call will find instances and volumes in all -zones. In addition, instance ids and volume ids are unique per-region (and -hence also per-zone). I believe they are actually globally unique, but I do -not know if this is guaranteed; in any case we only need global uniqueness if -we are to span regions, which will not be supported by multi-AZ clusters (to do -that correctly requires a full Cluster Federation type approach). - -## GCE Specific Considerations - -The GCE implementation is more complicated than the AWS implementation because -GCE APIs are zone-scoped. To perform an operation, we must perform one REST -call per zone and combine the results, unless we can determine in advance that -an operation references a particular zone. For many operations, we can make -that determination, but in some cases - such as listing all instances, we must -combine results from calls in all relevant zones. - -A further complexity is that GCE volume names are scoped per-zone, not -per-region. Thus it is permitted to have two volumes both named `myvolume` in -two different GCE zones. (Instance names are currently unique per-region, and -thus are not a problem for multi-AZ clusters). - -The volume scoping leads to a (small) behavioural change for multi-AZ clusters on -GCE. If you had two volumes both named `myvolume` in two different GCE zones, -this would not be ambiguous when Kubernetes is operating only in a single zone. -But, when operating a cluster across multiple zones, `myvolume` is no longer -sufficient to specify a volume uniquely. Worse, the fact that a volume happens -to be unambigious at a particular time is no guarantee that it will continue to -be unambigious in future, because a volume with the same name could -subsequently be created in a second zone. While perhaps unlikely in practice, -we cannot automatically enable multi-AZ clusters for GCE users if this then causes -volume mounts to stop working. - -This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e. -there must be a feature-flag). It may be that we can make this feature -semi-automatic in future, by detecting whether nodes are running in multiple -zones, but it seems likely that kube-up could instead simply set this flag. - -For the initial implementation, creating volumes with identical names will -yield undefined results. Later, we may add some way to specify the zone for a -volume (and possibly require that volumes have their zone specified when -running in multi-AZ cluster mode). We could add a new `zone` field to the -PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted -name for the volume name (.) - -Initially therefore, the GCE changes will be to: - -1. change kube-up to support creation of a cluster in multiple zones -1. pass a flag enabling multi-AZ clusters with kube-up -1. change the kubernetes cloud provider to iterate through relevant zones when resolving items -1. tag GCE PD volumes with the appropriate zone information - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federation-phase-1.md b/contributors/design-proposals/federation/federation-phase-1.md deleted file mode 100644 index 157b5668..00000000 --- a/contributors/design-proposals/federation/federation-phase-1.md +++ /dev/null @@ -1,407 +0,0 @@ -# Ubernetes Design Spec (phase one) - -**Huawei PaaS Team** - -## INTRODUCTION - -In this document we propose a design for the “Control Plane” of -Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of -this work please refer to -[this proposal](federation.md). -The document is arranged as following. First we briefly list scenarios -and use cases that motivate K8S federation work. These use cases drive -the design and they also verify the design. We summarize the -functionality requirements from these use cases, and define the “in -scope” functionalities that will be covered by this design (phase -one). After that we give an overview of the proposed architecture, API -and building blocks. And also we go through several activity flows to -see how these building blocks work together to support use cases. - -## REQUIREMENTS - -There are many reasons why customers may want to build a K8S -federation: - -+ **High Availability:** Customers want to be immune to the outage of - a single availability zone, region or even a cloud provider. -+ **Sensitive workloads:** Some workloads can only run on a particular - cluster. They cannot be scheduled to or migrated to other clusters. -+ **Capacity overflow:** Customers prefer to run workloads on a - primary cluster. But if the capacity of the cluster is not - sufficient, workloads should be automatically distributed to other - clusters. -+ **Vendor lock-in avoidance:** Customers want to spread their - workloads on different cloud providers, and can easily increase or - decrease the workload proportion of a specific provider. -+ **Cluster Size Enhancement:** Currently K8S cluster can only support -a limited size. While the community is actively improving it, it can -be expected that cluster size will be a problem if K8S is used for -large workloads or public PaaS infrastructure. While we can separate -different tenants to different clusters, it would be good to have a -unified view. - -Here are the functionality requirements derived from above use cases: - -+ Clients of the federation control plane API server can register and deregister -clusters. -+ Workloads should be spread to different clusters according to the - workload distribution policy. -+ Pods are able to discover and connect to services hosted in other - clusters (in cases where inter-cluster networking is necessary, - desirable and implemented). -+ Traffic to these pods should be spread across clusters (in a manner - similar to load balancing, although it might not be strictly - speaking balanced). -+ The control plane needs to know when a cluster is down, and migrate - the workloads to other clusters. -+ Clients have a unified view and a central control point for above - activities. - -## SCOPE - -It's difficult to have a perfect design with one click that implements -all the above requirements. Therefore we will go with an iterative -approach to design and build the system. This document describes the -phase one of the whole work. In phase one we will cover only the -following objectives: - -+ Define the basic building blocks and API objects of control plane -+ Implement a basic end-to-end workflow - + Clients register federated clusters - + Clients submit a workload - + The workload is distributed to different clusters - + Service discovery - + Load balancing - -The following parts are NOT covered in phase one: - -+ Authentication and authorization (other than basic client - authentication against the ubernetes API, and from ubernetes control - plane to the underlying kubernetes clusters). -+ Deployment units other than replication controller and service -+ Complex distribution policy of workloads -+ Service affinity and migration - -## ARCHITECTURE - -The overall architecture of a control plane is shown as following: - -![Ubernetes Architecture](ubernetes-design.png) - -Some design principles we are following in this architecture: - -1. Keep the underlying K8S clusters independent. They should have no - knowledge of control plane or of each other. -1. Keep the Ubernetes API interface compatible with K8S API as much as - possible. -1. Re-use concepts from K8S as much as possible. This reduces -customers' learning curve and is good for adoption. Below is a brief -description of each module contained in above diagram. - -## Ubernetes API Server - -The API Server in the Ubernetes control plane works just like the API -Server in K8S. It talks to a distributed key-value store to persist, -retrieve and watch API objects. This store is completely distinct -from the kubernetes key-value stores (etcd) in the underlying -kubernetes clusters. We still use `etcd` as the distributed -storage so customers don't need to learn and manage a different -storage system, although it is envisaged that other storage systems -(consol, zookeeper) will probably be developedand supported over -time. - -## Ubernetes Scheduler - -The Ubernetes Scheduler schedules resources onto the underlying -Kubernetes clusters. For example it watches for unscheduled Ubernetes -replication controllers (those that have not yet been scheduled onto -underlying Kubernetes clusters) and performs the global scheduling -work. For each unscheduled replication controller, it calls policy -engine to decide how to spit workloads among clusters. It creates a -Kubernetes Replication Controller on one ore more underlying cluster, -and post them back to `etcd` storage. - -One subtlety worth noting here is that the scheduling decision is arrived at by -combining the application-specific request from the user (which might -include, for example, placement constraints), and the global policy specified -by the federation administrator (for example, "prefer on-premise -clusters over AWS clusters" or "spread load equally across clusters"). - -## Ubernetes Cluster Controller - -The cluster controller -performs the following two kinds of work: - -1. It watches all the sub-resources that are created by Ubernetes - components, like a sub-RC or a sub-service. And then it creates the - corresponding API objects on the underlying K8S clusters. -1. It periodically retrieves the available resources metrics from the - underlying K8S cluster, and updates them as object status of the - `cluster` API object. An alternative design might be to run a pod - in each underlying cluster that reports metrics for that cluster to - the Ubernetes control plane. Which approach is better remains an - open topic of discussion. - -## Ubernetes Service Controller - -The Ubernetes service controller is a federation-level implementation -of K8S service controller. It watches service resources created on -control plane, creates corresponding K8S services on each involved K8S -clusters. Besides interacting with services resources on each -individual K8S clusters, the Ubernetes service controller also -performs some global DNS registration work. - -## API OBJECTS - -## Cluster - -Cluster is a new first-class API object introduced in this design. For -each registered K8S cluster there will be such an API resource in -control plane. The way clients register or deregister a cluster is to -send corresponding REST requests to following URL: -`/api/{$version}/clusters`. Because control plane is behaving like a -regular K8S client to the underlying clusters, the spec of a cluster -object contains necessary properties like K8S cluster address and -credentials. The status of a cluster API object will contain -following information: - -1. Which phase of its lifecycle -1. Cluster resource metrics for scheduling decisions. -1. Other metadata like the version of cluster - -$version.clusterSpec - - - - - - - - - - - - - - - - - - - - - - - - - -
Name
-
Description
-
Required
-
Schema
-
Default
-
Address
-
address of the cluster
-
yes
-
address
-

Credential
-
the type (e.g. bearer token, client -certificate etc) and data of the credential used to access cluster. It's used for system routines (not behalf of users)
-
yes
-
string
-

- -$version.clusterStatus - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Name
-
Description
-
Required
-
Schema
-
Default
-
Phase
-
the recently observed lifecycle phase of the cluster
-
yes
-
enum
-

Capacity
-
represents the available resources of a cluster
-
yes
-
any
-

ClusterMeta
-
Other cluster metadata like the version
-
yes
-
ClusterMeta
-

- -**For simplicity we didn't introduce a separate “cluster metrics” API -object here**. The cluster resource metrics are stored in cluster -status section, just like what we did to nodes in K8S. In phase one it -only contains available CPU resources and memory resources. The -cluster controller will periodically poll the underlying cluster API -Server to get cluster capability. In phase one it gets the metrics by -simply aggregating metrics from all nodes. In future we will improve -this with more efficient ways like leveraging heapster, and also more -metrics will be supported. Similar to node phases in K8S, the “phase” -field includes following values: - -+ pending: newly registered clusters or clusters suspended by admin - for various reasons. They are not eligible for accepting workloads -+ running: clusters in normal status that can accept workloads -+ offline: clusters temporarily down or not reachable -+ terminated: clusters removed from federation - -Below is the state transition diagram. - -![Cluster State Transition Diagram](ubernetes-cluster-state.png) - -## Replication Controller - -A global workload submitted to control plane is represented as a - replication controller in the Cluster Federation control plane. When a replication controller -is submitted to control plane, clients need a way to express its -requirements or preferences on clusters. Depending on different use -cases it may be complex. For example: - -+ This workload can only be scheduled to cluster Foo. It cannot be - scheduled to any other clusters. (use case: sensitive workloads). -+ This workload prefers cluster Foo. But if there is no available - capacity on cluster Foo, it's OK to be scheduled to cluster Bar - (use case: workload ) -+ Seventy percent of this workload should be scheduled to cluster Foo, - and thirty percent should be scheduled to cluster Bar (use case: - vendor lock-in avoidance). In phase one, we only introduce a - _clusterSelector_ field to filter acceptable clusters. In default - case there is no such selector and it means any cluster is - acceptable. - -Below is a sample of the YAML to create such a replication controller. - -```yaml -apiVersion: v1 -kind: ReplicationController -metadata: - name: nginx-controller -spec: - replicas: 5 - selector: - app: nginx - template: - metadata: - labels: - app: nginx - spec: - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - clusterSelector: - name in (Foo, Bar) -``` - -Currently clusterSelector (implemented as a -[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) -only supports a simple list of acceptable clusters. Workloads will be -evenly distributed on these acceptable clusters in phase one. After -phase one we will define syntax to represent more advanced -constraints, like cluster preference ordering, desired number of -splitted workloads, desired ratio of workloads spread on different -clusters, etc. - -Besides this explicit “clusterSelector” filter, a workload may have -some implicit scheduling restrictions. For example it defines -“nodeSelector” which can only be satisfied on some particular -clusters. How to handle this will be addressed after phase one. - -## Federated Services - -The Service API object exposed by the Cluster Federation is similar to service -objects on Kubernetes. It defines the access to a group of pods. The -federation service controller will create corresponding Kubernetes -service objects on underlying clusters. These are detailed in a -separate design document: [Federated Services](federated-services.md). - -## Pod - -In phase one we only support scheduling replication controllers. Pod -scheduling will be supported in later phase. This is primarily in -order to keep the Cluster Federation API compatible with the Kubernetes API. - -## ACTIVITY FLOWS - -## Scheduling - -The below diagram shows how workloads are scheduled on the Cluster Federation control\ -plane: - -1. A replication controller is created by the client. -1. APIServer persists it into the storage. -1. Cluster controller periodically polls the latest available resource - metrics from the underlying clusters. -1. Scheduler is watching all pending RCs. It picks up the RC, make - policy-driven decisions and split it into different sub RCs. -1. Each cluster control is watching the sub RCs bound to its - corresponding cluster. It picks up the newly created sub RC. -1. The cluster controller issues requests to the underlying cluster -API Server to create the RC. In phase one we don't support complex -distribution policies. The scheduling rule is basically: - 1. If a RC does not specify any nodeSelector, it will be scheduled - to the least loaded K8S cluster(s) that has enough available - resources. - 1. If a RC specifies _N_ acceptable clusters in the - clusterSelector, all replica will be evenly distributed among - these clusters. - -There is a potential race condition here. Say at time _T1_ the control -plane learns there are _m_ available resources in a K8S cluster. As -the cluster is working independently it still accepts workload -requests from other K8S clients or even another Cluster Federation control -plane. The Cluster Federation scheduling decision is based on this data of -available resources. However when the actual RC creation happens to -the cluster at time _T2_, the cluster may don't have enough resources -at that time. We will address this problem in later phases with some -proposed solutions like resource reservation mechanisms. - -![Federated Scheduling](ubernetes-scheduling.png) - -## Service Discovery - -This part has been included in the section “Federated Service” of -document -“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. -Please refer to that document for details. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() - diff --git a/contributors/design-proposals/federation/federation.md b/contributors/design-proposals/federation/federation.md deleted file mode 100644 index fc595123..00000000 --- a/contributors/design-proposals/federation/federation.md +++ /dev/null @@ -1,648 +0,0 @@ -# Kubernetes Cluster Federation - -## (previously nicknamed "Ubernetes") - -## Requirements Analysis and Product Proposal - -## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_ - -_Initial revision: 2015-03-05_ -_Last updated: 2015-08-20_ -This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2) -Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides) -Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto) - -## Introduction - -Today, each Kubernetes cluster is a relatively self-contained unit, -which typically runs in a single "on-premise" data centre or single -availability zone of a cloud provider (Google's GCE, Amazon's AWS, -etc). - -Several current and potential Kubernetes users and customers have -expressed a keen interest in tying together ("federating") multiple -clusters in some sensible way in order to enable the following kinds -of use cases (intentionally vague): - -1. _"Preferentially run my workloads in my on-premise cluster(s), but - automatically overflow to my cloud-hosted cluster(s) if I run out - of on-premise capacity"_. -1. _"Most of my workloads should run in my preferred cloud-hosted - cluster(s), but some are privacy-sensitive, and should be - automatically diverted to run in my secure, on-premise - cluster(s)"_. -1. _"I want to avoid vendor lock-in, so I want my workloads to run - across multiple cloud providers all the time. I change my set of - such cloud providers, and my pricing contracts with them, - periodically"_. -1. _"I want to be immune to any single data centre or cloud - availability zone outage, so I want to spread my service across - multiple such zones (and ideally even across multiple cloud - providers)."_ - -The above use cases are by necessity left imprecisely defined. The -rest of this document explores these use cases and their implications -in further detail, and compares a few alternative high level -approaches to addressing them. The idea of cluster federation has -informally become known as _"Ubernetes"_. - -## Summary/TL;DR - -Four primary customer-driven use cases are explored in more detail. -The two highest priority ones relate to High Availability and -Application Portability (between cloud providers, and between -on-premise and cloud providers). - -Four primary federation primitives are identified (location affinity, -cross-cluster scheduling, service discovery and application -migration). Fortunately not all four of these primitives are required -for each primary use case, so incremental development is feasible. - -## What exactly is a Kubernetes Cluster? - -A central design concept in Kubernetes is that of a _cluster_. While -loosely speaking, a cluster can be thought of as running in a single -data center, or cloud provider availability zone, a more precise -definition is that each cluster provides: - -1. a single Kubernetes API entry point, -1. a consistent, cluster-wide resource naming scheme -1. a scheduling/container placement domain -1. a service network routing domain -1. an authentication and authorization model. - -The above in turn imply the need for a relatively performant, reliable -and cheap network within each cluster. - -There is also assumed to be some degree of failure correlation across -a cluster, i.e. whole clusters are expected to fail, at least -occasionally (due to cluster-wide power and network failures, natural -disasters etc). Clusters are often relatively homogeneous in that all -compute nodes are typically provided by a single cloud provider or -hardware vendor, and connected by a common, unified network fabric. -But these are not hard requirements of Kubernetes. - -Other classes of Kubernetes deployments than the one sketched above -are technically feasible, but come with some challenges of their own, -and are not yet common or explicitly supported. - -More specifically, having a Kubernetes cluster span multiple -well-connected availability zones within a single geographical region -(e.g. US North East, UK, Japan etc) is worthy of further -consideration, in particular because it potentially addresses -some of these requirements. - -## What use cases require Cluster Federation? - -Let's name a few concrete use cases to aid the discussion: - -## 1.Capacity Overflow - -_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_ - -This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)". - -**Clarifying questions:** What is the unit of overflow? Individual - pods? Probably not always. Replication controllers and their - associated sets of pods? Groups of replication controllers - (a.k.a. distributed applications)? How are persistent disks - overflowed? Can the "overflowed" pods communicate with their - brethren and sistren pods and services in the other cluster(s)? - Presumably yes, at higher cost and latency, provided that they use - external service discovery. Is "overflow" enabled only when creating - new workloads/replication controllers, or are existing workloads - dynamically migrated between clusters based on fluctuating available - capacity? If so, what is the desired behaviour, and how is it - achieved? How, if at all, does this relate to quota enforcement - (e.g. if we run out of on-premise capacity, can all or only some - quotas transfer to other, potentially more expensive off-premise - capacity?) - -It seems that most of this boils down to: - -1. **location affinity** (pods relative to each other, and to other - stateful services like persistent storage - how is this expressed - and enforced?) -1. **cross-cluster scheduling** (given location affinity constraints - and other scheduling policy, which resources are assigned to which - clusters, and by what?) -1. **cross-cluster service discovery** (how do pods in one cluster - discover and communicate with pods in another cluster?) -1. **cross-cluster migration** (how do compute and storage resources, - and the distributed applications to which they belong, move from - one cluster to another) -1. **cross-cluster load-balancing** (how does is user traffic directed - to an appropriate cluster?) -1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility) - -## 2. Sensitive Workloads - -_"I want most of my workloads to run in my preferred cloud-hosted -cluster(s), but some are privacy-sensitive, and should be -automatically diverted to run in my secure, on-premise cluster(s). The -list of privacy-sensitive workloads changes over time, and they're -subject to external auditing."_ - -**Clarifying questions:** -1. What kinds of rules determine which -workloads go where? - 1. Is there in fact a requirement to have these rules be - declaratively expressed and automatically enforced, or is it - acceptable/better to have users manually select where to run - their workloads when starting them? - 1. Is a static mapping from container (or more typically, - replication controller) to cluster maintained and enforced? - 1. If so, is it only enforced on startup, or are things migrated - between clusters when the mappings change? - -This starts to look quite similar to "1. Capacity Overflow", and again -seems to boil down to: - -1. location affinity -1. cross-cluster scheduling -1. cross-cluster service discovery -1. cross-cluster migration -1. cross-cluster monitoring and auditing -1. cross-cluster load balancing - -## 3. Vendor lock-in avoidance - -_"My CTO wants us to avoid vendor lock-in, so she wants our workloads -to run across multiple cloud providers at all times. She changes our -set of preferred cloud providers and pricing contracts with them -periodically, and doesn't want to have to communicate and manually -enforce these policy changes across the organization every time this -happens. She wants it centrally and automatically enforced, monitored -and audited."_ - -**Clarifying questions:** - -1. How does this relate to other use cases (high availability, -capacity overflow etc), as they may all be across multiple vendors. -It's probably not strictly speaking a separate -use case, but it's brought up so often as a requirement, that it's -worth calling out explicitly. -1. Is a useful intermediate step to make it as simple as possible to - migrate an application from one vendor to another in a one-off fashion? - -Again, I think that this can probably be - reformulated as a Capacity Overflow problem - the fundamental - principles seem to be the same or substantially similar to those - above. - -## 4. "High Availability" - -_"I want to be immune to any single data centre or cloud availability -zone outage, so I want to spread my service across multiple such zones -(and ideally even across multiple cloud providers), and have my -service remain available even if one of the availability zones or -cloud providers "goes down"_. - -It seems useful to split this into multiple sets of sub use cases: - -1. Multiple availability zones within a single cloud provider (across - which feature sets like private networks, load balancing, - persistent disks, data snapshots etc are typically consistent and - explicitly designed to inter-operate). - 1. within the same geographical region (e.g. metro) within which network - is fast and cheap enough to be almost analogous to a single data - center. - 1. across multiple geographical regions, where high network cost and - poor network performance may be prohibitive. -1. Multiple cloud providers (typically with inconsistent feature sets, - more limited interoperability, and typically no cheap inter-cluster - networking described above). - -The single cloud provider case might be easier to implement (although -the multi-cloud provider implementation should just work for a single -cloud provider). Propose high-level design catering for both, with -initial implementation targeting single cloud provider only. - -**Clarifying questions:** -**How does global external service discovery work?** In the steady - state, which external clients connect to which clusters? GeoDNS or - similar? What is the tolerable failover latency if a cluster goes - down? Maybe something like (make up some numbers, notwithstanding - some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of - clients to re-issue DNS lookups and reconnect to a new cluster when - their home cluster fails is good enough for most Kubernetes users - (or at least way better than the status quo), given that these sorts - of failure only happen a small number of times a year? - -**How does dynamic load balancing across clusters work, if at all?** - One simple starting point might be "it doesn't". i.e. if a service - in a cluster is deemed to be "up", it receives as much traffic as is - generated "nearby" (even if it overloads). If the service is deemed - to "be down" in a given cluster, "all" nearby traffic is redirected - to some other cluster within some number of seconds (failover could - be automatic or manual). Failover is essentially binary. An - improvement would be to detect when a service in a cluster reaches - maximum serving capacity, and dynamically divert additional traffic - to other clusters. But how exactly does all of this work, and how - much of it is provided by Kubernetes, as opposed to something else - bolted on top (e.g. external monitoring and manipulation of GeoDNS)? - -**How does this tie in with auto-scaling of services?** More - specifically, if I run my service across _n_ clusters globally, and - one (or more) of them fail, how do I ensure that the remaining _n-1_ - clusters have enough capacity to serve the additional, failed-over - traffic? Either: - -1. I constantly over-provision all clusters by 1/n (potentially expensive), or -1. I "manually" (or automatically) update my replica count configurations in the - remaining clusters by 1/n when the failure occurs, and Kubernetes - takes care of the rest for me, or -1. Auto-scaling in the remaining clusters takes - care of it for me automagically as the additional failed-over - traffic arrives (with some latency). Note that this implies that - the cloud provider keeps the necessary resources on hand to - accommodate such auto-scaling (e.g. via something similar to AWS reserved - and spot instances) - -Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others). - -All of the above (regarding "Unavailability Zones") refers primarily -to already-running user-facing services, and minimizing the impact on -end users of those services becoming unavailable in a given cluster. -What about the people and systems that deploy Kubernetes services -(devops etc)? Should they be automatically shielded from the impact -of the cluster outage? i.e. have their new resource creation requests -automatically diverted to another cluster during the outage? While -this specific requirement seems non-critical (manual fail-over seems -relatively non-arduous, ignoring the user-facing issues above), it -smells a lot like the first three use cases listed above ("Capacity -Overflow, Sensitive Services, Vendor lock-in..."), so if we address -those, we probably get this one free of charge. - -## Core Challenges of Cluster Federation - -As we saw above, a few common challenges fall out of most of the use -cases considered above, namely: - -## Location Affinity - -Can the pods comprising a single distributed application be -partitioned across more than one cluster? More generally, how far -apart, in network terms, can a given client and server within a -distributed application reasonably be? A server need not necessarily -be a pod, but could instead be a persistent disk housing data, or some -other stateful network service. What is tolerable is typically -application-dependent, primarily influenced by network bandwidth -consumption, latency requirements and cost sensitivity. - -For simplicity, let's assume that all Kubernetes distributed -applications fall into one of three categories with respect to relative -location affinity: - -1. **"Strictly Coupled"**: Those applications that strictly cannot be - partitioned between clusters. They simply fail if they are - partitioned. When scheduled, all pods _must_ be scheduled to the - same cluster. To move them, we need to shut the whole distributed - application down (all pods) in one cluster, possibly move some - data, and then bring the up all of the pods in another cluster. To - avoid downtime, we might bring up the replacement cluster and - divert traffic there before turning down the original, but the - principle is much the same. In some cases moving the data might be - prohibitively expensive or time-consuming, in which case these - applications may be effectively _immovable_. -1. **"Strictly Decoupled"**: Those applications that can be - indefinitely partitioned across more than one cluster, to no - disadvantage. An embarrassingly parallel YouTube porn detector, - where each pod repeatedly dequeues a video URL from a remote work - queue, downloads and chews on the video for a few hours, and - arrives at a binary verdict, might be one such example. The pods - derive no benefit from being close to each other, or anything else - (other than the source of YouTube videos, which is assumed to be - equally remote from all clusters in this example). Each pod can be - scheduled independently, in any cluster, and moved at any time. -1. **"Preferentially Coupled"**: Somewhere between Coupled and - Decoupled. These applications prefer to have all of their pods - located in the same cluster (e.g. for failure correlation, network - latency or bandwidth cost reasons), but can tolerate being - partitioned for "short" periods of time (for example while - migrating the application from one cluster to another). Most small - to medium sized LAMP stacks with not-very-strict latency goals - probably fall into this category (provided that they use sane - service discovery and reconnect-on-fail, which they need to do - anyway to run effectively, even in a single Kubernetes cluster). - -From a fault isolation point of view, there are also opposites of the -above. For example, a master database and its slave replica might -need to be in different availability zones. We'll refer to this a -anti-affinity, although it is largely outside the scope of this -document. - -Note that there is somewhat of a continuum with respect to network -cost and quality between any two nodes, ranging from two nodes on the -same L2 network segment (lowest latency and cost, highest bandwidth) -to two nodes on different continents (highest latency and cost, lowest -bandwidth). One interesting point on that continuum relates to -multiple availability zones within a well-connected metro or region -and single cloud provider. Despite being in different data centers, -or areas within a mega data center, network in this case is often very fast -and effectively free or very cheap. For the purposes of this network location -affinity discussion, this case is considered analogous to a single -availability zone. Furthermore, if a given application doesn't fit -cleanly into one of the above, shoe-horn it into the best fit, -defaulting to the "Strictly Coupled and Immovable" bucket if you're -not sure. - -And then there's what I'll call _absolute_ location affinity. Some -applications are required to run in bounded geographical or network -topology locations. The reasons for this are typically -political/legislative (data privacy laws etc), or driven by network -proximity to consumers (or data providers) of the application ("most -of our users are in Western Europe, U.S. West Coast" etc). - -**Proposal:** First tackle Strictly Decoupled applications (which can - be trivially scheduled, partitioned or moved, one pod at a time). - Then tackle Preferentially Coupled applications (which must be - scheduled in totality in a single cluster, and can be moved, but - ultimately in total, and necessarily within some bounded time). - Leave strictly coupled applications to be manually moved between - clusters as required for the foreseeable future. - -## Cross-cluster service discovery - -I propose having pods use standard discovery methods used by external -clients of Kubernetes applications (i.e. DNS). DNS might resolve to a -public endpoint in the local or a remote cluster. Other than Strictly -Coupled applications, software should be largely oblivious of which of -the two occurs. - -_Aside:_ How do we avoid "tromboning" through an external VIP when DNS -resolves to a public IP on the local cluster? Strictly speaking this -would be an optimization for some cases, and probably only matters to -high-bandwidth, low-latency communications. We could potentially -eliminate the trombone with some kube-proxy magic if necessary. More -detail to be added here, but feel free to shoot down the basic DNS -idea in the mean time. In addition, some applications rely on private -networking between clusters for security (e.g. AWS VPC or more -generally VPN). It should not be necessary to forsake this in -order to use Cluster Federation, for example by being forced to use public -connectivity between clusters. - -## Cross-cluster Scheduling - -This is closely related to location affinity above, and also discussed -there. The basic idea is that some controller, logically outside of -the basic Kubernetes control plane of the clusters in question, needs -to be able to: - -1. Receive "global" resource creation requests. -1. Make policy-based decisions as to which cluster(s) should be used - to fulfill each given resource request. In a simple case, the - request is just redirected to one cluster. In a more complex case, - the request is "demultiplexed" into multiple sub-requests, each to - a different cluster. Knowledge of the (albeit approximate) - available capacity in each cluster will be required by the - controller to sanely split the request. Similarly, knowledge of - the properties of the application (Location Affinity class -- - Strictly Coupled, Strictly Decoupled etc, privacy class etc) will - be required. It is also conceivable that knowledge of service - SLAs and monitoring thereof might provide an input into - scheduling/placement algorithms. -1. Multiplex the responses from the individual clusters into an - aggregate response. - -There is of course a lot of detail still missing from this section, -including discussion of: - -1. admission control -1. initial placement of instances of a new -service vs. scheduling new instances of an existing service in response -to auto-scaling -1. rescheduling pods due to failure (response might be -different depending on if it's failure of a node, rack, or whole AZ) -1. data placement relative to compute capacity, -etc. - -## Cross-cluster Migration - -Again this is closely related to location affinity discussed above, -and is in some sense an extension of Cross-cluster Scheduling. When -certain events occur, it becomes necessary or desirable for the -cluster federation system to proactively move distributed applications -(either in part or in whole) from one cluster to another. Examples of -such events include: - -1. A low capacity event in a cluster (or a cluster failure). -1. A change of scheduling policy ("we no longer use cloud provider X"). -1. A change of resource pricing ("cloud provider Y dropped their - prices - let's migrate there"). - -Strictly Decoupled applications can be trivially moved, in part or in -whole, one pod at a time, to one or more clusters (within applicable -policy constraints, for example "PrivateCloudOnly"). - -For Preferentially Decoupled applications, the federation system must -first locate a single cluster with sufficient capacity to accommodate -the entire application, then reserve that capacity, and incrementally -move the application, one (or more) resources at a time, over to the -new cluster, within some bounded time period (and possibly within a -predefined "maintenance" window). Strictly Coupled applications (with -the exception of those deemed completely immovable) require the -federation system to: - -1. start up an entire replica application in the destination cluster -1. copy persistent data to the new application instance (possibly - before starting pods) -1. switch user traffic across -1. tear down the original application instance - -It is proposed that support for automated migration of Strictly -Coupled applications be deferred to a later date. - -## Other Requirements - -These are often left implicit by customers, but are worth calling out explicitly: - -1. Software failure isolation between Kubernetes clusters should be - retained as far as is practically possible. The federation system - should not materially increase the failure correlation across - clusters. For this reason the federation control plane software - should ideally be completely independent of the Kubernetes cluster - control software, and look just like any other Kubernetes API - client, with no special treatment. If the federation control plane - software fails catastrophically, the underlying Kubernetes clusters - should remain independently usable. -1. Unified monitoring, alerting and auditing across federated Kubernetes clusters. -1. Unified authentication, authorization and quota management across - clusters (this is in direct conflict with failure isolation above, - so there are some tough trade-offs to be made here). - -## Proposed High-Level Architectures - -Two distinct potential architectural approaches have emerged from discussions -thus far: - -1. An explicitly decoupled and hierarchical architecture, where the - Federation Control Plane sits logically above a set of independent - Kubernetes clusters, each of which is (potentially) unaware of the - other clusters, and of the Federation Control Plane itself (other - than to the extent that it is an API client much like any other). - One possible example of this general architecture is illustrated - below, and will be referred to as the "Decoupled, Hierarchical" - approach. -1. A more monolithic architecture, where a single instance of the - Kubernetes control plane itself manages a single logical cluster - composed of nodes in multiple availability zones and cloud - providers. - -A very brief, non-exhaustive list of pro's and con's of the two -approaches follows. (In the interest of full disclosure, the author -prefers the Decoupled Hierarchical model for the reasons stated below). - -1. **Failure isolation:** The Decoupled Hierarchical approach provides - better failure isolation than the Monolithic approach, as each - underlying Kubernetes cluster, and the Federation Control Plane, - can operate and fail completely independently of each other. In - particular, their software and configurations can be updated - independently. Such updates are, in our experience, the primary - cause of control-plane failures, in general. -1. **Failure probability:** The Decoupled Hierarchical model incorporates - numerically more independent pieces of software and configuration - than the Monolithic one. But the complexity of each of these - decoupled pieces is arguably better contained in the Decoupled - model (per standard arguments for modular rather than monolithic - software design). Which of the two models presents higher - aggregate complexity and consequent failure probability remains - somewhat of an open question. -1. **Scalability:** Conceptually the Decoupled Hierarchical model wins - here, as each underlying Kubernetes cluster can be scaled - completely independently w.r.t. scheduling, node state management, - monitoring, network connectivity etc. It is even potentially - feasible to stack federations of clusters (i.e. create - federations of federations) should scalability of the independent - Federation Control Plane become an issue (although the author does - not envision this being a problem worth solving in the short - term). -1. **Code complexity:** I think that an argument can be made both ways - here. It depends on whether you prefer to weave the logic for - handling nodes in multiple availability zones and cloud providers - within a single logical cluster into the existing Kubernetes - control plane code base (which was explicitly not designed for - this), or separate it into a decoupled Federation system (with - possible code sharing between the two via shared libraries). The - author prefers the latter because it: - 1. Promotes better code modularity and interface design. - 1. Allows the code - bases of Kubernetes and the Federation system to progress - largely independently (different sets of developers, different - release schedules etc). -1. **Administration complexity:** Again, I think that this could be argued - both ways. Superficially it would seem that administration of a - single Monolithic multi-zone cluster might be simpler by virtue of - being only "one thing to manage", however in practise each of the - underlying availability zones (and possibly cloud providers) has - its own capacity, pricing, hardware platforms, and possibly - bureaucratic boundaries (e.g. "our EMEA IT department manages those - European clusters"). So explicitly allowing for (but not - mandating) completely independent administration of each - underlying Kubernetes cluster, and the Federation system itself, - in the Decoupled Hierarchical model seems to have real practical - benefits that outweigh the superficial simplicity of the - Monolithic model. -1. **Application development and deployment complexity:** It's not clear - to me that there is any significant difference between the two - models in this regard. Presumably the API exposed by the two - different architectures would look very similar, as would the - behavior of the deployed applications. It has even been suggested - to write the code in such a way that it could be run in either - configuration. It's not clear that this makes sense in practise - though. -1. **Control plane cost overhead:** There is a minimum per-cluster - overhead -- two possibly virtual machines, or more for redundant HA - deployments. For deployments of very small Kubernetes - clusters with the Decoupled Hierarchical approach, this cost can - become significant. - -### The Decoupled, Hierarchical Approach - Illustrated - -![image](federation-high-level-arch.png) - -## Cluster Federation API - -It is proposed that this look a lot like the existing Kubernetes API -but be explicitly multi-cluster. - -+ Clusters become first class objects, which can be registered, - listed, described, deregistered etc via the API. -+ Compute resources can be explicitly requested in specific clusters, - or automatically scheduled to the "best" cluster by the Cluster - Federation control system (by a - pluggable Policy Engine). -+ There is a federated equivalent of a replication controller type (or - perhaps a [deployment](deployment.md)), - which is multicluster-aware, and delegates to cluster-specific - replication controllers/deployments as required (e.g. a federated RC for n - replicas might simply spawn multiple replication controllers in - different clusters to do the hard work). - -## Policy Engine and Migration/Replication Controllers - -The Policy Engine decides which parts of each application go into each -cluster at any point in time, and stores this desired state in the -Desired Federation State store (an etcd or -similar). Migration/Replication Controllers reconcile this against the -desired states stored in the underlying Kubernetes clusters (by -watching both, and creating or updating the underlying Replication -Controllers and related Services accordingly). - -## Authentication and Authorization - -This should ideally be delegated to some external auth system, shared -by the underlying clusters, to avoid duplication and inconsistency. -Either that, or we end up with multilevel auth. Local readonly -eventually consistent auth slaves in each cluster and in the Cluster -Federation control system -could potentially cache auth, to mitigate an SPOF auth system. - -## Data consistency, failure and availability characteristics - -The services comprising the Cluster Federation control plane) have to run - somewhere. Several options exist here: -* For high availability Cluster Federation deployments, these - services may run in either: - * a dedicated Kubernetes cluster, not co-located in the same - availability zone with any of the federated clusters (for fault - isolation reasons). If that cluster/availability zone, and hence the Federation - system, fails catastrophically, the underlying pods and - applications continue to run correctly, albeit temporarily - without the Federation system. - * across multiple Kubernetes availability zones, probably with - some sort of cross-AZ quorum-based store. This provides - theoretically higher availability, at the cost of some - complexity related to data consistency across multiple - availability zones. - * For simpler, less highly available deployments, just co-locate the - Federation control plane in/on/with one of the underlying - Kubernetes clusters. The downside of this approach is that if - that specific cluster fails, all automated failover and scaling - logic which relies on the federation system will also be - unavailable at the same time (i.e. precisely when it is needed). - But if one of the other federated clusters fails, everything - should work just fine. - -There is some further thinking to be done around the data consistency - model upon which the Federation system is based, and it's impact - on the detailed semantics, failure and availability - characteristics of the system. - -## Proposed Next Steps - -Identify concrete applications of each use case and configure a proof -of concept service that exercises the use case. For example, cluster -failure tolerance seems popular, so set up an apache frontend with -replicas in each of three availability zones with either an Amazon Elastic -Load Balancer or Google Cloud Load Balancer pointing at them? What -does the zookeeper config look like for N=3 across 3 AZs -- and how -does each replica find the other replicas and how do clients find -their primary zookeeper replica? And now how do I do a shared, highly -available redis database? Use a few common specific use cases like -this to flesh out the detailed API and semantics of Cluster Federation. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation.md?pixel)]() - diff --git a/contributors/design-proposals/federation/ubernetes-cluster-state.png b/contributors/design-proposals/federation/ubernetes-cluster-state.png deleted file mode 100644 index 56ec2df8..00000000 Binary files a/contributors/design-proposals/federation/ubernetes-cluster-state.png and /dev/null differ diff --git a/contributors/design-proposals/federation/ubernetes-design.png b/contributors/design-proposals/federation/ubernetes-design.png deleted file mode 100644 index 44924846..00000000 Binary files a/contributors/design-proposals/federation/ubernetes-design.png and /dev/null differ diff --git a/contributors/design-proposals/federation/ubernetes-scheduling.png b/contributors/design-proposals/federation/ubernetes-scheduling.png deleted file mode 100644 index 01774882..00000000 Binary files a/contributors/design-proposals/federation/ubernetes-scheduling.png and /dev/null differ diff --git a/contributors/design-proposals/multicluster/control-plane-resilience.md b/contributors/design-proposals/multicluster/control-plane-resilience.md new file mode 100644 index 00000000..1e0a3baf --- /dev/null +++ b/contributors/design-proposals/multicluster/control-plane-resilience.md @@ -0,0 +1,241 @@ +# Kubernetes and Cluster Federation Control Plane Resilience + +## Long Term Design and Current Status + +### by Quinton Hoole, Mike Danese and Justin Santa-Barbara + +### December 14, 2015 + +## Summary + +Some amount of confusion exists around how we currently, and in future +want to ensure resilience of the Kubernetes (and by implication +Kubernetes Cluster Federation) control plane. This document is an attempt to capture that +definitively. It covers areas including self-healing, high +availability, bootstrapping and recovery. Most of the information in +this document already exists in the form of github comments, +PR's/proposals, scattered documents, and corridor conversations, so +document is primarily a consolidation and clarification of existing +ideas. + +## Terms + +* **Self-healing:** automatically restarting or replacing failed + processes and machines without human intervention +* **High availability:** continuing to be available and work correctly + even if some components are down or uncontactable. This typically + involves multiple replicas of critical services, and a reliable way + to find available replicas. Note that it's possible (but not + desirable) to have high + availability properties (e.g. multiple replicas) in the absence of + self-healing properties (e.g. if a replica fails, nothing replaces + it). Fairly obviously, given enough time, such systems typically + become unavailable (after enough replicas have failed). +* **Bootstrapping**: creating an empty cluster from nothing +* **Recovery**: recreating a non-empty cluster after perhaps + catastrophic failure/unavailability/data corruption + +## Overall Goals + +1. **Resilience to single failures:** Kubernetes clusters constrained + to single availability zones should be resilient to individual + machine and process failures by being both self-healing and highly + available (within the context of such individual failures). +1. **Ubiquitous resilience by default:** The default cluster creation + scripts for (at least) GCE, AWS and basic bare metal should adhere + to the above (self-healing and high availability) by default (with + options available to disable these features to reduce control plane + resource requirements if so required). It is hoped that other + cloud providers will also follow the above guidelines, but the + above 3 are the primary canonical use cases. +1. **Resilience to some correlated failures:** Kubernetes clusters + which span multiple availability zones in a region should by + default be resilient to complete failure of one entire availability + zone (by similarly providing self-healing and high availability in + the default cluster creation scripts as above). +1. **Default implementation shared across cloud providers:** The + differences between the default implementations of the above for + GCE, AWS and basic bare metal should be minimized. This implies + using shared libraries across these providers in the default + scripts in preference to highly customized implementations per + cloud provider. This is not to say that highly differentiated, + customized per-cloud cluster creation processes (e.g. for GKE on + GCE, or some hosted Kubernetes provider on AWS) are discouraged. + But those fall squarely outside the basic cross-platform OSS + Kubernetes distro. +1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms + for achieving system resilience (replication controllers, health + checking, service load balancing etc) should be used in preference + to building a separate set of mechanisms to achieve the same thing. + This implies that self hosting (the kubernetes control plane on + kubernetes) is strongly preferred, with the caveat below. +1. **Recovery from catastrophic failure:** The ability to quickly and + reliably recover a cluster from catastrophic failure is critical, + and should not be compromised by the above goal to self-host + (i.e. it goes without saying that the cluster should be quickly and + reliably recoverable, even if the cluster control plane is + broken). This implies that such catastrophic failure scenarios + should be carefully thought out, and the subject of regular + continuous integration testing, and disaster recovery exercises. + +## Relative Priorities + +1. **(Possibly manual) recovery from catastrophic failures:** having a +Kubernetes cluster, and all applications running inside it, disappear forever +perhaps is the worst possible failure mode. So it is critical that we be able to +recover the applications running inside a cluster from such failures in some +well-bounded time period. + 1. In theory a cluster can be recovered by replaying all API calls + that have ever been executed against it, in order, but most + often that state has been lost, and/or is scattered across + multiple client applications or groups. So in general it is + probably infeasible. + 1. In theory a cluster can also be recovered to some relatively + recent non-corrupt backup/snapshot of the disk(s) backing the + etcd cluster state. But we have no default consistent + backup/snapshot, verification or restoration process. And we + don't routinely test restoration, so even if we did routinely + perform and verify backups, we have no hard evidence that we + can in practise effectively recover from catastrophic cluster + failure or data corruption by restoring from these backups. So + there's more work to be done here. +1. **Self-healing:** Most major cloud providers provide the ability to + easily and automatically replace failed virtual machines within a + small number of minutes (e.g. GCE + [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) + and Managed Instance Groups, + AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) + and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This + can fairly trivially be used to reduce control-plane down-time due + to machine failure to a small number of minutes per failure + (i.e. typically around "3 nines" availability), provided that: + 1. cluster persistent state (i.e. etcd disks) is either: + 1. truly persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using etcd [dynamic member + addition](https://github.com/coreos/etcd/blob/master/Documentation/v2/runtime-configuration.md#add-a-new-member) + or [backup and + recovery](https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery)). + 1. and boot disks are either: + 1. truly persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using boot-from-snapshot, + boot-from-pre-configured-image or + boot-from-auto-initializing image). +1. **High Availability:** This has the potential to increase + availability above the approximately "3 nines" level provided by + automated self-healing, but it's somewhat more complex, and + requires additional resources (e.g. redundant API servers and etcd + quorum members). In environments where cloud-assisted automatic + self-healing might be infeasible (e.g. on-premise bare-metal + deployments), it also gives cluster administrators more time to + respond (e.g. replace/repair failed machines) without incurring + system downtime. + +## Design and Status (as of December 2015) + + + + + + + + + + + + + + + + + + + + + + +
Control Plane ComponentResilience PlanCurrent Status
API Server + +Multiple stateless, self-hosted, self-healing API servers behind a HA +load balancer, built out by the default "kube-up" automation on GCE, +AWS and basic bare metal (BBM). Note that the single-host approach of +having etcd listen only on localhost to ensure that only API server can +connect to it will no longer work, so alternative security will be +needed in the regard (either using firewall rules, SSL certs, or +something else). All necessary flags are currently supported to enable +SSL between API server and etcd (OpenShift runs like this out of the +box), but this needs to be woven into the "kube-up" and related +scripts. Detailed design of self-hosting and related bootstrapping +and catastrophic failure recovery will be detailed in a separate +design doc. + + + +No scripted self-healing or HA on GCE, AWS or basic bare metal +currently exists in the OSS distro. To be clear, "no self healing" +means that even if multiple e.g. API servers are provisioned for HA +purposes, if they fail, nothing replaces them, so eventually the +system will fail. Self-healing and HA can be set up +manually by following documented instructions, but this is not +currently an automated process, and it is not tested as part of +continuous integration. So it's probably safest to assume that it +doesn't actually work in practise. + +
Controller manager and scheduler + +Multiple self-hosted, self healing warm standby stateless controller +managers and schedulers with leader election and automatic failover of API +server clients, automatically installed by default "kube-up" automation. + +As above.
etcd + +Multiple (3-5) etcd quorum members behind a load balancer with session +affinity (to prevent clients from being bounced from one to another). + +Regarding self-healing, if a node running etcd goes down, it is always necessary +to do three things: +
    +
  1. allocate a new node (not necessary if running etcd as a pod, in +which case specific measures are required to prevent user pods from +interfering with system pods, for example using node selectors as +described in +dynamic member addition. + +In the case of remote persistent disk, the etcd state can be recovered by +attaching the remote persistent disk to the replacement node, thus the state is +recoverable even if all other replicas are down. + +There are also significant performance differences between local disks and remote +persistent disks. For example, the + +sustained throughput local disks in GCE is approximately 20x that of remote +disks. + +Hence we suggest that self-healing be provided by remotely mounted persistent +disks in non-performance critical, single-zone cloud deployments. For +performance critical installations, faster local SSD's should be used, in which +case remounting on node failure is not an option, so + +etcd runtime configuration should be used to replace the failed machine. +Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so +automatic +runtime configuration is required. Similarly, basic bare metal deployments +cannot generally rely on remote persistent disks, so the same approach applies +there. +
+ +Somewhat vague instructions exist on how to set some of this up manually in +a self-hosted configuration. But automatic bootstrapping and self-healing is not +described (and is not implemented for the non-PD cases). This all still needs to +be automated and continuously tested. +
+ + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federated-api-servers.md b/contributors/design-proposals/multicluster/federated-api-servers.md new file mode 100644 index 00000000..ff214c23 --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-api-servers.md @@ -0,0 +1,8 @@ +# Federated API Servers + +Moved to [aggregated-api-servers.md](../api-machinery/aggregated-api-servers.md) since cluster +federation stole the word "federation" from this effort and it was very confusing. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-api-servers.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federated-ingress.md b/contributors/design-proposals/multicluster/federated-ingress.md new file mode 100644 index 00000000..07e75b0c --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-ingress.md @@ -0,0 +1,194 @@ +# Kubernetes Federated Ingress + + Requirements and High Level Design + + Quinton Hoole + + July 17, 2016 + +## Overview/Summary + +[Kubernetes Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) +provides an abstraction for sophisticated L7 load balancing through a +single IP address (and DNS name) across multiple pods in a single +Kubernetes cluster. Multiple alternative underlying implementations +are provided, including one based on GCE L7 load balancing and another +using an in-cluster nginx/HAProxy deployment (for non-GCE +environments). An AWS implementation, based on Elastic Load Balancers +and Route53 is under way by the community. + +To extend the above to cover multiple clusters, Kubernetes Federated +Ingress aims to provide a similar/identical API abstraction and, +again, multiple implementations to cover various +cloud-provider-specific as well as multi-cloud scenarios. The general +model is to allow the user to instantiate a single Ingress object via +the Federation API, and have it automatically provision all of the +necessary underlying resources (L7 cloud load balancers, in-cluster +proxies etc) to provide L7 load balancing across a service spanning +multiple clusters. + +Four options are outlined: + +1. GCP only +1. AWS only +1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP). +1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS). + +Option 1 is the: + +1. easiest/quickest, +1. most featureful + +Recommendations: + ++ Suggest tackling option 1 (GCP only) first (target beta in v1.4) ++ Thereafter option 3 (cross-cloud via GCP) ++ We should encourage/facilitate the community to tackle option 2 (AWS-only) + +## Options + +## Google Cloud Platform only - backed by GCE L7 Load Balancers + +This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE) + +### Features + +In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features: + +1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS) +1. Geo-locality for both external and GCP-internal clients +1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service). +1. URL-based request direction (different backend services can fulfill each different URL). +1. HTTPS request termination (at the GCE load balancer, with server SSL certs) + +### Implementation + +1. Federation user creates (federated) Ingress object (the services + backing the ingress object must share the same nodePort, as they + share a single GCP health check). +1. Federated Ingress Controller creates Ingress object in each cluster + in the federation (after [configuring each cluster ingress + controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)). +1. Each cluster-level Ingress Controller ("GLBC") creates Google L7 + Load Balancer machinery (forwarding rules, target proxy, URL map, + backend service, health check) which ensures that traffic to the + Ingress (backed by a Service), is directed to the nodes in the cluster. +1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) + +An alternative implementation approach involves lifting the current +Federated Ingress Controller functionality up into the Federation +control plane. This alternative is not considered any any further +detail in this document. + +### Outstanding work Items + +1. This should in theory all work out of the box. Need to confirm +with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341)) +1. Implement Federated Ingress: + 1. API machinery (~1 day) + 1. Controller (~3 weeks) +1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation) + 1. API machinery (~1 day) + 1. KubeDNS support (~ 1 week?) + +### Pros + +1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below) +1. Leverages existing K8s Ingress machinery - not too much to add. +1. Leverages existing Federated Service machinery - controller looks + almost identical, DNS provider also re-used. + +### Cons + +1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions). + +## Amazon Web Services only - backed by Route53 + +This is an option for AWS-only federations. Parts of this are +apparently work in progress, see e.g. +[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) +[[WIP/RFC] Simple ingress -> DNS controller, using AWS +Route53](https://github.com/kubernetes/contrib/pull/841). + +### Features + +In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/). + +1. Geo-aware DNS direction to closest regional elastic load balancer +1. DNS health checks to route traffic to only healthy elastic load +balancers +1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin +1. Elastic Load Balancing automatically routes traffic across multiple + instances and multiple Availability Zones within the same region. +1. Health checks ensure that only healthy Amazon EC2 instances receive traffic. + +### Implementation + +1. Federation user creates (federated) Ingress object +1. Federated Ingress Controller creates Ingress object in each cluster in the federation +1. Each cluster-level AWS Ingress Controller creates/updates + 1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region. + 1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer. +1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster. + +### Outstanding Work Items + +Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) +[[WIP/RFC] Simple ingress -> DNS controller, using AWS +Route53](https://github.com/kubernetes/contrib/pull/841). + +1. K8s AWS Ingress Controller +1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...". + +### Pros + +1. Geo-locality (via geo-DNS, not VIP) +1. Load-based overflow +1. Real load balancing (same caveats as for GCP above). +1. L7 SSL connection termination. +1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required. + +### Cons + +1. K8s Ingress Controller still needs to be developed. Lots of work. +1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless) +1. Only works on AWS (initial version, at least). + +## Cross-cloud via GCP + +### Summary + +Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods). + +### Features + +As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster). + +### Implementation + +TBD - see Summary above in the mean time. + +### Outstanding Work + +Assuming that GCP-only (see above) is complete: + +1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters +1. Probably some more - additional detailed research and design necessary. + +### Pros + +1. Works for cross-cloud. + +### Cons + +1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases. + +## Cross-cloud via AWS + +In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters. +Detail docs TBD. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-ingress.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federated-placement-policy.md b/contributors/design-proposals/multicluster/federated-placement-policy.md new file mode 100644 index 00000000..d613422d --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-placement-policy.md @@ -0,0 +1,371 @@ +# Policy-based Federated Resource Placement + +This document proposes a design for policy-based control over placement of +Federated resources. + +Tickets: + +- https://github.com/kubernetes/kubernetes/issues/39982 + +Authors: + +- Torin Sandall (torin@styra.com, tsandall@github) and Tim Hinrichs + (tim@styra.com). +- Based on discussions with Quinton Hoole (quinton.hoole@huawei.com, + quinton-hoole@github), Nikhil Jindal (nikhiljindal@github). + +## Background + +Resource placement is a policy-rich problem affecting many deployments. +Placement may be based on company conventions, external regulation, pricing and +performance requirements, etc. Furthermore, placement policies evolve over time +and vary across organizations. As a result, it is difficult to anticipate the +policy requirements of all users. + +A simple example of a placement policy is + +> Certain apps must be deployed on clusters in EU zones with sufficient PCI +> compliance. + +The [Kubernetes Cluster +Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multicluster/federation.md#policy-engine-and-migrationreplication-controllers) +design proposal includes a pluggable policy engine component that decides how +applications/resources are placed across federated clusters. + +Currently, the placement decision can be controlled for Federated ReplicaSets +using the `federation.kubernetes.io/replica-set-preferences` annotation. In the +future, the [Cluster +Selector](https://github.com/kubernetes/kubernetes/issues/29887) annotation will +provide control over placement of other resources. The proposed design supports +policy-based control over both of these annotations (as well as others). + +This proposal is based on a POC built using the Open Policy Agent project. [This +short video (7m)](https://www.youtube.com/watch?v=hRz13baBhfg) provides an +overview and demo of the POC. + +## Design + +The proposed design uses the [Open Policy Agent](http://www.openpolicyagent.org) +project (OPA) to realize the policy engine component from the Federation design +proposal. OPA is an open-source, general purpose policy engine that includes a +declarative policy language and APIs to answer policy queries. + +The proposed design allows administrators to author placement policies and have +them automatically enforced when resources are created or updated. The design +also covers support for automatic remediation of resource placement when policy +(or the relevant state of the world) changes. + +In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes +in the same cluster as the Federation Control Plane: + +![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407) + +The proposed design is divided into following sections: + +1. Control over the initial placement decision (admission controller) +1. Remediation of resource placement (opa-kube-sync/remediator) +1. Replication of Kubernetes resources (opa-kube-sync/replicator) +1. Management and storage of policies (ConfigMap) + +### 1. Initial Placement Decision + +To provide policy-based control over the initial placement decision, we propose +a new admission controller that integrates with OPA: + +When admitting requests, the admission controller executes an HTTP API call +against OPA. The API call passes the JSON representation of the resource in the +message body. + +The response from OPA contains the desired value for the resource’s annotations +(defined in policy by the administrator). The admission controller updates the +annotations on the resource and admits the request: + +![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288) + +The admission controller updates the resource by **merging** the annotations in +the response with existing annotations on the resource. If there are overlapping +annotation keys the admission controller replaces the existing value with the +value from the response. + +#### Example Policy Engine Query: + +```http +POST /v1/data/io/k8s/federation/admission HTTP/1.1 +Content-Type: application/json +``` + +```json +{ + "input": { + "apiVersion": "extensions/v1beta1", + "kind": "ReplicaSet", + "metadata": { + "annotations": { + "policy.federation.alpha.kubernetes.io/eu-jurisdiction-required": "true", + "policy.federation.alpha.kubernetes.io/pci-compliance-level": "2" + }, + "creationTimestamp": "2017-01-23T16:25:14Z", + "generation": 1, + "labels": { + "app": "nginx-eu" + }, + "name": "nginx-eu", + "namespace": "default", + "resourceVersion": "364993", + "selfLink": "/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-eu", + "uid": "84fab96d-e188-11e6-ac83-0a580a54020e" + }, + "spec": { + "replicas": 4, + "selector": {...}, + "template": {...}, + } + } +} +``` + +#### Example Policy Engine Response: + +```http +HTTP/1.1 200 OK +Content-Type: application/json +``` + +```json +{ + "result": { + "annotations": { + "federation.kubernetes.io/replica-set-preferences": { + "clusters": { + "gce-europe-west1": { + "weight": 1 + }, + "gce-europe-west2": { + "weight": 1 + } + }, + "rebalance": true + } + } + } +} +``` + +> This example shows the policy engine returning the replica-set-preferences. +> The policy engine could similarly return a desired value for other annotations +> such as the Cluster Selector annotation. + +#### Conflicts + +A conflict arises if the developer and the policy define different values for an +annotation. In this case, the developer's intent is provided as a policy query +input and the policy author's intent is encoded in the policy itself. Since the +policy is the only place where both the developer and policy author intents are +known, the policy (or policy engine) should be responsible for resolving the +conflict. + +There are a few options for handling conflicts. As a concrete example, this is +how a policy author could handle invalid clusters/conflicts: + +``` +package io.k8s.federation.admission + +errors["requested replica-set-preferences includes invalid clusters"] { + invalid_clusters = developer_clusters - policy_defined_clusters + invalid_clusters != set() +} + +annotations["replica-set-preferences"] = value { + value = developer_clusters & policy_defined_clusters +} + +# Not shown here: +# +# policy_defined_clusters[...] { ... } +# developer_clusters[...] { ... } +``` + +The admission controller will execute a query against +/io/k8s/federation/admission and if the policy detects an invalid cluster, the +"errors" key in the response will contain a non-empty array. In this case, the +admission controller will deny the request. + +```http +HTTP/1.1 200 OK +Content-Type: application/json +``` + +```json +{ + "result": { + "errors": [ + "requested replica-set-preferences includes invalid clusters" + ], + "annotations": { + "federation.kubernetes.io/replica-set-preferences": { + ... + } + } + } +} +``` + +This example shows how the policy could handle conflicts when the author's +intent is to define clusters that MAY be used. If the author's intent is to +define what clusters MUST be used, then the logic would not use intersection. + +#### Configuration + +The admission controller requires configuration for the OPA endpoint: + +``` +{ + "EnforceSchedulingPolicy": { + "url": “https://opa.federation.svc.cluster.local:8181/v1/data/io/k8s/federation/annotations”, + "token": "super-secret-token-value" + } +} +``` + +- `url` specifies the URL of the policy engine API to query. The query response + contains the annotations to apply to the resource. +- `token` specifies a static token to use for authentication when contacting the + policy engine. In the future, other authentication schemes may be supported. + +The configuration file is provided to the federation-apiserver with the +`--admission-control-config-file` command line argument. + +The admission controller is enabled in the federation-apiserver by providing the +`--admission-control` command line argument. E.g., +`--admission-control=AlwaysAdmit,EnforceSchedulingPolicy`. + +The admission controller will be enabled by default. + +#### Error Handling + +The admission controller is designed to **fail closed** if policies have been +created. + +Request handling may fail because of: + +- Serialization errors +- Request timeouts or other network errors +- Authentication or authorization errors from the policy engine +- Other unexpected errors from the policy engine + +In the event of request timeouts (or other network errors) or back-pressure +hints from the policy engine, the admission controller should retry after +applying a backoff. The admission controller should also create an event so that +developers can identify why their resources are not being scheduled. + +Policies are stored as ConfigMap resources in a well-known namespace. This +allows the admission controller to check if one or more policies exist. If one +or more policies exist, the admission controller will fail closed. Otherwise +the admission controller will **fail open**. + +### 2. Remediation of Resource Placement + +When policy changes or the environment in which resources are deployed changes +(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might +need to be moved for them to obey the placement policy. Sometimes administrators +may decide to remediate manually, other times they may want Kubernetes to +remediate automatically. + +To automatically reschedule resources onto desired clusters, we introduce a +remediator component (**opa-kube-sync**) that is deployed as a sidecar with OPA. + +![Remediation](https://docs.google.com/drawings/d/1ehuzwUXSpkOXzOUGyBW0_7jS8pKB4yRk_0YRb1X4zsY/pub?w=812&h=288) + +The notifications sent to the remediator by OPA specify the new value for +annotations such as replica-set-preferences. + +When the remediator component (in the sidecar) receives the notification it +sends a PATCH request to the federation-apiserver to update the affected +resource. This way, the actual rebalancing of ReplicaSets is still handled by +the [Rescheduling +Algorithm](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multicluster/federated-replicasets.md) +in the Federated ReplicaSet controller. + +The remediator component must be deployed with a kubeconfig for the +federation-apiserver so that it can identify itself when sending the PATCH +requests. We can use the same mechanism that is used for the +federation-controller-manager (which also needs ot identify itself when sending +requests to the federation-apiserver.) + +### 3. Replication of Kubernetes Resources + +Administrators must be able to author policies that refer to properties of +Kubernetes resources. For example, assuming the following sample policy (in +English): + +> Certain apps must be deployed on Clusters in EU zones with sufficient PCI +> compliance. + +The policy definition must refer to the geographic region and PCI compliance +rating of federated clusters. Today, the geographic region is stored as an +attribute on the cluster resource and the PCI compliance rating is an example of +data that may be included in a label or annotation. + +When the policy engine is queried for a placement decision (e.g., by the +admission controller), it must have access to the data representing the +federated clusters. + +To provide OPA with the data representing federated clusters as well as other +Kubernetes resource types (such as federated ReplicaSets), we use a sidecar +container that is deployed alongside OPA. The sidecar (“opa-kube-sync”) is +responsible for replicating Kubernetes resources into OPA: + +![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288) + +The sidecar/replicator component will implement the (somewhat common) list/watch +pattern against the federation-apiserver: + +- Initially, it will GET all resources of a particular type. +- Subsequently, it will GET with the **watch** and **resourceVersion** + parameters set and process add, remove, update events accordingly. + +Each resource received by the sidecar/replicator component will be pushed into +OPA. The sidecar will likely rely on one of the existing Kubernetes Go client +libraries to handle the low-level list/watch behavior. + +As new resource types are introduced in the federation-apiserver, the +sidecar/replicator component will need to be updated to support them. As a +result, the sidecar/replicator component must be designed so that it is easy to +add support for new resource types. + +Eventually, the sidecar/replicator component may allow admins to configure which +resource types are replicated. As an optimization, the sidecar may eventually +analyze policies to determine which resource properties are requires for policy +evaluation. This would allow it to replicate the minimum amount of data into +OPA. + +### 4. Policy Management + +Policies are written in a text-based, declarative language supported by OPA. The +policies can be loaded into the policy engine either on startup or via HTTP +APIs. + +To avoid introducing additional persistent state, we propose storing policies +in ConfigMap resources in the Federation Control Plane inside a well-known +namespace (e.g., `kube-federationscheduling-policy`). The ConfigMap resources +will be replicated into the policy engine by the sidecar. + +The sidecar can establish a watch on the ConfigMap resources in the Federation +Control Plane. This will enable hot-reloading of policies whenever they change. + +## Applicability to Other Policy Engines + +This proposal was designed based on a POC with OPA, but it can be applied to +other policy engines as well. The admission and remediation components are +comprised of two main pieces of functionality: (i) applying annotation values to +federated resources and (ii) asking the policy engine for annotation values. The +code for applying annotation values is completely independent of the policy +engine. The code that asks the policy engine for annotation values happens both +within the admission and remediation components. In the POC, asking OPA for +annotation values amounts to a simple RESTful API call that any other policy +engine could implement. + +## Future Work + +- This proposal uses ConfigMaps to store and manage policies. In the future, we + want to introduce a first-class **Policy** API resource. diff --git a/contributors/design-proposals/multicluster/federated-replicasets.md b/contributors/design-proposals/multicluster/federated-replicasets.md new file mode 100644 index 00000000..8b48731c --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-replicasets.md @@ -0,0 +1,513 @@ +# Federated ReplicaSets + +# Requirements & Design Document + +This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. + +Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) +Based on discussions with +Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) + +## Overview + +### Summary & Vision + +When running a global application on a federation of Kubernetes +clusters the owner currently has to start it in multiple clusters and +control whether he has both enough application replicas running +locally in each of the clusters (so that, for example, users are +handled by a nearby cluster, with low latency) and globally (so that +there is always enough capacity to handle all traffic). If one of the +clusters has issues or hasn't enough capacity to run the given set of +replicas the replicas should be automatically moved to some other +cluster to keep the application responsive. + +In single cluster Kubernetes there is a concept of ReplicaSet that +manages the replicas locally. We want to expand this concept to the +federation level. + +### Goals + ++ Win large enterprise customers who want to easily run applications + across multiple clusters ++ Create a reference controller implementation to facilitate bringing + other Kubernetes concepts to Federated Kubernetes. + +## Glossary + +Federation Cluster - a cluster that is a member of federation. + +Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster +that is a member of federation. + +Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. + +Federated ReplicaSet Controller (FRSC) - A controller running inside +of Federated K8S server that controls FRS. + +## User Experience + +### Critical User Journeys + ++ [CUJ1] User wants to create a ReplicaSet in each of the federation + cluster. They create a definition of federated ReplicaSet on the + federated master and (local) ReplicaSets are automatically created + in each of the federation clusters. The number of replicas is each + of the Local ReplicaSets is (perhaps indirectly) configurable by + the user. ++ [CUJ2] When the current number of replicas in a cluster drops below + the desired number and new replicas cannot be scheduled then they + should be started in some other cluster. + +### Features Enabling Critical User Journeys + +Feature #1 -> CUJ1: +A component which looks for newly created Federated ReplicaSets and +creates the appropriate Local ReplicaSet definitions in the federated +clusters. + +Feature #2 -> CUJ2: +A component that checks how many replicas are actually running in each +of the subclusters and if the number matches to the +FederatedReplicaSet preferences (by default spread replicas evenly +across the clusters but custom preferences are allowed - see +below). If it doesn't and the situation is unlikely to improve soon +then the replicas should be moved to other subclusters. + +### API and CLI + +All interaction with FederatedReplicaSet will be done by issuing +kubectl commands pointing on the Federated Master API Server. All the +commands would behave in a similar way as on the regular master, +however in the next versions (1.5+) some of the commands may give +slightly different output. For example kubectl describe on federated +replica set should also give some information about the subclusters. + +Moreover, for safety, some defaults will be different. For example for +kubectl delete federatedreplicaset cascade will be set to false. + +FederatedReplicaSet would have the same object as local ReplicaSet +(although it will be accessible in a different part of the +api). Scheduling preferences (how many replicas in which cluster) will +be passed as annotations. + +### FederateReplicaSet preferences + +The preferences are expressed by the following structure, passed as a +serialized json inside annotations. + +```go +type FederatedReplicaSetPreferences struct { + // If set to true then already scheduled and running replicas may be moved to other clusters to + // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, + // up and running replicas will not be moved. + Rebalance bool `json:"rebalance,omitempty"` + + // Map from cluster name to preferences for that cluster. It is assumed that if a cluster + // doesn't have a matching entry then it should not have local replica. The cluster matches + // to "*" if there is no entry with the real cluster name. + Clusters map[string]LocalReplicaSetPreferences +} + +// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. +type ClusterReplicaSetPreferences struct { + // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. + MinReplicas int64 `json:"minReplicas,omitempty"` + + // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). + MaxReplicas *int64 `json:"maxReplicas,omitempty"` + + // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. + Weight int64 +} +``` + +How this works in practice: + +**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity. + Replica layout: A=16 B=17 C=17. ++ Clusters A,B,C and C has capacity for 6 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C. B and C are offline: + Replica layout: A=50 + +**Scenario 2**. I want to have only 2 replicas in each of the clusters. + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} + } +} +``` + +Or + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } + } + } + +``` + +Or + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} + } +} +``` + +There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. + +**Scenario 3**. I want to have 20 replicas in each of 3 clusters. + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} + } +} +``` + +There is a global target for 50, however clusters require 60. So some clusters will have less replicas. + Replica layout: A=20 B=20 C=10. + +**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don't put more than 20 replicas to cluster C. + +```go +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=16 B=17 C=17. ++ B is offline/has no capacity + Replica layout: A=30 B=0 C=20 ++ A and B are offline: + Replica layout: C=20 + +**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. + +```go +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 1000000} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=50 B=0 C=0. ++ A has capacity for only 40 replicas + Replica layout: A=40 B=5 C=5 + +**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. + +```go +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 2} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there +are already some replicas, please do not move them. Config: + +```go +FederatedReplicaSetPreferences { + Rebalance : false + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity, but A already has 20 replicas + Replica layout: A=20 B=15 C=15. ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. + Replica layout: A=30 B=14 C=6 + +## The Idea + +A new federated controller - Federated Replica Set Controller (FRSC) +will be created inside federated controller manager. Below are +enumerated the key idea elements: + ++ [I0] It is considered OK to have slightly higher number of replicas + globally for some time. + ++ [I1] FRSC starts an informer on the FederatedReplicaSet that listens + on FRS being created, updated or deleted. On each create/update the + scheduling code will be started to calculate where to put the + replicas. The default behavior is to start the same amount of + replicas in each of the cluster. While creating LocalReplicaSets + (LRS) the following errors/issues can occur: + + + [E1] Master rejects LRS creation (for known or unknown + reason). In this case another attempt to create a LRS should be + attempted in 1m or so. This action can be tied with + [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created + the situation is the same as [E5]. If this happens multiple + times all due replicas should be moved elsewhere and later moved + back once the LRS is created. + + + [E2] LRS with the same name but different configuration already + exists. The LRS is then overwritten and an appropriate event + created to explain what happened. Pods under the control of the + old LRS are left intact and the new LRS may adopt them if they + match the selector. + + + [E3] LRS is new but the pods that match the selector exist. The + pods are adopted by the RS (if not owned by some other + RS). However they may have a different image, configuration + etc. Just like with regular LRS. + ++ [I2] For each of the cluster FRSC starts a store and an informer on + LRS that will listen for status updates. These status changes are + only interesting in case of troubles. Otherwise it is assumed that + LRS runs trouble free and there is always the right number of pod + created but possibly not scheduled. + + + + [E4] LRS is manually deleted from the local cluster. In this case + a new LRS should be created. It is the same case as + [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind + won't be killed and will be adopted after the LRS is recreated. + + + [E5] LRS fails to create (not necessary schedule) the desired + number of pods due to master troubles, admission control + etc. This should be considered as the same situation as replicas + unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). + + + [E6] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216). + ++ [I3] For each of the cluster start an store and informer to monitor + whether the created pods are eventually scheduled and what is the + current number of correctly running ready pods. Errors: + + + [E7] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216) + ++ [I4] It is assumed that a not scheduled pod is a normal situation +and can last up to X min if there is a huge traffic on the +cluster. However if the replicas are not scheduled in that time then +FRSC should consider moving most of the unscheduled replicas +elsewhere. For that purpose FRSC will maintain a data structure +where for each FRS controlled LRS we store a list of pods belonging +to that LRS along with their current status and status change timestamp. + ++ [I5] If a new cluster is added to the federation then it doesn't + have a LRS and the situation is equal to + [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). + ++ [I6] If a cluster is removed from the federation then the situation + is equal to multiple [E4]. It is assumed that if a connection with + a cluster is lost completely then the cluster is removed from the + the cluster list (or marked accordingly) so + [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) + don't need to be handled. + ++ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), + checked against the current list of clusters, and all missing LRS + are created. This will be executed in combination with [I8]. + ++ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min + (configurable) to check whether some replica move between clusters + is needed or not. + ++ FRSC never moves replicas to LRS that have not scheduled/running +pods or that has pods that failed to be created. + + + When FRSC notices that a number of pods are not scheduler/running + or not_even_created in one LRS for more than Y minutes it takes + most of them from LRS, leaving couple still waiting so that once + they are scheduled FRSC will know that it is ok to put some more + replicas to that cluster. + ++ [I9] FRS becomes ToBeChecked if: + + It is newly created + + Some replica set inside changed its status + + Some pods inside cluster changed their status + + Some cluster is added or deleted. +> FRS stops ToBeChecked if is in desired configuration (or is stable enough). + +## (RE)Scheduling algorithm + +To calculate the (re)scheduling moves for a given FRS: + +1. For each cluster FRSC calculates the number of replicas that are placed +(not necessary up and running) in the cluster and the number of replicas that +failed to be scheduled. Cluster capacity is the difference between the +the placed and failed to be scheduled. + +2. Order all clusters by their weight and hash of the name so that every time +we process the same replica-set we process the clusters in the same order. +Include federated replica set name in the cluster name hash so that we get +slightly different ordering for different RS. So that not all RS of size 1 +end up on the same cluster. + +3. Assign minimum preferred number of replicas to each of the clusters, if +there is enough replicas and capacity. + +4. If rebalance = false, assign the previously present replicas to the clusters, +remember the number of extra replicas added (ER). Of course if there +is enough replicas and capacity. + +5. Distribute the remaining replicas with regard to weights and cluster capacity. +In multiple iterations calculate how many of the replicas should end up in the cluster. +For each of the cluster cap the number of assigned replicas by max number of replicas and +cluster capacity. If there were some extra replicas added to the cluster in step +4, don't really add the replicas but balance them gains ER from 4. + +## Goroutines layout + ++ [GR1] Involved in FRS informer (see + [[I1]]). Whenever a FRS is created and + updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with + delay 0. + ++ [GR2_1...GR2_N] Involved in informers/store on LRS (see + [[I2]]). On all changes the FRS is put on + FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR3_1...GR3_N] Involved in informers/store on Pods + (see [[I3]] and [[I4]]). They maintain the status store + so that for each of the LRS we know the number of pods that are + actually running and ready in O(1) time. They also put the + corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR4] Involved in cluster informer (see + [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE + with delay 0. + ++ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on + FRS_CHANNEL after the given delay (and remove from + FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to + FRS_TO_CHECK_QUEUE the delays are compared and updated so that the + shorter delay is used. + ++ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever + a FRS is received it is put to a work queue. Work queue has no delay + and makes sure that a single replica set is process is processed by + only one goroutine. + ++ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. + Multiple replica set can be processed in parallel. Two Goroutines cannot + process the same FRS at the same time. + + +## Func DoFrsCheck + +The function does [[I7]] and[[I8]]. It is assumed that it is run on a +single thread/goroutine so we check and evaluate the same FRS on many +goroutines (however if needed the function can be parallelized for +different FRS). It takes data only from store maintained by GR2_* and +GR3_*. The external communication is only required to: + ++ Create LRS. If a LRS doesn't exist it is created after the + rescheduling, when we know how much replicas should it have. + ++ Update LRS replica targets. + +If FRS is not in the desired state then it is put to +FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). + +## Monitoring and status reporting + +FRCS should expose a number of metrics form the run, like + ++ FRSC -> LRS communication latency ++ Total times spent in various elements of DoFrsCheck + +FRSC should also expose the status of FRS as an annotation on FRS and +as events. + +## Workflow + +Here is the sequence of tasks that need to be done in order for a +typical FRS to be split into a number of LRS's and to be created in +the underlying federated clusters. + +Note a: the reason the workflow would be helpful at this phase is that +for every one or two steps we can create PRs accordingly to start with +the development. + +Note b: we assume that the federation is already in place and the +federated clusters are added to the federation. + +Step 1. the client sends an RS create request to the +federation-apiserver + +Step 2. federation-apiserver persists an FRS into the federation etcd + +Note c: federation-apiserver populates the clusterid field in the FRS +before persisting it into the federation etcd + +Step 3: the federation-level “informer” in FRSC watches federation +etcd for new/modified FRS's, with empty clusterid or clusterid equal +to federation ID, and if detected, it calls the scheduling code + +Step 4. + +Note d: scheduler populates the clusterid field in the LRS with the +IDs of target clusters + +Note e: at this point let us assume that it only does the even +distribution, i.e., equal weights for all of the underlying clusters + +Step 5. As soon as the scheduler function returns the control to FRSC, +the FRSC starts a number of cluster-level “informer”s, one per every +target cluster, to watch changes in every target cluster etcd +regarding the posted LRS's and if any violation from the scheduled +number of replicase is detected the scheduling code is re-called for +re-scheduling purposes. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federated-services.md b/contributors/design-proposals/multicluster/federated-services.md new file mode 100644 index 00000000..8ec9ca29 --- /dev/null +++ b/contributors/design-proposals/multicluster/federated-services.md @@ -0,0 +1,519 @@ +# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") + +## Cross-cluster Load Balancing and Service Discovery + +### Requirements and System Design + +### by Quinton Hoole, Dec 3 2015 + +## Requirements + +### Discovery, Load-balancing and Failover + +1. **Internal discovery and connection**: Pods/containers (running in + a Kubernetes cluster) must be able to easily discover and connect + to endpoints for Kubernetes services on which they depend in a + consistent way, irrespective of whether those services exist in a + different kubernetes cluster within the same cluster federation. + Hence-forth referred to as "cluster-internal clients", or simply + "internal clients". +1. **External discovery and connection**: External clients (running + outside a Kubernetes cluster) must be able to discover and connect + to endpoints for Kubernetes services on which they depend. + 1. **External clients predominantly speak HTTP(S)**: External + clients are most often, but not always, web browsers, or at + least speak HTTP(S) - notable exceptions include Enterprise + Message Busses (Java, TLS), DNS servers (UDP), + SIP servers and databases) +1. **Find the "best" endpoint:** Upon initial discovery and + connection, both internal and external clients should ideally find + "the best" endpoint if multiple eligible endpoints exist. "Best" + in this context implies the closest (by network topology) endpoint + that is both operational (as defined by some positive health check) + and not overloaded (by some published load metric). For example: + 1. An internal client should find an endpoint which is local to its + own cluster if one exists, in preference to one in a remote + cluster (if both are operational and non-overloaded). + Similarly, one in a nearby cluster (e.g. in the same zone or + region) is preferable to one further afield. + 1. An external client (e.g. in New York City) should find an + endpoint in a nearby cluster (e.g. U.S. East Coast) in + preference to one further away (e.g. Japan). +1. **Easy fail-over:** If the endpoint to which a client is connected + becomes unavailable (no network response/disconnected) or + overloaded, the client should reconnect to a better endpoint, + somehow. + 1. In the case where there exist one or more connection-terminating + load balancers between the client and the serving Pod, failover + might be completely automatic (i.e. the client's end of the + connection remains intact, and the client is completely + oblivious of the fail-over). This approach incurs network speed + and cost penalties (by traversing possibly multiple load + balancers), but requires zero smarts in clients, DNS libraries, + recursing DNS servers etc, as the IP address of the endpoint + remains constant over time. + 1. In a scenario where clients need to choose between multiple load + balancer endpoints (e.g. one per cluster), multiple DNS A + records associated with a single DNS name enable even relatively + dumb clients to try the next IP address in the list of returned + A records (without even necessarily re-issuing a DNS resolution + request). For example, all major web browsers will try all A + records in sequence until a working one is found (TBD: justify + this claim with details for Chrome, IE, Safari, Firefox). + 1. In a slightly more sophisticated scenario, upon disconnection, a + smarter client might re-issue a DNS resolution query, and + (modulo DNS record TTL's which can typically be set as low as 3 + minutes, and buggy DNS resolvers, caches and libraries which + have been known to completely ignore TTL's), receive updated A + records specifying a new set of IP addresses to which to + connect. + +### Portability + +A Kubernetes application configuration (e.g. for a Pod, Replication +Controller, Service etc) should be able to be successfully deployed +into any Kubernetes Cluster or Federation of Clusters, +without modification. More specifically, a typical configuration +should work correctly (although possibly not optimally) across any of +the following environments: + +1. A single Kubernetes Cluster on one cloud provider (e.g. Google + Compute Engine, GCE). +1. A single Kubernetes Cluster on a different cloud provider + (e.g. Amazon Web Services, AWS). +1. A single Kubernetes Cluster on a non-cloud, on-premise data center +1. A Federation of Kubernetes Clusters all on the same cloud provider + (e.g. GCE). +1. A Federation of Kubernetes Clusters across multiple different cloud + providers and/or on-premise data centers (e.g. one cluster on + GCE/GKE, one on AWS, and one on-premise). + +### Trading Portability for Optimization + +It should be possible to explicitly opt out of portability across some +subset of the above environments in order to take advantage of +non-portable load balancing and DNS features of one or more +environments. More specifically, for example: + +1. For HTTP(S) applications running on GCE-only Federations, + [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + should be usable. These provide single, static global IP addresses + which load balance and fail over globally (i.e. across both regions + and zones). These allow for really dumb clients, but they only + work on GCE, and only for HTTP(S) traffic. +1. For non-HTTP(S) applications running on GCE-only Federations within + a single region, + [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + should be usable. These provide TCP (i.e. both HTTP/S and + non-HTTP/S) load balancing and failover, but only on GCE, and only + within a single region. + [Google Cloud DNS](https://cloud.google.com/dns) can be used to + route traffic between regions (and between different cloud + providers and on-premise clusters, as it's plain DNS, IP only). +1. For applications running on AWS-only Federations, + [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + should be usable. These provide both L7 (HTTP(S)) and L4 load + balancing, but only within a single region, and only on AWS + ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be + used to load balance and fail over across multiple regions, and is + also capable of resolving to non-AWS endpoints). + +## Component Cloud Services + +Cross-cluster Federated load balancing is built on top of the following: + +1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + provide single, static global IP addresses which load balance and + fail over globally (i.e. across both regions and zones). These + allow for really dumb clients, but they only work on GCE, and only + for HTTP(S) traffic. +1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + provide both HTTP(S) and non-HTTP(S) load balancing and failover, + but only on GCE, and only within a single region. +1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + provide both L7 (HTTP(S)) and L4 load balancing, but only within a + single region, and only on AWS. +1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other + programmable DNS service, like + [CloudFlare](http://www.cloudflare.com) can be used to route + traffic between regions (and between different cloud providers and + on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS + doesn't provide any built-in geo-DNS, latency-based routing, health + checking, weighted round robin or other advanced capabilities. + It's plain old DNS. We would need to build all the aforementioned + on top of it. It can provide internal DNS services (i.e. serve RFC + 1918 addresses). + 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can + be used to load balance and fail over across regions, and is also + capable of routing to non-AWS endpoints). It provides built-in + geo-DNS, latency-based routing, health checking, weighted + round robin and optional tight integration with some other + AWS services (e.g. Elastic Load Balancers). +1. Kubernetes L4 Service Load Balancing: This provides both a + [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) + and a + [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) + service IP which is load-balanced (currently simple round-robin) + across the healthy pods comprising a service within a single + Kubernetes cluster. +1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): +A generic wrapper around cloud-provided L4 and L7 load balancing services, and +roll-your-own load balancers run in pods, e.g. HA Proxy. + +## Cluster Federation API + +The Cluster Federation API for load balancing should be compatible with the equivalent +Kubernetes API, to ease porting of clients between Kubernetes and +federations of Kubernetes clusters. +Further details below. + +## Common Client Behavior + +To be useful, our load balancing solution needs to work properly with real +client applications. There are a few different classes of those... + +### Browsers + +These are the most common external clients. These are all well-written. See below. + +### Well-written clients + +1. Do a DNS resolution every time they connect. +1. Don't cache beyond TTL (although a small percentage of the DNS + servers on which they rely might). +1. Do try multiple A records (in order) to connect. +1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. + +Examples: + ++ all common browsers (except for SRV records) ++ ... + +### Dumb clients + +1. Don't do a DNS resolution every time they connect (or do cache beyond the +TTL). +1. Do try multiple A records + +Examples: + ++ ... + +### Dumber clients + +1. Only do a DNS lookup once on startup. +1. Only try the first returned DNS A record. + +Examples: + ++ ... + +### Dumbest clients + +1. Never do a DNS lookup - are pre-configured with a single (or possibly +multiple) fixed server IP(s). Nothing else matters. + +## Architecture and Implementation + +### General Control Plane Architecture + +Each cluster hosts one or more Cluster Federation master components (Federation API +servers, controller managers with leader election, and etcd quorum members. This +is documented in more detail in a separate design doc: +[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). + +In the description below, assume that 'n' clusters, named 'cluster-1'... +'cluster-n' have been registered against a Cluster Federation "federation-1", +each with their own set of Kubernetes API endpoints,so, +"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), +[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) +... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . + +### Federated Services + +Federated Services are pretty straight-forward. They're comprised of multiple +equivalent underlying Kubernetes Services, each with their own external +endpoint, and a load balancing mechanism across them. Let's work through how +exactly that works in practice. + +Our user creates the following Federated Service (against a Federation +API endpoint): + + $ kubectl create -f my-service.yaml --context="federation-1" + +where service.yaml contains the following: + + kind: Service + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + ports: + - port: 2379 + protocol: TCP + targetPort: 2379 + name: client + - port: 2380 + protocol: TCP + targetPort: 2380 + name: peer + selector: + run: my-service + type: LoadBalancer + +The Cluster Federation control system in turn creates one equivalent service (identical config to the above) +in each of the underlying Kubernetes clusters, each of which results in +something like this: + + $ kubectl get -o yaml --context="cluster-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:25Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "147365" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00002 + spec: + clusterIP: 10.0.153.185 + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - ip: 104.197.117.10 + +Similar services are created in `cluster-2` and `cluster-3`, each of which are +allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. + +In the Cluster Federation `federation-1`, the resulting federated service looks as follows: + + $ kubectl get -o yaml --context="federation-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:23Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "157333" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00007 + spec: + clusterIP: + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - hostname: my-service.my-namespace.my-federation.my-domain.com + +Note that the federated service: + +1. Is API-compatible with a vanilla Kubernetes service. +1. has no clusterIP (as it is cluster-independent) +1. has a federation-wide load balancer hostname + +In addition to the set of underlying Kubernetes services (one per cluster) +described above, the Cluster Federation control system has also created a DNS name (e.g. on +[Google Cloud DNS](https://cloud.google.com/dns) or +[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) +which provides load balancing across all of those services. For example, in a +very basic configuration: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +Each of the above IP addresses (which are just the external load balancer +ingress IP's of each cluster service) is of course load balanced across the pods +comprising the service in each cluster. + +In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster +Federation control system +automatically creates a +[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) +which exposes a single, globally load-balanced IP: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 + +Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) +in each Kubernetes cluster to preferentially return the local +clusterIP for the service in that cluster, with other clusters' +external service IP's (or a global load-balanced IP) also configured +for failover purposes: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +If Cluster Federation Global Service Health Checking is enabled, multiple service health +checkers running across the federated clusters collaborate to monitor the health +of the service endpoints, and automatically remove unhealthy endpoints from the +DNS record (e.g. a majority quorum is required to vote a service endpoint +unhealthy, to avoid false positives due to individual health checker network +isolation). + +### Federated Replication Controllers + +So far we have a federated service defined, with a resolvable load balancer +hostname by which clients can reach it, but no pods serving traffic directed +there. So now we need a Federated Replication Controller. These are also fairly +straight-forward, being comprised of multiple underlying Kubernetes Replication +Controllers which do the hard work of keeping the desired number of Pod replicas +alive in each Kubernetes cluster. + + $ kubectl create -f my-service-rc.yaml --context="federation-1" + +where `my-service-rc.yaml` contains the following: + + kind: ReplicationController + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + replicas: 6 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + +The Cluster Federation control system in turn creates one equivalent replication controller +(identical config to the above, except for the replica count) in each +of the underlying Kubernetes clusters, each of which results in +something like this: + + $ ./kubectl get -o yaml rc my-service --context="cluster-1" + kind: ReplicationController + metadata: + creationTimestamp: 2015-12-02T23:00:47Z + labels: + run: my-service + name: my-service + namespace: my-namespace + selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service + uid: 86542109-9948-11e5-a38c-42010af00002 + spec: + replicas: 2 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + resources: {} + dnsPolicy: ClusterFirst + restartPolicy: Always + status: + replicas: 2 + +The exact number of replicas created in each underlying cluster will of course +depend on what scheduling policy is in force. In the above example, the +scheduler created an equal number of replicas (2) in each of the three +underlying clusters, to make up the total of 6 replicas required. To handle +entire cluster failures, various approaches are possible, including: +1. **simple overprovisioning**, such that sufficient replicas remain even if a + cluster fails. This wastes some resources, but is simple and reliable. + +2. **pod autoscaling**, where the replication controller in each + cluster automatically and autonomously increases the number of + replicas in its cluster in response to the additional traffic + diverted from the failed cluster. This saves resources and is relatively + simple, but there is some delay in the autoscaling. + +3. **federated replica migration**, where the Cluster Federation + control system detects the cluster failure and automatically + increases the replica count in the remaining clusters to make up + for the lost replicas in the failed cluster. This does not seem to + offer any benefits relative to pod autoscaling above, and is + arguably more complex to implement, but we note it here as a + possibility. + +### Implementation Details + +The implementation approach and architecture is very similar to Kubernetes, so +if you're familiar with how Kubernetes works, none of what follows will be +surprising. One additional design driver not present in Kubernetes is that +the Cluster Federation control system aims to be resilient to individual cluster and availability zone +failures. So the control plane spans multiple clusters. More specifically: + ++ Cluster Federation runs it's own distinct set of API servers (typically one + or more per underlying Kubernetes cluster). These are completely + distinct from the Kubernetes API servers for each of the underlying + clusters. ++ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, + by default). Approximately 1 quorum member runs in each underlying + cluster ("approximately" because we aim for an odd number of quorum + members, and typically don't want more than 5 quorum members, even + if we have a larger number of federated clusters, so 2 clusters->3 + quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). + +Cluster Controllers in the Federation control system watch against the +Federation API server/etcd +state, and apply changes to the underlying kubernetes clusters accordingly. They +also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" +state against kubernetes "actual desired" state. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federation-clusterselector.md b/contributors/design-proposals/multicluster/federation-clusterselector.md new file mode 100644 index 00000000..154412c7 --- /dev/null +++ b/contributors/design-proposals/multicluster/federation-clusterselector.md @@ -0,0 +1,81 @@ +# ClusterSelector Federated Resource Placement + +This document proposes a design for label based control over placement of +Federated resources. + +Tickets: + +- https://github.com/kubernetes/kubernetes/issues/29887 + +Authors: + +- Dan Wilson (emaildanwilson@github.com). +- Nikhil Jindal (nikhiljindal@github). + +## Background + +End users will often need a simple way to target a subset of clusters for deployment of resources. In some cases this will be for a specific cluster in other cases it will be groups of clusters. +A few examples... + +1. Deploy the foo service to all clusters in Europe +1. Deploy the bar service to cluster test15 +1. Deploy the baz service to all prod clusters globally + +Currently, it's possible to control placement decision of Federated ReplicaSets +using the `federation.kubernetes.io/replica-set-preferences` annotation. This provides functionality to change the number of ReplicaSets created on each Federated Cluster, by setting the quantity for each Cluster by Cluster Name. Since cluster names are required, in situations where clusters are add/removed from Federation it would require the object definitions to change in order to maintain the same configuration. From the example above, if a new cluster is created in Europe and added to federation, then the replica-set-preferences would need to be updated to include the new cluster name. + +This proposal is to provide placement decision support for all object types using Labels on the Federated Clusters as opposed to cluster names. The matching language currently used for nodeAffinity placement decisions onto nodes can be leveraged. + +Carrying forward the examples from above... + +1. "location=europe" +1. "someLabel exists" +1. "environment notin ["qa", "dev"] + +## Design + +The proposed design uses a ClusterSelector annotation that has a value that is parsed into a struct definition that follows the same design as the [NodeSelector type used w/ nodeAffinity](https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L1972) and will also use the [Matches function](https://github.com/kubernetes/apimachinery/blob/master/pkg/labels/selector.go#L172) of the apimachinery project to determine if an object should be sent on to federated clusters or not. + +In situations where objects are not to be forwarded to federated clusters, instead a delete api call will be made using the object definition. If the object does not exist it will be ignored. + +The federation-controller will be used to implement this with shared logic stored as utility functions to reduce duplicated code where appropriate. + +### End User Functionality +The annotation `federation.alpha.kubernetes.io/cluster-selector` is used on kubernetes objects to specify additional placement decisions that should be made. The value of the annotation will be a json object of type ClusterSelector which is an array of type ClusterSelectorRequirement. + +Each ClusterSelectorRequirement is defined in three possible parts consisting of +1. Key - Matches against label keys on the Federated Clusters. +1. Operator - Represents how the Key and/or Values will be matched against the label keys and values on the Federated Clusters one of ("In", in", "=", "==", "NotIn", notin", "Exists", "exists", "!=", "DoesNotExist", "!", "Gt", "gt", "Lt", "lt"). +1. Values - Matches against the label values on the Federated Clusters using the Key specified. When the operator is "Exists", "exists", "DoesNotExist" or "!" then Values should not be specified. + +Example ConfigMap that uses the ClusterSelector annotation. The yaml format is used here to show that the value of the annotation will still be json. +```yaml +apiVersion: v1 +data: + myconfigkey: myconfigdata +kind: ConfigMap +metadata: + annotations: + federation.alpha.kubernetes.io/cluster-selector: '[{"key": "location", "operator": + "in", "values": ["europe"]}, {"key": "environment", "operator": "==", "values": + ["prod"]}]' + creationTimestamp: 2017-02-07T19:43:40Z + name: myconfig +``` + +In order for the configmap in the example above to be forwarded to any Federated Clusters they MUST have two Labels: "location" with at least one value of "europe" and "environment" that has a value of "prod". + +### Matching Logic + +The logic to determine if an object is sent to a Federated Cluster will have two rules. + +1. An object with no `federation.alpha.kubernetes.io/cluster-selector` annotation will always be forwarded on to all Federated Clusters even if they have labels configured. (this ensures no regression from existing functionality) + +1. If an object contains the `federation.alpha.kubernetes.io/cluster-selector` annotation then ALL ClusterSelectorRequirements must match in order for the object to be forwarded to the Federated Cluster. + +1. If `federation.kubernetes.io/replica-set-preferences` are also defined they will be applied AFTER the ClusterSelectorRequirements. + +## Open Questions + +1. Should there be any special considerations for when dependant resources would not be forwarded together to a Federated Cluster. +1. How to improve usability of this feature long term. It will certainly help to give first class API support but easier ways to map labels or requirements to objects may be required. diff --git a/contributors/design-proposals/multicluster/federation-high-level-arch.png b/contributors/design-proposals/multicluster/federation-high-level-arch.png new file mode 100644 index 00000000..8a416cc1 Binary files /dev/null and b/contributors/design-proposals/multicluster/federation-high-level-arch.png differ diff --git a/contributors/design-proposals/multicluster/federation-lite.md b/contributors/design-proposals/multicluster/federation-lite.md new file mode 100644 index 00000000..549f98df --- /dev/null +++ b/contributors/design-proposals/multicluster/federation-lite.md @@ -0,0 +1,201 @@ +# Kubernetes Multi-AZ Clusters + +## (previously nicknamed "Ubernetes-Lite") + +## Introduction + +Full Cluster Federation will offer sophisticated federation between multiple kubernetes +clusters, offering true high-availability, multiple provider support & +cloud-bursting, multiple region support etc. However, many users have +expressed a desire for a "reasonably" high-available cluster, that runs in +multiple zones on GCE or availability zones in AWS, and can tolerate the failure +of a single zone without the complexity of running multiple clusters. + +Multi-AZ Clusters aim to deliver exactly that functionality: to run a single +Kubernetes cluster in multiple zones. It will attempt to make reasonable +scheduling decisions, in particular so that a replication controller's pods are +spread across zones, and it will try to be aware of constraints - for example +that a volume cannot be mounted on a node in a different zone. + +Multi-AZ Clusters are deliberately limited in scope; for many advanced functions +the answer will be "use full Cluster Federation". For example, multiple-region +support is not in scope. Routing affinity (e.g. so that a webserver will +prefer to talk to a backend service in the same zone) is similarly not in +scope. + +## Design + +These are the main requirements: + +1. kube-up must allow bringing up a cluster that spans multiple zones. +1. pods in a replication controller should attempt to spread across zones. +1. pods which require volumes should not be scheduled onto nodes in a different zone. +1. load-balanced services should work reasonably + +### kube-up support + +kube-up support for multiple zones will initially be considered +advanced/experimental functionality, so the interface is not initially going to +be particularly user-friendly. As we design the evolution of kube-up, we will +make multiple zones better supported. + +For the initial implementation, kube-up must be run multiple times, once for +each zone. The first kube-up will take place as normal, but then for each +additional zone the user must run kube-up again, specifying +`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then +create additional nodes in a different zone, but will register them with the +existing master. + +### Zone spreading + +This will be implemented by modifying the existing scheduler priority function +`SelectorSpread`. Currently this priority function aims to put pods in an RC +on different hosts, but it will be extended first to spread across zones, and +then to spread across hosts. + +So that the scheduler does not need to call out to the cloud provider on every +scheduling decision, we must somehow record the zone information for each node. +The implementation of this will be described in the implementation section. + +Note that zone spreading is 'best effort'; zones are just be one of the factors +in making scheduling decisions, and thus it is not guaranteed that pods will +spread evenly across zones. However, this is likely desirable: if a zone is +overloaded or failing, we still want to schedule the requested number of pods. + +### Volume affinity + +Most cloud providers (at least GCE and AWS) cannot attach their persistent +volumes across zones. Thus when a pod is being scheduled, if there is a volume +attached, that will dictate the zone. This will be implemented using a new +scheduler predicate (a hard constraint): `VolumeZonePredicate`. + +When `VolumeZonePredicate` observes a pod scheduling request that includes a +volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any +nodes not in that zone. + +Again, to avoid the scheduler calling out to the cloud provider, this will rely +on information attached to the volumes. This means that this will only support +PersistentVolumeClaims, because direct mounts do not have a place to attach +zone information. PersistentVolumes will then include zone information where +volumes are zone-specific. + +### Load-balanced services should operate reasonably + +For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each +service of type LoadBalancer. The native cloud load-balancers on both AWS & +GCE are region-level, and support load-balancing across instances in multiple +zones (in the same region). For both clouds, the behaviour of the native cloud +load-balancer is reasonable in the face of failures (indeed, this is why clouds +provide load-balancing as a primitve). + +For multi-AZ clusters we will therefore simply rely on the native cloud provider +load balancer behaviour, and we do not anticipate substantial code changes. + +One notable shortcoming here is that load-balanced traffic still goes through +kube-proxy controlled routing, and kube-proxy does not (currently) favor +targeting a pod running on the same instance or even the same zone. This will +likely produce a lot of unnecessary cross-zone traffic (which is likely slower +and more expensive). This might be sufficiently low-hanging fruit that we +choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed +after the initial implementation. + + +## Implementation + +The main implementation points are: + +1. how to attach zone information to Nodes and PersistentVolumes +1. how nodes get zone information +1. how volumes get zone information + +### Attaching zone information + +We must attach zone information to Nodes and PersistentVolumes, and possibly to +other resources in future. There are two obvious alternatives: we can use +labels/annotations, or we can extend the schema to include the information. + +For the initial implementation, we propose to use labels. The reasoning is: + +1. It is considerably easier to implement. +1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and +`failure-domain.alpha.kubernetes.io/region` for the two pieces of information +we need. By putting this under the `kubernetes.io` namespace there is no risk +of collision, and by putting it under `alpha.kubernetes.io` we clearly mark +this as an experimental feature. +1. We do not yet know whether these labels will be sufficient for all +environments, nor which entities will require zone information. Labels give us +more flexibility here. +1. Because the labels are reserved, we can move to schema-defined fields in +future using our cross-version mapping techniques. + +### Node labeling + +We do not want to require an administrator to manually label nodes. We instead +modify the kubelet to include the appropriate labels when it registers itself. +The information is easily obtained by the kubelet from the cloud provider. + +### Volume labeling + +As with nodes, we do not want to require an administrator to manually label +volumes. We will create an admission controller `PersistentVolumeLabel`. +`PersistentVolumeLabel` will intercept requests to create PersistentVolumes, +and will label them appropriately by calling in to the cloud provider. + +## AWS Specific Considerations + +The AWS implementation here is fairly straightforward. The AWS API is +region-wide, meaning that a single call will find instances and volumes in all +zones. In addition, instance ids and volume ids are unique per-region (and +hence also per-zone). I believe they are actually globally unique, but I do +not know if this is guaranteed; in any case we only need global uniqueness if +we are to span regions, which will not be supported by multi-AZ clusters (to do +that correctly requires a full Cluster Federation type approach). + +## GCE Specific Considerations + +The GCE implementation is more complicated than the AWS implementation because +GCE APIs are zone-scoped. To perform an operation, we must perform one REST +call per zone and combine the results, unless we can determine in advance that +an operation references a particular zone. For many operations, we can make +that determination, but in some cases - such as listing all instances, we must +combine results from calls in all relevant zones. + +A further complexity is that GCE volume names are scoped per-zone, not +per-region. Thus it is permitted to have two volumes both named `myvolume` in +two different GCE zones. (Instance names are currently unique per-region, and +thus are not a problem for multi-AZ clusters). + +The volume scoping leads to a (small) behavioural change for multi-AZ clusters on +GCE. If you had two volumes both named `myvolume` in two different GCE zones, +this would not be ambiguous when Kubernetes is operating only in a single zone. +But, when operating a cluster across multiple zones, `myvolume` is no longer +sufficient to specify a volume uniquely. Worse, the fact that a volume happens +to be unambigious at a particular time is no guarantee that it will continue to +be unambigious in future, because a volume with the same name could +subsequently be created in a second zone. While perhaps unlikely in practice, +we cannot automatically enable multi-AZ clusters for GCE users if this then causes +volume mounts to stop working. + +This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e. +there must be a feature-flag). It may be that we can make this feature +semi-automatic in future, by detecting whether nodes are running in multiple +zones, but it seems likely that kube-up could instead simply set this flag. + +For the initial implementation, creating volumes with identical names will +yield undefined results. Later, we may add some way to specify the zone for a +volume (and possibly require that volumes have their zone specified when +running in multi-AZ cluster mode). We could add a new `zone` field to the +PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted +name for the volume name (.) + +Initially therefore, the GCE changes will be to: + +1. change kube-up to support creation of a cluster in multiple zones +1. pass a flag enabling multi-AZ clusters with kube-up +1. change the kubernetes cloud provider to iterate through relevant zones when resolving items +1. tag GCE PD volumes with the appropriate zone information + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federation-phase-1.md b/contributors/design-proposals/multicluster/federation-phase-1.md new file mode 100644 index 00000000..157b5668 --- /dev/null +++ b/contributors/design-proposals/multicluster/federation-phase-1.md @@ -0,0 +1,407 @@ +# Ubernetes Design Spec (phase one) + +**Huawei PaaS Team** + +## INTRODUCTION + +In this document we propose a design for the “Control Plane” of +Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of +this work please refer to +[this proposal](federation.md). +The document is arranged as following. First we briefly list scenarios +and use cases that motivate K8S federation work. These use cases drive +the design and they also verify the design. We summarize the +functionality requirements from these use cases, and define the “in +scope” functionalities that will be covered by this design (phase +one). After that we give an overview of the proposed architecture, API +and building blocks. And also we go through several activity flows to +see how these building blocks work together to support use cases. + +## REQUIREMENTS + +There are many reasons why customers may want to build a K8S +federation: + ++ **High Availability:** Customers want to be immune to the outage of + a single availability zone, region or even a cloud provider. ++ **Sensitive workloads:** Some workloads can only run on a particular + cluster. They cannot be scheduled to or migrated to other clusters. ++ **Capacity overflow:** Customers prefer to run workloads on a + primary cluster. But if the capacity of the cluster is not + sufficient, workloads should be automatically distributed to other + clusters. ++ **Vendor lock-in avoidance:** Customers want to spread their + workloads on different cloud providers, and can easily increase or + decrease the workload proportion of a specific provider. ++ **Cluster Size Enhancement:** Currently K8S cluster can only support +a limited size. While the community is actively improving it, it can +be expected that cluster size will be a problem if K8S is used for +large workloads or public PaaS infrastructure. While we can separate +different tenants to different clusters, it would be good to have a +unified view. + +Here are the functionality requirements derived from above use cases: + ++ Clients of the federation control plane API server can register and deregister +clusters. ++ Workloads should be spread to different clusters according to the + workload distribution policy. ++ Pods are able to discover and connect to services hosted in other + clusters (in cases where inter-cluster networking is necessary, + desirable and implemented). ++ Traffic to these pods should be spread across clusters (in a manner + similar to load balancing, although it might not be strictly + speaking balanced). ++ The control plane needs to know when a cluster is down, and migrate + the workloads to other clusters. ++ Clients have a unified view and a central control point for above + activities. + +## SCOPE + +It's difficult to have a perfect design with one click that implements +all the above requirements. Therefore we will go with an iterative +approach to design and build the system. This document describes the +phase one of the whole work. In phase one we will cover only the +following objectives: + ++ Define the basic building blocks and API objects of control plane ++ Implement a basic end-to-end workflow + + Clients register federated clusters + + Clients submit a workload + + The workload is distributed to different clusters + + Service discovery + + Load balancing + +The following parts are NOT covered in phase one: + ++ Authentication and authorization (other than basic client + authentication against the ubernetes API, and from ubernetes control + plane to the underlying kubernetes clusters). ++ Deployment units other than replication controller and service ++ Complex distribution policy of workloads ++ Service affinity and migration + +## ARCHITECTURE + +The overall architecture of a control plane is shown as following: + +![Ubernetes Architecture](ubernetes-design.png) + +Some design principles we are following in this architecture: + +1. Keep the underlying K8S clusters independent. They should have no + knowledge of control plane or of each other. +1. Keep the Ubernetes API interface compatible with K8S API as much as + possible. +1. Re-use concepts from K8S as much as possible. This reduces +customers' learning curve and is good for adoption. Below is a brief +description of each module contained in above diagram. + +## Ubernetes API Server + +The API Server in the Ubernetes control plane works just like the API +Server in K8S. It talks to a distributed key-value store to persist, +retrieve and watch API objects. This store is completely distinct +from the kubernetes key-value stores (etcd) in the underlying +kubernetes clusters. We still use `etcd` as the distributed +storage so customers don't need to learn and manage a different +storage system, although it is envisaged that other storage systems +(consol, zookeeper) will probably be developedand supported over +time. + +## Ubernetes Scheduler + +The Ubernetes Scheduler schedules resources onto the underlying +Kubernetes clusters. For example it watches for unscheduled Ubernetes +replication controllers (those that have not yet been scheduled onto +underlying Kubernetes clusters) and performs the global scheduling +work. For each unscheduled replication controller, it calls policy +engine to decide how to spit workloads among clusters. It creates a +Kubernetes Replication Controller on one ore more underlying cluster, +and post them back to `etcd` storage. + +One subtlety worth noting here is that the scheduling decision is arrived at by +combining the application-specific request from the user (which might +include, for example, placement constraints), and the global policy specified +by the federation administrator (for example, "prefer on-premise +clusters over AWS clusters" or "spread load equally across clusters"). + +## Ubernetes Cluster Controller + +The cluster controller +performs the following two kinds of work: + +1. It watches all the sub-resources that are created by Ubernetes + components, like a sub-RC or a sub-service. And then it creates the + corresponding API objects on the underlying K8S clusters. +1. It periodically retrieves the available resources metrics from the + underlying K8S cluster, and updates them as object status of the + `cluster` API object. An alternative design might be to run a pod + in each underlying cluster that reports metrics for that cluster to + the Ubernetes control plane. Which approach is better remains an + open topic of discussion. + +## Ubernetes Service Controller + +The Ubernetes service controller is a federation-level implementation +of K8S service controller. It watches service resources created on +control plane, creates corresponding K8S services on each involved K8S +clusters. Besides interacting with services resources on each +individual K8S clusters, the Ubernetes service controller also +performs some global DNS registration work. + +## API OBJECTS + +## Cluster + +Cluster is a new first-class API object introduced in this design. For +each registered K8S cluster there will be such an API resource in +control plane. The way clients register or deregister a cluster is to +send corresponding REST requests to following URL: +`/api/{$version}/clusters`. Because control plane is behaving like a +regular K8S client to the underlying clusters, the spec of a cluster +object contains necessary properties like K8S cluster address and +credentials. The status of a cluster API object will contain +following information: + +1. Which phase of its lifecycle +1. Cluster resource metrics for scheduling decisions. +1. Other metadata like the version of cluster + +$version.clusterSpec + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Address
+
address of the cluster
+
yes
+
address
+

Credential
+
the type (e.g. bearer token, client +certificate etc) and data of the credential used to access cluster. It's used for system routines (not behalf of users)
+
yes
+
string
+

+ +$version.clusterStatus + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Phase
+
the recently observed lifecycle phase of the cluster
+
yes
+
enum
+

Capacity
+
represents the available resources of a cluster
+
yes
+
any
+

ClusterMeta
+
Other cluster metadata like the version
+
yes
+
ClusterMeta
+

+ +**For simplicity we didn't introduce a separate “cluster metrics” API +object here**. The cluster resource metrics are stored in cluster +status section, just like what we did to nodes in K8S. In phase one it +only contains available CPU resources and memory resources. The +cluster controller will periodically poll the underlying cluster API +Server to get cluster capability. In phase one it gets the metrics by +simply aggregating metrics from all nodes. In future we will improve +this with more efficient ways like leveraging heapster, and also more +metrics will be supported. Similar to node phases in K8S, the “phase” +field includes following values: + ++ pending: newly registered clusters or clusters suspended by admin + for various reasons. They are not eligible for accepting workloads ++ running: clusters in normal status that can accept workloads ++ offline: clusters temporarily down or not reachable ++ terminated: clusters removed from federation + +Below is the state transition diagram. + +![Cluster State Transition Diagram](ubernetes-cluster-state.png) + +## Replication Controller + +A global workload submitted to control plane is represented as a + replication controller in the Cluster Federation control plane. When a replication controller +is submitted to control plane, clients need a way to express its +requirements or preferences on clusters. Depending on different use +cases it may be complex. For example: + ++ This workload can only be scheduled to cluster Foo. It cannot be + scheduled to any other clusters. (use case: sensitive workloads). ++ This workload prefers cluster Foo. But if there is no available + capacity on cluster Foo, it's OK to be scheduled to cluster Bar + (use case: workload ) ++ Seventy percent of this workload should be scheduled to cluster Foo, + and thirty percent should be scheduled to cluster Bar (use case: + vendor lock-in avoidance). In phase one, we only introduce a + _clusterSelector_ field to filter acceptable clusters. In default + case there is no such selector and it means any cluster is + acceptable. + +Below is a sample of the YAML to create such a replication controller. + +```yaml +apiVersion: v1 +kind: ReplicationController +metadata: + name: nginx-controller +spec: + replicas: 5 + selector: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: nginx + ports: + - containerPort: 80 + clusterSelector: + name in (Foo, Bar) +``` + +Currently clusterSelector (implemented as a +[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) +only supports a simple list of acceptable clusters. Workloads will be +evenly distributed on these acceptable clusters in phase one. After +phase one we will define syntax to represent more advanced +constraints, like cluster preference ordering, desired number of +splitted workloads, desired ratio of workloads spread on different +clusters, etc. + +Besides this explicit “clusterSelector” filter, a workload may have +some implicit scheduling restrictions. For example it defines +“nodeSelector” which can only be satisfied on some particular +clusters. How to handle this will be addressed after phase one. + +## Federated Services + +The Service API object exposed by the Cluster Federation is similar to service +objects on Kubernetes. It defines the access to a group of pods. The +federation service controller will create corresponding Kubernetes +service objects on underlying clusters. These are detailed in a +separate design document: [Federated Services](federated-services.md). + +## Pod + +In phase one we only support scheduling replication controllers. Pod +scheduling will be supported in later phase. This is primarily in +order to keep the Cluster Federation API compatible with the Kubernetes API. + +## ACTIVITY FLOWS + +## Scheduling + +The below diagram shows how workloads are scheduled on the Cluster Federation control\ +plane: + +1. A replication controller is created by the client. +1. APIServer persists it into the storage. +1. Cluster controller periodically polls the latest available resource + metrics from the underlying clusters. +1. Scheduler is watching all pending RCs. It picks up the RC, make + policy-driven decisions and split it into different sub RCs. +1. Each cluster control is watching the sub RCs bound to its + corresponding cluster. It picks up the newly created sub RC. +1. The cluster controller issues requests to the underlying cluster +API Server to create the RC. In phase one we don't support complex +distribution policies. The scheduling rule is basically: + 1. If a RC does not specify any nodeSelector, it will be scheduled + to the least loaded K8S cluster(s) that has enough available + resources. + 1. If a RC specifies _N_ acceptable clusters in the + clusterSelector, all replica will be evenly distributed among + these clusters. + +There is a potential race condition here. Say at time _T1_ the control +plane learns there are _m_ available resources in a K8S cluster. As +the cluster is working independently it still accepts workload +requests from other K8S clients or even another Cluster Federation control +plane. The Cluster Federation scheduling decision is based on this data of +available resources. However when the actual RC creation happens to +the cluster at time _T2_, the cluster may don't have enough resources +at that time. We will address this problem in later phases with some +proposed solutions like resource reservation mechanisms. + +![Federated Scheduling](ubernetes-scheduling.png) + +## Service Discovery + +This part has been included in the section “Federated Service” of +document +“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. +Please refer to that document for details. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/federation.md b/contributors/design-proposals/multicluster/federation.md new file mode 100644 index 00000000..fc595123 --- /dev/null +++ b/contributors/design-proposals/multicluster/federation.md @@ -0,0 +1,648 @@ +# Kubernetes Cluster Federation + +## (previously nicknamed "Ubernetes") + +## Requirements Analysis and Product Proposal + +## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_ + +_Initial revision: 2015-03-05_ +_Last updated: 2015-08-20_ +This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2) +Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides) +Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto) + +## Introduction + +Today, each Kubernetes cluster is a relatively self-contained unit, +which typically runs in a single "on-premise" data centre or single +availability zone of a cloud provider (Google's GCE, Amazon's AWS, +etc). + +Several current and potential Kubernetes users and customers have +expressed a keen interest in tying together ("federating") multiple +clusters in some sensible way in order to enable the following kinds +of use cases (intentionally vague): + +1. _"Preferentially run my workloads in my on-premise cluster(s), but + automatically overflow to my cloud-hosted cluster(s) if I run out + of on-premise capacity"_. +1. _"Most of my workloads should run in my preferred cloud-hosted + cluster(s), but some are privacy-sensitive, and should be + automatically diverted to run in my secure, on-premise + cluster(s)"_. +1. _"I want to avoid vendor lock-in, so I want my workloads to run + across multiple cloud providers all the time. I change my set of + such cloud providers, and my pricing contracts with them, + periodically"_. +1. _"I want to be immune to any single data centre or cloud + availability zone outage, so I want to spread my service across + multiple such zones (and ideally even across multiple cloud + providers)."_ + +The above use cases are by necessity left imprecisely defined. The +rest of this document explores these use cases and their implications +in further detail, and compares a few alternative high level +approaches to addressing them. The idea of cluster federation has +informally become known as _"Ubernetes"_. + +## Summary/TL;DR + +Four primary customer-driven use cases are explored in more detail. +The two highest priority ones relate to High Availability and +Application Portability (between cloud providers, and between +on-premise and cloud providers). + +Four primary federation primitives are identified (location affinity, +cross-cluster scheduling, service discovery and application +migration). Fortunately not all four of these primitives are required +for each primary use case, so incremental development is feasible. + +## What exactly is a Kubernetes Cluster? + +A central design concept in Kubernetes is that of a _cluster_. While +loosely speaking, a cluster can be thought of as running in a single +data center, or cloud provider availability zone, a more precise +definition is that each cluster provides: + +1. a single Kubernetes API entry point, +1. a consistent, cluster-wide resource naming scheme +1. a scheduling/container placement domain +1. a service network routing domain +1. an authentication and authorization model. + +The above in turn imply the need for a relatively performant, reliable +and cheap network within each cluster. + +There is also assumed to be some degree of failure correlation across +a cluster, i.e. whole clusters are expected to fail, at least +occasionally (due to cluster-wide power and network failures, natural +disasters etc). Clusters are often relatively homogeneous in that all +compute nodes are typically provided by a single cloud provider or +hardware vendor, and connected by a common, unified network fabric. +But these are not hard requirements of Kubernetes. + +Other classes of Kubernetes deployments than the one sketched above +are technically feasible, but come with some challenges of their own, +and are not yet common or explicitly supported. + +More specifically, having a Kubernetes cluster span multiple +well-connected availability zones within a single geographical region +(e.g. US North East, UK, Japan etc) is worthy of further +consideration, in particular because it potentially addresses +some of these requirements. + +## What use cases require Cluster Federation? + +Let's name a few concrete use cases to aid the discussion: + +## 1.Capacity Overflow + +_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_ + +This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)". + +**Clarifying questions:** What is the unit of overflow? Individual + pods? Probably not always. Replication controllers and their + associated sets of pods? Groups of replication controllers + (a.k.a. distributed applications)? How are persistent disks + overflowed? Can the "overflowed" pods communicate with their + brethren and sistren pods and services in the other cluster(s)? + Presumably yes, at higher cost and latency, provided that they use + external service discovery. Is "overflow" enabled only when creating + new workloads/replication controllers, or are existing workloads + dynamically migrated between clusters based on fluctuating available + capacity? If so, what is the desired behaviour, and how is it + achieved? How, if at all, does this relate to quota enforcement + (e.g. if we run out of on-premise capacity, can all or only some + quotas transfer to other, potentially more expensive off-premise + capacity?) + +It seems that most of this boils down to: + +1. **location affinity** (pods relative to each other, and to other + stateful services like persistent storage - how is this expressed + and enforced?) +1. **cross-cluster scheduling** (given location affinity constraints + and other scheduling policy, which resources are assigned to which + clusters, and by what?) +1. **cross-cluster service discovery** (how do pods in one cluster + discover and communicate with pods in another cluster?) +1. **cross-cluster migration** (how do compute and storage resources, + and the distributed applications to which they belong, move from + one cluster to another) +1. **cross-cluster load-balancing** (how does is user traffic directed + to an appropriate cluster?) +1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility) + +## 2. Sensitive Workloads + +_"I want most of my workloads to run in my preferred cloud-hosted +cluster(s), but some are privacy-sensitive, and should be +automatically diverted to run in my secure, on-premise cluster(s). The +list of privacy-sensitive workloads changes over time, and they're +subject to external auditing."_ + +**Clarifying questions:** +1. What kinds of rules determine which +workloads go where? + 1. Is there in fact a requirement to have these rules be + declaratively expressed and automatically enforced, or is it + acceptable/better to have users manually select where to run + their workloads when starting them? + 1. Is a static mapping from container (or more typically, + replication controller) to cluster maintained and enforced? + 1. If so, is it only enforced on startup, or are things migrated + between clusters when the mappings change? + +This starts to look quite similar to "1. Capacity Overflow", and again +seems to boil down to: + +1. location affinity +1. cross-cluster scheduling +1. cross-cluster service discovery +1. cross-cluster migration +1. cross-cluster monitoring and auditing +1. cross-cluster load balancing + +## 3. Vendor lock-in avoidance + +_"My CTO wants us to avoid vendor lock-in, so she wants our workloads +to run across multiple cloud providers at all times. She changes our +set of preferred cloud providers and pricing contracts with them +periodically, and doesn't want to have to communicate and manually +enforce these policy changes across the organization every time this +happens. She wants it centrally and automatically enforced, monitored +and audited."_ + +**Clarifying questions:** + +1. How does this relate to other use cases (high availability, +capacity overflow etc), as they may all be across multiple vendors. +It's probably not strictly speaking a separate +use case, but it's brought up so often as a requirement, that it's +worth calling out explicitly. +1. Is a useful intermediate step to make it as simple as possible to + migrate an application from one vendor to another in a one-off fashion? + +Again, I think that this can probably be + reformulated as a Capacity Overflow problem - the fundamental + principles seem to be the same or substantially similar to those + above. + +## 4. "High Availability" + +_"I want to be immune to any single data centre or cloud availability +zone outage, so I want to spread my service across multiple such zones +(and ideally even across multiple cloud providers), and have my +service remain available even if one of the availability zones or +cloud providers "goes down"_. + +It seems useful to split this into multiple sets of sub use cases: + +1. Multiple availability zones within a single cloud provider (across + which feature sets like private networks, load balancing, + persistent disks, data snapshots etc are typically consistent and + explicitly designed to inter-operate). + 1. within the same geographical region (e.g. metro) within which network + is fast and cheap enough to be almost analogous to a single data + center. + 1. across multiple geographical regions, where high network cost and + poor network performance may be prohibitive. +1. Multiple cloud providers (typically with inconsistent feature sets, + more limited interoperability, and typically no cheap inter-cluster + networking described above). + +The single cloud provider case might be easier to implement (although +the multi-cloud provider implementation should just work for a single +cloud provider). Propose high-level design catering for both, with +initial implementation targeting single cloud provider only. + +**Clarifying questions:** +**How does global external service discovery work?** In the steady + state, which external clients connect to which clusters? GeoDNS or + similar? What is the tolerable failover latency if a cluster goes + down? Maybe something like (make up some numbers, notwithstanding + some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of + clients to re-issue DNS lookups and reconnect to a new cluster when + their home cluster fails is good enough for most Kubernetes users + (or at least way better than the status quo), given that these sorts + of failure only happen a small number of times a year? + +**How does dynamic load balancing across clusters work, if at all?** + One simple starting point might be "it doesn't". i.e. if a service + in a cluster is deemed to be "up", it receives as much traffic as is + generated "nearby" (even if it overloads). If the service is deemed + to "be down" in a given cluster, "all" nearby traffic is redirected + to some other cluster within some number of seconds (failover could + be automatic or manual). Failover is essentially binary. An + improvement would be to detect when a service in a cluster reaches + maximum serving capacity, and dynamically divert additional traffic + to other clusters. But how exactly does all of this work, and how + much of it is provided by Kubernetes, as opposed to something else + bolted on top (e.g. external monitoring and manipulation of GeoDNS)? + +**How does this tie in with auto-scaling of services?** More + specifically, if I run my service across _n_ clusters globally, and + one (or more) of them fail, how do I ensure that the remaining _n-1_ + clusters have enough capacity to serve the additional, failed-over + traffic? Either: + +1. I constantly over-provision all clusters by 1/n (potentially expensive), or +1. I "manually" (or automatically) update my replica count configurations in the + remaining clusters by 1/n when the failure occurs, and Kubernetes + takes care of the rest for me, or +1. Auto-scaling in the remaining clusters takes + care of it for me automagically as the additional failed-over + traffic arrives (with some latency). Note that this implies that + the cloud provider keeps the necessary resources on hand to + accommodate such auto-scaling (e.g. via something similar to AWS reserved + and spot instances) + +Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others). + +All of the above (regarding "Unavailability Zones") refers primarily +to already-running user-facing services, and minimizing the impact on +end users of those services becoming unavailable in a given cluster. +What about the people and systems that deploy Kubernetes services +(devops etc)? Should they be automatically shielded from the impact +of the cluster outage? i.e. have their new resource creation requests +automatically diverted to another cluster during the outage? While +this specific requirement seems non-critical (manual fail-over seems +relatively non-arduous, ignoring the user-facing issues above), it +smells a lot like the first three use cases listed above ("Capacity +Overflow, Sensitive Services, Vendor lock-in..."), so if we address +those, we probably get this one free of charge. + +## Core Challenges of Cluster Federation + +As we saw above, a few common challenges fall out of most of the use +cases considered above, namely: + +## Location Affinity + +Can the pods comprising a single distributed application be +partitioned across more than one cluster? More generally, how far +apart, in network terms, can a given client and server within a +distributed application reasonably be? A server need not necessarily +be a pod, but could instead be a persistent disk housing data, or some +other stateful network service. What is tolerable is typically +application-dependent, primarily influenced by network bandwidth +consumption, latency requirements and cost sensitivity. + +For simplicity, let's assume that all Kubernetes distributed +applications fall into one of three categories with respect to relative +location affinity: + +1. **"Strictly Coupled"**: Those applications that strictly cannot be + partitioned between clusters. They simply fail if they are + partitioned. When scheduled, all pods _must_ be scheduled to the + same cluster. To move them, we need to shut the whole distributed + application down (all pods) in one cluster, possibly move some + data, and then bring the up all of the pods in another cluster. To + avoid downtime, we might bring up the replacement cluster and + divert traffic there before turning down the original, but the + principle is much the same. In some cases moving the data might be + prohibitively expensive or time-consuming, in which case these + applications may be effectively _immovable_. +1. **"Strictly Decoupled"**: Those applications that can be + indefinitely partitioned across more than one cluster, to no + disadvantage. An embarrassingly parallel YouTube porn detector, + where each pod repeatedly dequeues a video URL from a remote work + queue, downloads and chews on the video for a few hours, and + arrives at a binary verdict, might be one such example. The pods + derive no benefit from being close to each other, or anything else + (other than the source of YouTube videos, which is assumed to be + equally remote from all clusters in this example). Each pod can be + scheduled independently, in any cluster, and moved at any time. +1. **"Preferentially Coupled"**: Somewhere between Coupled and + Decoupled. These applications prefer to have all of their pods + located in the same cluster (e.g. for failure correlation, network + latency or bandwidth cost reasons), but can tolerate being + partitioned for "short" periods of time (for example while + migrating the application from one cluster to another). Most small + to medium sized LAMP stacks with not-very-strict latency goals + probably fall into this category (provided that they use sane + service discovery and reconnect-on-fail, which they need to do + anyway to run effectively, even in a single Kubernetes cluster). + +From a fault isolation point of view, there are also opposites of the +above. For example, a master database and its slave replica might +need to be in different availability zones. We'll refer to this a +anti-affinity, although it is largely outside the scope of this +document. + +Note that there is somewhat of a continuum with respect to network +cost and quality between any two nodes, ranging from two nodes on the +same L2 network segment (lowest latency and cost, highest bandwidth) +to two nodes on different continents (highest latency and cost, lowest +bandwidth). One interesting point on that continuum relates to +multiple availability zones within a well-connected metro or region +and single cloud provider. Despite being in different data centers, +or areas within a mega data center, network in this case is often very fast +and effectively free or very cheap. For the purposes of this network location +affinity discussion, this case is considered analogous to a single +availability zone. Furthermore, if a given application doesn't fit +cleanly into one of the above, shoe-horn it into the best fit, +defaulting to the "Strictly Coupled and Immovable" bucket if you're +not sure. + +And then there's what I'll call _absolute_ location affinity. Some +applications are required to run in bounded geographical or network +topology locations. The reasons for this are typically +political/legislative (data privacy laws etc), or driven by network +proximity to consumers (or data providers) of the application ("most +of our users are in Western Europe, U.S. West Coast" etc). + +**Proposal:** First tackle Strictly Decoupled applications (which can + be trivially scheduled, partitioned or moved, one pod at a time). + Then tackle Preferentially Coupled applications (which must be + scheduled in totality in a single cluster, and can be moved, but + ultimately in total, and necessarily within some bounded time). + Leave strictly coupled applications to be manually moved between + clusters as required for the foreseeable future. + +## Cross-cluster service discovery + +I propose having pods use standard discovery methods used by external +clients of Kubernetes applications (i.e. DNS). DNS might resolve to a +public endpoint in the local or a remote cluster. Other than Strictly +Coupled applications, software should be largely oblivious of which of +the two occurs. + +_Aside:_ How do we avoid "tromboning" through an external VIP when DNS +resolves to a public IP on the local cluster? Strictly speaking this +would be an optimization for some cases, and probably only matters to +high-bandwidth, low-latency communications. We could potentially +eliminate the trombone with some kube-proxy magic if necessary. More +detail to be added here, but feel free to shoot down the basic DNS +idea in the mean time. In addition, some applications rely on private +networking between clusters for security (e.g. AWS VPC or more +generally VPN). It should not be necessary to forsake this in +order to use Cluster Federation, for example by being forced to use public +connectivity between clusters. + +## Cross-cluster Scheduling + +This is closely related to location affinity above, and also discussed +there. The basic idea is that some controller, logically outside of +the basic Kubernetes control plane of the clusters in question, needs +to be able to: + +1. Receive "global" resource creation requests. +1. Make policy-based decisions as to which cluster(s) should be used + to fulfill each given resource request. In a simple case, the + request is just redirected to one cluster. In a more complex case, + the request is "demultiplexed" into multiple sub-requests, each to + a different cluster. Knowledge of the (albeit approximate) + available capacity in each cluster will be required by the + controller to sanely split the request. Similarly, knowledge of + the properties of the application (Location Affinity class -- + Strictly Coupled, Strictly Decoupled etc, privacy class etc) will + be required. It is also conceivable that knowledge of service + SLAs and monitoring thereof might provide an input into + scheduling/placement algorithms. +1. Multiplex the responses from the individual clusters into an + aggregate response. + +There is of course a lot of detail still missing from this section, +including discussion of: + +1. admission control +1. initial placement of instances of a new +service vs. scheduling new instances of an existing service in response +to auto-scaling +1. rescheduling pods due to failure (response might be +different depending on if it's failure of a node, rack, or whole AZ) +1. data placement relative to compute capacity, +etc. + +## Cross-cluster Migration + +Again this is closely related to location affinity discussed above, +and is in some sense an extension of Cross-cluster Scheduling. When +certain events occur, it becomes necessary or desirable for the +cluster federation system to proactively move distributed applications +(either in part or in whole) from one cluster to another. Examples of +such events include: + +1. A low capacity event in a cluster (or a cluster failure). +1. A change of scheduling policy ("we no longer use cloud provider X"). +1. A change of resource pricing ("cloud provider Y dropped their + prices - let's migrate there"). + +Strictly Decoupled applications can be trivially moved, in part or in +whole, one pod at a time, to one or more clusters (within applicable +policy constraints, for example "PrivateCloudOnly"). + +For Preferentially Decoupled applications, the federation system must +first locate a single cluster with sufficient capacity to accommodate +the entire application, then reserve that capacity, and incrementally +move the application, one (or more) resources at a time, over to the +new cluster, within some bounded time period (and possibly within a +predefined "maintenance" window). Strictly Coupled applications (with +the exception of those deemed completely immovable) require the +federation system to: + +1. start up an entire replica application in the destination cluster +1. copy persistent data to the new application instance (possibly + before starting pods) +1. switch user traffic across +1. tear down the original application instance + +It is proposed that support for automated migration of Strictly +Coupled applications be deferred to a later date. + +## Other Requirements + +These are often left implicit by customers, but are worth calling out explicitly: + +1. Software failure isolation between Kubernetes clusters should be + retained as far as is practically possible. The federation system + should not materially increase the failure correlation across + clusters. For this reason the federation control plane software + should ideally be completely independent of the Kubernetes cluster + control software, and look just like any other Kubernetes API + client, with no special treatment. If the federation control plane + software fails catastrophically, the underlying Kubernetes clusters + should remain independently usable. +1. Unified monitoring, alerting and auditing across federated Kubernetes clusters. +1. Unified authentication, authorization and quota management across + clusters (this is in direct conflict with failure isolation above, + so there are some tough trade-offs to be made here). + +## Proposed High-Level Architectures + +Two distinct potential architectural approaches have emerged from discussions +thus far: + +1. An explicitly decoupled and hierarchical architecture, where the + Federation Control Plane sits logically above a set of independent + Kubernetes clusters, each of which is (potentially) unaware of the + other clusters, and of the Federation Control Plane itself (other + than to the extent that it is an API client much like any other). + One possible example of this general architecture is illustrated + below, and will be referred to as the "Decoupled, Hierarchical" + approach. +1. A more monolithic architecture, where a single instance of the + Kubernetes control plane itself manages a single logical cluster + composed of nodes in multiple availability zones and cloud + providers. + +A very brief, non-exhaustive list of pro's and con's of the two +approaches follows. (In the interest of full disclosure, the author +prefers the Decoupled Hierarchical model for the reasons stated below). + +1. **Failure isolation:** The Decoupled Hierarchical approach provides + better failure isolation than the Monolithic approach, as each + underlying Kubernetes cluster, and the Federation Control Plane, + can operate and fail completely independently of each other. In + particular, their software and configurations can be updated + independently. Such updates are, in our experience, the primary + cause of control-plane failures, in general. +1. **Failure probability:** The Decoupled Hierarchical model incorporates + numerically more independent pieces of software and configuration + than the Monolithic one. But the complexity of each of these + decoupled pieces is arguably better contained in the Decoupled + model (per standard arguments for modular rather than monolithic + software design). Which of the two models presents higher + aggregate complexity and consequent failure probability remains + somewhat of an open question. +1. **Scalability:** Conceptually the Decoupled Hierarchical model wins + here, as each underlying Kubernetes cluster can be scaled + completely independently w.r.t. scheduling, node state management, + monitoring, network connectivity etc. It is even potentially + feasible to stack federations of clusters (i.e. create + federations of federations) should scalability of the independent + Federation Control Plane become an issue (although the author does + not envision this being a problem worth solving in the short + term). +1. **Code complexity:** I think that an argument can be made both ways + here. It depends on whether you prefer to weave the logic for + handling nodes in multiple availability zones and cloud providers + within a single logical cluster into the existing Kubernetes + control plane code base (which was explicitly not designed for + this), or separate it into a decoupled Federation system (with + possible code sharing between the two via shared libraries). The + author prefers the latter because it: + 1. Promotes better code modularity and interface design. + 1. Allows the code + bases of Kubernetes and the Federation system to progress + largely independently (different sets of developers, different + release schedules etc). +1. **Administration complexity:** Again, I think that this could be argued + both ways. Superficially it would seem that administration of a + single Monolithic multi-zone cluster might be simpler by virtue of + being only "one thing to manage", however in practise each of the + underlying availability zones (and possibly cloud providers) has + its own capacity, pricing, hardware platforms, and possibly + bureaucratic boundaries (e.g. "our EMEA IT department manages those + European clusters"). So explicitly allowing for (but not + mandating) completely independent administration of each + underlying Kubernetes cluster, and the Federation system itself, + in the Decoupled Hierarchical model seems to have real practical + benefits that outweigh the superficial simplicity of the + Monolithic model. +1. **Application development and deployment complexity:** It's not clear + to me that there is any significant difference between the two + models in this regard. Presumably the API exposed by the two + different architectures would look very similar, as would the + behavior of the deployed applications. It has even been suggested + to write the code in such a way that it could be run in either + configuration. It's not clear that this makes sense in practise + though. +1. **Control plane cost overhead:** There is a minimum per-cluster + overhead -- two possibly virtual machines, or more for redundant HA + deployments. For deployments of very small Kubernetes + clusters with the Decoupled Hierarchical approach, this cost can + become significant. + +### The Decoupled, Hierarchical Approach - Illustrated + +![image](federation-high-level-arch.png) + +## Cluster Federation API + +It is proposed that this look a lot like the existing Kubernetes API +but be explicitly multi-cluster. + ++ Clusters become first class objects, which can be registered, + listed, described, deregistered etc via the API. ++ Compute resources can be explicitly requested in specific clusters, + or automatically scheduled to the "best" cluster by the Cluster + Federation control system (by a + pluggable Policy Engine). ++ There is a federated equivalent of a replication controller type (or + perhaps a [deployment](deployment.md)), + which is multicluster-aware, and delegates to cluster-specific + replication controllers/deployments as required (e.g. a federated RC for n + replicas might simply spawn multiple replication controllers in + different clusters to do the hard work). + +## Policy Engine and Migration/Replication Controllers + +The Policy Engine decides which parts of each application go into each +cluster at any point in time, and stores this desired state in the +Desired Federation State store (an etcd or +similar). Migration/Replication Controllers reconcile this against the +desired states stored in the underlying Kubernetes clusters (by +watching both, and creating or updating the underlying Replication +Controllers and related Services accordingly). + +## Authentication and Authorization + +This should ideally be delegated to some external auth system, shared +by the underlying clusters, to avoid duplication and inconsistency. +Either that, or we end up with multilevel auth. Local readonly +eventually consistent auth slaves in each cluster and in the Cluster +Federation control system +could potentially cache auth, to mitigate an SPOF auth system. + +## Data consistency, failure and availability characteristics + +The services comprising the Cluster Federation control plane) have to run + somewhere. Several options exist here: +* For high availability Cluster Federation deployments, these + services may run in either: + * a dedicated Kubernetes cluster, not co-located in the same + availability zone with any of the federated clusters (for fault + isolation reasons). If that cluster/availability zone, and hence the Federation + system, fails catastrophically, the underlying pods and + applications continue to run correctly, albeit temporarily + without the Federation system. + * across multiple Kubernetes availability zones, probably with + some sort of cross-AZ quorum-based store. This provides + theoretically higher availability, at the cost of some + complexity related to data consistency across multiple + availability zones. + * For simpler, less highly available deployments, just co-locate the + Federation control plane in/on/with one of the underlying + Kubernetes clusters. The downside of this approach is that if + that specific cluster fails, all automated failover and scaling + logic which relies on the federation system will also be + unavailable at the same time (i.e. precisely when it is needed). + But if one of the other federated clusters fails, everything + should work just fine. + +There is some further thinking to be done around the data consistency + model upon which the Federation system is based, and it's impact + on the detailed semantics, failure and availability + characteristics of the system. + +## Proposed Next Steps + +Identify concrete applications of each use case and configure a proof +of concept service that exercises the use case. For example, cluster +failure tolerance seems popular, so set up an apache frontend with +replicas in each of three availability zones with either an Amazon Elastic +Load Balancer or Google Cloud Load Balancer pointing at them? What +does the zookeeper config look like for N=3 across 3 AZs -- and how +does each replica find the other replicas and how do clients find +their primary zookeeper replica? And now how do I do a shared, highly +available redis database? Use a few common specific use cases like +this to flesh out the detailed API and semantics of Cluster Federation. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation.md?pixel)]() + diff --git a/contributors/design-proposals/multicluster/ubernetes-cluster-state.png b/contributors/design-proposals/multicluster/ubernetes-cluster-state.png new file mode 100644 index 00000000..56ec2df8 Binary files /dev/null and b/contributors/design-proposals/multicluster/ubernetes-cluster-state.png differ diff --git a/contributors/design-proposals/multicluster/ubernetes-design.png b/contributors/design-proposals/multicluster/ubernetes-design.png new file mode 100644 index 00000000..44924846 Binary files /dev/null and b/contributors/design-proposals/multicluster/ubernetes-design.png differ diff --git a/contributors/design-proposals/multicluster/ubernetes-scheduling.png b/contributors/design-proposals/multicluster/ubernetes-scheduling.png new file mode 100644 index 00000000..01774882 Binary files /dev/null and b/contributors/design-proposals/multicluster/ubernetes-scheduling.png differ diff --git a/contributors/design-proposals/scheduling/podaffinity.md b/contributors/design-proposals/scheduling/podaffinity.md index 24930f96..30bdb256 100644 --- a/contributors/design-proposals/scheduling/podaffinity.md +++ b/contributors/design-proposals/scheduling/podaffinity.md @@ -313,7 +313,7 @@ scheduler to not put more than one pod from S in the same zone, and thus by definition it will not put more than one pod from S on the same node, assuming each node is in one zone. This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in -[Cluster Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation/federation.md) clusters.) +[Cluster Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multicluster/federation.md) clusters.) * **Don't co-locate pods of this service with pods from service "evilService"**: `{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` diff --git a/contributors/devel/release/issues.md b/contributors/devel/release/issues.md index 7aa86297..819b6e41 100644 --- a/contributors/devel/release/issues.md +++ b/contributors/devel/release/issues.md @@ -32,7 +32,7 @@ The SIG owner label defines the SIG to which the bot will escalate if the issue or updated by the deadline. If there are no updates after escalation, the issue may automatically removed from the milestone. -e.g. `sig/node`, `sig/federation`, `sig/apps`, `sig/network` +e.g. `sig/node`, `sig/multicluster`, `sig/apps`, `sig/network` **Note:** - For test-infrastructure issues use `sig/testing`. diff --git a/sig-federation/ONCALL.md b/sig-federation/ONCALL.md deleted file mode 100644 index 41b506aa..00000000 --- a/sig-federation/ONCALL.md +++ /dev/null @@ -1,63 +0,0 @@ -# Overview - -We have an oncall rotation in the SIG. The role description is as follows: - -* Ensure that the testgrid (https://k8s-testgrid.appspot.com/sig-federation) is green. This person will be the point of contact if testgrid turns red. Will identify the problem and fix it (most common scenarios: find culprit PR and revert it or free quota by deleting leaked resources). -Will also report most common failure scenarios and suggest improvements. Its up to the sig or individuals to prioritize and take up those tasks. - -Oncall playbook: https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-federation-build-cop.md - -# Joining the rotation - -Add your name at the end of the current rotation schedule if you want to join the rotation. -Anyone is free to join as long as they can perform the expected work described above. No special permissions are required but familiarity with existing codebase is recommended. - -# Swapping the rotation - -If anyone is away on their oncall week (vacation, illness, etc), they are responsible for finding someone to swap with (by sending a PR, approved by that person). Swapping one week for another is usually relatively uncontentious. - -# Extending the rotation schedule - -Anyone can extend the existing schedule by assigning upcoming weeks to people in the same order as the existing schedule. cc the rotation members on the PR so that they know. -Please extend the schedule unless there are atlease 2 people assigned after you. - -# Current Oncall schedule - -``` -25 September - 1 October: Madhu (https://github.com/madhusudancs) -2 October - 8 October: Shashidhara (https://github.com/shashidharatd) -9 October - 15 October: Christian (https://github.com/csbell) -16 October - 22 October: Nikhil Jindal (https://github.com/nikhiljindal) -23 October - 29 October: Irfan (https://github.com/irfanurrehman) -30 October - 5 November: Maru (https://github.com/marun) -6 November - 12 November: Jonathan (https://github.com/perotinus) -(Madhu to be removed from next rotation) -``` - -# Past 5 rotation cycles -``` -(Adding Irfan) -7 August - 13 August: Nikhil Jindal (https://github.com/nikhiljindal) -14 August - 20 August: Shashidhara (https://github.com/shashidharatd) -21 August - 27 August: Christian (https://github.com/csbell) -28 August - 3 September: Madhu (https://github.com/madhusudancs) -4 September - 10 September: Irfan (https://github.com/irfanurrehman) -11 September - 17 September: Maru (https://github.com/marun) -18 September - 24 September: Jonathan (https://github.com/perotinus) - - -(Adding Jonathan) -26 June - 2 July: Nikhil Jindal (https://github.com/nikhiljindal) -3 July - 9 July: Shashidhara (https://github.com/shashidharatd) -10 July - 16 July: Christian (https://github.com/csbell) -17 July - 23 July: Madhu (https://github.com/madhusudancs) -24 July - 30 July: Maru (https://github.com/marun) -31 July - 6 August: Jonathan (https://github.com/perotinus) - - -22-28 May: Nikhil Jindal (https://github.com/nikhiljindal) -29 May - 4 June: Shashidhara (https://github.com/shashidharatd) -5 June - 11 June: Madhusudan (https://github.com/madhusudancs) -12 June - 18 June: Maru (https://github.com/marun) -19 June - 25 June: Christian (https://github.com/csbell) -``` diff --git a/sig-federation/OWNERS b/sig-federation/OWNERS deleted file mode 100644 index da6b0996..00000000 --- a/sig-federation/OWNERS +++ /dev/null @@ -1,6 +0,0 @@ -reviewers: - - csbell - - quinton-hoole -approvers: - - csbell - - quinton-hoole diff --git a/sig-federation/README.md b/sig-federation/README.md deleted file mode 100644 index dc3bff8e..00000000 --- a/sig-federation/README.md +++ /dev/null @@ -1,29 +0,0 @@ - -# Federation SIG - -Covers the Federation of Kubernetes Clusters and related topics. This includes: application resiliency against availability zone outages; hybrid clouds; spanning of multiple could providers; application migration from private to public clouds (and vice versa); and other similar subjects. - -## Meetings -* [Tuesdays at 16:30 UTC](https://plus.google.com/hangouts/_/google.com/k8s-federation) (biweekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=16:30&tz=UTC). - -Meeting notes and Agenda can be found [here](https://docs.google.com/document/d/18mk62nOXE_MCSSnb4yJD_8UadtzJrYyJxFwbrgabHe8/edit). -Meeting recordings can be found [here](https://www.youtube.com/watch?v=iWKC3FsNHWg&list=PL69nYSiGNLP0HqgyqTby6HlDEz7i1mb0-). - -## Leads -* Christian Bell (**[@csbell](https://github.com/csbell)**), Google -* Quinton Hoole (**[@quinton-hoole](https://github.com/quinton-hoole)**), Huawei - -## Contact -* [Slack](https://kubernetes.slack.com/messages/sig-federation) -* [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-federation) - - - - diff --git a/sig-list.md b/sig-list.md index dfaf3650..2f411c2e 100644 --- a/sig-list.md +++ b/sig-list.md @@ -35,8 +35,8 @@ When the need arises, a [new SIG can be created](sig-creation-procedure.md) |[Cluster Ops](sig-cluster-ops/README.md)|cluster-ops|* [Rob Hirschfeld](https://github.com/zehicle), RackN
* [Jaice Singer DuMars](https://github.com/jdumars), Microsoft
|* [Slack](https://kubernetes.slack.com/messages/sig-cluster-ops)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-cluster-ops)|* [Thursdays at 20:00 UTC (biweekly)](https://zoom.us/j/297937771)
|[Contributor Experience](sig-contributor-experience/README.md)|contributor-experience|* [Garrett Rodrigues](https://github.com/grodrigues3), Google
* [Elsie Phillips](https://github.com/Phillels), CoreOS
|* [Slack](https://kubernetes.slack.com/messages/sig-contribex)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-contribex)|* [Wednesdays at 16:30 UTC (biweekly)](https://zoom.us/j/7658488911)
|[Docs](sig-docs/README.md)|docs|* [Devin Donnelly](https://github.com/devin-donnelly), Google
* [Jared Bhatti](https://github.com/jaredbhatti), Google
|* [Slack](https://kubernetes.slack.com/messages/sig-docs)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-docs)|* [Tuesdays at 17:30 UTC (weekly)](https://zoom.us/j/678394311)
-|[Federation](sig-federation/README.md)|multicluster|* [Christian Bell](https://github.com/csbell), Google
* [Quinton Hoole](https://github.com/quinton-hoole), Huawei
|* [Slack](https://kubernetes.slack.com/messages/sig-federation)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-federation)|* [Tuesdays at 16:30 UTC (biweekly)](https://plus.google.com/hangouts/_/google.com/k8s-federation)
|[Instrumentation](sig-instrumentation/README.md)|instrumentation|* [Piotr Szczesniak](https://github.com/piosz), Google
* [Fabian Reinartz](https://github.com/fabxc), CoreOS
|* [Slack](https://kubernetes.slack.com/messages/sig-instrumentation)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-instrumentation)|* [Thursdays at 16:30 UTC (weekly)](https://zoom.us/j/5342565819)
+|[Multicluster](sig-multicluster/README.md)|multicluster|* [Christian Bell](https://github.com/csbell), Google
* [Quinton Hoole](https://github.com/quinton-hoole), Huawei
|* [Slack](https://kubernetes.slack.com/messages/sig-multicluster)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-multicluster)|* [Tuesdays at 16:30 UTC (biweekly)](https://plus.google.com/hangouts/_/google.com/k8s-mc)
|[Network](sig-network/README.md)|network|* [Tim Hockin](https://github.com/thockin), Google
* [Dan Williams](https://github.com/dcbw), Red Hat
* [Casey Davenport](https://github.com/caseydavenport), Tigera
|* [Slack](https://kubernetes.slack.com/messages/sig-network)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-network)|* [Thursdays at 21:00 UTC (biweekly)](https://zoom.us/j/5806599998)
|[Node](sig-node/README.md)|node|* [Dawn Chen](https://github.com/dchen1107), Google
* [Derek Carr](https://github.com/derekwaynecarr), Red Hat
|* [Slack](https://kubernetes.slack.com/messages/sig-node)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-node)|* [Tuesdays at 17:00 UTC (weekly)](https://plus.google.com/hangouts/_/google.com/sig-node-meetup?authuser=0)
|[On Premise](sig-on-premise/README.md)|onprem|* [Marco Ceppi](https://github.com/marcoceppi), Canonical
* [Dalton Hubble](https://github.com/dghubble), CoreOS
|* [Slack](https://kubernetes.slack.com/messages/sig-onprem)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-sig-on-prem)|* [Wednesdays at 16:00 UTC (weekly)](https://zoom.us/my/k8s.sig.onprem)
diff --git a/sig-multicluster/ONCALL.md b/sig-multicluster/ONCALL.md new file mode 100644 index 00000000..72aca202 --- /dev/null +++ b/sig-multicluster/ONCALL.md @@ -0,0 +1,76 @@ +# Overview + +We have an oncall rotation for Federation in the SIG. The role description is as +follows: + +* Ensure that the testgrid (https://k8s-testgrid.appspot.com/sig-multicluster) + is green. This person will be the point of contact if testgrid turns red. + Will identify the problem and fix it (most common scenarios: find culprit PR + and revert it or free quota by deleting leaked resources). Will also report + most common failure scenarios and suggest improvements. Its up to the sig or + individuals to prioritize and take up those tasks. + +Oncall playbook: +https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-federation-build-cop.md + +# Joining the rotation + +Add your name at the end of the current rotation schedule if you want to join +the rotation. Anyone is free to join as long as they can perform the expected +work described above. No special permissions are required but familiarity with +existing codebase is recommended. + +# Swapping the rotation + +If anyone is away on their oncall week (vacation, illness, etc), they are +responsible for finding someone to swap with (by sending a PR, approved by that +person). Swapping one week for another is usually relatively uncontentious. + +# Extending the rotation schedule + +Anyone can extend the existing schedule by assigning upcoming weeks to people in +the same order as the existing schedule. cc the rotation members on the PR so +that they know. Please extend the schedule unless there are atlease 2 people +assigned after you. + +# Current Oncall schedule + +``` +25 September - 1 October: Madhu (https://github.com/madhusudancs) +2 October - 8 October: Shashidhara (https://github.com/shashidharatd) +9 October - 15 October: Christian (https://github.com/csbell) +16 October - 22 October: Nikhil Jindal (https://github.com/nikhiljindal) +23 October - 29 October: Irfan (https://github.com/irfanurrehman) +30 October - 5 November: Maru (https://github.com/marun) +6 November - 12 November: Jonathan (https://github.com/perotinus) +(Madhu to be removed from next rotation) +``` + +# Past 5 rotation cycles + +``` +(Adding Irfan) +7 August - 13 August: Nikhil Jindal (https://github.com/nikhiljindal) +14 August - 20 August: Shashidhara (https://github.com/shashidharatd) +21 August - 27 August: Christian (https://github.com/csbell) +28 August - 3 September: Madhu (https://github.com/madhusudancs) +4 September - 10 September: Irfan (https://github.com/irfanurrehman) +11 September - 17 September: Maru (https://github.com/marun) +18 September - 24 September: Jonathan (https://github.com/perotinus) + + +(Adding Jonathan) +26 June - 2 July: Nikhil Jindal (https://github.com/nikhiljindal) +3 July - 9 July: Shashidhara (https://github.com/shashidharatd) +10 July - 16 July: Christian (https://github.com/csbell) +17 July - 23 July: Madhu (https://github.com/madhusudancs) +24 July - 30 July: Maru (https://github.com/marun) +31 July - 6 August: Jonathan (https://github.com/perotinus) + + +22-28 May: Nikhil Jindal (https://github.com/nikhiljindal) +29 May - 4 June: Shashidhara (https://github.com/shashidharatd) +5 June - 11 June: Madhusudan (https://github.com/madhusudancs) +12 June - 18 June: Maru (https://github.com/marun) +19 June - 25 June: Christian (https://github.com/csbell) +``` diff --git a/sig-multicluster/OWNERS b/sig-multicluster/OWNERS new file mode 100644 index 00000000..da6b0996 --- /dev/null +++ b/sig-multicluster/OWNERS @@ -0,0 +1,6 @@ +reviewers: + - csbell + - quinton-hoole +approvers: + - csbell + - quinton-hoole diff --git a/sig-multicluster/README.md b/sig-multicluster/README.md new file mode 100644 index 00000000..926d8822 --- /dev/null +++ b/sig-multicluster/README.md @@ -0,0 +1,29 @@ + +# Multicluster SIG + +Covers multi-cluster Kubernetes use cases and tooling. This includes: application resiliency against availability zone outages; hybrid clouds; spanning of multiple could providers; application migration from private to public clouds (and vice versa); and other similar subjects. This SIG was formerly called sig-federation and focused on the Federation project, but expanded its charter to all multi-cluster concerns in August 2017. + +## Meetings +* [Tuesdays at 16:30 UTC](https://plus.google.com/hangouts/_/google.com/k8s-mc) (biweekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=16:30&tz=UTC). + +Meeting notes and Agenda can be found [here](https://docs.google.com/document/d/18mk62nOXE_MCSSnb4yJD_8UadtzJrYyJxFwbrgabHe8/edit). +Meeting recordings can be found [here](https://www.youtube.com/watch?v=iWKC3FsNHWg&list=PL69nYSiGNLP0HqgyqTby6HlDEz7i1mb0-). + +## Leads +* Christian Bell (**[@csbell](https://github.com/csbell)**), Google +* Quinton Hoole (**[@quinton-hoole](https://github.com/quinton-hoole)**), Huawei + +## Contact +* [Slack](https://kubernetes.slack.com/messages/sig-multicluster) +* [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-multicluster) + + + + diff --git a/sigs.yaml b/sigs.yaml index e5a6add1..b49d1463 100644 --- a/sigs.yaml +++ b/sigs.yaml @@ -334,14 +334,15 @@ sigs: contact: slack: sig-docs mailing_list: https://groups.google.com/forum/#!forum/kubernetes-sig-docs - - name: Federation - dir: sig-federation - mission_statement: > - Covers the Federation of Kubernetes Clusters and related - topics. This includes: application resiliency against availability zone - outages; hybrid clouds; spanning of multiple could providers; application - migration from private to public clouds (and vice versa); and other - similar subjects. + - name: Multicluster + dir: sig-multicluster + mission_statement: > + Covers multi-cluster Kubernetes use cases and tooling. This includes: + application resiliency against availability zone outages; hybrid clouds; + spanning of multiple could providers; application migration from private + to public clouds (and vice versa); and other similar subjects. This SIG + was formerly called sig-federation and focused on the Federation project, + but expanded its charter to all multi-cluster concerns in August 2017. label: multicluster leads: - name: Christian Bell @@ -354,12 +355,12 @@ sigs: - day: Tuesday utc: "16:30" frequency: biweekly - meeting_url: https://plus.google.com/hangouts/_/google.com/k8s-federation + meeting_url: https://plus.google.com/hangouts/_/google.com/k8s-mc meeting_archive_url: https://docs.google.com/document/d/18mk62nOXE_MCSSnb4yJD_8UadtzJrYyJxFwbrgabHe8/edit meeting_recordings_url: https://www.youtube.com/watch?v=iWKC3FsNHWg&list=PL69nYSiGNLP0HqgyqTby6HlDEz7i1mb0- contact: - slack: sig-federation - mailing_list: https://groups.google.com/forum/#!forum/kubernetes-sig-federation + slack: sig-multicluster + mailing_list: https://groups.google.com/forum/#!forum/kubernetes-sig-multicluster - name: Instrumentation dir: sig-instrumentation mission_statement: > @@ -830,4 +831,4 @@ workinggroups: meeting_archive_url: hhttps://docs.google.com/document/d/1Pxc-qwAt4FvuISZ_Ib5KdUwlynFkGueuzPx5Je_lbGM/edit contact: slack: wg-app-def - mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-app-def \ No newline at end of file + mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-app-def -- cgit v1.2.3