summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorQuinton Hoole <quinton@google.com>2016-01-05 17:53:54 -0800
committerQuinton Hoole <quinton@google.com>2016-03-03 15:49:47 -0800
commitbbbede5eb69076167d635489d2cf48657be00dc1 (patch)
tree1e41867bcdb636e0ef815f0debe75b4fae6c1030
parentb231e7c92753e3be3b05e16de430e19afe659c4c (diff)
RFC design docs for Cluster Federation/Ubernetes.
-rw-r--r--control-plane-resilience.md269
-rw-r--r--federated-services.md550
-rw-r--r--federation-phase-1.md434
-rw-r--r--ubernetes-cluster-state.pngbin0 -> 13824 bytes
-rw-r--r--ubernetes-design.pngbin0 -> 20358 bytes
-rw-r--r--ubernetes-scheduling.pngbin0 -> 39094 bytes
6 files changed, 1253 insertions, 0 deletions
diff --git a/control-plane-resilience.md b/control-plane-resilience.md
new file mode 100644
index 00000000..8becccec
--- /dev/null
+++ b/control-plane-resilience.md
@@ -0,0 +1,269 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes/Ubernetes Control Plane Resilience
+
+## Long Term Design and Current Status
+
+### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
+
+### December 14, 2015
+
+## Summary
+
+Some amount of confusion exists around how we currently, and in future
+want to ensure resilience of the Kubernetes (and by implication
+Ubernetes) control plane. This document is an attempt to capture that
+definitively. It covers areas including self-healing, high
+availability, bootstrapping and recovery. Most of the information in
+this document already exists in the form of github comments,
+PR's/proposals, scattered documents, and corridor conversations, so
+document is primarily a consolidation and clarification of existing
+ideas.
+
+## Terms
+
+* **Self-healing:** automatically restarting or replacing failed
+ processes and machines without human intervention
+* **High availability:** continuing to be available and work correctly
+ even if some components are down or uncontactable. This typically
+ involves multiple replicas of critical services, and a reliable way
+ to find available replicas. Note that it's possible (but not
+ desirable) to have high
+ availability properties (e.g. multiple replicas) in the absence of
+ self-healing properties (e.g. if a replica fails, nothing replaces
+ it). Fairly obviously, given enough time, such systems typically
+ become unavailable (after enough replicas have failed).
+* **Bootstrapping**: creating an empty cluster from nothing
+* **Recovery**: recreating a non-empty cluster after perhaps
+ catastrophic failure/unavailability/data corruption
+
+## Overall Goals
+
+1. **Resilience to single failures:** Kubernetes clusters constrained
+ to single availability zones should be resilient to individual
+ machine and process failures by being both self-healing and highly
+ available (within the context of such individual failures).
+1. **Ubiquitous resilience by default:** The default cluster creation
+ scripts for (at least) GCE, AWS and basic bare metal should adhere
+ to the above (self-healing and high availability) by default (with
+ options available to disable these features to reduce control plane
+ resource requirements if so required). It is hoped that other
+ cloud providers will also follow the above guidelines, but the
+ above 3 are the primary canonical use cases.
+1. **Resilience to some correlated failures:** Kubernetes clusters
+ which span multiple availability zones in a region should by
+ default be resilient to complete failure of one entire availability
+ zone (by similarly providing self-healing and high availability in
+ the default cluster creation scripts as above).
+1. **Default implementation shared across cloud providers:** The
+ differences between the default implementations of the above for
+ GCE, AWS and basic bare metal should be minimized. This implies
+ using shared libraries across these providers in the default
+ scripts in preference to highly customized implementations per
+ cloud provider. This is not to say that highly differentiated,
+ customized per-cloud cluster creation processes (e.g. for GKE on
+ GCE, or some hosted Kubernetes provider on AWS) are discouraged.
+ But those fall squarely outside the basic cross-platform OSS
+ Kubernetes distro.
+1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
+ for achieving system resilience (replication controllers, health
+ checking, service load balancing etc) should be used in preference
+ to building a separate set of mechanisms to achieve the same thing.
+ This implies that self hosting (the kubernetes control plane on
+ kubernetes) is strongly preferred, with the caveat below.
+1. **Recovery from catastrophic failure:** The ability to quickly and
+ reliably recover a cluster from catastrophic failure is critical,
+ and should not be compromised by the above goal to self-host
+ (i.e. it goes without saying that the cluster should be quickly and
+ reliably recoverable, even if the cluster control plane is
+ broken). This implies that such catastrophic failure scenarios
+ should be carefully thought out, and the subject of regular
+ continuous integration testing, and disaster recovery exercises.
+
+## Relative Priorities
+
+1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
+ applications running inside it, disappear forever perhaps is the worst
+ possible failure mode. So it is critical that we be able to
+ recover the applications running inside a cluster from such
+ failures in some well-bounded time period.
+ 1. In theory a cluster can be recovered by replaying all API calls
+ that have ever been executed against it, in order, but most
+ often that state has been lost, and/or is scattered across
+ multiple client applications or groups. So in general it is
+ probably infeasible.
+ 1. In theory a cluster can also be recovered to some relatively
+ recent non-corrupt backup/snapshot of the disk(s) backing the
+ etcd cluster state. But we have no default consistent
+ backup/snapshot, verification or restoration process. And we
+ don't routinely test restoration, so even if we did routinely
+ perform and verify backups, we have no hard evidence that we
+ can in practise effectively recover from catastrophic cluster
+ failure or data corruption by restoring from these backups. So
+ there's more work to be done here.
+1. **Self-healing:** Most major cloud providers provide the ability to
+ easily and automatically replace failed virtual machines within a
+ small number of minutes (e.g. GCE
+ [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
+ and Managed Instance Groups,
+ AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
+ and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
+ can fairly trivially be used to reduce control-plane down-time due
+ to machine failure to a small number of minutes per failure
+ (i.e. typically around "3 nines" availability), provided that:
+ 1. cluster persistent state (i.e. etcd disks) is either:
+ 1. truely persistent (i.e. remote persistent disks), or
+ 1. reconstructible (e.g. using etcd [dynamic member
+ addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
+ or [backup and
+ recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
+
+ 1. and boot disks are either:
+ 1. truely persistent (i.e. remote persistent disks), or
+ 1. reconstructible (e.g. using boot-from-snapshot,
+ boot-from-pre-configured-image or
+ boot-from-auto-initializing image).
+1. **High Availability:** This has the potential to increase
+ availability above the approximately "3 nines" level provided by
+ automated self-healing, but it's somewhat more complex, and
+ requires additional resources (e.g. redundant API servers and etcd
+ quorum members). In environments where cloud-assisted automatic
+ self-healing might be infeasible (e.g. on-premise bare-metal
+ deployments), it also gives cluster administrators more time to
+ respond (e.g. replace/repair failed machines) without incurring
+ system downtime.
+
+## Design and Status (as of December 2015)
+
+<table>
+<tr>
+<td><b>Control Plane Component</b></td>
+<td><b>Resilience Plan</b></td>
+<td><b>Current Status</b></td>
+</tr>
+<tr>
+<td><b>API Server</b></td>
+<td>
+
+Multiple stateless, self-hosted, self-healing API servers behind a HA
+load balancer, built out by the default "kube-up" automation on GCE,
+AWS and basic bare metal (BBM). Note that the single-host approach of
+hving etcd listen only on localhost to ensure that onyl API server can
+connect to it will no longer work, so alternative security will be
+needed in the regard (either using firewall rules, SSL certs, or
+something else). All necessary flags are currently supported to enable
+SSL between API server and etcd (OpenShift runs like this out of the
+box), but this needs to be woven into the "kube-up" and related
+scripts. Detailed design of self-hosting and related bootstrapping
+and catastrophic failure recovery will be detailed in a separate
+design doc.
+
+</td>
+<td>
+
+No scripted self-healing or HA on GCE, AWS or basic bare metal
+currently exists in the OSS distro. To be clear, "no self healing"
+means that even if multiple e.g. API servers are provisioned for HA
+purposes, if they fail, nothing replaces them, so eventually the
+system will fail. Self-healing and HA can be set up
+manually by following documented instructions, but this is not
+currently an automated process, and it is not tested as part of
+continuous integration. So it's probably safest to assume that it
+doesn't actually work in practise.
+
+</td>
+</tr>
+<tr>
+<td><b>Controller manager and scheduler</b></td>
+<td>
+
+Multiple self-hosted, self healing warm standby stateless controller
+managers and schedulers with leader election and automatic failover of API server
+clients, automatically installed by default "kube-up" automation.
+
+</td>
+<td>As above.</td>
+</tr>
+<tr>
+<td><b>etcd</b></td>
+<td>
+
+Multiple (3-5) etcd quorum members behind a load balancer with session
+affinity (to prevent clients from being bounced from one to another).
+
+Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
+things:
+<ol>
+<li>allocate a new node (not necessary if running etcd as a pod, in
+which case specific measures are required to prevent user pods from
+interfering with system pods, for example using node selectors as
+described in <A HREF=")
+<li>start an etcd replica on that new node,
+<li>have the new replica recover the etcd state.
+</ol>
+In the case of local disk (which fails in concert with the machine), the etcd
+state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
+ addition</A>.
+In the case of remote persistent disk, the etcd state can be recovered
+by attaching the remote persistent disk to the replacement node, thus
+the state is recoverable even if all other replicas are down.
+
+There are also significant performance differences between local disks and remote
+persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
+local disks in GCE is approximatley 20x that of remote disks</A>.
+
+Hence we suggest that self-healing be provided by remotely mounted persistent disks in
+non-performance critical, single-zone cloud deployments. For
+performance critical installations, faster local SSD's should be used,
+in which case remounting on node failure is not an option, so
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
+should be used to replace the failed machine. Similarly, for
+cross-zone self-healing, cloud persistent disks are zonal, so
+automatic
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
+is required. Similarly, basic bare metal deployments cannot generally
+rely on
+remote persistent disks, so the same approach applies there.
+</td>
+<td>
+<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
+Somewhat vague instructions exist</A>
+on how to set some of this up manually in a self-hosted
+configuration. But automatic bootstrapping and self-healing is not
+described (and is not implemented for the non-PD cases). This all
+still needs to be automated and continuously tested.
+</td>
+</tr>
+</table>
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/federated-services.md b/federated-services.md
new file mode 100644
index 00000000..6febfb21
--- /dev/null
+++ b/federated-services.md
@@ -0,0 +1,550 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes Cluster Federation (a.k.a. "Ubernetes")
+
+## Cross-cluster Load Balancing and Service Discovery
+
+### Requirements and System Design
+
+### by Quinton Hoole, Dec 3 2015
+
+## Requirements
+
+### Discovery, Load-balancing and Failover
+
+1. **Internal discovery and connection**: Pods/containers (running in
+ a Kubernetes cluster) must be able to easily discover and connect
+ to endpoints for Kubernetes services on which they depend in a
+ consistent way, irrespective of whether those services exist in a
+ different kubernetes cluster within the same cluster federation.
+ Hence-forth referred to as "cluster-internal clients", or simply
+ "internal clients".
+1. **External discovery and connection**: External clients (running
+ outside a Kubernetes cluster) must be able to discover and connect
+ to endpoints for Kubernetes services on which they depend.
+ 1. **External clients predominantly speak HTTP(S)**: External
+ clients are most often, but not always, web browsers, or at
+ least speak HTTP(S) - notable exceptions include Enterprise
+ Message Busses (Java, TLS), DNS servers (UDP),
+ SIP servers and databases)
+1. **Find the "best" endpoint:** Upon initial discovery and
+ connection, both internal and external clients should ideally find
+ "the best" endpoint if multiple eligible endpoints exist. "Best"
+ in this context implies the closest (by network topology) endpoint
+ that is both operational (as defined by some positive health check)
+ and not overloaded (by some published load metric). For example:
+ 1. An internal client should find an endpoint which is local to its
+ own cluster if one exists, in preference to one in a remote
+ cluster (if both are operational and non-overloaded).
+ Similarly, one in a nearby cluster (e.g. in the same zone or
+ region) is preferable to one further afield.
+ 1. An external client (e.g. in New York City) should find an
+ endpoint in a nearby cluster (e.g. U.S. East Coast) in
+ preference to one further away (e.g. Japan).
+1. **Easy fail-over:** If the endpoint to which a client is connected
+ becomes unavailable (no network response/disconnected) or
+ overloaded, the client should reconnect to a better endpoint,
+ somehow.
+ 1. In the case where there exist one or more connection-terminating
+ load balancers between the client and the serving Pod, failover
+ might be completely automatic (i.e. the client's end of the
+ connection remains intact, and the client is completely
+ oblivious of the fail-over). This approach incurs network speed
+ and cost penalties (by traversing possibly multiple load
+ balancers), but requires zero smarts in clients, DNS libraries,
+ recursing DNS servers etc, as the IP address of the endpoint
+ remains constant over time.
+ 1. In a scenario where clients need to choose between multiple load
+ balancer endpoints (e.g. one per cluster), multiple DNS A
+ records associated with a single DNS name enable even relatively
+ dumb clients to try the next IP address in the list of returned
+ A records (without even necessarily re-issuing a DNS resolution
+ request). For example, all major web browsers will try all A
+ records in sequence until a working one is found (TBD: justify
+ this claim with details for Chrome, IE, Safari, Firefox).
+ 1. In a slightly more sophisticated scenario, upon disconnection, a
+ smarter client might re-issue a DNS resolution query, and
+ (modulo DNS record TTL's which can typically be set as low as 3
+ minutes, and buggy DNS resolvers, caches and libraries which
+ have been known to completely ignore TTL's), receive updated A
+ records specifying a new set of IP addresses to which to
+ connect.
+
+### Portability
+
+A Kubernetes application configuration (e.g. for a Pod, Replication
+Controller, Service etc) should be able to be successfully deployed
+into any Kubernetes Cluster or Ubernetes Federation of Clusters,
+without modification. More specifically, a typical configuration
+should work correctly (although possibly not optimally) across any of
+the following environments:
+
+1. A single Kubernetes Cluster on one cloud provider (e.g. Google
+ Compute Engine, GCE)
+1. A single Kubernetes Cluster on a different cloud provider
+ (e.g. Amazon Web Services, AWS)
+1. A single Kubernetes Cluster on a non-cloud, on-premise data center
+1. A Federation of Kubernetes Clusters all on the same cloud provider
+ (e.g. GCE)
+1. A Federation of Kubernetes Clusters across multiple different cloud
+ providers and/or on-premise data centers (e.g. one cluster on
+ GCE/GKE, one on AWS, and one on-premise).
+
+### Trading Portability for Optimization
+
+It should be possible to explicitly opt out of portability across some
+subset of the above environments in order to take advantage of
+non-portable load balancing and DNS features of one or more
+environments. More specifically, for example:
+
+1. For HTTP(S) applications running on GCE-only Federations,
+ [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+ should be usable. These provide single, static global IP addresses
+ which load balance and fail over globally (i.e. across both regions
+ and zones). These allow for really dumb clients, but they only
+ work on GCE, and only for HTTP(S) traffic.
+1. For non-HTTP(S) applications running on GCE-only Federations within
+ a single region,
+ [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+ should be usable. These provide TCP (i.e. both HTTP/S and
+ non-HTTP/S) load balancing and failover, but only on GCE, and only
+ within a single region.
+ [Google Cloud DNS](https://cloud.google.com/dns) can be used to
+ route traffic between regions (and between different cloud
+ providers and on-premise clusters, as it's plain DNS, IP only).
+1. For applications running on AWS-only Federations,
+ [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+ should be usable. These provide both L7 (HTTP(S)) and L4 load
+ balancing, but only within a single region, and only on AWS
+ ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
+ used to load balance and fail over across multiple regions, and is
+ also capable of resolving to non-AWS endpoints).
+
+## Component Cloud Services
+
+Ubernetes cross-cluster load balancing is built on top of the following:
+
+1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+ provide single, static global IP addresses which load balance and
+ fail over globally (i.e. across both regions and zones). These
+ allow for really dumb clients, but they only work on GCE, and only
+ for HTTP(S) traffic.
+1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+ provide both HTTP(S) and non-HTTP(S) load balancing and failover,
+ but only on GCE, and only within a single region.
+1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+ provide both L7 (HTTP(S)) and L4 load balancing, but only within a
+ single region, and only on AWS.
+1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
+ programmable DNS service, like
+ [CloudFlare](http://www.cloudflare.com) can be used to route
+ traffic between regions (and between different cloud providers and
+ on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
+ doesn't provide any built-in geo-DNS, latency-based routing, health
+ checking, weighted round robin or other advanced capabilities.
+ It's plain old DNS. We would need to build all the aforementioned
+ on top of it. It can provide internal DNS services (i.e. serve RFC
+ 1918 addresses).
+ 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
+ be used to load balance and fail over across regions, and is also
+ capable of routing to non-AWS endpoints). It provides built-in
+ geo-DNS, latency-based routing, health checking, weighted
+ round robin and optional tight integration with some other
+ AWS services (e.g. Elastic Load Balancers).
+1. Kubernetes L4 Service Load Balancing: This provides both a
+ [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
+ and a
+ [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
+ service IP which is load-balanced (currently simple round-robin)
+ across the healthy pods comprising a service within a single
+ Kubernetes cluster.
+1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
+
+## Ubernetes API
+
+The Ubernetes API for load balancing should be compatible with the
+equivalent Kubernetes API, to ease porting of clients between
+Ubernetes and Kubernetes. Further details below.
+
+## Common Client Behavior
+
+To be useful, our load balancing solution needs to work properly with
+real client applications. There are a few different classes of
+those...
+
+### Browsers
+
+These are the most common external clients. These are all well-written. See below.
+
+### Well-written clients
+
+1. Do a DNS resolution every time they connect.
+1. Don't cache beyond TTL (although a small percentage of the DNS
+ servers on which they rely might).
+1. Do try multiple A records (in order) to connect.
+1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
+
+Examples:
+
++ all common browsers (except for SRV records)
++ ...
+
+### Dumb clients
+
+1. Don't do a DNS resolution every time they connect (or do cache
+ beyond the TTL).
+1. Do try multiple A records
+
+Examples:
+
++ ...
+
+### Dumber clients
+
+1. Only do a DNS lookup once on startup.
+1. Only try the first returned DNS A record.
+
+Examples:
+
++ ...
+
+### Dumbest clients
+
+1. Never do a DNS lookup - are pre-configured with a single (or
+ possibly multiple) fixed server IP(s). Nothing else matters.
+
+## Architecture and Implementation
+
+### General control plane architecture
+
+Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
+etcd quorum members. This is documented in more detail in a
+[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
+
+In the description below, assume that 'n' clusters, named
+'cluster-1'... 'cluster-n' have been registered against an Ubernetes
+Federation "federation-1", each with their own set of Kubernetes API
+endpoints,so,
+"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
+[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
+... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
+
+### Federated Services
+
+Ubernetes Services are pretty straight-forward. They're comprised of
+multiple equivalent underlying Kubernetes Services, each with their
+own external endpoint, and a load balancing mechanism across them.
+Let's work through how exactly that works in practice.
+
+Our user creates the following Ubernetes Service (against an Ubernetes
+API endpoint):
+
+ $ kubectl create -f my-service.yaml --context="federation-1"
+
+where service.yaml contains the following:
+
+ kind: Service
+ metadata:
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ spec:
+ ports:
+ - port: 2379
+ protocol: TCP
+ targetPort: 2379
+ name: client
+ - port: 2380
+ protocol: TCP
+ targetPort: 2380
+ name: peer
+ selector:
+ run: my-service
+ type: LoadBalancer
+
+Ubernetes in turn creates one equivalent service (identical config to
+the above) in each of the underlying Kubernetes clusters, each of
+which results in something like this:
+
+ $ kubectl get -o yaml --context="cluster-1" service my-service
+
+ apiVersion: v1
+ kind: Service
+ metadata:
+ creationTimestamp: 2015-11-25T23:35:25Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ resourceVersion: "147365"
+ selfLink: /api/v1/namespaces/my-namespace/services/my-service
+ uid: 33bfc927-93cd-11e5-a38c-42010af00002
+ spec:
+ clusterIP: 10.0.153.185
+ ports:
+ - name: client
+ nodePort: 31333
+ port: 2379
+ protocol: TCP
+ targetPort: 2379
+ - name: peer
+ nodePort: 31086
+ port: 2380
+ protocol: TCP
+ targetPort: 2380
+ selector:
+ run: my-service
+ sessionAffinity: None
+ type: LoadBalancer
+ status:
+ loadBalancer:
+ ingress:
+ - ip: 104.197.117.10
+
+Similar services are created in `cluster-2` and `cluster-3`, each of
+which are allocated their own `spec.clusterIP`, and
+`status.loadBalancer.ingress.ip`.
+
+In Ubernetes `federation-1`, the resulting federated service looks as follows:
+
+ $ kubectl get -o yaml --context="federation-1" service my-service
+
+ apiVersion: v1
+ kind: Service
+ metadata:
+ creationTimestamp: 2015-11-25T23:35:23Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ resourceVersion: "157333"
+ selfLink: /api/v1/namespaces/my-namespace/services/my-service
+ uid: 33bfc927-93cd-11e5-a38c-42010af00007
+ spec:
+ clusterIP:
+ ports:
+ - name: client
+ nodePort: 31333
+ port: 2379
+ protocol: TCP
+ targetPort: 2379
+ - name: peer
+ nodePort: 31086
+ port: 2380
+ protocol: TCP
+ targetPort: 2380
+ selector:
+ run: my-service
+ sessionAffinity: None
+ type: LoadBalancer
+ status:
+ loadBalancer:
+ ingress:
+ - hostname: my-service.my-namespace.my-federation.my-domain.com
+
+Note that the federated service:
+
+1. Is API-compatible with a vanilla Kubernetes service.
+1. has no clusterIP (as it is cluster-independent)
+1. has a federation-wide load balancer hostname
+
+In addition to the set of underlying Kubernetes services (one per
+cluster) described above, Ubernetes has also created a DNS name
+(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
+[AWS Route 53](https://aws.amazon.com/route53/), depending on
+configuration) which provides load balancing across all of those
+services. For example, in a very basic configuration:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
+
+Each of the above IP addresses (which are just the external load
+balancer ingress IP's of each cluster service) is of course load
+balanced across the pods comprising the service in each cluster.
+
+In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
+automatically creates a
+[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+which exposes a single, globally load-balanced IP:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44
+
+Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
+in each Kubernetes cluster to preferentially return the local
+clusterIP for the service in that cluster, with other clusters'
+external service IP's (or a global load-balanced IP) also configured
+for failover purposes:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
+
+If Ubernetes Global Service Health Checking is enabled, multiple
+service health checkers running across the federated clusters
+collaborate to monitor the health of the service endpoints, and
+automatically remove unhealthy endpoints from the DNS record (e.g. a
+majority quorum is required to vote a service endpoint unhealthy, to
+avoid false positives due to individual health checker network
+isolation).
+
+### Federated Replication Controllers
+
+So far we have a federated service defined, with a resolvable load
+balancer hostname by which clients can reach it, but no pods serving
+traffic directed there. So now we need a Federated Replication
+Controller. These are also fairly straight-forward, being comprised
+of multiple underlying Kubernetes Replication Controllers which do the
+hard work of keeping the desired number of Pod replicas alive in each
+Kubernetes cluster.
+
+ $ kubectl create -f my-service-rc.yaml --context="federation-1"
+
+where `my-service-rc.yaml` contains the following:
+
+ kind: ReplicationController
+ metadata:
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ spec:
+ replicas: 6
+ selector:
+ run: my-service
+ template:
+ metadata:
+ labels:
+ run: my-service
+ spec:
+ containers:
+ image: gcr.io/google_samples/my-service:v1
+ name: my-service
+ ports:
+ - containerPort: 2379
+ protocol: TCP
+ - containerPort: 2380
+ protocol: TCP
+
+Ubernetes in turn creates one equivalent replication controller
+(identical config to the above, except for the replica count) in each
+of the underlying Kubernetes clusters, each of which results in
+something like this:
+
+ $ ./kubectl get -o yaml rc my-service --context="cluster-1"
+ kind: ReplicationController
+ metadata:
+ creationTimestamp: 2015-12-02T23:00:47Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
+ uid: 86542109-9948-11e5-a38c-42010af00002
+ spec:
+ replicas: 2
+ selector:
+ run: my-service
+ template:
+ metadata:
+ labels:
+ run: my-service
+ spec:
+ containers:
+ image: gcr.io/google_samples/my-service:v1
+ name: my-service
+ ports:
+ - containerPort: 2379
+ protocol: TCP
+ - containerPort: 2380
+ protocol: TCP
+ resources: {}
+ dnsPolicy: ClusterFirst
+ restartPolicy: Always
+ status:
+ replicas: 2
+
+The exact number of replicas created in each underlying cluster will
+of course depend on what scheduling policy is in force. In the above
+example, the scheduler created an equal number of replicas (2) in each
+of the three underlying clusters, to make up the total of 6 replicas
+required. To handle entire cluster failures, various approaches are possible,
+including:
+1. **simple overprovisioing**, such that sufficient replicas remain even if a
+ cluster fails. This wastes some resources, but is simple and
+ reliable.
+2. **pod autoscaling**, where the replication controller in each
+ cluster automatically and autonomously increases the number of
+ replicas in its cluster in response to the additional traffic
+ diverted from the
+ failed cluster. This saves resources and is reatively simple,
+ but there is some delay in the autoscaling.
+3. **federated replica migration**, where the Ubernetes Federation
+ Control Plane detects the cluster failure and automatically
+ increases the replica count in the remainaing clusters to make up
+ for the lost replicas in the failed cluster. This does not seem to
+ offer any benefits relative to pod autoscaling above, and is
+ arguably more complex to implement, but we note it here as a
+ possibility.
+
+### Implementation Details
+
+The implementation approach and architecture is very similar to
+Kubernetes, so if you're familiar with how Kubernetes works, none of
+what follows will be surprising. One additional design driver not
+present in Kubernetes is that Ubernetes aims to be resilient to
+individual cluster and availability zone failures. So the control
+plane spans multiple clusters. More specifically:
+
++ Ubernetes runs it's own distinct set of API servers (typically one
+ or more per underlying Kubernetes cluster). These are completely
+ distinct from the Kubernetes API servers for each of the underlying
+ clusters.
++ Ubernetes runs it's own distinct quorum-based metadata store (etcd,
+ by default). Approximately 1 quorum member runs in each underlying
+ cluster ("approximately" because we aim for an odd number of quorum
+ members, and typically don't want more than 5 quorum members, even
+ if we have a larger number of federated clusters, so 2 clusters->3
+ quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
+
+Cluster Controllers in Ubernetes watch against the Ubernetes API
+server/etcd state, and apply changes to the underlying kubernetes
+clusters accordingly. They also have the anti-entropy mechanism for
+reconciling ubernetes "desired desired" state against kubernetes
+"actual desired" state.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/federation-phase-1.md b/federation-phase-1.md
new file mode 100644
index 00000000..baf1e472
--- /dev/null
+++ b/federation-phase-1.md
@@ -0,0 +1,434 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+ width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Ubernetes Design Spec (phase one)
+
+**Huawei PaaS Team**
+
+## INTRODUCTION
+
+In this document we propose a design for the “Control Plane” of
+Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
+this work please refer to
+[this proposal](../../docs/proposals/federation.md).
+The document is arranged as following. First we briefly list scenarios
+and use cases that motivate K8S federation work. These use cases drive
+the design and they also verify the design. We summarize the
+functionality requirements from these use cases, and define the “in
+scope” functionalities that will be covered by this design (phase
+one). After that we give an overview of the proposed architecture, API
+and building blocks. And also we go through several activity flows to
+see how these building blocks work together to support use cases.
+
+## REQUIREMENTS
+
+There are many reasons why customers may want to build a K8S
+federation:
+
++ **High Availability:** Customers want to be immune to the outage of
+ a single availability zone, region or even a cloud provider.
++ **Sensitive workloads:** Some workloads can only run on a particular
+ cluster. They cannot be scheduled to or migrated to other clusters.
++ **Capacity overflow:** Customers prefer to run workloads on a
+ primary cluster. But if the capacity of the cluster is not
+ sufficient, workloads should be automatically distributed to other
+ clusters.
++ **Vendor lock-in avoidance:** Customers want to spread their
+ workloads on different cloud providers, and can easily increase or
+ decrease the workload proportion of a specific provider.
++ **Cluster Size Enhancement:** Currently K8S cluster can only support
+a limited size. While the community is actively improving it, it can
+be expected that cluster size will be a problem if K8S is used for
+large workloads or public PaaS infrastructure. While we can separate
+different tenants to different clusters, it would be good to have a
+unified view.
+
+Here are the functionality requirements derived from above use cases:
+
++ Clients of the federation control plane API server can register and deregister clusters.
++ Workloads should be spread to different clusters according to the
+ workload distribution policy.
++ Pods are able to discover and connect to services hosted in other
+ clusters (in cases where inter-cluster networking is necessary,
+ desirable and implemented).
++ Traffic to these pods should be spread across clusters (in a manner
+ similar to load balancing, although it might not be strictly
+ speaking balanced).
++ The control plane needs to know when a cluster is down, and migrate
+ the workloads to other clusters.
++ Clients have a unified view and a central control point for above
+ activities.
+
+## SCOPE
+
+It’s difficult to have a perfect design with one click that implements
+all the above requirements. Therefore we will go with an iterative
+approach to design and build the system. This document describes the
+phase one of the whole work. In phase one we will cover only the
+following objectives:
+
++ Define the basic building blocks and API objects of control plane
++ Implement a basic end-to-end workflow
+ + Clients register federated clusters
+ + Clients submit a workload
+ + The workload is distributed to different clusters
+ + Service discovery
+ + Load balancing
+
+The following parts are NOT covered in phase one:
+
++ Authentication and authorization (other than basic client
+ authentication against the ubernetes API, and from ubernetes control
+ plane to the underlying kubernetes clusters).
++ Deployment units other than replication controller and service
++ Complex distribution policy of workloads
++ Service affinity and migration
+
+## ARCHITECTURE
+
+The overall architecture of a control plane is shown as following:
+
+![Ubernetes Architecture](ubernetes-design.png)
+
+Some design principles we are following in this architecture:
+
+1. Keep the underlying K8S clusters independent. They should have no
+ knowledge of control plane or of each other.
+1. Keep the Ubernetes API interface compatible with K8S API as much as
+ possible.
+1. Re-use concepts from K8S as much as possible. This reduces
+customers’ learning curve and is good for adoption. Below is a brief
+description of each module contained in above diagram.
+
+## Ubernetes API Server
+
+The API Server in the Ubernetes control plane works just like the API
+Server in K8S. It talks to a distributed key-value store to persist,
+retrieve and watch API objects. This store is completely distinct
+from the kubernetes key-value stores (etcd) in the underlying
+kubernetes clusters. We still use `etcd` as the distributed
+storage so customers don’t need to learn and manage a different
+storage system, although it is envisaged that other storage systems
+(consol, zookeeper) will probably be developedand supported over
+time.
+
+## Ubernetes Scheduler
+
+The Ubernetes Scheduler schedules resources onto the underlying
+Kubernetes clusters. For example it watches for unscheduled Ubernetes
+replication controllers (those that have not yet been scheduled onto
+underlying Kubernetes clusters) and performs the global scheduling
+work. For each unscheduled replication controller, it calls policy
+engine to decide how to spit workloads among clusters. It creates a
+Kubernetes Replication Controller on one ore more underlying cluster,
+and post them back to `etcd` storage.
+
+One sublety worth noting here is that the scheduling decision is
+arrived at by combining the application-specific request from the user (which might
+include, for example, placement constraints), and the global policy specified
+by the federation administrator (for example, "prefer on-premise
+clusters over AWS clusters" or "spread load equally across clusters").
+
+## Ubernetes Cluster Controller
+
+The cluster controller
+performs the following two kinds of work:
+
+1. It watches all the sub-resources that are created by Ubernetes
+ components, like a sub-RC or a sub-service. And then it creates the
+ corresponding API objects on the underlying K8S clusters.
+1. It periodically retrieves the available resources metrics from the
+ underlying K8S cluster, and updates them as object status of the
+ `cluster` API object. An alternative design might be to run a pod
+ in each underlying cluster that reports metrics for that cluster to
+ the Ubernetes control plane. Which approach is better remains an
+ open topic of discussion.
+
+## Ubernetes Service Controller
+
+The Ubernetes service controller is a federation-level implementation
+of K8S service controller. It watches service resources created on
+control plane, creates corresponding K8S services on each involved K8S
+clusters. Besides interacting with services resources on each
+individual K8S clusters, the Ubernetes service controller also
+performs some global DNS registration work.
+
+## API OBJECTS
+
+## Cluster
+
+Cluster is a new first-class API object introduced in this design. For
+each registered K8S cluster there will be such an API resource in
+control plane. The way clients register or deregister a cluster is to
+send corresponding REST requests to following URL:
+`/api/{$version}/clusters`. Because control plane is behaving like a
+regular K8S client to the underlying clusters, the spec of a cluster
+object contains necessary properties like K8S cluster address and
+credentials. The status of a cluster API object will contain
+following information:
+
+1. Which phase of its lifecycle
+1. Cluster resource metrics for scheduling decisions.
+1. Other metadata like the version of cluster
+
+$version.clusterSpec
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Address<br>
+</td>
+<td style="padding:5px;">address of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">address<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Credential<br>
+</td>
+<td style="padding:5px;">the type (e.g. bearer token, client
+certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">string <br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+$version.clusterStatus
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Phase<br>
+</td>
+<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">enum<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Capacity<br>
+</td>
+<td style="padding:5px;">represents the available resources of a cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">any<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;">Other cluster metadata like the version<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+**For simplicity we didn’t introduce a separate “cluster metrics” API
+object here**. The cluster resource metrics are stored in cluster
+status section, just like what we did to nodes in K8S. In phase one it
+only contains available CPU resources and memory resources. The
+cluster controller will periodically poll the underlying cluster API
+Server to get cluster capability. In phase one it gets the metrics by
+simply aggregating metrics from all nodes. In future we will improve
+this with more efficient ways like leveraging heapster, and also more
+metrics will be supported. Similar to node phases in K8S, the “phase”
+field includes following values:
+
++ pending: newly registered clusters or clusters suspended by admin
+ for various reasons. They are not eligible for accepting workloads
++ running: clusters in normal status that can accept workloads
++ offline: clusters temporarily down or not reachable
++ terminated: clusters removed from federation
+
+Below is the state transition diagram.
+
+![Cluster State Transition Diagram](ubernetes-cluster-state.png)
+
+## Replication Controller
+
+A global workload submitted to control plane is represented as an
+Ubernetes replication controller. When a replication controller
+is submitted to control plane, clients need a way to express its
+requirements or preferences on clusters. Depending on different use
+cases it may be complex. For example:
+
++ This workload can only be scheduled to cluster Foo. It cannot be
+ scheduled to any other clusters. (use case: sensitive workloads).
++ This workload prefers cluster Foo. But if there is no available
+ capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
+ (use case: workload )
++ Seventy percent of this workload should be scheduled to cluster Foo,
+ and thirty percent should be scheduled to cluster Bar (use case:
+ vendor lock-in avoidance). In phase one, we only introduce a
+ _clusterSelector_ field to filter acceptable clusters. In default
+ case there is no such selector and it means any cluster is
+ acceptable.
+
+Below is a sample of the YAML to create such a replication controller.
+
+```
+apiVersion: v1
+kind: ReplicationController
+metadata:
+ name: nginx-controller
+spec:
+ replicas: 5
+ selector:
+ app: nginx
+ template:
+ metadata:
+ labels:
+ app: nginx
+ spec:
+ containers:
+ - name: nginx
+ image: nginx
+ ports:
+ - containerPort: 80
+ clusterSelector:
+ name in (Foo, Bar)
+```
+
+Currently clusterSelector (implemented as a
+[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
+only supports a simple list of acceptable clusters. Workloads will be
+evenly distributed on these acceptable clusters in phase one. After
+phase one we will define syntax to represent more advanced
+constraints, like cluster preference ordering, desired number of
+splitted workloads, desired ratio of workloads spread on different
+clusters, etc.
+
+Besides this explicit “clusterSelector” filter, a workload may have
+some implicit scheduling restrictions. For example it defines
+“nodeSelector” which can only be satisfied on some particular
+clusters. How to handle this will be addressed after phase one.
+
+## Ubernetes Services
+
+The Service API object exposed by Ubernetes is similar to service
+objects on Kubernetes. It defines the access to a group of pods. The
+Ubernetes service controller will create corresponding Kubernetes
+service objects on underlying clusters. These are detailed in a
+separate design document: [Federated Services](federated-services.md).
+
+## Pod
+
+In phase one we only support scheduling replication controllers. Pod
+scheduling will be supported in later phase. This is primarily in
+order to keep the Ubernetes API compatible with the Kubernetes API.
+
+## ACTIVITY FLOWS
+
+## Scheduling
+
+The below diagram shows how workloads are scheduled on the Ubernetes control plane:
+
+1. A replication controller is created by the client.
+1. APIServer persists it into the storage.
+1. Cluster controller periodically polls the latest available resource
+ metrics from the underlying clusters.
+1. Scheduler is watching all pending RCs. It picks up the RC, make
+ policy-driven decisions and split it into different sub RCs.
+1. Each cluster control is watching the sub RCs bound to its
+ corresponding cluster. It picks up the newly created sub RC.
+1. The cluster controller issues requests to the underlying cluster
+API Server to create the RC. In phase one we don’t support complex
+distribution policies. The scheduling rule is basically:
+ 1. If a RC does not specify any nodeSelector, it will be scheduled
+ to the least loaded K8S cluster(s) that has enough available
+ resources.
+ 1. If a RC specifies _N_ acceptable clusters in the
+ clusterSelector, all replica will be evenly distributed among
+ these clusters.
+
+There is a potential race condition here. Say at time _T1_ the control
+plane learns there are _m_ available resources in a K8S cluster. As
+the cluster is working independently it still accepts workload
+requests from other K8S clients or even another Ubernetes control
+plane. The Ubernetes scheduling decision is based on this data of
+available resources. However when the actual RC creation happens to
+the cluster at time _T2_, the cluster may don’t have enough resources
+at that time. We will address this problem in later phases with some
+proposed solutions like resource reservation mechanisms.
+
+![Ubernetes Scheduling](ubernetes-scheduling.png)
+
+## Service Discovery
+
+This part has been included in the section “Federated Service” of
+document
+“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
+refer to that document for details.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/ubernetes-cluster-state.png b/ubernetes-cluster-state.png
new file mode 100644
index 00000000..56ec2df8
--- /dev/null
+++ b/ubernetes-cluster-state.png
Binary files differ
diff --git a/ubernetes-design.png b/ubernetes-design.png
new file mode 100644
index 00000000..44924846
--- /dev/null
+++ b/ubernetes-design.png
Binary files differ
diff --git a/ubernetes-scheduling.png b/ubernetes-scheduling.png
new file mode 100644
index 00000000..01774882
--- /dev/null
+++ b/ubernetes-scheduling.png
Binary files differ