summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBobby (Babak) Salamat <bsalamat@google.com>2017-07-13 11:03:57 -0700
committerGitHub <noreply@github.com>2017-07-13 11:03:57 -0700
commit9b7c8fafa9047833091052651ee2e8100a649ae7 (patch)
treec7a2f4329b17247da329fa4b78d9391c86703df4
parent54309f17cbf9fd1da3fc316fcef08e3779d975fc (diff)
Design proposal for adding priority to Kubernetes API (#604)
* Add a design proposal for adding priority to Kubernetes API
-rw-r--r--contributors/design-proposals/pod-priority-api.md243
1 files changed, 243 insertions, 0 deletions
diff --git a/contributors/design-proposals/pod-priority-api.md b/contributors/design-proposals/pod-priority-api.md
new file mode 100644
index 00000000..914d229c
--- /dev/null
+++ b/contributors/design-proposals/pod-priority-api.md
@@ -0,0 +1,243 @@
+# Priority in Kubernetes API
+
+@bsalamat
+
+May 2017
+ * [Objective](#objective)
+ * [Non-Goals](#non-goals)
+ * [Background](#background)
+ * [Overview](#overview)
+ * [Detailed Design](#detailed-design)
+ * [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
+ * [Effect of priority on preemption](#effect-of-priority-on-preemption)
+ * [Priority in PodSpec](#priority-in-podspec)
+ * [Priority Classes](#priority-classes)
+ * [Resolving priority class names](#resolving-priority-class-names)
+ * [Ordering of priorities](#ordering-of-priorities)
+ * [System Priority Class Names](#system-priority-class-names)
+ * [Modifying Priority Classes](#modifying-priority-classes)
+ * [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes)
+ * [Priority and QoS classes](#priority-and-qos-classes)
+
+
+## Objective
+
+
+
+* How to specify priority for workloads in Kubernetes API.
+* Define how the order of these priorities are specified.
+* Define how new priority levels are added.
+* Effect of priority on scheduling and preemption.
+
+### Non-Goals
+
+
+
+* How preemption works in Kubernetes.
+* How quota allocation and accounting works for each priority.
+
+## Background
+
+It is fairly common in clusters to have more tasks than what the cluster
+resources can handle. Often times the workload is a mix of high priority
+critical tasks, and non-urgent tasks that can wait. Cluster management should be
+able to distinguish these workloads in order to decide which ones should acquire
+the resources sooner and which ones can wait. Priority of the workload is one of
+the key metrics that provides the information to the cluster. This document is a
+more detailed design proposal for part of the high-level architecture described
+in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).
+
+## Overview
+
+This design doc introduces the concept of priorities for pods in Kubernetes and
+how the priority impacts scheduling and preemption of pods when the cluster
+runs out of resources. A pod can specify a priority at the creation time. The
+priority must be one of the valid values and there is a total order on the
+values. The priority of a pod is independent of its workload type. The priority
+is global and not specific to a particular namespace.
+
+## Detailed Design
+
+### Effect of priority on scheduling
+
+One could generally expect a pod with higher priority has a higher chance of
+getting scheduled than the same pod with lower priority. However, there are
+many other parameters that affect scheduling decisions. So, a high priority pod
+may or may not be scheduled before lower priority pods. The details of
+what determines the order at which pods are scheduled are beyond the scope of
+this document.
+
+### Effect of priority on preemption
+
+Generally, lower priority pods are more likely to get preempted by higher
+priority pods when cluster has reached a threshold. In such a case, scheduler
+may decide to preempt lower priority pods to release enough resources for higher
+priority pending pods. As mentioned before, there are many other parameters
+that affect scheduling decisions, such as affinity and anti-affinity. If
+scheduler determines that a high priority pod cannot be scheduled even if lower
+priority pods are preempted, it will not preempt lower priority pods. Scheduler
+may have other restrictions on preempting pods, for example, it may refuse to
+preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
+preemption decisions are beyond the scope of this document.
+
+### Priority in PodSpec
+
+Pods may have priority in their pod spec. PodSpec will have two new fields
+called "PriorityClassName" which is specified by user, and "Priority" which will
+be populated by Kubernetes. User-specified priority (PriorityClassName) is a
+string and all of the valid priority classes are defined by a system wide
+mapping that maps each string to an integer. The PriorityClassName specified in
+a pod spec must be found in this map or the pod creation request will be
+rejected. If PriorityClassName is empty, it will resolve to the default
+priority (See below for more info on name resolution). Once the
+PriorityClassName is resolved to an integer, it is placed in "Priority" field of
+PodSpec.
+
+
+```
+type PodSpec struct {
+ ...
+ PriorityClassName string
+ Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly.
+}
+```
+
+### Priority Classes
+
+The cluster may have many user defined priority classes for
+various use cases. The following list is an example of how the priorities and
+their values may look like.
+Kubernetes will also have special priority class names reserved for critical system
+pods. Please see [System Priority Class Names](#system-priority-class-names) for
+more information. Any priority value above 1 billion is reserved for system use.
+Aside from those system priority classes, Kubernetes is not shipped with predefined
+priority classes usable by user pods. The main goal of having no built-in
+priority classes for user pods is to avoid creating defacto standard names which
+may be hard to change in the future.
+
+```
+system 2147483647 (int_max)
+tier1 4000
+tier2 2000
+tier3 1000
+```
+
+The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:
+
+* Kubernetes system daemons (per-node like fluentd, and cluster-level like
+ Heapster)
+* Critical user infrastructure (e.g. storage servers, monitoring system like
+ Prometheus, etc.)
+* Components that are in the user-facing request serving path and must be able
+ to scale up arbitrarily in response to load spikes (web servers, middleware,
+ etc.)
+* Important interruptible workloads that need strong guarantee of
+ schedulability and of not being interrupted
+* Less important interruptible workloads that need a less strong guarantee of
+ schedulability and of not being interrupted
+* Best effort / opportunistic
+
+### Resolving priority class names
+
+User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec.
+Admission controller resolves a PriorityClassName to its corresponding number
+and populates the "Priority" field of the pod spec. The rest of Kubernetes
+components look at the "Priority" field of pod status and work with the integer
+value. In other words, `PriorityClassName` will be ignored by the rest of the
+system.
+
+We are going to add a new API object called PriorityClass. The priority class
+defines the mapping between the priority name and its value. It can have an
+optional description. It is an arbitrary string and is provided
+only as a guideline for users.
+
+A priority class can be marked as "Global Default" by setting its
+`GlobalDefault` field to true. If a pod does not specify any `PriorityClassName`,
+the system resolves it to the value of the global default priority class if
+exists. If there is no global default, the pod's priority will be resolved to
+zero. Priority admission controller ensures that there is only one global
+default priority class.
+
+```
+type PriorityClass struct {
+ metav1.TypeMeta
+ // +optional
+ metav1.ObjectMeta
+
+ // The value of this priority class. This is the actual priority that pods
+ // recieve when they have the above name in their pod spec.
+ Value int32
+ GlobalDefault bool
+ Description string
+}
+```
+
+### Ordering of priorities
+
+As mentioned earlier, a PriorityClassName is resolved by the admission controller to
+its integral value and Kubernetes components use the integral value. The higher
+the value, the higher the priority.
+
+### System Priority Class Names
+There will be special priority class names reserved for system use only. These
+classes have a value larger than one billion.
+Priority admission controller ensures that new priority classes will be not
+created with those names. They are used for critical system pods that must not
+be preempted. We set default policies that deny creation of pods with
+PriorityClassNames corresponding to these priorities. Cluster admins can
+authorize users or service accounts to create pods with these priorities. When
+non-authorized users set PriorityClassName to one of these priority classes in
+their pod spec, their pod creation request will be rejected. For pods created by
+controllers, the service account must be authorized by cluster admins.
+
+### Modifying priority classes
+
+Priority classes can be added or removed, but their name and value cannot be
+updated. We allow updating `GlobalDefault` and `Description` as long as there is
+a maximum of one global default. While
+Kubernetes can work fine if priority classes are changed at run-time, the change
+can be confusing to users as pods with a priority class which were created
+before the change will have a different priority value than those created after
+the change. Deletion of priority classes is allowed, despite the fact that there
+may be existing pods that have specified such priority class names in their pod
+spec. In other words, there will be no referential integrity for priority
+classes. This is another reason that all system components should only work with
+the integer value of the priority and not with the `PriorityClassName`.
+
+One could delete an existing priority class and create another one with the same
+name and a different value. By doing so, they can achieve the same effect as
+updating a priority class, but we still do not allow updating priority classes
+to prevent accidental changes.
+
+Newly added priority classes cannot have a value higher than what is reserved
+for "system". The reason for this restriction
+is that Kubernetes critical system pods will have one of the "system" priorities
+and no pod should be able to preempt them.
+
+#### Drawbacks of changing priority classes
+
+While Kubernetes effectively allows changing priority classes (by deleting and
+adding them with a different value), it should be done only when
+absolutely needed. Changing priority classes has the following disadvantages:
+
+
+* May remove config portability: pod specs written for one cluster are no
+ longer guaranteed to work on a different cluster if the same priority classes
+ do not exist in the second cluster.
+* If quota is specified for existing priority classes (at the time of this writing,
+ we don't have this feature in Kubernetes), adding or deleting priority classes
+ will require reconfiguration of quota allocations.
+* An existing pods may have an integer value of priority that does not reflect
+ the current value of its PriorityClass.
+
+### Priority and QoS classes
+
+Kubernetes has [three QoS
+classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
+which are derived from request and limit of pods. Priority is introduced as an
+independent concept; meaning that any QoS class may have any valid priority.
+When a node is out of resources and pods needs to be preempted, we give
+priority a higher weight over QoS classes. In other words, we preempt the lowest
+priority pod and break ties with some other metrics, such as, QoS class, usage
+above request, etc. This is not finalized yet. We will discuss and finalize
+preemption in a separate doc.