summaryrefslogtreecommitdiff
path: root/nodeaffinity.md
diff options
context:
space:
mode:
Diffstat (limited to 'nodeaffinity.md')
-rw-r--r--nodeaffinity.md165
1 files changed, 88 insertions, 77 deletions
diff --git a/nodeaffinity.md b/nodeaffinity.md
index dda04a51..8c999fec 100644
--- a/nodeaffinity.md
+++ b/nodeaffinity.md
@@ -36,40 +36,41 @@ Documentation for other releases can be found at
## Introduction
-This document proposes a new label selector representation, called `NodeSelector`,
-that is similar in many ways to `LabelSelector`, but is a bit more flexible and is
-intended to be used only for selecting nodes.
-
-In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler
-currently uses as part of restricting the set of nodes onto which a pod is
-eligible to schedule, with a field of type `Affinity` that contains one or
-more affinity specifications. In this document we discuss `NodeAffinity`, which
-contains one or more of the following
+This document proposes a new label selector representation, called
+`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit
+more flexible and is intended to be used only for selecting nodes.
+
+In addition, we propose to replace the `map[string]string` in `PodSpec` that the
+scheduler currently uses as part of restricting the set of nodes onto which a
+pod is eligible to schedule, with a field of type `Affinity` that contains one
+or more affinity specifications. In this document we discuss `NodeAffinity`,
+which contains one or more of the following:
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
the current `map[string]string` but still serves the purpose of restricting
-the set of nodes onto which the pod can schedule. In addition, unlike the behavior
-of the current `map[string]string`, when it becomes violated the system will
-try to eventually evict the pod from its node.
-* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical
-to `RequiredDuringSchedulingRequiredDuringExecution` except that the system
-may or may not try to eventually evict the pod from its node.
-* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are
-preferred for scheduling among those that meet all scheduling requirements.
+the set of nodes onto which the pod can schedule. In addition, unlike the
+behavior of the current `map[string]string`, when it becomes violated the system
+will try to eventually evict the pod from its node.
+* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is
+identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the
+system may or may not try to eventually evict the pod from its node.
+* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that
+specifies which nodes are preferred for scheduling among those that meet all
+scheduling requirements.
(In practice, as discussed later, we will actually *add* the `Affinity` field
-rather than replacing `map[string]string`, due to backward compatibility requirements.)
+rather than replacing `map[string]string`, due to backward compatibility
+requirements.)
-The affiniy specifications described above allow a pod to request various properties
-that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a
-multi-zone cluster, "run this pod on a node in zone Z."
+The affiniy specifications described above allow a pod to request various
+properties that are inherent to nodes, for example "run this pod on a node with
+an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z."
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
-some of the properties that a node might publish as labels, which affinity expressions
-can match against.)
-They do *not* allow a pod to request to schedule
-(or not schedule) on a node based on what other pods are running on the node. That
-feature is called "inter-pod topological affinity/anti-afinity" and is described
-[here](https://github.com/kubernetes/kubernetes/pull/18265).
+some of the properties that a node might publish as labels, which affinity
+expressions can match against.) They do *not* allow a pod to request to schedule
+(or not schedule) on a node based on what other pods are running on the node.
+That feature is called "inter-pod topological affinity/anti-afinity" and is
+described [here](https://github.com/kubernetes/kubernetes/pull/18265).
## API
@@ -171,9 +172,9 @@ type PreferredSchedulingTerm struct {
}
```
-Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector`
-and we can't change it since this name is part of the API. Hopefully this won't
-cause too much confusion.
+Unfortunately, the name of the existing `map[string]string` field in PodSpec is
+`NodeSelector` and we can't change it since this name is part of the API.
+Hopefully this won't cause too much confusion.
## Examples
@@ -186,81 +187,91 @@ cause too much confusion.
## Backward compatibility
-When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec
+When we add `Affinity` to PodSpec, we will deprecate, but not remove, the
+current field in PodSpec
```go
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
```
-Old version of the scheduler will ignore the `Affinity` field.
-New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`,
-i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not
-attempt to convert between `Affinity` and `nodeSelector`.
+Old version of the scheduler will ignore the `Affinity` field. New versions of
+the scheduler will apply their scheduling predicates to both `Affinity` and
+`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets
+of requirements. We will not attempt to convert between `Affinity` and
+`nodeSelector`.
-Old versions of non-scheduling clients will not know how to do anything semantically meaningful
-with `Affinity`, but we don't expect that this will cause a problem.
+Old versions of non-scheduling clients will not know how to do anything
+semantically meaningful with `Affinity`, but we don't expect that this will
+cause a problem.
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
for more discussion.
-Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master
-for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet
-or master to a version that does not support them. Longer-term we will use a programatic approach to
-enforcing this (#4855).
+Users should not start using `NodeAffinity` until the full implementation has
+been in Kubelet and the master for enough binary versions that we feel
+comfortable that we will not need to roll back either Kubelet or master to a
+version that does not support them. Longer-term we will use a programatic
+approach to enforcing this (#4855).
## Implementation plan
-1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`,
-and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API
-2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account
-3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account
-4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated
-5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API
-6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account
-7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision
-8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
-`RequiredDuringSchedulingRequiredDuringExecution`
-(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
-
-We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
-domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
+1. Add the `Affinity` field to PodSpec and the `NodeAffinity`,
+`PreferredDuringSchedulingIgnoredDuringExecution`, and
+`RequiredDuringSchedulingIgnoredDuringExecution` types to the API.
+2. Implement a scheduler predicate that takes
+`RequiredDuringSchedulingIgnoredDuringExecution` into account.
+3. Implement a scheduler priority function that takes
+`PreferredDuringSchedulingIgnoredDuringExecution` into account.
+4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be
+marked as deprecated.
+5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API.
+6. Modify the scheduler predicate from step 2 to also take
+`RequiredDuringSchedulingRequiredDuringExecution` into account.
+7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission
+decision.
+8. Implement code in Kubelet *or* the controllers that evicts a pod that no
+longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
+
+We assume Kubelet publishes labels describing the node's membership in all of
+the relevant scheduling domains (e.g. node name, rack name, availability zone
+name, etc.). See #9044.
## Extensibility
-The design described here is the result of careful analysis of use cases, a decade of experience
-with Borg at Google, and a review of similar features in other open-source container orchestration
-systems. We believe that it properly balances the goal of expressiveness against the goals of
-simplicity and efficiency of implementation. However, we recognize that
-use cases may arise in the future that cannot be expressed using the syntax described here.
-Although we are not implementing an affinity-specific extensibility mechanism for a variety
-of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
-users to get a consistent experience, etc.), the regular Kubernetes
-annotation mechanism can be used to add or replace affinity rules. The way this work would is
+The design described here is the result of careful analysis of use cases, a
+decade of experience with Borg at Google, and a review of similar features in
+other open-source container orchestration systems. We believe that it properly
+balances the goal of expressiveness against the goals of simplicity and
+efficiency of implementation. However, we recognize that use cases may arise in
+the future that cannot be expressed using the syntax described here. Although we
+are not implementing an affinity-specific extensibility mechanism for a variety
+of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
+for Kubernetes users to get a consistent experience, etc.), the regular
+Kubernetes annotation mechanism can be used to add or replace affinity rules.
+The way this work would is:
1. Define one or more annotations to describe the new affinity rule(s)
-1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
-If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
-from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
-annotation(s).
+1. User (or an admission controller) attaches the annotation(s) to pods to
+request the desired scheduling behavior. If the new rule(s) *replace* one or
+more fields of `Affinity` then the user would omit those fields from `Affinity`;
+if they are *additional rules*, then the user would fill in `Affinity` as well
+as the annotation(s).
1. Scheduler takes the annotation(s) into account when scheduling.
-If some particular new syntax becomes popular, we would consider upstreaming it by integrating
-it into the standard `Affinity`.
+If some particular new syntax becomes popular, we would consider upstreaming it
+by integrating it into the standard `Affinity`.
## Future work
-Are there any other fields we should convert from `map[string]string` to `NodeSelector`?
+Are there any other fields we should convert from `map[string]string` to
+`NodeSelector`?
## Related issues
The review for this proposal is in #18261.
-The main related issue is #341. Issue #367 is also related. Those issues reference other
-related issues.
-
-
-
-
+The main related issue is #341. Issue #367 is also related. Those issues
+reference other related issues.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->