summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>2019-03-01 16:49:26 -0800
committerGitHub <noreply@github.com>2019-03-01 16:49:26 -0800
commitb9fd7ff50a3f767133943bc8e3993447ecbdba07 (patch)
treed4067de6ada74d08b1eaf2518e3c1e482de314f6
parent453928f49cc7430eac177199096103474128f5dd (diff)
parent662c66103c0369d576f76b2858215902fe24c1f1 (diff)
Merge pull request #3338 from misterikkit/framework2
Remove old copy of scheduling framework design.
-rw-r--r--contributors/design-proposals/scheduling/images/scheduling-framework-extensions.pngbin50818 -> 0 bytes
-rw-r--r--contributors/design-proposals/scheduling/images/scheduling-framework-threads.pngbin44305 -> 0 bytes
-rw-r--r--contributors/design-proposals/scheduling/scheduling-framework.md431
3 files changed, 2 insertions, 429 deletions
diff --git a/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png b/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png
deleted file mode 100644
index 25f50471..00000000
--- a/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png
+++ /dev/null
Binary files differ
diff --git a/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png b/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png
deleted file mode 100644
index ae9e1965..00000000
--- a/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png
+++ /dev/null
Binary files differ
diff --git a/contributors/design-proposals/scheduling/scheduling-framework.md b/contributors/design-proposals/scheduling/scheduling-framework.md
index 39ff7db2..1de43aab 100644
--- a/contributors/design-proposals/scheduling/scheduling-framework.md
+++ b/contributors/design-proposals/scheduling/scheduling-framework.md
@@ -1,436 +1,9 @@
Status: Draft
-Created: 2018-04-09 / Last updated: 2018-08-15
+Created: 2018-04-09 / Last updated: 2019-03-01
Author: bsalamat
Contributors: misterikkit
---
-#
-- [SUMMARY ](#summary-)
-- [OBJECTIVE](#objective)
- - [Terminology](#terminology)
-- [BACKGROUND](#background)
-- [OVERVIEW](#overview)
- - [Non-goals](#non-goals)
-- [DETAILED DESIGN](#detailed-design)
- - [Bare bones of scheduling](#bare-bones-of-scheduling)
- - [Communication and statefulness of plugins](#communication-and-statefulness-of-plugins)
- - [Plugin registration](#plugin-registration)
- - [Extension points](#extension-points)
- - [Scheduling queue sort](#scheduling-queue-sort)
- - [Pre-filter](#pre-filter)
- - [Filter](#filter)
- - [Post-filter](#post-filter)
- - [Scoring](#scoring)
- - [Post-scoring/pre-reservation](#post-scoringpre-reservation)
- - [Reserve](#reserve)
- - [Permit](#permit)
- - [Approving a Pod binding](#approving-a-pod-binding)
- - [Reject](#reject)
- - [Pre-Bind](#pre-bind)
- - [Bind](#bind)
- - [Post Bind](#post-bind)
-- [USE-CASES](#use-cases)
- - [Dynamic binding of cluster-level resources](#dynamic-binding-of-cluster-level-resources)
- - [Gang Scheduling](#gang-scheduling)
-- [OUT OF PROCESS PLUGINS](#out-of-process-plugins)
-- [CONFIGURING THE SCHEDULING FRAMEWORK](#configuring-the-scheduling-framework)
-- [BACKWARD COMPATIBILITY WITH SCHEDULER v1](#backward-compatibility-with-scheduler-v1)
-- [DEVELOPMENT PLAN](#development-plan)
-- [TESTING PLAN](#testing-plan)
-- [WORK ESTIMATES ](#work-estimates)
-
-# SUMMARY
-
-This document describes the Kubernetes Scheduling Framework. The scheduling
-framework implements only basic functionality, but exposes many extension points
-for plugins to expand its functionality. The plan is that this framework (with
-its plugins) will eventually replace the current Kubernetes scheduler.
-
-# OBJECTIVE
-
-- make scheduler more extendable.
-- Make scheduler core simpler by moving some of its features to plugins.
-- Propose extension points in the framework.
-- Propose a mechanism to receive plugin results and continue or abort based
- on the received results.
-- Propose a mechanism to handle errors and communicate it with plugins.
-
-## Terminology
-
-Scheduler v1, current scheduler: refer to existing scheduler of Kubernetes.
-Scheduler v2, scheduling framework: refer to the new scheduler proposed in this
-doc.
-
-# BACKGROUND
-
-Many features are being added to the Kubernetes default scheduler. They keep
-making the code larger and logic more complex. A more complex scheduler is
-harder to maintain, its bugs are harder to find and fix, and those users running
-a custom scheduler have a hard time catching up and integrating new changes.
-The current Kubernetes scheduler provides
-[webhooks to extend](./scheduler_extender.md)
-its functionality. However, these are limited in a few ways:
-
-1. The number of extension points are limited: "Filter" extenders are called
- after default predicate functions. "Prioritize" extenders are called after
- default priority functions. "Preempt" extenders are called after running
- default preemption mechanism. "Bind" verb of the extenders are used to bind
- a Pod. Only one of the extenders can be a binding extender, and that
- extender performs binding instead of the scheduler. Extenders cannot be
- invoked at other points, for example, they cannot be called before running
- predicate functions.
-1. Every call to the extenders involves marshaling and unmarshalling JSON.
- Calling a webhook (HTTP request) is also slower than calling native functions.
-1. It is hard to inform an extender that scheduler has aborted scheduling of
- a Pod. For example, if an extender provisions a cluster resource and
- scheduler contacts the extender and asks it to provision an instance of the
- resource for the Pod being scheduled and then scheduler faces errors
- scheduling the Pod and decides to abort the scheduling, it will be hard to
- communicate the error with the extender and ask it to undo the provisioning
- of the resource.
-1. Since current extenders run as a separate process, they cannot use
- scheduler's cache. They must either build their own cache from the API
- server or process only the information they receive from the default scheduler.
-
-The above limitations hinder building high performance and versatile scheduler
-extensions. We would ideally like to have an extension mechanism that is fast
-enough to allow keeping a bare minimum logic in the scheduler core and convert
-many of the existing features of default scheduler, such as predicate and
-priority functions and preemption into plugins. Such plugins will be compiled
-with the scheduler. We would also like to provide an extension mechanism that do
-not need recompilation of scheduler. The expected performance of such plugins is
-lower than in-process plugins. Such out-of-process plugins should be used in
-cases where quick invocation of the plugin is not a constraint.
-
-# OVERVIEW
-
-Scheduler v2 allows both built-in and out-of-process extenders. This new
-architecture is a scheduling framework that exposes several extension points
-during a scheduling cycle. Scheduler plugins can register to run at one or more
-extension points.
-
-#### Non-goals
-
-- We will keep Kubernetes API backward compatibility, but keeping scheduler
- v1 backward compatibility is a non-goal. Particularly, scheduling policy
- config and v1 extenders won't work in this new framework.
-- Solve all the scheduler v1 limitations, although we would like to ensure
- that the new framework allows us to address known limitations in the future.
-- Provide implementation details of plugins and call-back functions, such as
- all of their arguments and return values.
-
-# DETAILED DESIGN
-
-## Bare bones of scheduling
-
-Pods that are not assigned to any node go to a scheduling queue and sorted by
-order specified by plugins (described [here](#scheduling-queue-sort)). The
-scheduling framework picks the head of the queue and starts a **scheduling
-cycle** to schedule the pod. At the end of the cycle scheduler determines
-whether the pod is schedulable or not. If the pod is not schedulable, its status
-is updated and goes back to the scheduling queue. If the pod is schedulable (one
-or more nodes are found that can run the Pod), the scoring process is started.
-The scoring process finds the best node to run the Pod. Once the best node is
-picked, the scheduler updates its cache and then a bind go routine is started to
-bind the pod.
-The above process is the same as what Kubernetes scheduler v1 does. Some of the
-essential features of scheduler v1, such as leader election, will also be
-transferred to the scheduling framework.
-In the rest of this section we describe how various plugins are used to enrich
-this basic workflow. This document focuses on in-process plugins.
-Out-of-process plugins are discussed later in a separate doc.
-
-## Communication and statefulness of plugins
-
-The scheduling framework provides a library that plugins can use to pass
-information to other plugins. This library keeps a map from keys of type string
-to opaque pointers of type interface{}. A write operation takes a key and a
-pointer and stores the opaque pointer in the map with the given key. Other
-plugins can provide the key and receive the opaque pointer. Multiple plugins can
-share the state or communicate via this mechanism.
-The saved state is preserved only during a single scheduling cycle. At the end
-of a scheduling cycle, this map is destructed. So, plugins cannot keep shared
-state across multiple scheduling cycle. They can, however, update the scheduler
-cache via the provided interface of the cache. The cache interface allows
-limited state preservation across multiple scheduling cycle.
-It is worth noting that plugins are assumed to be **trusted**. Scheduler does
-not prevent one plugin from accessing or modifying another plugin's state.
-
-## Plugin registration
-
-Plugin registration is done by providing an extension point and a function that
-should be called at that extension point. This step will be something like:
-
-```go
-register("pre-filter", plugin.foo)
-```
-
-The details of the function signature will be provided later.
-
-## Extension points
-
-The following picture shows the scheduling cycle of a Pod and the extension
-points that the scheduling framework exposes. In this picture "Filter" is
-equivalent to "Predicate" in scheduler v1 and "Scoring" is equivalent to
-"Priority function". Plugins are go functions. They are registered to be called
-at one of these extension points. They are called by the framework in the same
-order they are registered for each extension point.
-In the following sections we describe each extension point in the same order
-they are called in a schedule cycle.
-
-![image](images/scheduling-framework-extensions.png)
-
-### Scheduling queue sort
-
-These plugins indicate how Pods should be sorted in the scheduling queue. A
-plugin registered at this point only returns greater, smaller, or equal to
-indicate an ordering between two Pods. In other words, a plugin at this
-extension point returns the answer to "less(pod1, pod2)". Multiple plugins may
-be registered at this point. Plugins registered at this point are called in
-order and the invocation continues as long as plugins return "equal". Once a
-plugin returns "greater" or "smaller" the invocation of these plugins are
-stopped.
-
-### Pre-filter
-
-These plugins are generally useful to check certain conditions that the cluster
-or the Pod must meet. These are also useful to perform pre-processing on the pod
-and store some information about the pod that can be used by other plugins.
-The pod pointer is passed as an argument to these plugins. If any of these
-plugins return an error, the scheduling cycle is aborted.
-These plugins are called serially in the same order registered.
-
-### Filter
-
-Filter plugins filter out nodes that cannot run the Pod. Scheduler runs these
-plugins per node in the same order that they are registered, but scheduler may
-run these filter function for multiple nodes in parallel. So, these plugins must
-use synchronization when they modify state.
-Scheduler stops running the remaining filter functions for a node once one of
-these filters fails for the node.
-
-### Post-filter
-
-The Pod and the set of nodes that can run the Pod are passed to these plugins.
-They are called whether Pod is schedulable or not (whether the set of nodes is
-empty or non-empty).
-If any of these plugins return an error or if the Pod is determined
-unschedulable, the scheduling cycle is aborted.
-These plugins are called serially.
-
-### Scoring
-
-These plugins are similar to priority function in scheduler v1. They are
-utilized to rank nodes that have passed the filtering stage. Similar to Filter
-plugins, these are called per node serially in the same order registered, but
-scheduler may run them for multiple nodes in parallel.
-Each one of these functions return a score for the given node. The score is
-multiplied by the weight of the function and aggregated with the result of other
-scoring functions to yield a total score for the node.
-These functions can never block scheduling. In case of an error they should
-return zero for the Node being ranked.
-
-### Post-scoring/pre-reservation
-
-After all scoring plugins are invoked and the score of nodes are determined, the
-framework picks the best node with the highest score and then it calls
-post-scoring plugins. The Pod and the chosen Node are passed to these plugins.
-These plugins have one more chance to check any conditions about the assignment
-of the Pod to this Node and reject the node if needed.
-
-![image](images/scheduling-framework-threads.png)
-
-### Reserve
-
-At this point scheduler updates its cache by "reserving" a Node (partially or
-fully) for the Pod. In scheduler v1 this stage is called "assume".
-At this point, only the scheduler cache is updated to
-reflect that the Node is (partially) reserved for the Pod. The scheduling
-framework calls plugins registered at this extension points so that they get a
-chance to perform cache updates or other accounting activities. These plugins
-do not return any value (except errors).
-
-The actual assignment of the Node to the Pod happens during the "Bind" phase.
-That is when the API server updates the Pod object with the Node information.
-
-### Permit
-
-Permit plugins run in a separate go routine (in parallel). Each plugin can return
-one of the three possible values: 1) "permit", 2) "deny", or 3) "wait". If all
-plugins registered at this extension point return "permit", the pod is sent to
-the next step for binding. If any of the plugins returns "deny", the pod is
-rejected and sent back to the scheduling queue. If any of the plugins returns
-"wait", the Pod is kept in reserved state until it is explicitly approved for
-binding. A plugin that returns "wait" must return a "timeout" as well. If the
-timeout expires, the pod is rejected and goes back to the scheduling queue.
-
-#### Approving a Pod binding
-
-While any plugin can receive the list of reserved Pod from the cache and approve
-them, we expect only the "Permit" plugins to approve binding of reserved Pods
-that are in "waiting" state. Once a Pod is approved, it is sent to the Bind
-stage.
-
-### Reject
-
-Plugins called at "Permit" may perform some operations that should be undone if
-the Pod reservation fails. The "Reject" extension point allows such clean-up
-operations to happen. Plugins registered at this point are called if the
-reservation of the Pod is cancelled. The reservation is cancelled if any of the
-"Permit" plugins returns "reject" or if a Pod reservation, which is in "wait"
-state, times out.
-
-### Pre-Bind
-
-When a Pod is approved for binding it reaches to this stage. These plugins run
-before the actual binding of the Pod to a Node happens. The binding starts only
-if all of these plugins return true. If any returns false, the Pod is rejected
-and sent back to the scheduling queue. These plugins run in a separate go
-routine. The same go routine runs "Bind" after these plugins when all of them
-return true.
-
-### Bind
-
-Once all pre-bind plugins return true, the Bind plugins are executed. Multiple
-plugins may be registered at this extension point. Each plugin may return true
-or false (or an error). If a plugin returns false, the next plugin will be
-called until a plugin returns true. Once a true is returned **the remaining
-plugins are skipped**. If any of the plugins returns an error or all of them
-return false, the Pod is rejected and sent back to the scheduling queue.
-
-### Post Bind
-
-The Post Bind plugins can be useful for housekeeping after a pod is scheduled.
-These plugins do not return any value and are not expected to influence the
-scheduling decision made in the scheduling cycle.
-
-### Informer Events
-
-The scheduling framework, similar to Scheduler v1, will have informers that let
-the framework keep its copy of the state of the cluster up-to-date. The
-informers generate events, such as "PodAdd", "PodUpdate", "PodDelete", etc. The
-framework allows plugins to register their own handlers for any of these events.
-The handlers allow plugins with internal state or caches to keep their state
-updated.
-
-# USE-CASES
-
-In this section we provide a couple of examples on how the scheduling framework
-can be used to solve common scheduling scenarios.
-
-### Dynamic binding of cluster-level resources
-
-Cluster level resources are resources which are not immediately available on
-nodes at the time of scheduling Pods. Scheduler needs to ensure that such
-cluster level resources are bound to a chosen Node before it can schedule a Pod
-that requires such resources to the Node. We refer to this type of binding of
-resources to Nodes at the time of scheduling Pods as dynamic resource binding.
-Dynamic resource binding has proven to be a challenge in Scheduler v1, because
-Scheduler v1 is not flexible enough to support various types of plugins at
-different phases of scheduling. As a result, binding of storage volumes is
-integrated in the scheduler code and some non-trivial changes are done to the
-scheduler extender to support dynamic binding of network GPUs.
-The scheduling framework allows such dynamic bindings in a cleaner way. The main
-thread of scheduling framework process a pending Pod that requests a network
-resource and finds a node for the Pod and reserves the Pod. A dynamic resource
-binder plugin installed at "Pre-Bind" stage is invoked (in a separate thread).
-It analyzes the Pod and when detects that the Pod needs dynamic binding of the
-resource, the plugin tries to attach the cluster resource to the chosen node and
-then returns true so that the Pod can be bound. If the resource attachment
-fails, it returns false and the Pod will be retried.
-When there are multiple of such network resources, each one of them installs one
-"pre-bind" plugin. Each plugin looks at the Pod and if the Pod is not requesting
-the resource that they are interested in, they simply return "true" for the
-pod.
-
-### Gang Scheduling
-
-Gang scheduling allows a certain number of Pods to be scheduled simultaneously.
-If all the members of the gang cannot be scheduled at the same time, none of
-them should be scheduled. Gang scheduling may have various other features as
-well, but in this context we are interested in simultaneous scheduling of Pods.
-Gang scheduling in the scheduling framework can be done with an "Permit" plugin.
-The main scheduling thread processes pods one by one and reserves nodes for
-them. The gang scheduling plugin at the Permit stage is invoked for each pod.
-When it finds that the pod belongs to a gang, it checks the properties of the
-gang. If there are not enough members of the gang which are scheduled or in
-"wait" state, the plugin returns "wait". When the number reaches the desired
-value, all the Pods in wait state are approved and sent for binding.
-
-# OUT OF PROCESS PLUGINS
-
-Out of process plugins (OOPP) are called via JSON over an HTTP interface. In
-other words, the scheduler will support webhooks at most (maybe all) of the
-extension points. Data sent to an OOPP must be marshalled to JSON and data
-received must be unmarshalled. So, calling an OOPP is significantly slower than
-in-process plugins.
-We do not plan to build OOPPs in the first version of the scheduling framework.
-So, more details on them is to be determined.
-
-
-# DEVELOPMENT PLAN
-
-Earlier, we wanted to develop the scheduling framework as an independent project
-from scheduler V1. However, that would need much engineering resources.
-It would also be more difficult to roll out a new and not fully-backward
-compatible scheduler in Kubernetes where tens of thousands of users depend on
-the behavior of the scheduler.
-After revisiting the ideas and challenges, we changed our plan and have decided
-to build some of the ideas of the scheduling framework into Scheduler V1 to make
-it more extendable.
-
-As the first step, we would like to build:
- 1. [Pre-bind](#pre-bind) and [Reserve](#reserve) plugin points. These will
- help us move our existing cluster resource binding code, such as persistent
- volume binding, to plugins.
- 1. We will also build
- [the plugin communication mechanism](#communication-and-statefulness-of-plugins).
- This will allow us to build more sophisticated plugins that would require
- communication and also help us clean up existing scheduler's code by removing
- existing transient cache data.
-
-More features of the framework can be added to the Scheduler in the future based
-on the requirements.
-
-<s>
-# CONFIGURING THE SCHEDULING FRAMEWORK
-
-TBD
-
-# BACKWARD COMPATIBILITY WITH SCHEDULER v1
-
-We will build a new set of plugins for scheduler v2 to ensure that the existing
-behavior of scheduler v1 in placing Pods on nodes is preserved. This includes
-building plugins that replicate default predicate and priority functions of
-scheduler v1 and its binding mechanism, but scheduler extenders built for
-scheduler v1 won't be compatible with scheduler v2. Also, predicate and priority
-functions which are not enabled by default (such as service affinity) are not
-guaranteed to exist in scheduler v2.
-
-# DEVELOPMENT PLAN
-
-We will develop the scheduling framework as an incubator project in SIG
-scheduling. It will be built in a separate code-base independently from
-scheduler v1, but we will probably use a lot of code from scheduler v1.
-
-# TESTING PLAN
-
-We will add unit-tests as we build functionalities of the scheduling framework.
-The scheduling framework should eventually be able to pass integration and e2e
-tests of scheduler v1, excluding those tests that involve scheduler extensions.
-The e2e and integration tests may need to be modified slightly as the
-initialization and configuration of the scheduling framework will be different
-than scheduler v1.
-
-# WORK ESTIMATES
-
-We expect to see an early version of the scheduling framework in two release
-cycles (end of 2018). If things go well, we will start offering it as an
-alternative to the scheduler v1 by the end of Q1 2019 and start the deprecation
-of scheduler v1. We will make it the default scheduler of Kubernetes in Q2 2019,
-but we will keep the option of using scheduler v1 for at least two more release
-cycles.
-</s>
-
+The scheduling framework design has moved to https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md