diff options
| author | Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> | 2019-03-01 16:49:26 -0800 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2019-03-01 16:49:26 -0800 |
| commit | b9fd7ff50a3f767133943bc8e3993447ecbdba07 (patch) | |
| tree | d4067de6ada74d08b1eaf2518e3c1e482de314f6 | |
| parent | 453928f49cc7430eac177199096103474128f5dd (diff) | |
| parent | 662c66103c0369d576f76b2858215902fe24c1f1 (diff) | |
Merge pull request #3338 from misterikkit/framework2
Remove old copy of scheduling framework design.
| -rw-r--r-- | contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png | bin | 50818 -> 0 bytes | |||
| -rw-r--r-- | contributors/design-proposals/scheduling/images/scheduling-framework-threads.png | bin | 44305 -> 0 bytes | |||
| -rw-r--r-- | contributors/design-proposals/scheduling/scheduling-framework.md | 431 |
3 files changed, 2 insertions, 429 deletions
diff --git a/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png b/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png Binary files differdeleted file mode 100644 index 25f50471..00000000 --- a/contributors/design-proposals/scheduling/images/scheduling-framework-extensions.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png b/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png Binary files differdeleted file mode 100644 index ae9e1965..00000000 --- a/contributors/design-proposals/scheduling/images/scheduling-framework-threads.png +++ /dev/null diff --git a/contributors/design-proposals/scheduling/scheduling-framework.md b/contributors/design-proposals/scheduling/scheduling-framework.md index 39ff7db2..1de43aab 100644 --- a/contributors/design-proposals/scheduling/scheduling-framework.md +++ b/contributors/design-proposals/scheduling/scheduling-framework.md @@ -1,436 +1,9 @@ Status: Draft -Created: 2018-04-09 / Last updated: 2018-08-15 +Created: 2018-04-09 / Last updated: 2019-03-01 Author: bsalamat Contributors: misterikkit --- -# -- [SUMMARY ](#summary-) -- [OBJECTIVE](#objective) - - [Terminology](#terminology) -- [BACKGROUND](#background) -- [OVERVIEW](#overview) - - [Non-goals](#non-goals) -- [DETAILED DESIGN](#detailed-design) - - [Bare bones of scheduling](#bare-bones-of-scheduling) - - [Communication and statefulness of plugins](#communication-and-statefulness-of-plugins) - - [Plugin registration](#plugin-registration) - - [Extension points](#extension-points) - - [Scheduling queue sort](#scheduling-queue-sort) - - [Pre-filter](#pre-filter) - - [Filter](#filter) - - [Post-filter](#post-filter) - - [Scoring](#scoring) - - [Post-scoring/pre-reservation](#post-scoringpre-reservation) - - [Reserve](#reserve) - - [Permit](#permit) - - [Approving a Pod binding](#approving-a-pod-binding) - - [Reject](#reject) - - [Pre-Bind](#pre-bind) - - [Bind](#bind) - - [Post Bind](#post-bind) -- [USE-CASES](#use-cases) - - [Dynamic binding of cluster-level resources](#dynamic-binding-of-cluster-level-resources) - - [Gang Scheduling](#gang-scheduling) -- [OUT OF PROCESS PLUGINS](#out-of-process-plugins) -- [CONFIGURING THE SCHEDULING FRAMEWORK](#configuring-the-scheduling-framework) -- [BACKWARD COMPATIBILITY WITH SCHEDULER v1](#backward-compatibility-with-scheduler-v1) -- [DEVELOPMENT PLAN](#development-plan) -- [TESTING PLAN](#testing-plan) -- [WORK ESTIMATES ](#work-estimates) - -# SUMMARY - -This document describes the Kubernetes Scheduling Framework. The scheduling -framework implements only basic functionality, but exposes many extension points -for plugins to expand its functionality. The plan is that this framework (with -its plugins) will eventually replace the current Kubernetes scheduler. - -# OBJECTIVE - -- make scheduler more extendable. -- Make scheduler core simpler by moving some of its features to plugins. -- Propose extension points in the framework. -- Propose a mechanism to receive plugin results and continue or abort based - on the received results. -- Propose a mechanism to handle errors and communicate it with plugins. - -## Terminology - -Scheduler v1, current scheduler: refer to existing scheduler of Kubernetes. -Scheduler v2, scheduling framework: refer to the new scheduler proposed in this -doc. - -# BACKGROUND - -Many features are being added to the Kubernetes default scheduler. They keep -making the code larger and logic more complex. A more complex scheduler is -harder to maintain, its bugs are harder to find and fix, and those users running -a custom scheduler have a hard time catching up and integrating new changes. -The current Kubernetes scheduler provides -[webhooks to extend](./scheduler_extender.md) -its functionality. However, these are limited in a few ways: - -1. The number of extension points are limited: "Filter" extenders are called - after default predicate functions. "Prioritize" extenders are called after - default priority functions. "Preempt" extenders are called after running - default preemption mechanism. "Bind" verb of the extenders are used to bind - a Pod. Only one of the extenders can be a binding extender, and that - extender performs binding instead of the scheduler. Extenders cannot be - invoked at other points, for example, they cannot be called before running - predicate functions. -1. Every call to the extenders involves marshaling and unmarshalling JSON. - Calling a webhook (HTTP request) is also slower than calling native functions. -1. It is hard to inform an extender that scheduler has aborted scheduling of - a Pod. For example, if an extender provisions a cluster resource and - scheduler contacts the extender and asks it to provision an instance of the - resource for the Pod being scheduled and then scheduler faces errors - scheduling the Pod and decides to abort the scheduling, it will be hard to - communicate the error with the extender and ask it to undo the provisioning - of the resource. -1. Since current extenders run as a separate process, they cannot use - scheduler's cache. They must either build their own cache from the API - server or process only the information they receive from the default scheduler. - -The above limitations hinder building high performance and versatile scheduler -extensions. We would ideally like to have an extension mechanism that is fast -enough to allow keeping a bare minimum logic in the scheduler core and convert -many of the existing features of default scheduler, such as predicate and -priority functions and preemption into plugins. Such plugins will be compiled -with the scheduler. We would also like to provide an extension mechanism that do -not need recompilation of scheduler. The expected performance of such plugins is -lower than in-process plugins. Such out-of-process plugins should be used in -cases where quick invocation of the plugin is not a constraint. - -# OVERVIEW - -Scheduler v2 allows both built-in and out-of-process extenders. This new -architecture is a scheduling framework that exposes several extension points -during a scheduling cycle. Scheduler plugins can register to run at one or more -extension points. - -#### Non-goals - -- We will keep Kubernetes API backward compatibility, but keeping scheduler - v1 backward compatibility is a non-goal. Particularly, scheduling policy - config and v1 extenders won't work in this new framework. -- Solve all the scheduler v1 limitations, although we would like to ensure - that the new framework allows us to address known limitations in the future. -- Provide implementation details of plugins and call-back functions, such as - all of their arguments and return values. - -# DETAILED DESIGN - -## Bare bones of scheduling - -Pods that are not assigned to any node go to a scheduling queue and sorted by -order specified by plugins (described [here](#scheduling-queue-sort)). The -scheduling framework picks the head of the queue and starts a **scheduling -cycle** to schedule the pod. At the end of the cycle scheduler determines -whether the pod is schedulable or not. If the pod is not schedulable, its status -is updated and goes back to the scheduling queue. If the pod is schedulable (one -or more nodes are found that can run the Pod), the scoring process is started. -The scoring process finds the best node to run the Pod. Once the best node is -picked, the scheduler updates its cache and then a bind go routine is started to -bind the pod. -The above process is the same as what Kubernetes scheduler v1 does. Some of the -essential features of scheduler v1, such as leader election, will also be -transferred to the scheduling framework. -In the rest of this section we describe how various plugins are used to enrich -this basic workflow. This document focuses on in-process plugins. -Out-of-process plugins are discussed later in a separate doc. - -## Communication and statefulness of plugins - -The scheduling framework provides a library that plugins can use to pass -information to other plugins. This library keeps a map from keys of type string -to opaque pointers of type interface{}. A write operation takes a key and a -pointer and stores the opaque pointer in the map with the given key. Other -plugins can provide the key and receive the opaque pointer. Multiple plugins can -share the state or communicate via this mechanism. -The saved state is preserved only during a single scheduling cycle. At the end -of a scheduling cycle, this map is destructed. So, plugins cannot keep shared -state across multiple scheduling cycle. They can, however, update the scheduler -cache via the provided interface of the cache. The cache interface allows -limited state preservation across multiple scheduling cycle. -It is worth noting that plugins are assumed to be **trusted**. Scheduler does -not prevent one plugin from accessing or modifying another plugin's state. - -## Plugin registration - -Plugin registration is done by providing an extension point and a function that -should be called at that extension point. This step will be something like: - -```go -register("pre-filter", plugin.foo) -``` - -The details of the function signature will be provided later. - -## Extension points - -The following picture shows the scheduling cycle of a Pod and the extension -points that the scheduling framework exposes. In this picture "Filter" is -equivalent to "Predicate" in scheduler v1 and "Scoring" is equivalent to -"Priority function". Plugins are go functions. They are registered to be called -at one of these extension points. They are called by the framework in the same -order they are registered for each extension point. -In the following sections we describe each extension point in the same order -they are called in a schedule cycle. - - - -### Scheduling queue sort - -These plugins indicate how Pods should be sorted in the scheduling queue. A -plugin registered at this point only returns greater, smaller, or equal to -indicate an ordering between two Pods. In other words, a plugin at this -extension point returns the answer to "less(pod1, pod2)". Multiple plugins may -be registered at this point. Plugins registered at this point are called in -order and the invocation continues as long as plugins return "equal". Once a -plugin returns "greater" or "smaller" the invocation of these plugins are -stopped. - -### Pre-filter - -These plugins are generally useful to check certain conditions that the cluster -or the Pod must meet. These are also useful to perform pre-processing on the pod -and store some information about the pod that can be used by other plugins. -The pod pointer is passed as an argument to these plugins. If any of these -plugins return an error, the scheduling cycle is aborted. -These plugins are called serially in the same order registered. - -### Filter - -Filter plugins filter out nodes that cannot run the Pod. Scheduler runs these -plugins per node in the same order that they are registered, but scheduler may -run these filter function for multiple nodes in parallel. So, these plugins must -use synchronization when they modify state. -Scheduler stops running the remaining filter functions for a node once one of -these filters fails for the node. - -### Post-filter - -The Pod and the set of nodes that can run the Pod are passed to these plugins. -They are called whether Pod is schedulable or not (whether the set of nodes is -empty or non-empty). -If any of these plugins return an error or if the Pod is determined -unschedulable, the scheduling cycle is aborted. -These plugins are called serially. - -### Scoring - -These plugins are similar to priority function in scheduler v1. They are -utilized to rank nodes that have passed the filtering stage. Similar to Filter -plugins, these are called per node serially in the same order registered, but -scheduler may run them for multiple nodes in parallel. -Each one of these functions return a score for the given node. The score is -multiplied by the weight of the function and aggregated with the result of other -scoring functions to yield a total score for the node. -These functions can never block scheduling. In case of an error they should -return zero for the Node being ranked. - -### Post-scoring/pre-reservation - -After all scoring plugins are invoked and the score of nodes are determined, the -framework picks the best node with the highest score and then it calls -post-scoring plugins. The Pod and the chosen Node are passed to these plugins. -These plugins have one more chance to check any conditions about the assignment -of the Pod to this Node and reject the node if needed. - - - -### Reserve - -At this point scheduler updates its cache by "reserving" a Node (partially or -fully) for the Pod. In scheduler v1 this stage is called "assume". -At this point, only the scheduler cache is updated to -reflect that the Node is (partially) reserved for the Pod. The scheduling -framework calls plugins registered at this extension points so that they get a -chance to perform cache updates or other accounting activities. These plugins -do not return any value (except errors). - -The actual assignment of the Node to the Pod happens during the "Bind" phase. -That is when the API server updates the Pod object with the Node information. - -### Permit - -Permit plugins run in a separate go routine (in parallel). Each plugin can return -one of the three possible values: 1) "permit", 2) "deny", or 3) "wait". If all -plugins registered at this extension point return "permit", the pod is sent to -the next step for binding. If any of the plugins returns "deny", the pod is -rejected and sent back to the scheduling queue. If any of the plugins returns -"wait", the Pod is kept in reserved state until it is explicitly approved for -binding. A plugin that returns "wait" must return a "timeout" as well. If the -timeout expires, the pod is rejected and goes back to the scheduling queue. - -#### Approving a Pod binding - -While any plugin can receive the list of reserved Pod from the cache and approve -them, we expect only the "Permit" plugins to approve binding of reserved Pods -that are in "waiting" state. Once a Pod is approved, it is sent to the Bind -stage. - -### Reject - -Plugins called at "Permit" may perform some operations that should be undone if -the Pod reservation fails. The "Reject" extension point allows such clean-up -operations to happen. Plugins registered at this point are called if the -reservation of the Pod is cancelled. The reservation is cancelled if any of the -"Permit" plugins returns "reject" or if a Pod reservation, which is in "wait" -state, times out. - -### Pre-Bind - -When a Pod is approved for binding it reaches to this stage. These plugins run -before the actual binding of the Pod to a Node happens. The binding starts only -if all of these plugins return true. If any returns false, the Pod is rejected -and sent back to the scheduling queue. These plugins run in a separate go -routine. The same go routine runs "Bind" after these plugins when all of them -return true. - -### Bind - -Once all pre-bind plugins return true, the Bind plugins are executed. Multiple -plugins may be registered at this extension point. Each plugin may return true -or false (or an error). If a plugin returns false, the next plugin will be -called until a plugin returns true. Once a true is returned **the remaining -plugins are skipped**. If any of the plugins returns an error or all of them -return false, the Pod is rejected and sent back to the scheduling queue. - -### Post Bind - -The Post Bind plugins can be useful for housekeeping after a pod is scheduled. -These plugins do not return any value and are not expected to influence the -scheduling decision made in the scheduling cycle. - -### Informer Events - -The scheduling framework, similar to Scheduler v1, will have informers that let -the framework keep its copy of the state of the cluster up-to-date. The -informers generate events, such as "PodAdd", "PodUpdate", "PodDelete", etc. The -framework allows plugins to register their own handlers for any of these events. -The handlers allow plugins with internal state or caches to keep their state -updated. - -# USE-CASES - -In this section we provide a couple of examples on how the scheduling framework -can be used to solve common scheduling scenarios. - -### Dynamic binding of cluster-level resources - -Cluster level resources are resources which are not immediately available on -nodes at the time of scheduling Pods. Scheduler needs to ensure that such -cluster level resources are bound to a chosen Node before it can schedule a Pod -that requires such resources to the Node. We refer to this type of binding of -resources to Nodes at the time of scheduling Pods as dynamic resource binding. -Dynamic resource binding has proven to be a challenge in Scheduler v1, because -Scheduler v1 is not flexible enough to support various types of plugins at -different phases of scheduling. As a result, binding of storage volumes is -integrated in the scheduler code and some non-trivial changes are done to the -scheduler extender to support dynamic binding of network GPUs. -The scheduling framework allows such dynamic bindings in a cleaner way. The main -thread of scheduling framework process a pending Pod that requests a network -resource and finds a node for the Pod and reserves the Pod. A dynamic resource -binder plugin installed at "Pre-Bind" stage is invoked (in a separate thread). -It analyzes the Pod and when detects that the Pod needs dynamic binding of the -resource, the plugin tries to attach the cluster resource to the chosen node and -then returns true so that the Pod can be bound. If the resource attachment -fails, it returns false and the Pod will be retried. -When there are multiple of such network resources, each one of them installs one -"pre-bind" plugin. Each plugin looks at the Pod and if the Pod is not requesting -the resource that they are interested in, they simply return "true" for the -pod. - -### Gang Scheduling - -Gang scheduling allows a certain number of Pods to be scheduled simultaneously. -If all the members of the gang cannot be scheduled at the same time, none of -them should be scheduled. Gang scheduling may have various other features as -well, but in this context we are interested in simultaneous scheduling of Pods. -Gang scheduling in the scheduling framework can be done with an "Permit" plugin. -The main scheduling thread processes pods one by one and reserves nodes for -them. The gang scheduling plugin at the Permit stage is invoked for each pod. -When it finds that the pod belongs to a gang, it checks the properties of the -gang. If there are not enough members of the gang which are scheduled or in -"wait" state, the plugin returns "wait". When the number reaches the desired -value, all the Pods in wait state are approved and sent for binding. - -# OUT OF PROCESS PLUGINS - -Out of process plugins (OOPP) are called via JSON over an HTTP interface. In -other words, the scheduler will support webhooks at most (maybe all) of the -extension points. Data sent to an OOPP must be marshalled to JSON and data -received must be unmarshalled. So, calling an OOPP is significantly slower than -in-process plugins. -We do not plan to build OOPPs in the first version of the scheduling framework. -So, more details on them is to be determined. - - -# DEVELOPMENT PLAN - -Earlier, we wanted to develop the scheduling framework as an independent project -from scheduler V1. However, that would need much engineering resources. -It would also be more difficult to roll out a new and not fully-backward -compatible scheduler in Kubernetes where tens of thousands of users depend on -the behavior of the scheduler. -After revisiting the ideas and challenges, we changed our plan and have decided -to build some of the ideas of the scheduling framework into Scheduler V1 to make -it more extendable. - -As the first step, we would like to build: - 1. [Pre-bind](#pre-bind) and [Reserve](#reserve) plugin points. These will - help us move our existing cluster resource binding code, such as persistent - volume binding, to plugins. - 1. We will also build - [the plugin communication mechanism](#communication-and-statefulness-of-plugins). - This will allow us to build more sophisticated plugins that would require - communication and also help us clean up existing scheduler's code by removing - existing transient cache data. - -More features of the framework can be added to the Scheduler in the future based -on the requirements. - -<s> -# CONFIGURING THE SCHEDULING FRAMEWORK - -TBD - -# BACKWARD COMPATIBILITY WITH SCHEDULER v1 - -We will build a new set of plugins for scheduler v2 to ensure that the existing -behavior of scheduler v1 in placing Pods on nodes is preserved. This includes -building plugins that replicate default predicate and priority functions of -scheduler v1 and its binding mechanism, but scheduler extenders built for -scheduler v1 won't be compatible with scheduler v2. Also, predicate and priority -functions which are not enabled by default (such as service affinity) are not -guaranteed to exist in scheduler v2. - -# DEVELOPMENT PLAN - -We will develop the scheduling framework as an incubator project in SIG -scheduling. It will be built in a separate code-base independently from -scheduler v1, but we will probably use a lot of code from scheduler v1. - -# TESTING PLAN - -We will add unit-tests as we build functionalities of the scheduling framework. -The scheduling framework should eventually be able to pass integration and e2e -tests of scheduler v1, excluding those tests that involve scheduler extensions. -The e2e and integration tests may need to be modified slightly as the -initialization and configuration of the scheduling framework will be different -than scheduler v1. - -# WORK ESTIMATES - -We expect to see an early version of the scheduling framework in two release -cycles (end of 2018). If things go well, we will start offering it as an -alternative to the scheduler v1 by the end of Q1 2019 and start the deprecation -of scheduler v1. We will make it the default scheduler of Kubernetes in Q2 2019, -but we will keep the option of using scheduler v1 for at least two more release -cycles. -</s> - +The scheduling framework design has moved to https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md |
