summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>2021-04-13 10:13:11 -0700
committerGitHub <noreply@github.com>2021-04-13 10:13:11 -0700
commitfce5c5dd577e636c3ad1c0b215ca78c0f9c6feae (patch)
tree636567f81da4525d33601a32a627c3251981e5ca
parentbb7e37432bc6b2d40b8aa9d260eed05e06c0e371 (diff)
parente729a944398a288004ddcd7c664812bc94da3225 (diff)
Merge pull request #5674 from ingvagabund/sig-scheduling-code-architecture
sig-scheduling: Scheduler code hierarchy overview
-rw-r--r--contributors/devel/sig-scheduling/default_scheduler_architecture.pngbin0 -> 28576 bytes
-rw-r--r--contributors/devel/sig-scheduling/scheduling_code_hierarchy_overview.md286
2 files changed, 286 insertions, 0 deletions
diff --git a/contributors/devel/sig-scheduling/default_scheduler_architecture.png b/contributors/devel/sig-scheduling/default_scheduler_architecture.png
new file mode 100644
index 00000000..f742b8b2
--- /dev/null
+++ b/contributors/devel/sig-scheduling/default_scheduler_architecture.png
Binary files differ
diff --git a/contributors/devel/sig-scheduling/scheduling_code_hierarchy_overview.md b/contributors/devel/sig-scheduling/scheduling_code_hierarchy_overview.md
new file mode 100644
index 00000000..006fea70
--- /dev/null
+++ b/contributors/devel/sig-scheduling/scheduling_code_hierarchy_overview.md
@@ -0,0 +1,286 @@
+# Scheduler code hierarchy overview
+
+## Introduction
+
+The scheduler watches for newly created Pods that have no Node assigned.
+For every Pod that the scheduler discovers, the scheduler becomes responsible
+for finding the best Node for that Pod to run on.
+Scheduling in general is quite an extensive field in computer science which takes
+into account various range of constraints and limitations.
+Each workload may require a different approach to achieve optimal scheduling results.
+The kube-scheduler provided by Kubernetes project was constructed with a goal
+to provide high throughput at the cost of being simple.
+To help in building a scheduler (the default or a custom one) and to share
+elements of the scheduling logic,
+[the scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/)
+was implemented.
+The framework does not provide all pieces to build a new scheduler from scratch.
+Queues, caches, scheduling algorithms and other building elements are still needed to assemble
+a fully functional unit. This document aims at describing how all the individual
+pieces are put together and what’s their role in the overall architecture
+so a developer can quickly orient in the code.
+
+## Scheduling a pod
+
+The default scheduler instance has a loop running indefinitely
+which (everytime there’s a pod) is responsible for invoking the scheduling logic
+and making sure a pod gets either a node assigned or requeued for future processing.
+Each loop consists of a blocking scheduling and a non-blocking binding cycle.
+The scheduling cycle is responsible for running the scheduling algorithm selecting
+the most suitable node for placing the pod.
+The binding cycle makes sure the kube-apiserver is made aware of the selected
+node at the right time. A pod may be bound immediately, or in the case of gang scheduling,
+wait until all its sibling pods have their node assigned.
+
+### Scheduling cycle
+
+Each cycle honors the following steps:
+1. Get the next pod for scheduling
+1. Schedule a pod with provided algorithm
+1. If a pod fails to be scheduled due to `FitError`, run preemption plugin in
+ `PostFilterPlugin` (if the plugin is registered) to nominate a node where
+ the pods can run. If preemption was successful,
+ let the current pod be aware of the nominated node.
+ Handle the error, get the next pod and start over.
+1. If the scheduling algorithm finds a suitable node, store the pod into
+ the scheduler cache (`AssumePod` operation) and run plugins from the `Reserve`
+ and `Permit` extension point in that order. In case any of the plugins fails,
+ end the current scheduling cycle, increase relevant metrics and handle
+ the scheduling error through the `Error` handler.
+1. Upon successfully running all extension points, proceed to the binding cycle.
+ At the same time start processing another pod (if there’s any).
+
+### Binding cycle
+
+Consists of the following four steps ran in the same order:
+- Invoking [WaitOnPermit](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L560)
+ (internal API) of plugins from `Permit` extension point. Some plugins from the extension point
+ may send a request for an operation requiring to wait for a condition
+ (e.g. wait for additional resources to be available or wait for all pods
+ in a gang to be assumed).
+ Under the hood, `WaitOnPermit` waits for such a condition to be met within a timeout threshold.
+- Invoking plugins from [PreBind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L580) extension point
+- Invoking plugins from [Bind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L592) extension point
+- Invoking plugins from [PostBind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L611) extension point
+
+In case of processing of any of the extension points fails, `Unreserve` operation
+of all `Reserve` plugins is invoked (e.g. free resources allocated for a gang of pods).
+
+## Configuring and assembling the scheduler
+
+The scheduler codebase spans across various locations. Last but not least to mention:
+- [cmd/kube-scheduler/app](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app):
+ location of the controller code alongside definition of CLI arguments (honors the standard setup for all Kubernetes controllers)
+- [pkg/scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler):
+ the default scheduler codebase root directory
+- [pkg/scheduler/core](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core):
+ location of the default scheduling algorithm
+- [pkg/scheduler/framework](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework):
+ scheduling framework alongside plugins
+- [pkg/scheduler/internal](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/internal):
+ implementation of the cache, queues and other internal elements
+- [staging/src/k8s.io/kube-scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/staging/src/k8s.io/kube-scheduler):
+ location of ComponentConfig API types
+- [test/e2e/scheduling](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/e2e/scheduling):
+ scheduling e2e
+- [test/integration/scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/integration/scheduler)
+ scheduling integration tests
+- [test/integration/scheduler_perf](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/integration/scheduler_perf)
+ scheduling performance benchmarks
+
+### Initial startup configuration
+
+Code under `cmd/kube-scheduler/app` is responsible for collecting scheduler
+configuration and initializing logic allowing the kube-scheduler to run
+as part of the Kubernetes control plane. The code includes:
+- Initializing [command line options](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L96)
+ (along with a default `ComponentConfig`) and [validation](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L300)
+- Initializing [metrics](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L238)
+ (`/metrics`), [health check](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L268)
+ (`/healthz`) and [other handlers](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L225-L236)
+ (authorization, authentication, panic recovery, etc.)
+- Reading and defaulting configuration of [KubeSchedulerConfiguration](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/apis/config/types.go#L49-L106)
+- Building a registry with plugins (in-tree, [out-of-tree](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L312-L317))
+- Initializing the scheduler with various options such as [profiles, algorithm source, pod back off, etc.](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L326-L337)
+- Invocation of [LogOrWriteConfig](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L342) which logs the final scheduler configuration for debugging purposes
+- Right before running, `/configz` [is registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L141),
+ [events broadcaster started](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L148),
+ [leader election initiated](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L198-L216),
+ and [the server](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L185)
+ with all the configured handlers and [informers](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L192)
+ is started.
+
+Once initialized, the scheduler can run.
+
+In more detail, there’s a [Setup](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L299)
+function accomplishing what is essentially
+the initialization of the scheduler’s core process.
+First, it validates the options that have been passed through (the flags added
+in `NewSchedulerCommand()` are set directly on this options struct’s fields).
+If the options passed so far don’t raise any errors, it then calls `opts.Config()`
+which sets up the final internal settings including secure serving, leader election,
+clients, and begins parsing options related to the algorithm source
+(like loading config files and initializing empty profiles as well as handling
+deprecated options like policy config). The next lines call `c.Complete()` to complete
+the config by filling in any empty values. At this point any out-of-tree plugins
+are registered by creating a blank registry and adding entries in that registry
+for each plugin’s New function. It should be noted that the Registry is simply
+a map of plugin names to their factory functions. For the default scheduler,
+this step does nothing (because our main function in `cmd/kube-scheduler/scheduler.go`
+passes nothing to `NewSchedulerCommand()`).
+This means the default set of plugins is initialized in `scheduler.New()`.
+
+Given the initialization is performed outside the scheduling framework,
+different consumers of the framework can initialize the environment differently
+to cover their needs. For example, a simulator can inject its own object
+through informers. Or custom plugins may be provided instead of the default ones.
+Known consumers of the scheduling framework:
+- [cluster-autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/scheduler_based_predicates_checker.go#L48-L79)
+- [cluster-capacity](https://github.com/kubernetes-sigs/cluster-capacity/blob/8e9c2dcf3644cb5f73fca3d35d4e22899c265ad5/pkg/framework/simulator.go#L370-L383)
+
+### Assembling the scheduler
+
+The code is located under `pkg/scheduler`.
+This is where implementation of the default scheduler lives.
+Various elements of the scheduler are initialized and put together here:
+- Default scheduling options such as node percentage, initial and maximum backoff, profiles
+- Scheduler cache and queues
+- Scheduling profiles instantiated to tailor a framework for each profile
+ to better suit pod placement (each profile defines a set of plugins to use)
+- Handler functions for getting the next pod for scheduling (`NextPod`) and error handling (`Error`)
+
+The following steps are taken during the process of creating a scheduler instance:
+- Scheduler [cache is initialized](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L206)
+- Both in-tree and out-of-tree registries with plugins are [merged together](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L208-L211)
+- Metrics are [registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L232)
+- [Configurator](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L215-L230)
+ building a scheduler instance (wiring the cache, plugin registry,
+ scheduling algorithm and other elements together)
+- Event handlers [are registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L273)
+ to allow the scheduler to react on changes in PVs,
+ PVCs, services and other objects relevant for scheduling (eventually,
+ each plugin will define a set of events on which it reacts,
+ see [kubernetes/kubernetes#100347](https://github.com/kubernetes/kubernetes/issues/100347)
+ for more details).
+
+The following diagram shows how individual elements are connected together
+once initialized. Event handlers make sure pods are properly enqueued
+in the [scheduling queues](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_queues.md),
+the cache is updated with pods and nodes
+as they go (to provide up-to-date snapshot). Scheduling algorithm and the binding cycle
+have the right instances of the framework available (one instance of the framework per a profile).
+
+![Scheduler architecture](default_scheduler_architecture.png "Scheduler architecture")
+
+#### Scheduling framework
+
+Its code is currently located under `pkg/scheduler/framework`.
+It contains [various plugins](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/plugins)
+responsible for filtering and scoring nodes (among others).
+Used as building blocks for any scheduling algorithm.
+
+When a [plugin is initialized](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/framework/runtime/framework.go#L310),
+it’s passed a [framework handler](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/framework/runtime/framework.go#L251-L264)
+which provides interfaces to access and/or manipulate pods, nodes, clientset,
+event recorder and other handlers every plugin needs to implement its functionality.
+
+#### Scheduler cache
+
+Cache is responsible for capturing the current state of a cluster.
+Keeping a list of nodes and assumed pods alongside states of pods and images.
+The cache provides methods for reconciling pod and node objects
+(invoked through event handlers) keeping the state of the cluster up to date.
+Allowing to update the snapshot of a cluster (to pin the cluster state while a scheduling
+algorithm is run) with the latest state at the beginning of each scheduling cycle.
+
+The cache also allows to run assume operation which temporarily stores a pod
+in the cache and makes it look as the pod is actually already
+running on a designated node for all consumers of the snapshot.
+Assume operation exists to remove the time the pod actually gets updated
+on the kube-apiserver side and thus increasing the scheduler’s throughput.
+The following operations manipulate with the assumed pods:
+- `AssumePod`: to signal the scheduling algorithm found a feasible node so the next
+ pod can be attempted while the current pod enters the binding cycle
+- `FinishBinding`: used to signal Bind finished so the pod can be removed
+ from the list of assumed pods
+- `ForgetPod`: removes pod from the list of assumed pods, used in case the pod
+ fails to get processed in the binding cycle successfully
+ (e.g. during `Reserve`, `Permit`, `PreBind` or `Bind` evaluation)
+
+The cache keeps track of the following three metrics:
+- `scheduler_cache_size_assumed_pods`: number of pods in the assume pods list
+- `scheduler_cache_size_pods`: number of pods in the cache
+- `scheduler_cache_size_nodes`: number of nodes in the cache
+
+#### Snapshot
+
+The [snapshot](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/internal/cache/snapshot.go)
+captures the state of a cluster carrying information about all nodes
+in a cluster and objects located on each node.
+Namely node objects, pods assigned on each node, requested resources of all pods
+on each node, node’s allocatable, images pulled and other information needed
+to make a scheduling decision. Every time a pod is scheduled,
+a snapshot of the current state of the cluster is captured.
+To avoid a case where a pod or node gets changed while plugins are processed
+which might lead to data inconsistency as some plugins might get a different
+view of the cluster.
+
+#### Configurator
+
+A [configurator](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L90)
+builds the scheduler instance by wiring plugins, cache, queues,
+handlers and other elements together. Each profile [is initialized](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L138-L147)
+with its own framework (with all frameworks sharing informers, event recorders, etc.).
+
+At this point it’s still possible to have the configurator create the instance
+[from a policy file](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L213).
+Though, this approach is deprecated and will be removed
+from the configuration eventually. Keeping only the kube scheduler configuration
+as the only way to provide the configuration.
+
+#### Default scheduling algorithm
+
+The codebase defines a [ScheduleAlgorithm](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L61-L66)
+interface.
+Any implementation of the interface can be used as a scheduling algorithm.
+There are two methods:
+- `Schedule`: responsible for scheduling a pod using plugins from `PreFilter`
+ up to `NormalizeScore` extension points, provides [ScheduleResult](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L70-L77)
+ containing a scheduling decision (the most suitable nodes) with additional
+ accompanying information such as how many nodes were evaluated
+ and how many nodes were found feasible for scheduling.
+- `Extenders`: currently exposed only for testing
+
+Each cycle of the default algorithm implementation consists of:
+1. Taking the [current snapshot](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L101)
+ from the scheduling cache
+1. [Filter out all nodes](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L110)
+ not feasible for scheduling a pod
+ 1. Run [PreFilter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L230)
+ first (preprocessing phase, e.g. computing pod [anti-]affinity relations)
+ 1. Run [Filter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L261) in parallel:
+ filter out all nodes which does not satisfy pod’s constraints
+ (e.g. sufficient resources, node affinity, etc.), including running filter extenders
+ 1. Run [PostFilter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L479)
+ if no node can fit the incoming pod
+1. In case there are at least two feasible nodes for scheduling, run [scoring plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L133):
+ 1. Run [PreScore plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L427)
+ first (preprocessing phase)
+ 1. Run [Score plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L433) in parallel:
+ each node is given a score vector (each coordinate corresponding to one plugin)
+ 1. Run [NormalizeScore plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/runtime/framework.go#L798):
+ to have all plugins given a score in <0; 100> interval
+ 1. Compute [weighted score](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/runtime/framework.go#L810-L828)
+ for each node (each score plugin can have
+ a weight assigned indicating how much its score is preferred over others)
+ 1. Run [score extenders](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L456)
+ and add it to the total score of each node
+1. [Select](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L138)
+ and [give back a node](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L141-L145)
+ with the highest score. If there’s only a [single feasible node](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L125-L131)
+ skip `PreScore`, `Score` and `NormalizeScore` extension points
+ and give back the node right away. If there’s no feasible node, report it.
+
+Be aware of:
+- If a plugin provides score normalization, it needs to return non-nil
+when [ScoreExtensions()](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/plugins/podtopologyspread/scoring.go#L254-L256) gets invoked