contributors/devel/sig-scheduling/scheduling_code_hierarchy_overview.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286

# Scheduler code hierarchy overview

## Introduction

The scheduler watches for newly created Pods that have no Node assigned.
For every Pod that the scheduler discovers, the scheduler becomes responsible
for finding the best Node for that Pod to run on.
Scheduling in general is quite an extensive field in computer science which takes
into account various range of constraints and limitations.
Each workload may require a different approach to achieve optimal scheduling results.
The kube-scheduler provided by Kubernetes project was constructed with a goal
to provide high throughput at the cost of being simple.
To help in building a scheduler (the default or a custom one) and to share
elements of the scheduling logic,
[the scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/)
was implemented.
The framework does not provide all pieces to build a new scheduler from scratch.
Queues, caches, scheduling algorithms and other building elements are still needed to assemble
a fully functional unit. This document aims at describing how all the individual
pieces are put together and what’s their role in the overall architecture
so a developer can quickly orient in the code.

## Scheduling a pod

The default scheduler instance has a loop running indefinitely
which (everytime there’s a pod) is responsible for invoking the scheduling logic
and making sure a pod gets either a node assigned or requeued for future processing.
Each loop consists of a blocking scheduling and a non-blocking binding cycle.
The scheduling cycle is responsible for running the scheduling algorithm selecting
the most suitable node for placing the pod.
The binding cycle makes sure the kube-apiserver is made aware of the selected
node at the right time. A pod may be bound immediately, or in the case of gang scheduling,
wait until all its sibling pods have their node assigned.

### Scheduling cycle

Each cycle honors the following steps:
1. Get the next pod for scheduling
1. Schedule a pod with provided algorithm
1. If a pod fails to be scheduled due to `FitError`, run preemption plugin in
   `PostFilterPlugin` (if the plugin is registered) to nominate a node where
   the pods can run. If preemption was successful,
   let the current pod be aware of the nominated node.
   Handle the error, get the next pod and start over.
1. If the scheduling algorithm finds a suitable node, store the pod into
   the scheduler cache (`AssumePod` operation) and run plugins from the `Reserve`
   and `Permit` extension point in that order. In case any of the plugins fails,
   end the current scheduling cycle, increase relevant metrics and handle
   the scheduling error through the `Error` handler.
1. Upon successfully running all extension points, proceed to the binding cycle.
   At the same time start processing another pod (if there’s any).

### Binding cycle

Consists of the following four steps ran in the same order:
- Invoking [WaitOnPermit](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L560)
  (internal API) of plugins from `Permit` extension point. Some plugins from the extension point
  may send a request for an operation requiring to wait for a condition
  (e.g. wait for additional resources to be available or wait for all pods
  in a gang to be assumed).
  Under the hood, `WaitOnPermit` waits for such a condition to be met within a timeout threshold.
- Invoking plugins from [PreBind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L580) extension point
- Invoking plugins from [Bind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L592) extension point
- Invoking plugins from [PostBind](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L611) extension point

In case of processing of any of the extension points fails, `Unreserve` operation
of all `Reserve` plugins is invoked (e.g. free resources allocated for a gang of pods).

## Configuring and assembling the scheduler

The scheduler codebase spans across various locations. Last but not least to mention:
- [cmd/kube-scheduler/app](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app):
  location of the controller code alongside definition of CLI arguments (honors the standard setup for all Kubernetes controllers)
- [pkg/scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler):
  the default scheduler codebase root directory
- [pkg/scheduler/core](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core):
  location of the default scheduling algorithm
- [pkg/scheduler/framework](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework):
  scheduling framework alongside plugins
- [pkg/scheduler/internal](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/internal):
  implementation of the cache, queues and other internal elements
- [staging/src/k8s.io/kube-scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/staging/src/k8s.io/kube-scheduler):
  location of ComponentConfig API types
- [test/e2e/scheduling](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/e2e/scheduling):
  scheduling e2e
- [test/integration/scheduler](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/integration/scheduler)
  scheduling integration tests
- [test/integration/scheduler_perf](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/test/integration/scheduler_perf)
  scheduling performance benchmarks

### Initial startup configuration

Code under `cmd/kube-scheduler/app` is responsible for collecting scheduler
configuration and initializing logic allowing the kube-scheduler to run
as part of the Kubernetes control plane. The code includes:
- Initializing [command line options](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L96)
  (along with a default `ComponentConfig`) and [validation](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L300)
- Initializing [metrics](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L238)
  (`/metrics`), [health check](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L268)
  (`/healthz`) and [other handlers](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L225-L236)
  (authorization, authentication, panic recovery, etc.)
- Reading and defaulting configuration of [KubeSchedulerConfiguration](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/apis/config/types.go#L49-L106)
- Building a registry with plugins (in-tree, [out-of-tree](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L312-L317))
- Initializing the scheduler with various options such as [profiles, algorithm source, pod back off, etc.](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L326-L337)
- Invocation of [LogOrWriteConfig](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L342) which logs the final scheduler configuration for debugging purposes
- Right before running, `/configz` [is registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L141),
  [events broadcaster started](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L148),
  [leader election initiated](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L198-L216),
   and [the server](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L185)
   with all the configured handlers and [informers](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L192)
   is started.

Once initialized, the scheduler can run.

In more detail, there’s a [Setup](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/cmd/kube-scheduler/app/server.go#L299)
function accomplishing what is essentially
the initialization of the scheduler’s core process.
First, it validates the options that have been passed through (the flags added
in `NewSchedulerCommand()` are set directly on this options struct’s fields).
If the options passed so far don’t raise any errors, it then calls `opts.Config()`
which sets up the final internal settings including secure serving, leader election,
clients, and begins parsing options related to the algorithm source
(like loading config files and initializing empty profiles as well as handling
deprecated options like policy config). The next lines call `c.Complete()` to complete
the config by filling in any empty values. At this point any out-of-tree plugins
are registered by creating a blank registry and adding entries in that registry
for each plugin’s New function. It should be noted that the Registry is simply
a map of plugin names to their factory functions. For the default scheduler,
this step does nothing (because our main function in `cmd/kube-scheduler/scheduler.go`
passes nothing to `NewSchedulerCommand()`).
This means the default set of plugins is initialized in `scheduler.New()`.

Given the initialization is performed outside the scheduling framework,
different consumers of the framework can initialize the environment differently
to cover their needs. For example, a simulator can inject its own object
through informers. Or custom plugins may be provided instead of the default ones.
Known consumers of the scheduling framework:
- [cluster-autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/simulator/scheduler_based_predicates_checker.go#L48-L79)
- [cluster-capacity](https://github.com/kubernetes-sigs/cluster-capacity/blob/8e9c2dcf3644cb5f73fca3d35d4e22899c265ad5/pkg/framework/simulator.go#L370-L383)

### Assembling the scheduler

The code is located under `pkg/scheduler`.
This is where implementation of the default scheduler lives.
Various elements of the scheduler are initialized and put together here:
- Default scheduling options such as node percentage, initial and maximum backoff, profiles
- Scheduler cache and queues
- Scheduling profiles instantiated to tailor a framework for each profile
  to better suit pod placement (each profile defines a set of plugins to use)
- Handler functions for getting the next pod for scheduling (`NextPod`) and error handling (`Error`)

The following steps are taken during the process of creating a scheduler instance:
- Scheduler [cache is initialized](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L206)
- Both in-tree and out-of-tree registries with plugins are [merged together](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L208-L211)
- Metrics are [registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L232)
- [Configurator](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L215-L230)
  building a scheduler instance (wiring the cache, plugin registry,
  scheduling algorithm and other elements together)
- Event handlers [are registered](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L273)
  to allow the scheduler to react on changes in PVs,
  PVCs, services and other objects relevant for scheduling (eventually,
  each plugin will define a set of events on which it reacts,
  see [kubernetes/kubernetes#100347](https://github.com/kubernetes/kubernetes/issues/100347)
  for more details).

The following diagram shows how individual elements are connected together
once initialized. Event handlers make sure pods are properly enqueued
in the [scheduling queues](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_queues.md),
the cache is updated with pods and nodes
as they go (to provide up-to-date snapshot). Scheduling algorithm and the binding cycle
have the right instances of the framework available (one instance of the framework per a profile).

![Scheduler architecture](default_scheduler_architecture.png "Scheduler architecture")

#### Scheduling framework

Its code is currently located under `pkg/scheduler/framework`.
It contains [various plugins](https://github.com/kubernetes/kubernetes/tree/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/plugins)
responsible for filtering and scoring nodes (among others).
Used as building blocks for any scheduling algorithm.

When a [plugin is initialized](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/framework/runtime/framework.go#L310),
it’s passed a [framework handler](https://github.com/kubernetes/kubernetes/blob/4740173f3378ef9d0dc59b0aa9299444a97d0818/pkg/scheduler/framework/runtime/framework.go#L251-L264)
which provides interfaces to access and/or manipulate pods, nodes, clientset,
event recorder and other handlers every plugin needs to implement its functionality.

#### Scheduler cache

Cache is responsible for capturing the current state of a cluster.
Keeping a list of nodes and assumed pods alongside states of pods and images.
The cache provides methods for reconciling pod and node objects
(invoked through event handlers) keeping the state of the cluster up to date.
Allowing to update the snapshot of a cluster (to pin the cluster state while a scheduling
algorithm is run) with the latest state at the beginning of each scheduling cycle.

The cache also allows to run assume operation which temporarily stores a pod
in the cache and makes it look as the pod is actually already
running on a designated node for all consumers of the snapshot.
Assume operation exists to remove the time the pod actually gets updated
on the kube-apiserver side and thus increasing the scheduler’s throughput.
The following operations manipulate with the assumed pods:
- `AssumePod`: to signal the scheduling algorithm found a feasible node so the next
  pod can be attempted while the current pod enters the binding cycle
- `FinishBinding`: used to signal Bind finished so the pod can be removed
  from the list of assumed pods
- `ForgetPod`: removes pod from the list of assumed pods, used in case the pod
  fails to get processed in the binding cycle successfully
  (e.g. during `Reserve`, `Permit`, `PreBind` or `Bind` evaluation)

The cache keeps track of the following three metrics:
- `scheduler_cache_size_assumed_pods`: number of pods in the assume pods list
- `scheduler_cache_size_pods`: number of pods in the cache
- `scheduler_cache_size_nodes`: number of nodes in the cache

#### Snapshot

The [snapshot](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/internal/cache/snapshot.go)
captures the state of a cluster carrying information about all nodes
in a cluster and objects located on each node.
Namely node objects, pods assigned on each node, requested resources of all pods
on each node, node’s allocatable, images pulled and other information needed
to make a scheduling decision. Every time a pod is scheduled,
a snapshot of the current state of the cluster is captured.
To avoid a case where a pod or node gets changed while plugins are processed
which might lead to data inconsistency as some plugins might get a different
view of the cluster.

#### Configurator

A [configurator](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L90)
builds the scheduler instance by wiring plugins, cache, queues,
handlers and other elements together. Each profile [is initialized](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L138-L147)
with its own framework (with all frameworks sharing informers, event recorders, etc.).

At this point it’s still possible to have the configurator create the instance
[from a policy file](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/factory.go#L213).
Though, this approach is deprecated and will be removed
from the configuration eventually. Keeping only the kube scheduler configuration
as the only way to provide the configuration.

#### Default scheduling algorithm

The codebase defines a [ScheduleAlgorithm](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L61-L66)
interface.
Any implementation of the interface can be used as a scheduling algorithm.
There are two methods:
- `Schedule`: responsible for scheduling a pod using plugins from `PreFilter`
  up to `NormalizeScore` extension points, provides [ScheduleResult](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L70-L77)
  containing a scheduling decision (the most suitable nodes) with additional
  accompanying information such as how many nodes were evaluated
  and how many nodes were found feasible for scheduling.
- `Extenders`: currently exposed only for testing

Each cycle of the default algorithm implementation consists of:
1. Taking the [current snapshot](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L101)
   from the scheduling cache
1. [Filter out all nodes](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L110)
   not feasible for scheduling a pod
   1. Run [PreFilter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L230)
      first (preprocessing phase, e.g. computing pod [anti-]affinity relations)
   1. Run [Filter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L261) in parallel:
      filter out all nodes which does not satisfy pod’s constraints
      (e.g. sufficient resources, node affinity, etc.), including running filter extenders
   1. Run [PostFilter plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/scheduler.go#L479)
      if no node can fit the incoming pod
1. In case there are at least two feasible nodes for scheduling, run [scoring plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L133):
   1. Run [PreScore plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L427)
      first (preprocessing phase)
   1. Run [Score plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L433) in parallel:
      each node is given a score vector (each coordinate corresponding to one plugin)
   1. Run [NormalizeScore plugins](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/runtime/framework.go#L798):
      to have all plugins given a score in <0; 100> interval
   1. Compute [weighted score](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/runtime/framework.go#L810-L828)
      for each node (each score plugin can have
      a weight assigned indicating how much its score is preferred over others)
   1. Run [score extenders](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L456)
      and add it to the total score of each node
1. [Select](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L138)
   and [give back a node](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L141-L145)
   with the highest score. If there’s only a [single feasible node](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/core/generic_scheduler.go#L125-L131)
   skip `PreScore`, `Score` and `NormalizeScore` extension points
   and give back the node right away. If there’s no feasible node, report it.

Be aware of:
- If a plugin provides score normalization, it needs to return non-nil
when [ScoreExtensions()](https://github.com/kubernetes/kubernetes/blob/a651804427dd9a15bb91e1c4fb7a79994e4817a2/pkg/scheduler/framework/plugins/podtopologyspread/scoring.go#L254-L256) gets invoked