mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Files

Piotr Kazmierczak f9b95ae896 scheduler: account for infeasible nodes when reconciling system jobs (#26868 )

Node reconciler never took node feasibility into account. In cases when
there were nodes excluded from allocation placement due to constraints
not being met, for example, the desired total or desired canary numbers
were never updated in the reconciler to account for that. Thus,
deployments would never become successful.

2025-10-02 16:17:46 +02:00

benchmarks

scheduler: isolate feasibility (#26031 )

2025-06-11 20:11:04 +02:00

feasible

scheduler: allow device count to use different vendors/models (#26649 )

2025-09-10 07:12:38 -04:00

integration

scheduler: move tests that depend on calling schedulers into integration package (#26037 )

2025-06-24 09:31:10 +02:00

reconciler

scheduler: account for infeasible nodes when reconciling system jobs (#26868 )

2025-10-02 16:17:46 +02:00

structs

scheduler: avoid importing the Planner test harness in scheduler calls (#26544 )

2025-08-18 19:35:34 +02:00

tests

scheduler: avoid importing the Planner test harness in scheduler calls (#26544 )

2025-08-18 19:35:34 +02:00

annotate_test.go

testing: migrate nomad/scheduler off of testify (#25968 )

2025-06-04 09:29:28 +02:00

annotate.go

Allow for in-place update when affinity or spread was changed (#25109 )

2025-02-14 14:33:18 -05:00

doc.go

scheduler: doc.go files for new packages (#26177 )

2025-07-01 16:28:33 +02:00

generic_sched_test.go

scheduler: don't suppress blocked evals on delay if previous expires (#26523 )

2025-08-15 10:53:52 -04:00

generic_sched.go

scheduler: don't suppress blocked evals on delay if previous expires (#26523 )

2025-08-15 10:53:52 -04:00

README.md

Update UI, code comment, and README links to docs, tutorials (#26429 )

2025-08-06 09:40:23 -05:00

scheduler_ce.go

admin: rename _oss files to _ce (#18209 )

2023-08-18 07:47:24 +01:00

scheduler_sysbatch_test.go

scheduler: upgrade block testing for system deployments (#26579 )

2025-09-05 10:22:42 -04:00

scheduler_system_test.go

scheduler: upgrade block testing for system deployments (#26579 )

2025-09-05 10:22:42 -04:00

scheduler_system.go

scheduler: account for infeasible nodes when reconciling system jobs (#26868 )

2025-10-02 16:17:46 +02:00

scheduler_test.go

scheduler: isolate feasibility (#26031 )

2025-06-11 20:11:04 +02:00

scheduler.go

scheduler: isolate feasibility (#26031 )

2025-06-11 20:11:04 +02:00

util_test.go

secrets: Add secrets block to job spec (#26076 )

2025-09-04 15:58:03 -04:00

util.go

secrets: Add secrets block to job spec (#26076 )

2025-09-04 15:58:03 -04:00

README.md

Nomad Scheduler

This package holds the logic behind Nomad schedulers. The Scheduler interface is implemented by two objects:

GenericScheduler and
SystemScheduler.

The CoreScheduler object also implements this interface, but it's use is purely internal, the core scheduler does not schedule any user jobs.

Nomad scheduler's task is to, given an evaluation, produce a plan of placing the desired allocations on feasibile nodes. Consult Nomad documentation for more details.

The diagram below illustrates this process for the service and system schedulers in more detail:

                                 +--------------+        +-----------+                            +-------------+      +----------+
                                 |   cluster    |        |feasibility|       +-------------+      |    score    |      |   plan   |
    Service and batch jobs:      |reconciliation|------->|   check   |------>|     fit     |----->| assignment  |----->|submission|
                                 +--------------+        +-----------+       +-------------+      +-------------+      +----------+

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                                 +--------------+        +-----------+                                                 +----------+
 System and sysbatch jobs:       |     node     |        |feasibility|       +-------------+                           |   plan   |
                                 |reconciliation|------->|   check   |------>|     fit     |-------------------------->|submission|
                                 +--------------+        +-----------+       +-------------+                           +----------+

Reconciliation

The first step for the service and bach job scheduler is called "reconciliation," and its logic lies in the scheduler/reconciler package. There are two reconcilers: AllocReconciler object for service and batch jobs, and Node reconciler used by system and sysbatch jobs.

Both reconciler's task is to tell the scheduler about desired allocations or deployments to be updated or created, which should be updated destructively or in-place, which should be stopped, and which are disconnected or are reconnecting. The reconciler works in terms of "buckets," that is, it processes allocations by putting them into different sets, and that's how its whole logic is implemented.

The following vocabulary is used by the reconcilers:

"tainted node:" a node is considered "tainted" if allocations must be migrated off of it. These are nodes that are draining or have been drained, but also nodes that are disconnected and should be used to calculated reconnect timeout. The cluster reconciler commonly also refers to "untainted" allocations, i.e., those that do not require migration and are not on disconnected or reconnecting nodes.
"paused deployment:" a deployment is paused when it has an explicit paused status, but also when it's pending or initializing.
the reconciler uses the following 6 "buckets" to categorize allocation sets:
- "migrating allocations:" allocations that are on nodes that are draining.
- "lost allocations:" allocations that have expired or exist on lost nodes.
- "disconnecting allocations:" allocations that are on disconnected nodes which haven't been considered "lost" yet, that is, they are in their reconnect timeout.
- "reconnecting allocations:" allocations on nodes that have reconnected.
- "ignored allocations:" allocations which are in a noop state, the reconciler will not be touching these. These are also not to be upgraded in-place, for updates, the reconciler uses additional "buckets" (in the computeUpdates method): "inplace" and "destructive."
- "expiring allocations:" allocations which are not possible to reschedule, due to lost configurations of their disconnected clients.

Cluster Reconciler

The following diagram illustrates the logic flow of the cluster reconciler:

          +---------+
          |Compute()|
          +---------+
               |
               |
               |
               v                deployments are unneeded in 3 cases:
 +---------------------------+  1. when the are already successful
 | cancelUnneededDeployments |  2. when they are active but reference an older job
 +---------------------------+  3. when the job is marked as stopped, but the
               |                deployment is non-terminal
               v
         +-----------+          if the job is stopped, we stop
         |handle stop|          all allocations and handle the
         +-----------+          lost allocations.
               |
               |
               |                for every task group, this method
               |                calls computeGroup which returns
+--------------+-------------+  "true" if deployment is complete
| computeDeploymentComplete  |  for the task group.
+--------------+-------------+  computeDeploymentComplete itself
               |                returns a boolean and a
               |                ReconcileResults object.
               |
               |                        +---------------+
               +----------------------->| computeGroup  |
               |                        +---------------+
               |
               |                contains the main, and most complex part
               |                of the reconciler. it calls many helper
               |                methods:
               |                - filterOldTerminalAllocs: allocs that
               |                are terminal or from older job ver are
               |                put into "ignore" bucket
               |                - cancelUnneededCanaries
               |                - filterByTainted: results in 6 buckets
               |                mentioned in the paragraphs above:
               |                untainted, migrate, lost, disconnecting,
               |                reconnecting, ignore and expiring.
               |                - filterByRescheduleable: updates the
               |                untainted bucket and creates 2 new ones:
               |                rescheduleNow and rescheduleLater
               |                - reconcileReconnecting: returns which
               |                allocs should be marked for reconnecting
               |                and which should be stopped
               |                - computeStop
               |                - computeCanaries
               |                - computePlacements: allocs are placed if
               |                deployment is not paused or failed, they
               |                are not canaries (unless promoted),
               |                previous alloc was lost
               |                - placeAllocs
               |                - computeDestructiveUpdates
               |                - computeMigrations
               |                - createDeployment
               |
               v
+-----------------------------+
|setDeploymentStatusAndUpdates|
+-----------------------------+
               |
               |                for complete deployments, it
               |                handles multi-region case and
               v                sets the deploymentUpdates
  +------------------------+
  |return *ReconcileResults|
  +------------------------+

Node Reconciler

The system scheduler also does a "reconciliation" step, but only on a per-node basis (system jobs run on all feasible nodes), which makes it simpler than the service reconciler which takes into account a whole cluster, and has jobs that can run on arbitrary subset of clients.

Node reconciliation removes tainted nodes, updates terminal allocations to lost, deals with disconnected nodes and computes placements.

Finding the right node

The scheduler/feasible package contains all the logic used to finding the right nodes to place workloads.

Feasibility checking

Nomad uses a set of iterators to iterate over nodes and check how feasible they are for any given allocation. The scheduler uses a Stack interface that lives in scheduler/feasible/stack.go file in order to make placement decisions, and feasibility iterators that live in scheduler/feasible/feasible.go to filter by:

node eligibiligy,
data center,
and node pool.

Once nodes are filtered, the Stack implementations (GenericStack and SystemStack) check for:

drivers,
job constraints,
devices,
volumes,
networking,
affinities,
and quotas.

Finding the best fit and scoring

Applies only to service and batch jobs, since system and sysbatch jobs are placed on all feasible nodes.

This part of scheduling sits in the scheduler/feasible/rank.go file. The RankIterator interface, which is implemented by e.g., SpreadIterator and BinPackIterator, captures the ranking logic in its Next() methods.