Node reconciler never took node feasibility into account. In cases when there were nodes excluded from allocation placement due to constraints not being met, for example, the desired total or desired canary numbers were never updated in the reconciler to account for that. Thus, deployments would never become successful.
Nomad Scheduler
This package holds the logic behind Nomad schedulers. The Scheduler interface
is implemented by two objects:
GenericSchedulerandSystemScheduler.
The CoreScheduler object also implements this interface, but it's use is
purely internal, the core scheduler does not schedule any user jobs.
Nomad scheduler's task is to, given an evaluation, produce a plan of placing the desired allocations on feasibile nodes. Consult Nomad documentation for more details.
The diagram below illustrates this process for the service and system schedulers in more detail:
+--------------+ +-----------+ +-------------+ +----------+
| cluster | |feasibility| +-------------+ | score | | plan |
Service and batch jobs: |reconciliation|------->| check |------>| fit |----->| assignment |----->|submission|
+--------------+ +-----------+ +-------------+ +-------------+ +----------+
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+--------------+ +-----------+ +----------+
System and sysbatch jobs: | node | |feasibility| +-------------+ | plan |
|reconciliation|------->| check |------>| fit |-------------------------->|submission|
+--------------+ +-----------+ +-------------+ +----------+
Reconciliation
The first step for the service and bach job scheduler is called
"reconciliation," and its logic lies in the scheduler/reconciler package.
There are two reconcilers: AllocReconciler object for service and batch jobs,
and Node reconciler used by system and sysbatch jobs.
Both reconciler's task is to tell the scheduler about desired allocations or deployments to be updated or created, which should be updated destructively or in-place, which should be stopped, and which are disconnected or are reconnecting. The reconciler works in terms of "buckets," that is, it processes allocations by putting them into different sets, and that's how its whole logic is implemented.
The following vocabulary is used by the reconcilers:
-
"tainted node:" a node is considered "tainted" if allocations must be migrated off of it. These are nodes that are draining or have been drained, but also nodes that are disconnected and should be used to calculated reconnect timeout. The cluster reconciler commonly also refers to "untainted" allocations, i.e., those that do not require migration and are not on disconnected or reconnecting nodes.
-
"paused deployment:" a deployment is paused when it has an explicit
pausedstatus, but also when it's pending or initializing. -
the reconciler uses the following 6 "buckets" to categorize allocation sets:
-
"migrating allocations:" allocations that are on nodes that are draining.
-
"lost allocations:" allocations that have expired or exist on lost nodes.
-
"disconnecting allocations:" allocations that are on disconnected nodes which haven't been considered "lost" yet, that is, they are in their reconnect timeout.
-
"reconnecting allocations:" allocations on nodes that have reconnected.
-
"ignored allocations:" allocations which are in a noop state, the reconciler will not be touching these. These are also not to be upgraded in-place, for updates, the reconciler uses additional "buckets" (in the
computeUpdatesmethod): "inplace" and "destructive." -
"expiring allocations:" allocations which are not possible to reschedule, due to lost configurations of their disconnected clients.
-
Cluster Reconciler
The following diagram illustrates the logic flow of the cluster reconciler:
+---------+
|Compute()|
+---------+
|
|
|
v deployments are unneeded in 3 cases:
+---------------------------+ 1. when the are already successful
| cancelUnneededDeployments | 2. when they are active but reference an older job
+---------------------------+ 3. when the job is marked as stopped, but the
| deployment is non-terminal
v
+-----------+ if the job is stopped, we stop
|handle stop| all allocations and handle the
+-----------+ lost allocations.
|
|
| for every task group, this method
| calls computeGroup which returns
+--------------+-------------+ "true" if deployment is complete
| computeDeploymentComplete | for the task group.
+--------------+-------------+ computeDeploymentComplete itself
| returns a boolean and a
| ReconcileResults object.
|
| +---------------+
+----------------------->| computeGroup |
| +---------------+
|
| contains the main, and most complex part
| of the reconciler. it calls many helper
| methods:
| - filterOldTerminalAllocs: allocs that
| are terminal or from older job ver are
| put into "ignore" bucket
| - cancelUnneededCanaries
| - filterByTainted: results in 6 buckets
| mentioned in the paragraphs above:
| untainted, migrate, lost, disconnecting,
| reconnecting, ignore and expiring.
| - filterByRescheduleable: updates the
| untainted bucket and creates 2 new ones:
| rescheduleNow and rescheduleLater
| - reconcileReconnecting: returns which
| allocs should be marked for reconnecting
| and which should be stopped
| - computeStop
| - computeCanaries
| - computePlacements: allocs are placed if
| deployment is not paused or failed, they
| are not canaries (unless promoted),
| previous alloc was lost
| - placeAllocs
| - computeDestructiveUpdates
| - computeMigrations
| - createDeployment
|
v
+-----------------------------+
|setDeploymentStatusAndUpdates|
+-----------------------------+
|
| for complete deployments, it
| handles multi-region case and
v sets the deploymentUpdates
+------------------------+
|return *ReconcileResults|
+------------------------+
Node Reconciler
The system scheduler also does a "reconciliation" step, but only on a per-node basis (system jobs run on all feasible nodes), which makes it simpler than the service reconciler which takes into account a whole cluster, and has jobs that can run on arbitrary subset of clients.
Node reconciliation removes tainted nodes, updates terminal allocations to lost, deals with disconnected nodes and computes placements.
Finding the right node
The scheduler/feasible package contains all the logic used to finding the
right nodes to place workloads.
Feasibility checking
Nomad uses a set of iterators to iterate over nodes and check how feasible they
are for any given allocation. The scheduler uses a Stack interface that lives
in scheduler/feasible/stack.go file in order to make placement decisions,
and feasibility iterators that live in scheduler/feasible/feasible.go to
filter by:
- node eligibiligy,
- data center,
- and node pool.
Once nodes are filtered, the Stack implementations (GenericStack and
SystemStack) check for:
- drivers,
- job constraints,
- devices,
- volumes,
- networking,
- affinities,
- and quotas.
Finding the best fit and scoring
Applies only to service and batch jobs, since system and sysbatch jobs are placed on all feasible nodes.
This part of scheduling sits in the scheduler/feasible/rank.go file. The
RankIterator interface, which is implemented by e.g., SpreadIterator and
BinPackIterator, captures the ranking logic in its Next() methods.