From ce054aae96f28131ee9f5199189ae9a86701f801 Mon Sep 17 00:00:00 2001
From: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
Date: Thu, 5 Jun 2025 15:36:17 +0200
Subject: [PATCH] scheduler: add a readme and start documenting low level
 implementation details (#25986)

In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
---
 scheduler/README.md | 206 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 206 insertions(+)
 create mode 100644 scheduler/README.md

diff --git a/scheduler/README.md b/scheduler/README.md
new file mode 100644
index 000000000..662b25dc5
--- /dev/null
+++ b/scheduler/README.md
@@ -0,0 +1,206 @@
+# Nomad Scheduler
+
+This package holds the logic behind Nomad schedulers. The `Scheduler` interface
+is implemented by two objects:
+
+- `GenericScheduler` and
+- `SystemScheduler`.
+
+The `CoreScheduler` object also implements this interface, but it's use is
+purely internal, the core scheduler does not schedule any user jobs.
+
+Nomad scheduler's task is to, given an evaluation, produce a plan of placing the
+desired allocations on feasibile nodes. Consult [Nomad documentation][0] for
+more details.
+
+The diagram below illustrates this process for the service and system schedulers
+in more detail:
+
+```
+                                 +--------------+        +-----------+                            +-------------+      +----------+
+                                 |   cluster    |        |feasibility|       +-------------+      |    score    |      |   plan   |
+    Service and batch jobs:      |reconciliation|------->|   check   |------>|     fit     |----->| assignment  |----->|submission|
+                                 +--------------+        +-----------+       +-------------+      +-------------+      +----------+
+
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+
+                                 +--------------+        +-----------+                                                 +----------+
+ System and sysbatch jobs:       |     node     |        |feasibility|       +-------------+                           |   plan   |
+                                 |reconciliation|------->|   check   |------>|     fit     |-------------------------->|submission|
+                                 +--------------+        +-----------+       +-------------+                           +----------+
+```
+
+## Cluster reconciliation
+
+The first step for the service and bach job scheduler is called
+"reconciliation," and its logic lies in `scheduler/reconcile.go` file. The
+`allocReconciler` object has one public method: `Compute`, which takes no
+arguments and returns `reconcileResults` object. This results object tells the
+scheduler about desired deployment to be updated or created, which allocations
+to place, which should be updated destructively or in-place, which should be
+stopped, and which are disconnected or are reconnecting. The reconciler works
+in terms of "buckets," that is, it processes allocations by putting them into
+different sets, and that's how its whole logic is implemented.
+
+The following vocabulary is used by the reconciler:
+
+- "tainted node:" a node is considered "tainted" if allocations must be migrated
+off of it. These are nodes that are draining or have been drained, but also
+nodes that are disconnected and should be used to calculated reconnect timeout.
+The cluster reconciler commonly also refers to "untainted" allocations, i.e.,
+those that do not require migration and are not on disconnected or reconnecting
+nodes.
+
+- "paused deployment:" a deployment is paused when it has an explicit `paused`
+status, but also when it's pending or initializing.
+
+- the reconciler uses the following 6 "buckets" to categorize allocation sets:
+
+  - "migrating allocations:" allocations that are on nodes that are draining.
+
+  - "lost allocations:" allocations that have expired or exist on lost nodes.
+
+  - "disconnecting allocations:" allocations that are on disconnected nodes
+    which haven't been considered "lost" yet, that is, they are in their reconnect
+    timeout.
+
+  - "reconnecting allocations:" allocations on nodes that have reconnected.
+
+  - "ignored allocations:" allocations which are in a noop state, the reconciler
+     will not be touching these. These are also not to be upgraded in-place,
+     for updates, the reconciler uses additional "buckets" (in the `computeUpdates`
+     method): "inplace" and "destructive."
+
+  - "expiring allocations:" allocations which are not possible to reschedule, due
+     to lost configurations of their disconnected clients.
+
+The following diagram illustrates the logic flow of the cluster reconciler:
+
+```
+          +---------+
+          |Compute()|
+          +---------+
+               |
+               v
+        +------------+
+        |create a new|          allocMatrix is created from existing
+        | allocation |          allocations for a job, and is a map of
+        |  matrix m  |          task groups to allocation sets.
+        +------------+
+               |
+               v                deployments are unneeded in 3 cases:
+ +---------------------------+  1. when the are already successful
+ |cancelUnneededDeployments()|  2. when they are active but reference an older job
+ +---------------------------+  3. when the job is marked as stopped, but the
+               |                deployment is non-terminal
+               v
+         +-----------+          if the job is stopped, we stop
+         |handle stop|          all allocations and handle the
+         +-----------+          lost allocations.
+               |
+               v
+  +-------------------------+   sets deploymentPaused and
+  |computeDeploymentPaused()|   deploymentFailed fields on the
+  +-------------------------+   reconciler.
+               |
+               |
+               |                for every task group, this method
+               v                calls computeGroup which returns
++----------------------------+  "true" if deployment is complete
+|computeDeploymentComplete(m)|  for the task group.
++----------------------------+  computeDeploymentComplete itself
+               |                returns a boolean.
+               |
+               |              +------------------------------------+
+               |              |    computeGroup(groupName, all     |
+               +------------->|            allocations)            |
+               |              +------------------------------------+
+               |
+               |                contains the main, and most complex part
+               |                of the reconciler. it calls many helper
+               |                methods:
+               |                - filterOldTerminalAllocs: allocs that
+               |                are terminal or from older job ver are
+               |                put into "ignore" bucket
+               |                - cancelUnneededCanaries
+               |                - filterByTainted: results in 6 buckets
+               |                mentioned in the paragraphs above:
+               |                untainted, migrate, lost, disconnecting,
+               |                reconnecting, ignore and expiring.
+               |                - filterByRescheduleable: updates the
+               |                untainted bucket and creates 2 new ones:
+               |                rescheduleNow and rescheduleLater
+               +-------+        - reconcileReconnecting: returns which
+                       |        allocs should be marked for reconnecting
+                       |        and which should be stopped
+                       |        - computeStop
+                       |        - computeCanaries
+                       |        - computePlacements: allocs are placed if
+                       |        deployment is not paused or failed, they
+                       |        are not canaries (unless promoted),
+                       |        previous alloc was lost
+                       |        - computeReplacements
+                       |        - createDeployment
+                       |
+                       |
+                       |
+                       |
+                       |
+                       |
+                       |
+                       v
++--------------------------------------------+
+|computeDeploymentUpdates(deploymentComplete)|
++--------------------------------------------+
+                       |
+                +------+        for complete deployments, it
+                |               handles multi-region case and
+                v               sets the deploymentUpdates
+        +---------------+
+        |return a.result|
+        +---------------+
+```
+
+## Feasibility checking
+
+Nomad uses a set of iterators to iterate over nodes and check how feasible
+they are for any given allocation. The scheduler uses a `Stack` interface that
+lives in `scheduler/stack.go` file in order to make placement decisions, and
+feasibility iterators that live in `scheduler/feasible.go` to filter by:
+
+- node eligibiligy,
+- data center,
+- and node pool. 
+
+Once nodes are filtered, the `Stack` implementations (`GenericStack` and
+`SystemStack`) check for:
+
+- drivers,
+- job constraints,
+- devices,
+- volumes,
+- networking,
+- affinities,
+- and quotas.
+
+## Node reconciliation
+
+The system scheduler also does a "reconciliation" step, but only on a
+per-node basis (system jobs run on all feasible nodes), which makes it
+simpler than the service reconciler which takes into account a whole cluster,
+and has jobs that can run on arbitrary subset of clients. The code is in
+`scheduler/scheduler_system.go` file.
+
+Node reconciliation removes tainted nodes, updates terminal allocations to lost,
+deals with disconnected nodes and computes placements.
+
+## Finding the best fit and scoring
+
+Applies only to service and batch jobs, since system and sysbatch jobs are
+placed on all feasible nodes.
+
+This part of scheduling sits in the `scheduler/rank.go` file. The `RankIterator`
+interface, which is implemented by e.g., `SpreadIterator` and `BinPackIterator`,
+captures the ranking logic in its `Next()` methods.
+
+[0]: https://developer.hashicorp.com/nomad/docs/concepts/scheduling/scheduling