37 Commits

Author SHA1 Message Date
Piotr Kazmierczak
f9b95ae896 scheduler: account for infeasible nodes when reconciling system jobs (#26868)
Node reconciler never took node feasibility into account. In cases when
there were nodes excluded from allocation placement due to constraints
not being met, for example, the desired total or desired canary numbers
were never updated in the reconciler to account for that. Thus,
deployments would never become successful.
2025-10-02 16:17:46 +02:00
Piotr Kazmierczak
eaa0fe0e27 scheduler: always set the right deployment status for system jobs that require promotion (#26851)
In cases where system jobs had the same amount of canary allocations
deployed as there were eligible nodes, the scheduler would incorrectly
mark the deployment as complete, as if auto promotion was set. This edge
case uncovered a bug in the setDeploymentStatusAndUpdates method, and
since we round up canary nodes, it may not be such an edge case
afterall.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-30 09:18:59 +02:00
Piotr Kazmierczak
46dfd9d992 scheduler: do not create deployments for system job reschedules (#26789)
System jobs that get rescheduled should not get new deployments.
2025-09-18 14:54:54 +02:00
Tim Gross
ce614e6b7a scheduler: upgrade block testing for system deployments (#26579)
This changeset adds system scheduler tests of various permutations of the `update`
block. It also fixes a number of bugs discovered in the process.

* Don't create deployment for in-flight rollout. If a system job is in the
  middle of a rollout prior to upgrading to a version of Nomad with system
  deployments, we'll end up creating a system deployment which might never
  complete because previously placed allocs will not be tracked. Check to see if
  we have existing allocs that should belong to the new deployment and prevent a
  deployment from being created in that case.
* Ensure we call `Copy` on `Deployment` to avoid state store corruption.
* Don't limit canary counts by `max_parallel`.
* Never create deployments for `sysbatch` jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-761
2025-09-05 10:22:42 -04:00
Piotr Kazmierczak
a083495240 system scheduler: correction to Test_computeCanaryNodes (#26707) 2025-09-05 16:20:34 +02:00
Piotr Kazmierczak
276ab8a4c6 system scheduler: keep track of previously used canary nodes (#26697)
In the system scheduler, we need to keep track which nodes were previously used
as "canary nodes" and not pick them at random, in case of previously failed
canaries or changes to the amount of canaries in the jobspec.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 15:32:08 +02:00
Piotr Kazmierczak
14e98a2420 scheduler: fix promotions of system job canaries (#26652)
This changeset adjusts the handling of allocations placement when we're
promoting a deployment, and it corrects the behavior of isDeploymentComplete,
which previously would never mark promoted deployment as complete.
2025-09-03 16:09:36 +02:00
Piotr Kazmierczak
8b8e21dc0e scheduler: check if system job deploy is complete before other guards (#26651) 2025-08-28 17:29:13 +02:00
Piotr Kazmierczak
de342ee48b scheduler: correct dstate total/canary counts for system deployments (#26641) 2025-08-28 16:24:52 +02:00
Piotr Kazmierczak
ca96de15d0 scheduler: correct handling of MaxParallel and obsoleting Stagger in the system scheduler (#26631) 2025-08-27 09:38:35 +02:00
Piotr Kazmierczak
3d373c9a6a scheduler: support canary deployments for system jobs (#26499)
This changeset introduces canary deployments for system jobs.

Canaries work a little different for system jobs than for service jobs. The
integer in the update block of a task group is interpreted as a percentage of
eligible nodes that this task group update should be deployed to (rounded up
to the nearest integer, so, e.g., for 5 eligible nodes and canary value set to
50, we will deploy to 3 nodes). 

In contrast to service jobs, system job canaries are not tracked, i.e., the
scheduler doesn't need to know which allocations are canaries and which are not,
since any node can only run one system job. Canary deployments are marked for
promotion and if promoted, the scheduler simply performs an update as usual,
replacing allocations belonging to a previous job version, and leaving new ones
intact.
2025-08-22 15:02:40 +02:00
Piotr Kazmierczak
0e6e5ef8d1 scheduler: handle deployment completeness in the node reconciler (#26445)
This PR introduces marking deployments as complete if there are no remaining
placements to be made for a given task group.
2025-08-21 18:34:59 +02:00
Piotr Kazmierczak
c33e30596c scheduler: support deployments in the NodeReconciler (#26318)
This is the initial implementation of deployments for the system and sysbatch
reconciler. It does not support updates or canaries at this point, it simply
provides the necessary plumbing for deployments.
2025-08-21 18:34:59 +02:00
Tim Gross
80ddb7392a scheduler: fix debug-level logging for node reconciler (#26583)
In #26169 we started emitting structured logs from the reconciler. But the node
reconciler results are `AllocTuple` structs and not counts, so the information
we put in the logs ends up being pointer addresses in hex. Fix this so that
we're recording the number of allocs in each bucket instead.

Fix another misleading log-line while we're here.

Ref: https://github.com/hashicorp/nomad/pull/26169
2025-08-19 15:17:17 -04:00
Tim Gross
4ce937884d scheduler: move result mutation into computeStop (#26351)
The `computeStop` method returns two values that only get used to mutate the
result and the untainted set. Move the mutation into the method to match the
work done in #26325.

Ref: https://github.com/hashicorp/nomad/pull/26325
Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-29 08:23:06 -04:00
Tim Gross
26554e544e scheduler: move result mutation into computeUpdates (#26336)
The `computeUpdate` method returns 4 different values, some of which are just
different shapes of the same data and only ever get used to be applied to the
result in the caller. Move the mutation of the result into `computeUpdates` to
match the work done in #26325. Clean up the return signature so that only slices
we need downstream are returned, and fix the incorrect docstring.

Also fix a silent bug where the `inplace` set includes the original alloc and
not the updated version. This has no functional change because all existing
callers only ever look at the length of this slice, but it will prevent future
bugs if that ever changes.

Ref: https://github.com/hashicorp/nomad/pull/26325
Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-25 08:21:37 -04:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
Tim Gross
2c4be7fc2e Reconciler mutation improvements (#26325)
Refactors of the `computeGroup` code in the reconciler to make understanding its
mutations more manageable. Some of this work makes mutation more consistent but
more importantly it's intended to make it readily _detectable_ while still being
readable. Includes:

* In the `computeCanaries` function, we mutate the dstate and the result and
  then the return values are used to further mutate the result in the
  caller. Move all this mutation into the function.

* In the `computeMigrations` function, we mutate the result and then the return
  values are used to further mutate the result in the caller. Move all this
  mutation into the function.

* In the `cancelUnneededCanaries` function, we mutate the result and then the
  return values are used to further mutate the result in the caller. Move all
  this mutation into the function, and annotate which `allocSet`s are mutated by
  taking a pointer to the set.

* The `createRescheduleLaterEvals` function currently mutates the results and
  returns updates to mutate the results in the caller. Move all this mutation
  into the function to help cleanup `computeGroup`.

* Extract `computeReconnecting` method from `computeGroup`. There's some tangled
  logic in `computeGroup` for determining changes to make for reconnecting
  allocations. Pull this out into its own function. Annotate mutability in the
  function by passing pointers to `allocSet` where needed, and mutate the result
  to update counts. Rename the old `computeReconnecting` method to
  `appendReconnectingUpdates` to mirror the naming of the similar logic for
  disconnects.

* Extract `computeDisconnecting` method from `computeGroup`. There's some
  tangled logic in `computeGroup` for determining changes to make for
  disconnected allocations. Pull this out into its own function. Annotate
  mutability in the function by passing pointers to `allocSet` where needed, and
  mutate the result to update counts.

* The `appendUnknownDisconnectingUpdates` method used to create updates for
  disconnected allocations mutates one of its `allocSet` arguments to change the
  allocations that the reschedule now set points to. Pull this update out into
  the caller.

* A handful of small docstring and helper function fixes


Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-24 08:33:49 -04:00
Tim Gross
e675491eb6 refactor uses of allocSet in reconciler (#26324)
The reconciler contains a large set of methods and functions that operate on
`allocSet` (a map of allocation IDs to their allocs). Update these so that they
are consistently methods that are documented to not consume the `allocSet`. This
sets the stage for further improvements around mutability in the reconciler.

This changeset also includes a few related refactors:
* Use the `allocSet` alias in every location it's relevant in the reconciler,
  for consistency and clarity.
* Move the filter functions and related helpers in the `allocs.go` file into the
  `filters.go` file.
* Update the method receiver on `allocSet` to match everywhere and generally
  improve the docstrings on the filter functions.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-23 08:57:41 -04:00
Piotr Kazmierczak
973a554808 scheduler: remove unnecessary reconnecting and ignore allocset assignment (#26298)
These values aren't used anywhere, and the code is confusing as is.
2025-07-21 09:06:52 +02:00
Tim Gross
333dd94362 scheduler: exit early on count=0 and filter out server-terminal (#26292)
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.

This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.

This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In https://github.com/hashicorp/nomad/pull/25726 we added a test of how
  canaries were treated when on draining nodes. But the test didn't correctly
  configure the job with an update block, leading to misleading test
  behavior. Fix the test to exercise the intended behavior and refactor for
  clarity.
* While working on reconciler behaviors around stopped allocations, I found it
  extremely hard to follow the intent of the disconnected client tests because
  many of the fields in the table-driven test are switches for more complex
  behavior or just tersely named. Attempt to make this a little more legible by
  moving some branches directly into fields, renaming some fields, and
  flattening out some branching.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-18 08:51:52 -04:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Tim Gross
26302ab25d reconciler: share assertions in property tests (#26259)
Refactor the reconciler property tests to extract functions for safety property
assertions we'll share between different job types for the same reconciler.
2025-07-11 09:27:22 -04:00
Tim Gross
74f7a8f037 scheduler: basic node reconciler safety properties for system jobs (#26216)
Property test assertions for the core safety properties of the node reconciler,
for system jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:44:05 -04:00
Tim Gross
94e03f894a scheduler: basic cluster reconciler safety properties for batch jobs (#26172)
Property test assertions for the core safety proprerties of the cluster
reconciler, for batch jobs. The changeset includes fixes for any bugs found
during work-in-progress, which will get pulled out to their own PRs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:43:55 -04:00
Piotr Kazmierczak
e50db4d1b8 scheduler: property testing of cancelUnneededCanaries (#26204)
In the spirit of #26180

Internal ref: https://hashicorp.atlassian.net/browse/NMD-814
2025-07-09 13:46:13 -04:00
Tim Gross
7c6c1ed0d3 scheduler: reconciler should constrain placements to count (#26239)
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.

Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-09 11:51:01 -04:00
Tim Gross
eb47d1ca11 scheduler: eliminate dead code in node reconciler (#26236)
While working on property testing in #26216, I discovered we had unreachable
code in the node reconciler. The `diffSystemAllocsForNode` function receives a
set of non-terminal allocations, but then has branches where it assumes the
allocations might be terminal. It's trivially provable that these allocs are
always live, as the system scheduler splits the set of known allocs into live
and terminal sets before passing them into the node reconciler.

Eliminate the unreachable code and improve the variable names to make the known
state of the allocs more clear in the reconciler code.

Ref: https://github.com/hashicorp/nomad/pull/26216
2025-07-09 11:31:04 -04:00
Piotr Kazmierczak
8bc6abcd2e scheduler: basic cluster reconciler safety properties for service jobs (#26167) 2025-07-09 17:30:37 +02:00
Tim Gross
c043d1c850 scheduler: property testing of reconcile reconnecting (#26180)
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-07 09:40:49 -04:00
Tim Gross
9a29df2292 scheduler: emit structured logs from reconciliation (#26169)
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-07-01 10:37:44 -04:00
Piotr Kazmierczak
36e7148247 scheduler: doc.go files for new packages (#26177) 2025-07-01 16:28:33 +02:00
Tim Gross
ec8250ed30 property test generation for reconciler (#26142)
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).

The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.

Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
2025-06-26 11:09:53 -04:00
Piotr Kazmierczak
12ddb6db94 scheduler: capture reconciler state in ReconcilerState object (#26088)
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
2025-06-23 15:36:39 +02:00
Piotr Kazmierczak
1030760d3f scheduler: adjust method comments and names to reflect recent refactoring (#26085)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-20 17:23:31 +02:00
Piotr Kazmierczak
b82fd2e159 scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042)
Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
2025-06-20 07:37:16 +02:00
Piotr Kazmierczak
0ddbc548a3 scheduler: rename reconciliation package to reconciler (#26038)
nouns are better than verbs for package names
2025-06-12 14:36:09 +02:00