Files
nomad/.changelog/26292.txt
Tim Gross 333dd94362 scheduler: exit early on count=0 and filter out server-terminal (#26292)
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.

This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.

This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In https://github.com/hashicorp/nomad/pull/25726 we added a test of how
  canaries were treated when on draining nodes. But the test didn't correctly
  configure the job with an update block, leading to misleading test
  behavior. Fix the test to exercise the intended behavior and refactor for
  clarity.
* While working on reconciler behaviors around stopped allocations, I found it
  extremely hard to follow the intent of the disconnected client tests because
  many of the fields in the table-driven test are switches for more complex
  behavior or just tersely named. Attempt to make this a little more legible by
  moving some branches directly into fields, renaming some fields, and
  flattening out some branching.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-18 08:51:52 -04:00

8 lines
366 B
Plaintext

```release-note:improvement
scheduler: For service and batch jobs, the scheduler treats a group.count=0 identically to removing the task group from the job, and will stop all non-terminal allocations.
```
```release-note:improvement
scheduler: For service and batch jobs, the scheduler no longer includes stops for already-stopped canaries in plans it submits.
```