Node reconciler never took node feasibility into account. In cases when
there were nodes excluded from allocation placement due to constraints
not being met, for example, the desired total or desired canary numbers
were never updated in the reconciler to account for that. Thus,
deployments would never become successful.
In cases where system jobs had the same amount of canary allocations
deployed as there were eligible nodes, the scheduler would incorrectly
mark the deployment as complete, as if auto promotion was set. This edge
case uncovered a bug in the setDeploymentStatusAndUpdates method, and
since we round up canary nodes, it may not be such an edge case
afterall.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This changeset adds system scheduler tests of various permutations of the `update`
block. It also fixes a number of bugs discovered in the process.
* Don't create deployment for in-flight rollout. If a system job is in the
middle of a rollout prior to upgrading to a version of Nomad with system
deployments, we'll end up creating a system deployment which might never
complete because previously placed allocs will not be tracked. Check to see if
we have existing allocs that should belong to the new deployment and prevent a
deployment from being created in that case.
* Ensure we call `Copy` on `Deployment` to avoid state store corruption.
* Don't limit canary counts by `max_parallel`.
* Never create deployments for `sysbatch` jobs.
Ref: https://hashicorp.atlassian.net/browse/NMD-761
In the system scheduler, we need to keep track which nodes were previously used
as "canary nodes" and not pick them at random, in case of previously failed
canaries or changes to the amount of canaries in the jobspec.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This changeset adjusts the handling of allocations placement when we're
promoting a deployment, and it corrects the behavior of isDeploymentComplete,
which previously would never mark promoted deployment as complete.
This changeset introduces canary deployments for system jobs.
Canaries work a little different for system jobs than for service jobs. The
integer in the update block of a task group is interpreted as a percentage of
eligible nodes that this task group update should be deployed to (rounded up
to the nearest integer, so, e.g., for 5 eligible nodes and canary value set to
50, we will deploy to 3 nodes).
In contrast to service jobs, system job canaries are not tracked, i.e., the
scheduler doesn't need to know which allocations are canaries and which are not,
since any node can only run one system job. Canary deployments are marked for
promotion and if promoted, the scheduler simply performs an update as usual,
replacing allocations belonging to a previous job version, and leaving new ones
intact.
This is the initial implementation of deployments for the system and sysbatch
reconciler. It does not support updates or canaries at this point, it simply
provides the necessary plumbing for deployments.
In #26169 we started emitting structured logs from the reconciler. But the node
reconciler results are `AllocTuple` structs and not counts, so the information
we put in the logs ends up being pointer addresses in hex. Fix this so that
we're recording the number of allocs in each bucket instead.
Fix another misleading log-line while we're here.
Ref: https://github.com/hashicorp/nomad/pull/26169
While working on property testing in #26216, I discovered we had unreachable
code in the node reconciler. The `diffSystemAllocsForNode` function receives a
set of non-terminal allocations, but then has branches where it assumes the
allocations might be terminal. It's trivially provable that these allocs are
always live, as the system scheduler splits the set of known allocs into live
and terminal sets before passing them into the node reconciler.
Eliminate the unreachable code and improve the variable names to make the known
state of the allocs more clear in the reconciler code.
Ref: https://github.com/hashicorp/nomad/pull/26216
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212