Jobs were being marked incorectly as having paused allocations
when termimal allocations were marked with the paused boolean. The
UI should only mark a job as including paused allocations when
these paused allocations are in the correct client state, which is
pending.
---------
Co-authored-by: Phil Renaud <phil@riotindustries.com>
When the context closes, the stats emitter closes its channel. It's possible
for the channel to be closed in the stats emitter goroutine before the `select`
in the test sees that the context has closed, which can result in a panic in the
test when we try to read the empty value off the channel.
While working on #25726, I found a method in the drainer code that records
creates a map of job IDs to allocations.
At first glance this looks like a bug because it effectively de-duplicates the
allocations per job. But the consumer of the map is only concerned with jobs,
not allocations, and simply reads the job off the allocation. Refactor this to
make it obvious we're looking at the job.
Ref: https://github.com/hashicorp/nomad/pull/25726
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.
When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.
Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
The fresh deployment of the Redis job took around 20s which is
also the default context timeout on the e2e util that monitors and
waits for a deployment to complete.
The tight timing meant the test often timed out but sometimes
would complete successfully. Increasing the timeout for this
deployment will remove the flakiness.
While working on #25726, I explored a hypothesis that the problem could be
in the state store, but this proved to be a dead end. While I was in this area
of the code I migrated the tests to `shoenig/test`.
Ref: https://github.com/hashicorp/nomad/pull/25726
* Preserve core resources during inplace service alloc updates
When an alloc is running with the core resources specified, and the
alloc is able to be updated in place, the cores it is running on should
be preserved.
This fixes a bug where the allocation's task's core resources
(CPU.ReservedCores) would be recomputed each time the reconciler checked
that the allocation could continue to run on the given node. Under
circumstances where a different core on the node became available before
this check was made, the selection process could compute this new core
as the core to run on, regardless of core the allocation was already
running on. The check takes into account other allocations running on
the node with reserved cores, but cannot check itself.
When this would happen for multiple allocations being evaluated in a
single plan, the selection process would see the other cores being
previously reserved but be unaware of the one it ran on, resulting in
the same core being chosen over and over for each allocation that was
being checked, and updated in the state store (but not on the node).
Once those cores were chosen and committed for multiple allocs, the node
appears to be exhausted on the cores dimension, and it would prevent any
additional allocations from being started on the node.
The reconciler check/computation for allocations that are being updated
in place and have resources.cores defined is effectively a check that
the node has the available cores to run on, not a computation that
should be changed. The fix still performs the check, but once it is
successful any existing ReservedCores are preserved. Because any changes
to this resource is considered a "destructive change", this can be
confidently preserved during the inplace update.
* Adjust reservedCores scheduler test
* Add changelog entry
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>