* fix: wait for all allocs to be running before checking for their IDs after client upgrade
* style: linter fix
* fix: filter running allocs per client ID when checking for allocs after upgrade
The test for `nomad setup vault` command expects a specific `CreateIndex` for the
job it creates. Any Raft write when a server comes up or establishes leadership
can cause this test to break. Interpolate the expected index as we've done for
other indexes on the job to make this test less brittle.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673#issuecomment-2847619747
When a Nomad server restores its state via a snapshot and logs, it
is possible a legacy wrapped key object/log is found. This key
will not contain any wrapped keys and therefore should be ingored
within the encrypter.
It is theoretically possible without this change that a key which
generates zero decrypt tasks supersedes a running task and will
place itself in the tracked decrypt task tracker. This decrypt
task has no running work to remove its entry.
The `CreateIndexAndIDTokenizer` creates a composite token by
combining the create index value and ID from the object with
a `.`. Tokens are then compared lexicographically. The comparison
is appropriate for the ID segment of the token, but it is not for
the create index segement. Since the create index values are stored
with numeric ordering, using a lexicographical comparison can cause
unexpected results.
For example, when comparing the token `12.object-id` to `102.object-id`
the result will show `12.object-id` being greater. This is the
correct comparison but it is incorrect for the intention of the token.
With the knowledge of the composition of the token, the response
should be that `12.object-id` is less.
The unexpected behavior can be seen when performing lists (like listing
allocations). The behavior is encountered inconsistently due to
two requirements which must be met:
1. Create index values with a large enough span (ex: 12 and 102)
2. Correct per page value to get a "bad" next token (ex: prefix with 102)
To prevent the unexpected behavior, the target token is split
and the components are used individually to compare against the
object.
Fixes#25435
Jobs were being marked incorectly as having paused allocations
when termimal allocations were marked with the paused boolean. The
UI should only mark a job as including paused allocations when
these paused allocations are in the correct client state, which is
pending.
---------
Co-authored-by: Phil Renaud <phil@riotindustries.com>
When the context closes, the stats emitter closes its channel. It's possible
for the channel to be closed in the stats emitter goroutine before the `select`
in the test sees that the context has closed, which can result in a panic in the
test when we try to read the empty value off the channel.
While working on #25726, I found a method in the drainer code that records
creates a map of job IDs to allocations.
At first glance this looks like a bug because it effectively de-duplicates the
allocations per job. But the consumer of the map is only concerned with jobs,
not allocations, and simply reads the job off the allocation. Refactor this to
make it obvious we're looking at the job.
Ref: https://github.com/hashicorp/nomad/pull/25726
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.
When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.
Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
The fresh deployment of the Redis job took around 20s which is
also the default context timeout on the e2e util that monitors and
waits for a deployment to complete.
The tight timing meant the test often timed out but sometimes
would complete successfully. Increasing the timeout for this
deployment will remove the flakiness.
While working on #25726, I explored a hypothesis that the problem could be
in the state store, but this proved to be a dead end. While I was in this area
of the code I migrated the tests to `shoenig/test`.
Ref: https://github.com/hashicorp/nomad/pull/25726