* Only error on constraints if no allocs are running
When running `nomad job run <JOB>` multiple times with constraints
defined, there should be no error as a result of filtering out nodes
that do not/have not ever satsified the constraints.
When running a systems job with constraint, any run after an initial
startup returns an exit(2) and a warning about unplaced allocations due
to constraints. An error that is not encountered on the initial run,
though the constraint stays the same.
This is because the node that satisfies the condition is already running
the allocation, and the placement is ignored. Another placement is
attempted, but the only node(s) left are the ones that do not satisfy
the constraint. Nomad views this case (no allocations that were
attempted to placed could be placed successfully) as an error, and
reports it as such. In reality, no allocations should be placed or
updated in this case, but it should not be treated as an error.
This change uses the `ignored` placements from diffSystemAlloc to attempt to
determine if the case encountered is an error (no ignored placements
means that nothing is already running, and is an error), or is not one
(an ignored placement means that the task is already running somewhere
on a node). It does this at the point where `failedTGAlloc` is
populated, so placement functionality isn't changed, just the field that
populates error.
There is functionality that should be preserved which (correctly)
notifies a user if a job is attempted that cannot be run on any node due
to the constraints filtering out all available nodes. This should still
behave as expected.
* Add changelog entry
* Handle in-place updates for constrained system jobs
* Update .changelog/25850.txt
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
* Remove conditionals
---------
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
During the upgrade test we can trigger a re-render of the Vault secret due to
client restart before the allocrunner has marked the task as running, which
triggers the change mode on the template and restarts the task. This results in
a race where the alloc is still "pending" when we go to check it. We never
change the value of this secret in upgrade testing, so paper over this race
condition by setting a "noop" change mode.
We're required to pin Docker images for Actions to a specific SHA now and this
is tripping scans in the Enterprise repo. Update the actionlint image.
Ref: https://go.hashi.co/memo/sec-032
Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.
This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
This changeset includes several adjustments to the upgrade testing scripts to
reduce flakes and make problems more understandable:
* When a node is drained prior to the 3rd client upgrade, it's entirely
possible the 3rd client to be upgraded is the drained node. This results in
miscounting the expected number of allocations because many of them will be
"complete" (service/batch) or "pending" (system). Leave the system jobs running
during drains and only count the running allocations at that point as the
expected set. Move the inline script that gets this count into a script file for
legibility.
* When the last initial workload is deployed, it's possible for it to be
briefly still in "pending" when we move to the next step. Poll for a short
window for the expected count of jobs.
* Make sure that any scripts that are being run right after a server or client
is coming back up can handle temporary unavailability gracefully.
* Change the debugging output of several scripts to avoid having the debug
output run into the error message (Ex. "some allocs are not running" looked like
the first allocation running was the missing allocation).
* Add some notes to the README about running locally with `-dev` builds and
tagging a cluster with your own name.
Ref: https://hashicorp.atlassian.net/browse/NMD-162
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.
In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.
The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
New wrapped keys were added to the encrypter and tracked using
their keyID with the context cancelation function. This tracking
was performed primarily so the FSM could load its known key
objects and logs with entries for the same ID superseding existing
decryption tasks. This is a hard to reason about approach and in
theory can cause timing problems in conjunction with the locking.
The new approach still tracks decryption tasks but does not store
the cancelation context. This context is now controlled within a
single function in an attempt to provide a clearer workflow. In
the event two calls for the same key are made in close succession
meaning there is no entry in the keyring for the key yet, all
tasks will be launched. The first-past-the-post will write the
cipher to encrypter state, the second task will complete but not
write the cipher.
* fix: wait for all allocs to be running before checking for their IDs after client upgrade
* style: linter fix
* fix: filter running allocs per client ID when checking for allocs after upgrade
The test for `nomad setup vault` command expects a specific `CreateIndex` for the
job it creates. Any Raft write when a server comes up or establishes leadership
can cause this test to break. Interpolate the expected index as we've done for
other indexes on the job to make this test less brittle.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673#issuecomment-2847619747
When a Nomad server restores its state via a snapshot and logs, it
is possible a legacy wrapped key object/log is found. This key
will not contain any wrapped keys and therefore should be ingored
within the encrypter.
It is theoretically possible without this change that a key which
generates zero decrypt tasks supersedes a running task and will
place itself in the tracked decrypt task tracker. This decrypt
task has no running work to remove its entry.
The `CreateIndexAndIDTokenizer` creates a composite token by
combining the create index value and ID from the object with
a `.`. Tokens are then compared lexicographically. The comparison
is appropriate for the ID segment of the token, but it is not for
the create index segement. Since the create index values are stored
with numeric ordering, using a lexicographical comparison can cause
unexpected results.
For example, when comparing the token `12.object-id` to `102.object-id`
the result will show `12.object-id` being greater. This is the
correct comparison but it is incorrect for the intention of the token.
With the knowledge of the composition of the token, the response
should be that `12.object-id` is less.
The unexpected behavior can be seen when performing lists (like listing
allocations). The behavior is encountered inconsistently due to
two requirements which must be met:
1. Create index values with a large enough span (ex: 12 and 102)
2. Correct per page value to get a "bad" next token (ex: prefix with 102)
To prevent the unexpected behavior, the target token is split
and the components are used individually to compare against the
object.
Fixes#25435
Jobs were being marked incorectly as having paused allocations
when termimal allocations were marked with the paused boolean. The
UI should only mark a job as including paused allocations when
these paused allocations are in the correct client state, which is
pending.
---------
Co-authored-by: Phil Renaud <phil@riotindustries.com>