Fix the checking of the staging path against the mountRoot on the host
rather then checking against the containerMountPoint which (probably)
never exists on the host causing it to default back the the legacy
behaviour.
Some of our allocrunner hooks require a task environment for interpolating values based on the node or allocation. But several of the hooks accept an already-built environment or builder and then keep that in memory. Both of these retain a copy of all the node attributes and allocation metadata, which balloons memory usage until the allocation is GC'd.
While we'd like to look into ways to avoid keeping the allocrunner around entirely (see #25372), for now we can significantly reduce memory usage by creating the task environment on-demand when calling allocrunner methods, rather than persisting it in the allocrunner hooks.
In doing so, we uncover two other bugs:
* The WID manager, the group service hook, and the checks hook have to interpolate services for specific tasks. They mutated a taskenv builder to do so, but each time they mutate the builder, they write to the same environment map. When a group has multiple tasks, it's possible for one task to set an environment variable that would then be interpolated in the service definition for another task if that task did not have that environment variable. Only the service definition interpolation is impacted. This does not leak env vars across running tasks, as each taskrunner has its own builder.
To fix this, we move the `UpdateTask` method off the builder and onto the taskenv as the `WithTask` method. This makes a shallow copy of the taskenv with a deep clone of the environment map used for interpolation, and then overwrites the environment from the task.
* The checks hook interpolates Nomad native service checks only on `Prerun` and not on `Update`. This could cause unexpected deregistration and registration of checks during in-place updates. To fix this, we make sure we interpolate in the `Update` method.
I also bumped into an incorrectly implemented interface in the CSI hook. I've pulled that and some better guardrails out to https://github.com/hashicorp/nomad/pull/25472.
Fixes: https://github.com/hashicorp/nomad/issues/25269
Fixes: https://hashicorp.atlassian.net/browse/NET-12310
Ref: https://github.com/hashicorp/nomad/issues/25372
If a CSI volume is has terminal allocations, the volumewatcher will submit an
`Unpublish` RPC. But the "past claim" we create is missing the "external" node
identifier (ex. the AWS EC2 instance ID). The unpublish RPC can tolerate this if
the node still exists in the state store, but if the node has been GC'd the
controller unpublish step will return an error. But at this point we've already
checkpointed the unpublish workflow, which triggers a notification on the
volumewatcher. This results in the volumewatcher getting into a tight loop of
retries. Unfortunately even if we somehow break the loop (perhaps because we hit
a different code path), we'll kick off this loop again after a leader election
when we spin up the volumewatchers again.
This changeset includes the following:
* Fix the primary bug by including the external node ID when creating a "past
claim" for a terminal allocation.
* If we can't lookup the external ID because there's no external node ID and the
node no longer exists, abandon it in the same way that we do the node unpublish
step.
* Rate limit the volumewatcher loop so that any future bugs of this type don't
cause a tight loop.
* Remove some dead code found while working on this.
Fixes: https://github.com/hashicorp/nomad/issues/25349
Ref: https://hashicorp.atlassian.net/browse/NET-12298
While working on #25373, I noticed that the CSI hook's `Destroy` method doesn't
match the interface, which means it never gets called. Because this method only
cancels any in-flight CSI requests, the only impact of this bug is that any CSI
RPCs that are in-flight when an alloc is GC'd on the client or a dev agent is
shut down won't be interrupted gracefully.
Fix the interface, but also make static assertions for all the allocrunner hooks
in the production code, so that you can make changes to interfaces and have
compile-time assistance in avoiding mistakes.
Ref: https://github.com/hashicorp/nomad/pull/25373
* test: use statedb factory
Swapping fields on Client after it has been created is a race.
* test: lock before checking heartbeat state
Fixes races
* test: fix races by copying fsm objects
A common source of data races in tests is when they insert a fixture
directly into memdb and then later mutate the object. Since objects in
the state store are readonly, any later mutation is a data race.
* test: lock when peeking at eval stats
* test: lock when peeking at serf state
* test: lock when looking at stats
* test: fix default eval broker state test
The test was not applying the config callback. In addition the test
raced against the configuration being applied. Waiting for the keyring
to be initialized resolved the race in my testing, but given the high
concurrency of the various leadership subsystems it's possible it may
still flake.
Nomad driver handles incorrectly set exit code 0 in case of executor failure.
This corrects that behavior.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The `server.num_scheduler` configuration value should be a value
between 0 and the number of CPUs on the machine. The Nomad agent
was not validating the configuration parameter which meant you
could use a negative value or a value much larger than the
available machine CPUs. This change enforces validation of the
configuration value both on server startup and when the agent is
reloaded.
The Nomad API was only performing negative value validation when
updating the scheduler number via this method. This change adds
to the validation to ensure the number is not greater than the
CPUs on the machine.
This change removes any blocking calls to destroyAllocRunner, which
caused nomad clients to block when running allocations in certain
scenarios. In addition, this change consolidates client GC by removing
the MakeRoomFor method, which is redundant to keepUsageBelowThreshold.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Docker driver's TestDockerDriver_OOMKilled should run on cgroups v2 now, since
we're running docker v27 client library and our runners run docker v26 that
contain containerd fixcontainerd/containerd#6323.
* Custom watchQuery equivalent on the storage index
* Tests for live updates to the storage page
* Deconditionalizing the pagination on storage, and fixing a bug where I was looking at filtered but not paginated DHV
* Test for pagination with live-updates
We can't delete a CSI plugin when it has volumes in use. When periodic GC runs,
we send the RPC unconditionally and then let the state store return an error. We
accidentally fixed the excess logging this causes (#17025) in #20555, but we can
also check if the plugin is empty first before sending the RPC to save a
request and subsequent Raft write.
Fixes: https://github.com/hashicorp/nomad/issues/17025
Ref: https://github.com/hashicorp/nomad/pull/20555
When configuring Consul to use Nomad workload identities, you create the Consul
auth method in the default namespace. If you're using Consul Enterprise
namespaces, there are two available approaches: one is to create the tokens in
the default namespace and give them policies that define cross-namespace access,
and the other is to use binding rules that map the login to a particular
namespace. The latter is what we show in our docs, but this was missing a note
that any roles (and their associated policies) targetted by `-bind-type role`
need to exist in the Consul namespace we're logging into.
Also, in Nomad CE, the `consul.namespace` flag is always treated as having been set to
`"default"`. That is, we ignore it and don't return an error even though it's a
Nomad ENT-only feature. Clarify this in the documentation for the field the same
way we've done for the `cluster` field.
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
The agent retry joiner implementation had different parameters
to control its execution for agents running in server and client
mode. The agent would set up individual joiners depending on the
agent mode, making the object parameter overhead unrequired.
This change removes the excess configuration options for the
joiner, reducing code complexity slighly and hopefully making
future modifications in this area easier to make.