When Nomad generates an identity for a node, the root key used to
sign the JWT will be stored as a field on the node object and
written to state. To provide fast lookup of nodes by their
signing key, the node table schema has been modified to include
the keyID as an index.
In order to ensure a root key is not deleted while identities are
still actively signed by it, the Nomad state has an in-use check.
This check has been extended to cover node identities.
Nomad node identities will have an expiration. The expiration will
be defined by a TTL configured within the node pool specification
as a time duration. When not supplied by the operator, a default
value of 24hr is applied.
On cluster upgrades, a Nomad server will restore from snapshot
and/or replay logs. The FSM has therefore been modified to ensure
restored node pool objects include the default value. The builtin
"all" and "default" pools have also been updated to include this
default value.
Nomad node identities will be a new identity concept in Nomad and
will exist alongside workload identities. This change introduces a
new envelope identity claim which contains generic public claims
as well as either a node or workload identity claims. This allows
us to use a single encryption and decryption path, no matter what
the underlying identity. Where possible node and workload
identities will use common functions for identity claim
generation.
The new node identity has the following claims:
* "nomad_node_id" - the node ID which is typically generated on
the first boot of the Nomad client as a UUID within the
"ensureNodeID" function.
* "nomad_node_pool" - the node pool is a client configuration
parameter which provides logical grouping of Nomad clients.
* "nomad_node_class" - the node class is a client configuration
parameter which provides scheduling constraints for Nomad clients.
* "nomad_node_datacenter" - the node datacenter is a client
configuration parameter which provides scheduling constraints
for Nomad clients and a logical grouping method.
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.
Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.
Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.
Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
Batch job allocations that are drained from a node will be moved
to an eligible node. However, when no eligible nodes are available
to place the draining allocations, the tasks will end up being
complete and will not be placed when an eligible node becomes
available. This occurs because the drained allocations are
simultaneously stopped on the draining node while attempting to
be placed on an eligible node. The stopping of the allocations on
the draining node result in tasks being killed, but importantly this
kill does not fail the task. The result is tasks reporting as complete
due to their state being dead and not being failed. As such, when an
eligible node becomes available, all tasks will show as complete and
no allocations will need to be placed.
To prevent the behavior described above a check is performed when
the alloc runner kills its tasks. If the allocation's job type is
batch, and the allocation has a desired transition of migrate, the
task will be failed when it is killed. This ensures the task does
not report as complete, and when an eligible node becomes available
the allocations are placed as expected.
* sec:add sprig template functions in denylists
* remove explicit set which is no longer needed
* go mod tidy
* add changelog
* better changelog and filtered denylist
* go mod tidy with 1.24.4
* edit changelog and remove htpasswd and derive
* fix tests
* Update client/allocrunner/taskrunner/template/template_test.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* edit changelog
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The `disconnect.stop_on_client_after` feature is implemented as a loop on the
client that's intended to wait on the shortest timeout of all the allocations on
the node and then check whether the interval since the last heartbeat has been
longer than the timeout. It uses a buffered channel of allocations written and
read from the same goroutine to push "stops" from the timeout expiring to the
next pass through the loop. Unfortunately if there are multiple allocations that
need to be stopped in the same timeout event, or even if a previous event has
not yet been dequeued, then sending on the channel will block and the entire
goroutine deadlocks itself.
While fixing this, I also discovered that the `stop_on_client_after` and
heartbeat loops can synchronize in a pathological way that extends the
`stop_on_client_after` window. If a heartbeat fails close to the beginning of
the shortest `stop_on_client_after` window, the loop will end up waiting until
almost 2x the intended wait period.
While fixing both of those issues, I discovered that the existing tests had a
bug such that we were asserting that an allocrunner was being destroyed when it
had already exited.
This commit includes the following:
* Rework the watch loop so that we handle the stops in the same case as the
timer expiration, rather than using a channel in the method scope.
* Remove the alloc intervals map field from the struct and keep it in the
method scope, in order to discourage writing racy tests that read its value.
* Reset the timer whenever we receive a heartbeat, which forces the two
intervals to synchronize correctly.
* Minor refactoring of the disconnect timeout lookup to improve brevity.
Fixes: https://github.com/hashicorp/nomad/issues/24679
Ref: https://hashicorp.atlassian.net/browse/NMD-407
When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.
Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Collecting metrics from processes is expensive, especially on platforms like
Windows. The executor code has a 5s cache of stats to ensure that we don't
thrash syscalls on nodes running many allocations. But the timestamp used to
calculate TTL of this cache was never being set, so we were always treating it
as expired. This causes excess CPU utilization on client nodes.
Ensure that when we fill the cache, we set the timestamp. In testing on Windows,
this reduces exector CPU overhead by roughly 75%.
This changeset includes two other related items:
* The `telemetry.publish_allocation_metrics` field correctly prevents a node
from publishing metrics, but the stats hook on the taskrunner still collects
the metrics, which can be expensive. Thread the configuration value into the
stats hook so that we don't collect if `telemetry.publish_allocation_metrics =
false`.
* The `linuxProcStats` type in the executor's `procstats` package is misnamed as
a result of a couple rounds of refactoring. It's used by all task executors,
not just Linux. Rename this and move a comment about how Windows processes are
listed so that the comment is closer to where the logic is implemented.
Fixes: https://github.com/hashicorp/nomad/issues/23323
Fixes: https://hashicorp.atlassian.net/browse/NMD-455
In #24658 we fixed a bug around client restarts where we would not assert
network namespaces existed and were properly configured when restoring
allocations. We introduced a call to the CNI `Check` method so that the plugins
could report correct config. But when we get an error from this call, we don't
log it unless the error is fatal. This makes it challenging to debug the case
where the initial check fails but we tear down the network and try again (as
described in #25510). Add a noisy log line here.
Ref: https://github.com/hashicorp/nomad/pull/24658
Ref: https://github.com/hashicorp/nomad/issues/25510
Fix the checking of the staging path against the mountRoot on the host
rather then checking against the containerMountPoint which (probably)
never exists on the host causing it to default back the the legacy
behaviour.
Some of our allocrunner hooks require a task environment for interpolating values based on the node or allocation. But several of the hooks accept an already-built environment or builder and then keep that in memory. Both of these retain a copy of all the node attributes and allocation metadata, which balloons memory usage until the allocation is GC'd.
While we'd like to look into ways to avoid keeping the allocrunner around entirely (see #25372), for now we can significantly reduce memory usage by creating the task environment on-demand when calling allocrunner methods, rather than persisting it in the allocrunner hooks.
In doing so, we uncover two other bugs:
* The WID manager, the group service hook, and the checks hook have to interpolate services for specific tasks. They mutated a taskenv builder to do so, but each time they mutate the builder, they write to the same environment map. When a group has multiple tasks, it's possible for one task to set an environment variable that would then be interpolated in the service definition for another task if that task did not have that environment variable. Only the service definition interpolation is impacted. This does not leak env vars across running tasks, as each taskrunner has its own builder.
To fix this, we move the `UpdateTask` method off the builder and onto the taskenv as the `WithTask` method. This makes a shallow copy of the taskenv with a deep clone of the environment map used for interpolation, and then overwrites the environment from the task.
* The checks hook interpolates Nomad native service checks only on `Prerun` and not on `Update`. This could cause unexpected deregistration and registration of checks during in-place updates. To fix this, we make sure we interpolate in the `Update` method.
I also bumped into an incorrectly implemented interface in the CSI hook. I've pulled that and some better guardrails out to https://github.com/hashicorp/nomad/pull/25472.
Fixes: https://github.com/hashicorp/nomad/issues/25269
Fixes: https://hashicorp.atlassian.net/browse/NET-12310
Ref: https://github.com/hashicorp/nomad/issues/25372
While working on #25373, I noticed that the CSI hook's `Destroy` method doesn't
match the interface, which means it never gets called. Because this method only
cancels any in-flight CSI requests, the only impact of this bug is that any CSI
RPCs that are in-flight when an alloc is GC'd on the client or a dev agent is
shut down won't be interrupted gracefully.
Fix the interface, but also make static assertions for all the allocrunner hooks
in the production code, so that you can make changes to interfaces and have
compile-time assistance in avoiding mistakes.
Ref: https://github.com/hashicorp/nomad/pull/25373
* test: use statedb factory
Swapping fields on Client after it has been created is a race.
* test: lock before checking heartbeat state
Fixes races
* test: fix races by copying fsm objects
A common source of data races in tests is when they insert a fixture
directly into memdb and then later mutate the object. Since objects in
the state store are readonly, any later mutation is a data race.
* test: lock when peeking at eval stats
* test: lock when peeking at serf state
* test: lock when looking at stats
* test: fix default eval broker state test
The test was not applying the config callback. In addition the test
raced against the configuration being applied. Waiting for the keyring
to be initialized resolved the race in my testing, but given the high
concurrency of the various leadership subsystems it's possible it may
still flake.
This change removes any blocking calls to destroyAllocRunner, which
caused nomad clients to block when running allocations in certain
scenarios. In addition, this change consolidates client GC by removing
the MakeRoomFor method, which is redundant to keepUsageBelowThreshold.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Errors from `volume create` or `volume delete` only get logged by the client
agent, which may make it harder for volume authors to debug these tasks if they
are not also the cluster administrator with access to host logs.
Allow plugins to include an optional error message in their response. Because we
can't count on receiving this response (the error could come before the plugin
executes), we parse this message optimistically and include it only if
available.
Ref: https://hashicorp.atlassian.net/browse/NET-12087
When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout
expires (by default 30s). But if the plugin never becomes healthy, we're not
restarting the task as documented.
Update the plugin supervisor to trigger a restart instead. We still exit the
supervisor loop at that point to avoid having the supervisor send probes to a
task that isn't running yet. This requires reworking the poststart hook to allow
the supervisor loop to be restarted when the task restarts.
In doing so, I identified that we weren't respecting the task kill context from
the post start hook, which would leave the supervisor running in the window
between when a task is killed because it failed and its stop hooks were
triggered. Combine the two contexts to make sure we stop the supervisor
whichever context gets closed first.
Fixes: https://github.com/hashicorp/nomad/issues/25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Before the fixes in #20165, the wait feature was disabled by
default. After these changes, it's always enabled, which - at
least on some platforms - leads to a significant increase in
load (5-7x).
This patch allows disabling the wait feature in the client
stanza of the configuration file by setting min and max to 0:
wait {
min = "0"
max = "0"
}
Per-template wait blocks in the task description still work like
one would expect.
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
In #24526 we updated Consul and Vault fingerprinting so that we no longer
periodically fingerprint. In #25102 we made it so that we fingerprint
periodically on start until the first fingerprint, in order to tolerate Consul
or Vault not being available on start. For clusters not running Consul, this
leads to a warn-level log every 15s. This same log exists for Vault, but Vault
support is opt-in via `vault.enable = true` whereas you have to manually disable
the fingerprinter for Consul.
Make it so that we only log a failed Consul fingerprint once per Consul
cluster. Reset the gate on this once we have a successful fingerprint, so that
we get the logs after a reload if Consul is unavailable.
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/pull/25102
Fixes: https://github.com/hashicorp/nomad/issues/25181
In #20165 we fixed a bug where a partially configured `client.template` retry
block would set any unset fields to nil instead of their default values. But
this patch introduced a regression in the default values, so we were now
defaulting to unlimited retries if the retry block was unset. Restore the
correct behavior and add better test coverage at both the config parsing and
template configuration code.
Ref: https://github.com/hashicorp/nomad/pull/20165
Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565
In #24526 we updated the Consul and Vault fingerprints so that they are no
longer periodic. This fixed a problem that cluster admins reported where rolling
updates of Vault or Consul would cause a thundering herd of fingerprint updates
across the whole cluster.
But if Consul/Vault is not available during the initial fingerprint, it will
never get fingerprinted again. This is challenging for cluster updates and black
starts because the implicit service startup ordering may require
reloads. Instead, have the fingerprinter run periodically but mark that it has
made its first successful fingerprint of all Consul/Vault clusters. At that
point, we can skip further periodic updates. The `Reload` method will reset the
mark and allow the subsequent fingerprint to run normally.
Fixes: https://github.com/hashicorp/nomad/issues/25097
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/issues/24049
In #24650 we switched to using ephemeral state for CNI plugins, so that when a
host reboots and we lose all the allocations we don't end up trying to use IPs
we created in network namespaces we just destroyed. Unfortunately upgrade
testing missed that in a non-reboot scenario, the existing CNI state was being
used by plugins like the ipam plugin to hand out the "next available" IP
address. So with no state carried over, we might allocate new addresses that
conflict with existing allocations. (This can be avoided by draining the node
first.)
As a compatibility shim, copy the old CNI state directory to the new CNI state
directory during agent startup, if the new CNI state directory doesn't already
exist.
Ref: https://github.com/hashicorp/nomad/pull/24650
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.
Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.
Fixes: https://github.com/hashicorp/nomad/issues/25033
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.