Commit Graph

5121 Commits

Author SHA1 Message Date
James Rasell
a206ff3858 test: Fix test flake in client get registration token (#26796)
The test was incorrectly writing to state that registration had
been finished before writing the node identity token. This is the
opposite of what happens in the client code and caused a timing
issue which meant we read registration as completed before we had
the identity available and therefore returned the secret ID.
2025-09-18 13:56:17 +01:00
Tim Gross
3432b0a2d6 consul: only add fingerprint link if unique.consul.name is set (#26787)
In Nomad Enterprise we can fingerprint multiple Consul datacenters. If neither
is `"default"` then we end up with warning logs about adding a "link".

The `Link` field on the `Node` struct is a map of attributes that only
contributes to the node's computed hash. The `"consul"` key's value is derived
from the `unique.consul.name` attribute, which only exists if there's a default
Consul cluster.

Update the fingerprint to skip setting the link field if there's no
`unique.consul.name`, and lower the warning log for malformed fields to debug;
this is a minor scheduling optimization largely captured by existing Consul
fields in the node computed class. The only reason not to remove it entirely is
to avoid changing computed classes on existing large clusters.

Fixes: https://github.com/hashicorp/nomad/issues/26781
Ref: https://hashicorp.atlassian.net/browse/NMD-998
2025-09-17 13:23:01 -04:00
Olli Janatuinen
6398ef9475 secrets: Support custom plugins in Windows (#26751)
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2025-09-16 09:14:50 -04:00
Michael Smithhisler
c20f854d16 client: set network status on tasks when restoring allocations (#26699)
The allocation network hook was not properly restoring network status from state when the network had previously been setup.  This led to missing environment variables, misconfigured hosts file, and resolv.conf when a task was restarted after the nomad agent has restarted.
---------

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2025-09-11 13:10:21 -04:00
Chris Roberts
8b51acf259 [artifact] fix path within check on trimmed target (#26748)
When checking if the target path is within the root path, the
target path is trimmed and then file information is fetched. If
the trimmed path does not exist, then the full target path is
not within the root. In the case of receiving a not exist error,
simply return false.
2025-09-11 08:59:18 -07:00
Tim Gross
75774711f0 eliminate dead Vault-related code from nomad/structs (#26736)
When we removed the legacy Vault token workflow, we left behind a few bits of
code that only served that workflow. Remove the dead code.
2025-09-09 12:12:57 -04:00
Michael Smithhisler
37da98be1c Merge pull request #26681 from hashicorp/NMD-760-nomad-secrets-block
Secrets Block: merge feature branch to main
2025-09-09 10:46:18 -04:00
Michael Smithhisler
10ed46cbd4 secrets: pass key/value config data to plugins as env (#26455)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 16:08:24 -04:00
Michael Smithhisler
e9e1631b8c test: add task validation when using vault secret provider (#26517) 2025-09-05 16:08:23 -04:00
Michael Smithhisler
1089b8893e secrets: refactor template providers to hold secrets in memory (#26506) 2025-09-05 16:08:23 -04:00
Michael Smithhisler
9950ef515c secrets: validate name and update client config (#26447) 2025-09-05 16:08:23 -04:00
Michael Smithhisler
00ef9cacab secrets: add common secrets plugins impl (#26335)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2025-09-05 16:08:23 -04:00
Michael Smithhisler
ac32b0864d scheduler: adds implicit constraint for secrets plugin node attributes (#26303) 2025-09-05 16:08:23 -04:00
Michael Smithhisler
6dcd155bf8 add input validation and path traversal protections (#26241)
---------

Co-authored-by: Deniz Onur Duzgun <59659739+dduzgun-security@users.noreply.github.com>
2025-09-05 16:08:23 -04:00
Piotr Kazmierczak
964cc8b8ca Merge pull request #26708 from hashicorp/f-system-deployments
scheduler: system deployments
2025-09-05 18:23:41 +02:00
Michael Smithhisler
85a2875183 task: adds ability to interpret values from secrets hook (#26261) 2025-09-04 15:58:03 -04:00
Michael Smithhisler
2d0ce43c47 secrets: add vault secrets provider (#26198) 2025-09-04 15:58:03 -04:00
Michael Smithhisler
20a855ea13 secrets: add secrets hook with nomad provider (#26143) 2025-09-04 15:58:03 -04:00
Daniel Bennett
9682aa2724 consul connect: allow "cni/*" network mode (#26449)
don't require "bridge" network mode when using connect{}

we document this as "at your own risk" because CNI configuration
is so flexible that we can't guarantee a user's network will work,
but Nomad's "bridge" CNI config may be used as a reference.
2025-09-04 12:29:50 -04:00
Juana De La Cuesta
2944a34b58 Reuse token if it exists on client reconnect (#26604)
Currently every time a client starts, it creates a new consul token per service or task,. This PR changes the behaviour , it persists consul ACL token to the client state and it starts by looking up a token before creating a new one.

Fixes: #20184
Fixes: #20185
2025-09-04 15:27:57 +02:00
Chris Roberts
fd1e40537c [artifact] add artifact inspection after download (#26608)
This adds artifact inspection after download to detect any issues
with the content fetched. Currently this means checking for any
symlinks within the artifact that resolve outside the task or
allocation directories. On platforms where lockdown is available
(some Linux) this inspection is not performed.

The inspection can be disabled with the DisableArtifactInspection
option. A dedicated option for disabling this behavior allows
the DisableFilesystemIsolation option to be enabled but still
have artifacts inspected after download.
2025-08-27 10:37:34 -07:00
Piotr Kazmierczak
7c4faf9227 scheduler: monitor deployments correctly (#26605)
Corrects two minor bugs that prevented proper deployment monitoring for systems
jobs: populating the new deployment field of the system scheduler object, and
correcting allocrunner health checks that were guarded not to run on system
jobs.
2025-08-25 15:29:13 +02:00
Chris Roberts
33a72c2d01 [landlock] Allow read access for random content (#26510)
When attempting to clone a git repository within a sandbox that is
configured with landlock, the clone will fail with error messages
related to inability to get random bytes for a temporary file.
Including a read rule for `/dev/urandom` resolves the error
and the git clone works as expected.
2025-08-22 14:04:55 -07:00
James Rasell
3b0b7db1a1 client: Add client identity API, CLI, and RPC workflow. (#26543)
The Nomad clients store their Nomad identity in memory and within
their state store. While active, it is not possible to dump the
state to view the stored identity token, so having a way to view
the current claims while running aids debugging and operations.

This change adds a client identity workflow, allowing operators
to view the current claims of the nodes identity. It does not
return any of the signing key material.
2025-08-19 08:25:51 +01:00
Wim
f712d5db90 Add AllocIPv6 option to allow IPv6 address being used for service registration (#25632)
Fixes #25627 by adding an extra `alloc_advertise_ipv6` option similar to the `AdvertiseIPv6Addr` with the docker driver config.

Fixes: https://github.com/hashicorp/nomad/issues/25627
2025-08-08 15:01:46 -04:00
James Rasell
1c63ad50d9 Merge pull request #26430 from hashicorp/f-NMD-763-introduction
introduction: The initial implementation code for node introduction.
2025-08-06 14:41:16 +02:00
James Rasell
622def8bcf test: Ensure client rpclogger is set on RPC only client. (#26443)
If a test encounters an RPC error using the test client, it will
panic as the rpc logger is not set when it attempts to log the
error.
2025-08-06 10:20:28 +01:00
James Rasell
ad508616dc Merge branch 'main' into f-NMD-763-introduction 2025-08-05 08:56:51 +01:00
James Rasell
350662c88e Merge pull request #26291 from hashicorp/f-NMD-763-identity
identity: The initial implementation code for node identity.
2025-08-05 09:52:28 +02:00
James Rasell
80a26306bf intro: Add node introduction flow for Nomad client registration. (#26405)
This change implements the client -> server workflow for Nomad
node introduction. A Nomad node can optionally be started with an
introduction token, which is a signed JWT containing claims for
the node registration. The server handles this according to the
enforcement configuration.

The introduction token can be provided by env var, cli flag, or
by placing it within a default filesystem location. The latter
option does not override the CLI or env var.

The region claims has been removed from the initial claims set of
the intro identity. This boundary is guarded by mTLS and aligns
with the node identity.
2025-08-05 08:23:44 +01:00
tehut
21841d3067 Add historical journald and log export flags to operator debug command (#26410)
* Add -log-file-export and -log-lookback commands to add historical log to
debug capture
* use monitor.PrepFile() helper for other historical log tests
2025-08-04 13:55:25 -07:00
tehut
d709accaf5 Add nomad monitor export command (#26178)
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
2025-08-01 10:26:59 -07:00
James Rasell
f2417ffb89 ci: Update hclogvet and correctly run across codebase. (#26362) 2025-07-28 14:15:33 +01:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
James Rasell
dce4284361 Merge branch 'main' into f-NMD-763-identity 2025-07-17 07:35:16 +01:00
James Rasell
953a149180 client: Allow operators to force a client to renew its identity. (#26277)
The Nomad client will have its identity renewed according to the
TTL which defaults to 24h. In certain situations such as root
keyring rotation, operators may want to force clients to renew
their identities before the TTL threshold is met. This change
introduces a client HTTP and RPC endpoint which will instruct the
node to request a new identity at its next heartbeat. This can be
used via the API or a new command.

While this is a manual intervention step on top of the any keyring
rotation, it dramatically reduces the initial feature complexity
as it provides an asynchronous and efficient method of renewal that
utilises existing functionality.
2025-07-16 14:56:00 +01:00
Daniel Bennett
089c148236 allocrunner: run all postrun hooks, even on error (#26271)
e.g. if the consul postrun hook fails, continue running
the subsequent postrun hooks, which among other things
includes network/CNI/iptables cleanup.
2025-07-14 13:55:33 -04:00
James Rasell
8096ea4129 client: Handle identities from servers and use for RPC auth. (#26218)
Nomad servers, if upgraded, can return node identities as part of
the register and update/heartbeat response objects. The Nomad
client will now handle this and store it as appropriate within its
memory and statedb.

The client will now use any stored identity for RPC authentication
with a fallback to the secretID. This supports upgrades paths where
the Nomad clients are updated before the Nomad servers.
2025-07-14 14:24:43 +01:00
James Rasell
7c5a5782bc client: Use single time variable when handling heartbeat response. (#26238)
When the client handles an update status response from the server,
it modifies its heartbeat stop tracker with a time set once the
RPC call returns. It optionally also emits a log message, if the
client suspects it has missed a heartbeat.

These times were originally tracked by two different calls to the
time function which were executed 2 microseconds apart. There is
no reason we cannot use a single time variable for both uses which
saves us one whole call to time.Now.
2025-07-10 08:07:32 +01:00
Juana De La Cuesta
3b44090156 Avoid panic during startup with 1.10.2 (#26219)
* fix: initalize the topology of teh processors to avoid nil pointers

* func: initialize topology to avoid nil pointers

* fix: update the new public method for NodeProcessorResources
2025-07-08 16:07:14 +02:00
James Rasell
2f30205102 client: Add state functionality for set and get client identities. (#26184)
The Nomad client will persist its own identity within its state
store for restart persistence. The added benefit of using it over
the filesystem is that it supports transactions. This is useful
when considering the identity will be renewed periodically.
2025-07-07 15:28:27 +01:00
James Rasell
e158356dd2 client: Remove created directory when mkdir plugin fails to chown. (#26194)
The mkdir plugin creates the directory and then chowns it. In the
event the chown command fails, we should attempt to remove the
directory. Without this, we leave directories on the client in
partial failure situations.
2025-07-04 08:36:36 +01:00
Chris Roberts
362690ddd1 client: suppress kill task event on completed tasks (#26075)
The `killTasks` function will kill all the alloc runners
task runners. If the task of a task runner has already
completed, the killing of the task runner can cause
confusion due to the task event showing that the task
was signaled even though it is already complete.

To prevent this, a check is done when creating the
task event to determine if the task has completed. If
it has no task event is created and when the task
runner is killed, no extra task event is added.
2025-07-01 13:30:52 -07:00
James Rasell
d5b2d5078b rpc: Generate node identities with node RPC handlers when needed. (#26165)
When a Nomad client register or re-registers, the RPC handler will
generate and return a node identity if required. When an identity
is generated, the signing key ID will be stored within the node
object, to ensure a root key is not deleted until it is not used.

During normal client operation it will periodically heartbeat to
the Nomad servers to indicate aliveness. The RPC handler that
is used for this action has also been updated to conditionally
perform identity generation. Performing it here means no extra RPC
handlers are required and we inherit the jitter in identity
generation from the heartbeat mechanism.

The identity generation check methods are performed from the RPC
request arguments, so they a scoped to the required behaviour and
can handle the nuance of each RPC. Failure to generate an identity
is considered terminal to the RPC call. The client will include
behaviour to retry this error which is always caused by the
encrypter not being ready unless the servers keyring has been
corrupted.
2025-07-01 16:07:21 +01:00
James Rasell
325048c898 Merge branch 'main' into f-NMD-763-identity 2025-06-24 08:42:33 +01:00
James Rasell
26c3f19129 identity: Base server objects and mild rework of identity implementation to support node identities. (#26052)
When Nomad generates an identity for a node, the root key used to
sign the JWT will be stored as a field on the node object and
written to state. To provide fast lookup of nodes by their
signing key, the node table schema has been modified to include
the keyID as an index.

In order to ensure a root key is not deleted while identities are
still actively signed by it, the Nomad state has an in-use check.
This check has been extended to cover node identities.

Nomad node identities will have an expiration. The expiration will
be defined by a TTL configured within the node pool specification
as a time duration. When not supplied by the operator, a default
value of 24hr is applied.

On cluster upgrades, a Nomad server will restore from snapshot
and/or replay logs. The FSM has therefore been modified to ensure
restored node pool objects include the default value. The builtin
"all" and "default" pools have also been updated to include this
default value.

Nomad node identities will be a new identity concept in Nomad and
will exist alongside workload identities. This change introduces a
new envelope identity claim which contains generic public claims
as well as either a node or workload identity claims. This allows
us to use a single encryption and decryption path, no matter what
the underlying identity. Where possible node and workload
identities will use common functions for identity claim
generation.

The new node identity has the following claims:

* "nomad_node_id" - the node ID which is typically generated on
  the first boot of the Nomad client as a UUID within the
  "ensureNodeID" function.

* "nomad_node_pool" - the node pool is a client configuration
  parameter which provides logical grouping of Nomad clients.

* "nomad_node_class" - the node class is a client configuration
  parameter which provides scheduling constraints for Nomad clients.

* "nomad_node_datacenter" - the node datacenter is a client
  configuration parameter which provides scheduling constraints
  for Nomad clients and a logical grouping method.
2025-06-18 07:43:27 +01:00
Tim Gross
26004c5407 vault: set renew increment to lease duration (#26041)
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.

Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.

Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.

Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
2025-06-13 13:50:54 -04:00
Chris Roberts
dfa07e10ed client: fix batch job drain behavior (#26025)
Batch job allocations that are drained from a node will be moved
to an eligible node. However, when no eligible nodes are available
to place the draining allocations, the tasks will end up being
complete and will not be placed when an eligible node becomes
available. This occurs because the drained allocations are
simultaneously stopped on the draining node while attempting to
be placed on an eligible node. The stopping of the allocations on
the draining node result in tasks being killed, but importantly this
kill does not fail the task. The result is tasks reporting as complete
due to their state being dead and not being failed. As such, when an
eligible node becomes available, all tasks will show as complete and
no allocations will need to be placed.

To prevent the behavior described above a check is performed when
the alloc runner kills its tasks. If the allocation's job type is
batch, and the allocation has a desired transition of migrate, the
task will be failed when it is killed. This ensures the task does
not report as complete, and when an eligible node becomes available
the allocations are placed as expected.
2025-06-13 08:28:31 -07:00
Daniel Bennett
7519df8d06 task env: add NOMAD_UNIX_ADDR var (#25598)
for easier setup when using workload identity + task api
2025-06-11 15:56:51 -04:00
Deniz Onur Duzgun
abd0efdd76 sec: remove non-hermetic sprig template functions (#25998)
* sec:add sprig template functions in denylists

* remove explicit set which is no longer needed

* go mod tidy

* add changelog

* better changelog and filtered denylist

* go mod tidy with 1.24.4

* edit changelog and remove htpasswd and derive

* fix tests

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* edit changelog

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-09 13:00:47 -04:00