Commit Graph

2427 Commits

Author SHA1 Message Date
James Rasell
62f1dbebfb server: Add RPC and HTTP functionality for node intro token gen. (#26320)
The node introduction workflow will utilise JWT's that can be used
as authentication tokens on initial client registration. This
change implements the basic builder for this JWT claim type and
the RPC and HTTP handler functionality that will expose this to
the operator.
2025-07-23 14:32:26 +01:00
James Rasell
7466dd71b2 server: Add new server.client_introduction config block. (#26315)
The new configuration block exposes some key options which allow
cluster administrators to control certain client introduction
behaviours.

This change introduces the new block and plumbing, so that it is
exposed in the Nomad server for consumption via internal processes.
2025-07-22 08:50:19 +01:00
James Rasell
dce4284361 Merge branch 'main' into f-NMD-763-identity 2025-07-17 07:35:16 +01:00
James Rasell
953a149180 client: Allow operators to force a client to renew its identity. (#26277)
The Nomad client will have its identity renewed according to the
TTL which defaults to 24h. In certain situations such as root
keyring rotation, operators may want to force clients to renew
their identities before the TTL threshold is met. This change
introduces a client HTTP and RPC endpoint which will instruct the
node to request a new identity at its next heartbeat. This can be
used via the API or a new command.

While this is a manual intervention step on top of the any keyring
rotation, it dramatically reduces the initial feature complexity
as it provides an asynchronous and efficient method of renewal that
utilises existing functionality.
2025-07-16 14:56:00 +01:00
hc-github-team-nomad-core
ccba3ae6a2 Generate files for 1.10.3 release 2025-07-08 16:47:39 -07:00
Chris Roberts
493e7b2faa command: prevent server panic on graceful shutdown (#26171)
When performing a graceful shutdown the client drain configuration
is checked for a deadline which is appended to the timeout. When
running as a server the client will not be set. Attempting to get
the drain deadline will result in a panic. This checks for the
client being available prior to fetching the deadline value.
2025-07-01 15:54:03 -07:00
James Rasell
d5b2d5078b rpc: Generate node identities with node RPC handlers when needed. (#26165)
When a Nomad client register or re-registers, the RPC handler will
generate and return a node identity if required. When an identity
is generated, the signing key ID will be stored within the node
object, to ensure a root key is not deleted until it is not used.

During normal client operation it will periodically heartbeat to
the Nomad servers to indicate aliveness. The RPC handler that
is used for this action has also been updated to conditionally
perform identity generation. Performing it here means no extra RPC
handlers are required and we inherit the jitter in identity
generation from the heartbeat mechanism.

The identity generation check methods are performed from the RPC
request arguments, so they a scoped to the required behaviour and
can handle the nuance of each RPC. Failure to generate an identity
is considered terminal to the RPC call. The client will include
behaviour to retry this error which is always caused by the
encrypter not being ready unless the servers keyring has been
corrupted.
2025-07-01 16:07:21 +01:00
Allison Larson
63f0788747 Expose Kind field for Consul Service Registrations (#26170)
* consul: Add service kind to jobspec

* consul: Add kind to service docs

* Add changelog
2025-06-30 14:32:23 -07:00
James Rasell
7a5f5750b0 test: Wait for client when enabled in test agent if possible. (#26129)
When a test starts an agent and the client is enabled, we can
wait until this reaches the ready state within the set up method.
This mimics what we already do with leadership and the root
keyring and should reduce flakey tests where it assume the client
is ready as soon as the set up function returns, which is not
guaranteed.

The change exposed a couple of TLS reload tests which were not
using the test agent correctly. They were setting up a client even
though it would never be able to join the cluster due to TLS
configuration issues. These have been fixed.
2025-06-25 10:00:28 +01:00
Chris Roberts
4dbf645bf7 command: prevent panic on graceful shutdown (#26018)
When performing a graceful shutdown a channel is used to wait for
the agent to leave. The channel is closed when the agent leaves
successfully, but it also is closed within a deferral. If the
agent successfully leaves and closes the channel, a panic will
occur when the channel is closed the second time within the
deferral. To prevent this from occurring, the channel closing
is wrapped within a `OnceFunc` so the channel is only closed
once.
2025-06-12 09:35:57 -07:00
Chris Roberts
eeec603975 command: prevent early exit from graceful shutdown (#26023)
While waiting for the agent to leave during a graceful shutdown
the wait can be interrupted immediately if another signal is
received. It is common that while waiting a `SIGPIPE` is received
from journald causing the wait to end early. This results in the
agent not finishing the leave process and reporting an error when
the process has stopped. Instead of allowing any signal to interrupt
the wait, the signal is checked for a `SIGPIPE` and if matched will
continue waiting.
2025-06-12 08:56:55 -07:00
hc-github-team-nomad-core
1e49d9eb44 Generate files for 1.10.2 release 2025-06-10 14:35:25 -07:00
James Rasell
e95148c10d consul: Fix data race within test by using mutex to read map. (#25977) 2025-06-04 15:09:37 +01:00
Michael Smithhisler
4c8257d0c7 client: add once mode to template block (#25922) 2025-05-28 11:45:11 -04:00
Tim Gross
3f59860254 host volumes: add configuration to GC on node GC (#25903)
When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.

Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-27 10:22:08 -04:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Daniel Bennett
15c01e5a49 ipv6: normalize addrs per RFC-5942 §4 (#25921)
https://datatracker.ietf.org/doc/html/rfc5952#section-4

* copy NormalizeAddr func from vault
  * PRs hashicorp/vault#29228 & hashicorp/vault#29517
* normalize bind/advertise addrs
* normalize consul/vault addrs
2025-05-22 14:21:30 -04:00
Piotr Kazmierczak
cdc308a0eb wi: new endpoint for listing workload attached ACL policies (#25588)
This introduces a new HTTP endpoint (and an associated CLI command) for querying
ACL policies associated with a workload identity. It allows users that want
to learn about the ACL capabilities from within WI-tasks to know what sort of
policies are enabled.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-05-19 19:54:12 +02:00
Tim Gross
8a5a057d88 offline license utilization reporting (#25844)
Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.

This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
2025-05-14 09:51:13 -04:00
hc-github-team-nomad-core
9ef42e9807 Generate files for 1.10.1 release 2025-05-13 14:26:48 +02:00
James Rasell
0b265d2417 encrypter: Track initial tasks for is ready calculation. (#25803)
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.

In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.

The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
2025-05-07 15:38:16 +01:00
Juanadelacuesta
9288a3141a func and docs: Use the config from the client and not from the agent that is already parsed. Add the breaking change to the release notes 2025-04-30 10:53:02 +02:00
Juanadelacuesta
949571e313 func: read the config from the agent, dont reparse 2025-04-24 05:01:53 +02:00
Juanadelacuesta
46343ee56e func: use the client's configured drain deadline to calculate the graceful timeout when terminating an agent 2025-04-23 23:59:50 +02:00
Juanadelacuesta
c91f24681d style: add changelog 2025-04-23 23:28:54 +02:00
Juana De La Cuesta
9778a31e29 Update command/agent/command.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-23 23:18:09 +02:00
Juana De La Cuesta
39b3d63172 Update command/agent/command.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-23 23:18:02 +02:00
Juana De La Cuesta
313f430fdd Update command/agent/command.go
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2025-04-23 23:17:36 +02:00
Juanadelacuesta
b47b962439 fix: use a timer instead of a time.After to avoid memory leaks 2025-04-23 16:26:46 +02:00
Juanadelacuesta
38c27b7e7f fix: make teh disctintion when for os.Interrupt 2025-04-23 16:20:04 +02:00
Juanadelacuesta
adf038b495 fix: correct the logic for LeaveOnTerm or LeaveOnInt depending on the incoming signal 2025-04-23 16:03:12 +02:00
Juanadelacuesta
b375974bc3 style: add comments 2025-04-23 15:47:37 +02:00
Juanadelacuesta
c5c4272aee func: force agent return if there is an error on reload 2025-04-23 15:14:48 +02:00
Piotr Kazmierczak
df3b00bce0 acl: use WhoAmI RPC endpoint in /acl/token/self (#25547)
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-22 17:53:39 +02:00
tehut
b11619010e Add priority flag to Dispatch CLI and API (#25622)
* Add priority flag to Dispatch CLI and DispatchOpts() helper to HTTP API
2025-04-18 13:24:52 -07:00
Arian van Putten
d28af58cbb agent: implement sd-notify reload correctly (#25636)
First of all, we should not send the unix time, but the monotonic time.
Second of all, RELOADING= and MONOTONIC_USEC fields should be sent in
*single* message not two separate messages.

From the man page of [systemd.service](https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Type=)

> notification message via sd_notify(3) that contains the "RELOADING=1" field in
> combination with "MONOTONIC_USEC=" set to the current monotonic time (i.e.
> CLOCK_MONOTONIC in clock_gettime(2)) in μs, formatted as decimal string.

[sd_notify](https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html)
now has code samples of the protocol to clarify.

Without these changes, if you'd set
Type=notify-reload on the agen'ts systemd unit, systemd
would kill the service due to the service not responding to reload
correctly.
2025-04-14 11:38:56 -04:00
Michael Schurter
c5451cf300 Merge pull request #25635 from hashicorp/post-1.10.0-release
Post 1.10.0 release
2025-04-10 10:32:24 -07:00
Tim Gross
27caae2b2a api: make attempting to remove peer by address a no-op (#25599)
In Nomad 1.4.0 we removed support for Raft Protocol v2 entirely. But the
`Operator.RemoveRaftPeerByAddress` RPC handler was left in place, along with its
supporting HTTP API and command line flags. Using this API will always result in
the Raft library error "operation not supported with current protocol version".

Unfortunately it's still possible in unit tests to exercise this code path, and
these tests are quite flaky. This changeset turns the RPC handler and HTTP API
into a no-op, removes the associated command line flags, and removes the flaky
tests. I've also cleaned up the test for `RemoveRaftPeerByID` to consolidate
test servers and use `shoenig/test`.

Fixes: https://hashicorp.atlassian.net/browse/NET-12413
Ref: https://github.com/hashicorp/nomad/pull/13467
Ref: https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#raft-protocol-version-2-unsupported
Ref: https://github.com/hashicorp/nomad-enterprise/actions/runs/13201513025/job/36855234398?pr=2302
2025-04-10 09:19:25 -04:00
hc-github-team-nomad-core
71af41b4b1 Generate files for 1.10.0 release 2025-04-09 16:03:21 -07:00
hc-github-team-nomad-core
239c5f11ee Generate files for 1.10.0 release 2025-04-09 16:03:21 -07:00
hc-github-team-nomad-core
a18faebda1 Generate files for 1.10.0-rc.1 release 2025-04-03 18:21:58 +00:00
Nikita Eliseev
76fb3eb9a1 rpc: added configuration for yamux session (#25466)
Fixes: https://github.com/hashicorp/nomad/issues/25380
2025-04-02 10:58:23 -04:00
James Rasell
27ad88ac17 test: Calculate agent endpoint scheduler count, not static. (#25473) 2025-03-21 13:47:53 +00:00
James Rasell
b3f28f9387 test: Use runtime CPUs for test not static number. (#25458) 2025-03-20 09:05:36 +00:00
James Rasell
5a157eb123 server: Validate config num schedulers is between 0 and num CPUs. (#25441)
The `server.num_scheduler` configuration value should be a value
between 0 and the number of CPUs on the machine. The Nomad agent
was not validating the configuration parameter which meant you
could use a negative value or a value much larger than the
available machine CPUs. This change enforces validation of the
configuration value both on server startup and when the agent is
reloaded.

The Nomad API was only performing negative value validation when
updating the scheduler number via this method. This change adds
to the validation to ensure the number is not greater than the
CPUs on the machine.
2025-03-20 07:29:57 +00:00
James Rasell
61b2b9d3d0 agent: Improve retry joiner code with small refactor. (#25422)
The agent retry joiner implementation had different parameters
to control its execution for agents running in server and client
mode. The agent would set up individual joiners depending on the
agent mode, making the object parameter overhead unrequired.

This change removes the excess configuration options for the
joiner, reducing code complexity slighly and hopefully making
future modifications in this area easier to make.
2025-03-18 15:55:52 +00:00
hc-github-team-nomad-core
e1b9bd8ab0 Generate files for 1.10.0-beta.1 release 2025-03-12 10:37:46 +00:00
Daniel Bennett
04db81951f test: fix go 1.24 test complaints (#25346)
e.g. Error: nomad/leader_test.go:382:12: non-constant format string in call to (*testing.common).Fatalf
2025-03-11 11:01:39 -05:00
Tim Gross
1ffb7ab3fb dynamic host volumes: allow plugins to return an error message (#25341)
Errors from `volume create` or `volume delete` only get logged by the client
agent, which may make it harder for volume authors to debug these tasks if they
are not also the cluster administrator with access to host logs.

Allow plugins to include an optional error message in their response. Because we
can't count on receiving this response (the error could come before the plugin
executes), we parse this message optimistically and include it only if
available.

Ref: https://hashicorp.atlassian.net/browse/NET-12087
2025-03-11 11:06:57 -04:00
hc-github-team-nomad-core
1da56b8e07 Generate files for 1.9.7 release 2025-03-11 14:09:02 +00:00