Commit Graph

1339 Commits

Author SHA1 Message Date
Allison Larson
e40164abce Add preserve-resources flag (#26841)
* Add preserve-resources flag when registering a job

* Add preserve-resources flag to website docs

* Add changelog

* Update tests, docs

* Preserve counts & resources in fsm

* Update doc

* Update preservation of resources/count to happen in StateStore
2025-10-02 13:56:59 -07:00
James Rasell
c80c60965f node pool: Allow specifying node identity ttl in HCL or JSON spec. (#26825)
The node identity TTL defaults to 24hr but can be altered by
setting the node identity TTL parameter. In order to allow setting
and viewing the value, the field is now plumbed through the CLI
and HTTP API.

In order to parse the HCL, a new helper package has been created
which contains generic parsing and decoding functionality for
dealing with HCL that contains time durations. hclsimple can be
used when this functionality is not needed. In order to parse the
JSON, custom marshal and unmarshal functions have been created as
used in many other places.

The node pool init command has been updated to include this new
parameter, although commented out, so reference. The info command
now includes the TTL in its output too.
2025-09-24 14:20:34 +01:00
Piotr Kazmierczak
f42239bf6c api: add DefaultUpdateStrategy to system jobs if missing (#26777)
From 1.11, Nomad system jobs will feature deployments, and thus jobspecs missing
an update block should be canonicalized to have one.
2025-09-18 15:21:23 +02:00
Tim Gross
4e75e99f1a windows: use/accept platform-specific signal for stopping agent (#26780)
On Windows, the `os.Process.Signal` method returns an error when sending
`os.Interrupt` (SIGINT) because it isn't implemented. This causes test servers
in the `testutil` packages to break on Windows. Use the platform specific
syscalls to generate the SIGINT instead.

The agent's signal handler also did not correctly handle the Ctrl-C because we
were masking os.Interrupt instead of SIGINT.

Fixes: https://github.com/hashicorp/nomad/issues/26775

Co-authored-by: Chris Roberts <croberts@hashicorp.com>
2025-09-17 11:32:20 -04:00
dependabot[bot]
ababacc9ab chore(deps): bump github.com/shoenig/test from 1.12.1 to 1.12.2 in /api (#26757)
* chore(deps): bump github.com/shoenig/test from 1.12.1 to 1.12.2 in /api

Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 1.12.1 to 1.12.2.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v1.12.1...v1.12.2)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-version: 1.12.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* root dep needs to be updated too

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-15 09:06:41 -04:00
Michael Smithhisler
37da98be1c Merge pull request #26681 from hashicorp/NMD-760-nomad-secrets-block
Secrets Block: merge feature branch to main
2025-09-09 10:46:18 -04:00
dependabot[bot]
00fd92a1d4 chore(deps): bump github.com/hashicorp/cronexpr in /api (#26715)
Bumps [github.com/hashicorp/cronexpr](https://github.com/hashicorp/cronexpr) from 1.1.2 to 1.1.3.
- [Release notes](https://github.com/hashicorp/cronexpr/releases)
- [Commits](https://github.com/hashicorp/cronexpr/compare/v1.1.2...v1.1.3)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/cronexpr
  dependency-version: 1.1.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-08 17:06:16 +02:00
Michael Smithhisler
10ed46cbd4 secrets: pass key/value config data to plugins as env (#26455)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 16:08:24 -04:00
Michael Smithhisler
65c7f34f2d secrets: Add secrets block to job spec (#26076) 2025-09-04 15:58:03 -04:00
James Rasell
270ab1011e lint: Enable and fix SA9004 constant type lint errors. (#26678)
When creating constants with a custom type, each definition should
include the type definition. If only the first constant defines
this, it will have a different type to the other constants.

This change fixes occurances of this and enables SA9004 within CI
linting to catch future problems while the change is in review.
2025-09-03 07:45:29 +01:00
Michael Smithhisler
485356c3d3 csi: fix volume registration error (#26642) 2025-08-27 15:00:16 -04:00
James Rasell
3b0b7db1a1 client: Add client identity API, CLI, and RPC workflow. (#26543)
The Nomad clients store their Nomad identity in memory and within
their state store. While active, it is not possible to dump the
state to view the stored identity token, so having a way to view
the current claims while running aids debugging and operations.

This change adds a client identity workflow, allowing operators
to view the current claims of the nodes identity. It does not
return any of the signing key material.
2025-08-19 08:25:51 +01:00
Daniel Bennett
2c699b9794 sysbatch: fix panic from reschedule block (#26534)
* fix panic from nil ReschedulePolicy

commit 279775082c (pr #26279)
intended to return an error for sysbatch jobs with a reschedule block,
but in bypassing populating the `ReschedulePolicy`'s pointer fields,
a nil pointer panic occurred before the job could get rejected
with the intended error.

in particular, in `command/agent/job_endpoint.go`, `func ApiTgToStructsTG`,

```
if taskGroup.ReschedulePolicy != nil {
	tg.ReschedulePolicy = &structs.ReschedulePolicy{
		Attempts:      *taskGroup.ReschedulePolicy.Attempts,
		Interval:      *taskGroup.ReschedulePolicy.Interval,
```

`*taskGroup.ReschedulePolicy.Interval` was a nil pointer.

* fix e2e test jobs
2025-08-18 10:19:14 -04:00
Allison Larson
e16a3339ad Add CSI Volume Sentinel Policy scaffolding (#26438)
* Add ent policy enforcement stubs to CSI Volume create/register

* Wire policy override/warnings through CSI volume register/create

* Add new scope to sentinel apply

* Sanitize CSISecrets & CSIMountOptions

* Add sentinel policy scope to ui

* Update docs for new sentinel scope/policy

* Create new api funcs for CSI endpoints

* fix sentinel csi ui test

* Update sentinel-policy docs

* Add changelog

* Update docs from feedback
2025-08-07 12:03:18 -07:00
James Rasell
ad508616dc Merge branch 'main' into f-NMD-763-introduction 2025-08-05 08:56:51 +01:00
James Rasell
350662c88e Merge pull request #26291 from hashicorp/f-NMD-763-identity
identity: The initial implementation code for node identity.
2025-08-05 09:52:28 +02:00
tehut
d709accaf5 Add nomad monitor export command (#26178)
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
2025-08-01 10:26:59 -07:00
James Rasell
20251b675d Add CLI and API components for creating node introduction tokens via ACL endpoint. (#26332) 2025-07-25 13:28:45 +01:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
James Rasell
dce4284361 Merge branch 'main' into f-NMD-763-identity 2025-07-17 07:35:16 +01:00
Allison Larson
918e1eb123 Correctly canonicalize lifecycle block when missing hook value (#26285) 2025-07-16 11:40:16 -07:00
James Rasell
953a149180 client: Allow operators to force a client to renew its identity. (#26277)
The Nomad client will have its identity renewed according to the
TTL which defaults to 24h. In certain situations such as root
keyring rotation, operators may want to force clients to renew
their identities before the TTL threshold is met. This change
introduces a client HTTP and RPC endpoint which will instruct the
node to request a new identity at its next heartbeat. This can be
used via the API or a new command.

While this is a manual intervention step on top of the any keyring
rotation, it dramatically reduces the initial feature complexity
as it provides an asynchronous and efficient method of renewal that
utilises existing functionality.
2025-07-16 14:56:00 +01:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Allison Larson
3ca518e89c Add node_pool to blockedEval metric (#26215)
Adds the node_pool to the blockedEval metrics that get emitted for
resource/cpu, along with the dc and node class.
2025-07-15 09:48:04 -07:00
Tim Gross
279775082c sysbatch: correctly validate that reschedule policy is not allowed (#26279)
System and sysbatch jobs don't support the reschedule block, because we'd always
replace allocations back onto the same node. The job validation for system jobs
asserts that the user hasn't set a `reschedule` block so that users aren't
submitting jobs expecting it to be supported. But this validation was missing
for sysbatch jobs.

Validate that sysbatch jobs don't have a reschedule block.
2025-07-15 10:47:02 -04:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
Allison Larson
63f0788747 Expose Kind field for Consul Service Registrations (#26170)
* consul: Add service kind to jobspec

* consul: Add kind to service docs

* Add changelog
2025-06-30 14:32:23 -07:00
Tim Gross
aa3c08d069 eval status: enrich with related evals and placed allocs tables (#26156)
When debugging an evaluation, you almost always want to know about all the
related evaluations and what allocations were placed by that evaluation (and
where), not just failed placements. We can enrich the command by adding the
`related` query parameter to the API, and having the command query for the
evaluations allocations automatically. Emit this data as a pair of new tables
and expose fields like quota limits, and previous/next/blocked eval without the
`-verbose` flag.

Update the docs to include the full output and remove references to long-removed
behavior of the `-json` flag.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-06-30 09:23:36 -04:00
Juana De La Cuesta
bdfd573fc4 Update the scaling policies when deregistering a job (#25911)
* func: Update the scaling policies when deregistering a job

* func: Add tests for updating the policy

* docs: add changelog

* func: set back the old order

* style: rearrange for clarity and to reuse the watchset

* func: set the policies to teh last submitted when starting a job

* func: expand tests  of teh start job command to include job submission

* func: Expand the tests to verify the correct state of the scaling policy after job start

* Update command/job_start.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update nomad/fsm_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: add warning when there is no previous job submission

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-02 16:11:38 +02:00
Michael Smithhisler
4c8257d0c7 client: add once mode to template block (#25922) 2025-05-28 11:45:11 -04:00
Tim Gross
3f59860254 host volumes: add configuration to GC on node GC (#25903)
When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.

Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-27 10:22:08 -04:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Tim Gross
41cf1b03b4 host volumes: -force flag for delete (#25902)
When a node is garbage collected, we leave behind the dynamic host volume in the
state store. We don't want to automatically garbage collect the volumes and risk
data loss, but we should allow these to be removed via the API.

Fixes: https://github.com/hashicorp/nomad/issues/25762
Fixes: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-21 08:55:52 -04:00
Piotr Kazmierczak
cdc308a0eb wi: new endpoint for listing workload attached ACL policies (#25588)
This introduces a new HTTP endpoint (and an associated CLI command) for querying
ACL policies associated with a workload identity. It allows users that want
to learn about the ACL capabilities from within WI-tasks to know what sort of
policies are enabled.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-05-19 19:54:12 +02:00
Tim Gross
8a5a057d88 offline license utilization reporting (#25844)
Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.

This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
2025-05-14 09:51:13 -04:00
Piotr Kazmierczak
df3b00bce0 acl: use WhoAmI RPC endpoint in /acl/token/self (#25547)
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-22 17:53:39 +02:00
tehut
b11619010e Add priority flag to Dispatch CLI and API (#25622)
* Add priority flag to Dispatch CLI and DispatchOpts() helper to HTTP API
2025-04-18 13:24:52 -07:00
James Rasell
85c30dfd1e test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640)
The stdlib testing package now includes this interface, so we can
remove our dependency on the external library.
2025-04-14 07:43:49 +01:00
Tim Gross
27caae2b2a api: make attempting to remove peer by address a no-op (#25599)
In Nomad 1.4.0 we removed support for Raft Protocol v2 entirely. But the
`Operator.RemoveRaftPeerByAddress` RPC handler was left in place, along with its
supporting HTTP API and command line flags. Using this API will always result in
the Raft library error "operation not supported with current protocol version".

Unfortunately it's still possible in unit tests to exercise this code path, and
these tests are quite flaky. This changeset turns the RPC handler and HTTP API
into a no-op, removes the associated command line flags, and removes the flaky
tests. I've also cleaned up the test for `RemoveRaftPeerByID` to consolidate
test servers and use `shoenig/test`.

Fixes: https://hashicorp.atlassian.net/browse/NET-12413
Ref: https://github.com/hashicorp/nomad/pull/13467
Ref: https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#raft-protocol-version-2-unsupported
Ref: https://github.com/hashicorp/nomad-enterprise/actions/runs/13201513025/job/36855234398?pr=2302
2025-04-10 09:19:25 -04:00
Daniel Bennett
5c8e436de9 auth: oidc: disable pkce by default (#25600)
our goal of "enable by default, only for new auth methods"
proved to be unwieldy, so instead make it a simple bool,
disabled by default.
2025-04-07 12:36:09 -05:00
Daniel Bennett
6a0c4f5a3d auth: oidc: enable pkce only on new auth methods (#25593)
trying not to violate the principle of least astonishment.

we want to only auto-enable PKCE on *new* auth methods,
rather than *new or updated* auth methods, to avoid a
scenario where a Nomad admin updates an auth method
sometime in the future -- something innocent like a new
client secret -- and their OIDC provider doesn't like PKCE.

the main concern is that the provider won't like PKCE
in a totally confusing way. error messages rarely
say PKCE directly, so why the user's auth method
suddenly broke would be a big mystery.

this means that to enable it on existing auth methods,
you would set `OIDCDisablePKCE = false`, and the double-
negative doesn't feel right, so instead, swap the language,
so enabling it on *existing* methods reads sensibly, and to
disable it on *new* methods reads ok-enough:
`OIDCEnablePKCE = false`
2025-04-03 10:56:17 -05:00
dependabot[bot]
d67a74d0f4 chore(deps): bump github.com/gorilla/websocket in /api (#25502)
Bumps [github.com/gorilla/websocket](https://github.com/gorilla/websocket) from 1.5.0 to 1.5.3.
- [Release notes](https://github.com/gorilla/websocket/releases)
- [Commits](https://github.com/gorilla/websocket/compare/v1.5.0...v1.5.3)

---
updated-dependencies:
- dependency-name: github.com/gorilla/websocket
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:53:27 -04:00
dependabot[bot]
f16104ab84 chore(deps): bump github.com/shoenig/test from 1.7.1 to 1.12.1 in /api (#25501)
Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 1.7.1 to 1.12.1.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v1.7.1...v1.12.1)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:52:56 -04:00
dependabot[bot]
de723690f7 chore(deps): bump github.com/felixge/httpsnoop in /api (#25499)
Bumps [github.com/felixge/httpsnoop](https://github.com/felixge/httpsnoop) from 1.0.3 to 1.0.4.
- [Release notes](https://github.com/felixge/httpsnoop/releases)
- [Commits](https://github.com/felixge/httpsnoop/compare/v1.0.3...v1.0.4)

---
updated-dependencies:
- dependency-name: github.com/felixge/httpsnoop
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:02:00 -04:00
Daniel Bennett
8c609ad762 docs: oidc client assertions and pkce (#25375) 2025-03-20 09:14:17 -05:00
Daniel Bennett
d98d414c7f oidc: more tests for client assertions (#25352)
Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2025-03-11 15:56:26 -05:00
Tim Gross
1ffb7ab3fb dynamic host volumes: allow plugins to return an error message (#25341)
Errors from `volume create` or `volume delete` only get logged by the client
agent, which may make it harder for volume authors to debug these tasks if they
are not also the cluster administrator with access to host logs.

Allow plugins to include an optional error message in their response. Because we
can't count on receiving this response (the error could come before the plugin
executes), we parse this message optimistically and include it only if
available.

Ref: https://hashicorp.atlassian.net/browse/NET-12087
2025-03-11 11:06:57 -04:00
Daniel Bennett
8e56805fea oidc: support PKCE and client assertion / private key JWT (#25231)
PKCE is enabled by default for new/updated auth methods.
 * ref: https://oauth.net/2/pkce/

Client assertions are an optional, more secure replacement for client secrets
 * ref: https://oauth.net/private-key-jwt/

a change to the existing flow, even without these new options,
is that the oidc.Req is retained on the Nomad server (leader)
in between auth-url and complete-auth calls.

and some fields in auth method config are now more strictly required.
2025-03-10 13:32:53 -05:00
James Rasell
c53ba3e7d1 consul: Remove implicit workload identity when task has a template. (#25298)
When a task included a template block, Nomad was adding a Consul
identity by default which allowed the template to use Consul API
template functions even when they were not needed or desired.

This change removes the implict addition of Consul identities to
tasks when they include a template block. Job specification
authors will now need to add a Consul identity or Consul block to
their task if they have a template which uses Consul API functions.

This change also removes the default addition of a Consul block to
all task groups registered and processed by the API package.
2025-03-10 13:49:50 +00:00
Michael Smithhisler
5c4d0e923d consul: Remove legacy token based authentication workflow (#25217) 2025-03-05 15:38:11 -05:00