Commit Graph

1329 Commits

Author SHA1 Message Date
Michael Smithhisler
485356c3d3 csi: fix volume registration error (#26642) 2025-08-27 15:00:16 -04:00
James Rasell
3b0b7db1a1 client: Add client identity API, CLI, and RPC workflow. (#26543)
The Nomad clients store their Nomad identity in memory and within
their state store. While active, it is not possible to dump the
state to view the stored identity token, so having a way to view
the current claims while running aids debugging and operations.

This change adds a client identity workflow, allowing operators
to view the current claims of the nodes identity. It does not
return any of the signing key material.
2025-08-19 08:25:51 +01:00
Daniel Bennett
2c699b9794 sysbatch: fix panic from reschedule block (#26534)
* fix panic from nil ReschedulePolicy

commit 279775082c (pr #26279)
intended to return an error for sysbatch jobs with a reschedule block,
but in bypassing populating the `ReschedulePolicy`'s pointer fields,
a nil pointer panic occurred before the job could get rejected
with the intended error.

in particular, in `command/agent/job_endpoint.go`, `func ApiTgToStructsTG`,

```
if taskGroup.ReschedulePolicy != nil {
	tg.ReschedulePolicy = &structs.ReschedulePolicy{
		Attempts:      *taskGroup.ReschedulePolicy.Attempts,
		Interval:      *taskGroup.ReschedulePolicy.Interval,
```

`*taskGroup.ReschedulePolicy.Interval` was a nil pointer.

* fix e2e test jobs
2025-08-18 10:19:14 -04:00
Allison Larson
e16a3339ad Add CSI Volume Sentinel Policy scaffolding (#26438)
* Add ent policy enforcement stubs to CSI Volume create/register

* Wire policy override/warnings through CSI volume register/create

* Add new scope to sentinel apply

* Sanitize CSISecrets & CSIMountOptions

* Add sentinel policy scope to ui

* Update docs for new sentinel scope/policy

* Create new api funcs for CSI endpoints

* fix sentinel csi ui test

* Update sentinel-policy docs

* Add changelog

* Update docs from feedback
2025-08-07 12:03:18 -07:00
James Rasell
ad508616dc Merge branch 'main' into f-NMD-763-introduction 2025-08-05 08:56:51 +01:00
James Rasell
350662c88e Merge pull request #26291 from hashicorp/f-NMD-763-identity
identity: The initial implementation code for node identity.
2025-08-05 09:52:28 +02:00
tehut
d709accaf5 Add nomad monitor export command (#26178)
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
2025-08-01 10:26:59 -07:00
James Rasell
20251b675d Add CLI and API components for creating node introduction tokens via ACL endpoint. (#26332) 2025-07-25 13:28:45 +01:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
James Rasell
dce4284361 Merge branch 'main' into f-NMD-763-identity 2025-07-17 07:35:16 +01:00
Allison Larson
918e1eb123 Correctly canonicalize lifecycle block when missing hook value (#26285) 2025-07-16 11:40:16 -07:00
James Rasell
953a149180 client: Allow operators to force a client to renew its identity. (#26277)
The Nomad client will have its identity renewed according to the
TTL which defaults to 24h. In certain situations such as root
keyring rotation, operators may want to force clients to renew
their identities before the TTL threshold is met. This change
introduces a client HTTP and RPC endpoint which will instruct the
node to request a new identity at its next heartbeat. This can be
used via the API or a new command.

While this is a manual intervention step on top of the any keyring
rotation, it dramatically reduces the initial feature complexity
as it provides an asynchronous and efficient method of renewal that
utilises existing functionality.
2025-07-16 14:56:00 +01:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Allison Larson
3ca518e89c Add node_pool to blockedEval metric (#26215)
Adds the node_pool to the blockedEval metrics that get emitted for
resource/cpu, along with the dc and node class.
2025-07-15 09:48:04 -07:00
Tim Gross
279775082c sysbatch: correctly validate that reschedule policy is not allowed (#26279)
System and sysbatch jobs don't support the reschedule block, because we'd always
replace allocations back onto the same node. The job validation for system jobs
asserts that the user hasn't set a `reschedule` block so that users aren't
submitting jobs expecting it to be supported. But this validation was missing
for sysbatch jobs.

Validate that sysbatch jobs don't have a reschedule block.
2025-07-15 10:47:02 -04:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
Allison Larson
63f0788747 Expose Kind field for Consul Service Registrations (#26170)
* consul: Add service kind to jobspec

* consul: Add kind to service docs

* Add changelog
2025-06-30 14:32:23 -07:00
Tim Gross
aa3c08d069 eval status: enrich with related evals and placed allocs tables (#26156)
When debugging an evaluation, you almost always want to know about all the
related evaluations and what allocations were placed by that evaluation (and
where), not just failed placements. We can enrich the command by adding the
`related` query parameter to the API, and having the command query for the
evaluations allocations automatically. Emit this data as a pair of new tables
and expose fields like quota limits, and previous/next/blocked eval without the
`-verbose` flag.

Update the docs to include the full output and remove references to long-removed
behavior of the `-json` flag.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-06-30 09:23:36 -04:00
Juana De La Cuesta
bdfd573fc4 Update the scaling policies when deregistering a job (#25911)
* func: Update the scaling policies when deregistering a job

* func: Add tests for updating the policy

* docs: add changelog

* func: set back the old order

* style: rearrange for clarity and to reuse the watchset

* func: set the policies to teh last submitted when starting a job

* func: expand tests  of teh start job command to include job submission

* func: Expand the tests to verify the correct state of the scaling policy after job start

* Update command/job_start.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update nomad/fsm_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: add warning when there is no previous job submission

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-02 16:11:38 +02:00
Michael Smithhisler
4c8257d0c7 client: add once mode to template block (#25922) 2025-05-28 11:45:11 -04:00
Tim Gross
3f59860254 host volumes: add configuration to GC on node GC (#25903)
When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.

Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-27 10:22:08 -04:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Tim Gross
41cf1b03b4 host volumes: -force flag for delete (#25902)
When a node is garbage collected, we leave behind the dynamic host volume in the
state store. We don't want to automatically garbage collect the volumes and risk
data loss, but we should allow these to be removed via the API.

Fixes: https://github.com/hashicorp/nomad/issues/25762
Fixes: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-21 08:55:52 -04:00
Piotr Kazmierczak
cdc308a0eb wi: new endpoint for listing workload attached ACL policies (#25588)
This introduces a new HTTP endpoint (and an associated CLI command) for querying
ACL policies associated with a workload identity. It allows users that want
to learn about the ACL capabilities from within WI-tasks to know what sort of
policies are enabled.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-05-19 19:54:12 +02:00
Tim Gross
8a5a057d88 offline license utilization reporting (#25844)
Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.

This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
2025-05-14 09:51:13 -04:00
Piotr Kazmierczak
df3b00bce0 acl: use WhoAmI RPC endpoint in /acl/token/self (#25547)
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-22 17:53:39 +02:00
tehut
b11619010e Add priority flag to Dispatch CLI and API (#25622)
* Add priority flag to Dispatch CLI and DispatchOpts() helper to HTTP API
2025-04-18 13:24:52 -07:00
James Rasell
85c30dfd1e test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640)
The stdlib testing package now includes this interface, so we can
remove our dependency on the external library.
2025-04-14 07:43:49 +01:00
Tim Gross
27caae2b2a api: make attempting to remove peer by address a no-op (#25599)
In Nomad 1.4.0 we removed support for Raft Protocol v2 entirely. But the
`Operator.RemoveRaftPeerByAddress` RPC handler was left in place, along with its
supporting HTTP API and command line flags. Using this API will always result in
the Raft library error "operation not supported with current protocol version".

Unfortunately it's still possible in unit tests to exercise this code path, and
these tests are quite flaky. This changeset turns the RPC handler and HTTP API
into a no-op, removes the associated command line flags, and removes the flaky
tests. I've also cleaned up the test for `RemoveRaftPeerByID` to consolidate
test servers and use `shoenig/test`.

Fixes: https://hashicorp.atlassian.net/browse/NET-12413
Ref: https://github.com/hashicorp/nomad/pull/13467
Ref: https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#raft-protocol-version-2-unsupported
Ref: https://github.com/hashicorp/nomad-enterprise/actions/runs/13201513025/job/36855234398?pr=2302
2025-04-10 09:19:25 -04:00
Daniel Bennett
5c8e436de9 auth: oidc: disable pkce by default (#25600)
our goal of "enable by default, only for new auth methods"
proved to be unwieldy, so instead make it a simple bool,
disabled by default.
2025-04-07 12:36:09 -05:00
Daniel Bennett
6a0c4f5a3d auth: oidc: enable pkce only on new auth methods (#25593)
trying not to violate the principle of least astonishment.

we want to only auto-enable PKCE on *new* auth methods,
rather than *new or updated* auth methods, to avoid a
scenario where a Nomad admin updates an auth method
sometime in the future -- something innocent like a new
client secret -- and their OIDC provider doesn't like PKCE.

the main concern is that the provider won't like PKCE
in a totally confusing way. error messages rarely
say PKCE directly, so why the user's auth method
suddenly broke would be a big mystery.

this means that to enable it on existing auth methods,
you would set `OIDCDisablePKCE = false`, and the double-
negative doesn't feel right, so instead, swap the language,
so enabling it on *existing* methods reads sensibly, and to
disable it on *new* methods reads ok-enough:
`OIDCEnablePKCE = false`
2025-04-03 10:56:17 -05:00
dependabot[bot]
d67a74d0f4 chore(deps): bump github.com/gorilla/websocket in /api (#25502)
Bumps [github.com/gorilla/websocket](https://github.com/gorilla/websocket) from 1.5.0 to 1.5.3.
- [Release notes](https://github.com/gorilla/websocket/releases)
- [Commits](https://github.com/gorilla/websocket/compare/v1.5.0...v1.5.3)

---
updated-dependencies:
- dependency-name: github.com/gorilla/websocket
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:53:27 -04:00
dependabot[bot]
f16104ab84 chore(deps): bump github.com/shoenig/test from 1.7.1 to 1.12.1 in /api (#25501)
Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 1.7.1 to 1.12.1.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](https://github.com/shoenig/test/compare/v1.7.1...v1.12.1)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:52:56 -04:00
dependabot[bot]
de723690f7 chore(deps): bump github.com/felixge/httpsnoop in /api (#25499)
Bumps [github.com/felixge/httpsnoop](https://github.com/felixge/httpsnoop) from 1.0.3 to 1.0.4.
- [Release notes](https://github.com/felixge/httpsnoop/releases)
- [Commits](https://github.com/felixge/httpsnoop/compare/v1.0.3...v1.0.4)

---
updated-dependencies:
- dependency-name: github.com/felixge/httpsnoop
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-25 10:02:00 -04:00
Daniel Bennett
8c609ad762 docs: oidc client assertions and pkce (#25375) 2025-03-20 09:14:17 -05:00
Daniel Bennett
d98d414c7f oidc: more tests for client assertions (#25352)
Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2025-03-11 15:56:26 -05:00
Tim Gross
1ffb7ab3fb dynamic host volumes: allow plugins to return an error message (#25341)
Errors from `volume create` or `volume delete` only get logged by the client
agent, which may make it harder for volume authors to debug these tasks if they
are not also the cluster administrator with access to host logs.

Allow plugins to include an optional error message in their response. Because we
can't count on receiving this response (the error could come before the plugin
executes), we parse this message optimistically and include it only if
available.

Ref: https://hashicorp.atlassian.net/browse/NET-12087
2025-03-11 11:06:57 -04:00
Daniel Bennett
8e56805fea oidc: support PKCE and client assertion / private key JWT (#25231)
PKCE is enabled by default for new/updated auth methods.
 * ref: https://oauth.net/2/pkce/

Client assertions are an optional, more secure replacement for client secrets
 * ref: https://oauth.net/private-key-jwt/

a change to the existing flow, even without these new options,
is that the oidc.Req is retained on the Nomad server (leader)
in between auth-url and complete-auth calls.

and some fields in auth method config are now more strictly required.
2025-03-10 13:32:53 -05:00
James Rasell
c53ba3e7d1 consul: Remove implicit workload identity when task has a template. (#25298)
When a task included a template block, Nomad was adding a Consul
identity by default which allowed the template to use Consul API
template functions even when they were not needed or desired.

This change removes the implict addition of Consul identities to
tasks when they include a template block. Job specification
authors will now need to add a Consul identity or Consul block to
their task if they have a template which uses Consul API functions.

This change also removes the default addition of a Consul block to
all task groups registered and processed by the API package.
2025-03-10 13:49:50 +00:00
Michael Smithhisler
5c4d0e923d consul: Remove legacy token based authentication workflow (#25217) 2025-03-05 15:38:11 -05:00
Michael Smithhisler
f2b761f17c disconnected: removes deprecated disconnect fields (#25284)
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-05 14:46:02 -05:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Piotr Kazmierczak
58c6387323 stateful deployments: task group host volume claims API (#25114)
This PR introduces API endpoints /v1/volumes/claims/ and /v1/volumes/claim/:id
for listing and deleting task group host volume claims, respectively.
2025-02-25 15:51:59 +01:00
Piotr Kazmierczak
611452e1af stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs (#24993)
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store
2025-02-07 17:41:01 +01:00
Michael Smithhisler
47c14ddf28 remove remote task execution code (#24909) 2025-01-29 08:08:34 -05:00
Tim Gross
09eb473189 dynamic host volumes: set status unavailable on failed restore (#24962)
When a client restarts but can't restore a volume (ex. the plugin is now
missing), it's removed from the node fingerprint. So we won't allow future
scheduling of the volume, but we were not updating the volume state field to
report this reasoning to operators. Make debugging easier and the state field
more meaningful by setting the value to "unavailable".

Also, remove the unused "deleted" field. We did not implement soft deletes and
aren't planning on it for Nomad 1.10.0.

Ref: https://hashicorp.atlassian.net/browse/NET-11551
2025-01-27 16:35:53 -05:00
Michael Smithhisler
d621211108 auth: adds option to enable verbose logging during sso (#24892)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2025-01-23 11:40:01 -05:00
Piotr Kazmierczak
ebffcce378 stateful deployments: remove CSIVolumeIDs (#24908) 2025-01-21 17:00:55 +01:00
Tim Gross
96e539ee87 dynamic host volumes quotas (#24871)
Allow users to configure a host volumes quota in MB. This will be enforced at
the time of provisioning via create/register RPCs. This changeset is the CE
version of ENT/2114.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2114
Ref: https://hashicorp.atlassian.net/browse/NET-11549
2025-01-17 11:41:56 -05:00
Tim Gross
203a6533bb API: host volume access modes should match list in structs package (#24838)
We changed the list of access modes available for dynamic host volumes in #24705
but neglected to change them in the API package. Update the API package to
match.

Ref: https://github.com/hashicorp/nomad/pull/24705
2025-01-13 15:00:22 -05:00