nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Michael Smithhisler	485356c3d3	csi: fix volume registration error (#26642 )	2025-08-27 15:00:16 -04:00
James Rasell	3b0b7db1a1	client: Add client identity API, CLI, and RPC workflow. (#26543 ) The Nomad clients store their Nomad identity in memory and within their state store. While active, it is not possible to dump the state to view the stored identity token, so having a way to view the current claims while running aids debugging and operations. This change adds a client identity workflow, allowing operators to view the current claims of the nodes identity. It does not return any of the signing key material.	2025-08-19 08:25:51 +01:00
Daniel Bennett	2c699b9794	sysbatch: fix panic from reschedule block (#26534 ) * fix panic from nil ReschedulePolicy commit `279775082c` (pr #26279) intended to return an error for sysbatch jobs with a reschedule block, but in bypassing populating the `ReschedulePolicy`'s pointer fields, a nil pointer panic occurred before the job could get rejected with the intended error. in particular, in `command/agent/job_endpoint.go`, `func ApiTgToStructsTG`, ``` if taskGroup.ReschedulePolicy != nil { tg.ReschedulePolicy = &structs.ReschedulePolicy{ Attempts: taskGroup.ReschedulePolicy.Attempts, Interval: taskGroup.ReschedulePolicy.Interval, ``` `taskGroup.ReschedulePolicy.Interval` was a nil pointer. fix e2e test jobs	2025-08-18 10:19:14 -04:00
Allison Larson	e16a3339ad	Add CSI Volume Sentinel Policy scaffolding (#26438 ) * Add ent policy enforcement stubs to CSI Volume create/register * Wire policy override/warnings through CSI volume register/create * Add new scope to sentinel apply * Sanitize CSISecrets & CSIMountOptions * Add sentinel policy scope to ui * Update docs for new sentinel scope/policy * Create new api funcs for CSI endpoints * fix sentinel csi ui test * Update sentinel-policy docs * Add changelog * Update docs from feedback	2025-08-07 12:03:18 -07:00
James Rasell	ad508616dc	Merge branch 'main' into f-NMD-763-introduction	2025-08-05 08:56:51 +01:00
James Rasell	350662c88e	Merge pull request #26291 from hashicorp/f-NMD-763-identity identity: The initial implementation code for node identity.	2025-08-05 09:52:28 +02:00
tehut	d709accaf5	Add nomad monitor export command (#26178 ) * Add MonitorExport command and handlers * Implement autocomplete * Require nomad in serviceName * Fix race in StreamReader.Read * Add and use framer.Flush() to coordinate function exit * Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer * Parameterize StreamFixed stream size	2025-08-01 10:26:59 -07:00
James Rasell	20251b675d	Add CLI and API components for creating node introduction tokens via ACL endpoint. (#26332 )	2025-07-25 13:28:45 +01:00
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
James Rasell	dce4284361	Merge branch 'main' into f-NMD-763-identity	2025-07-17 07:35:16 +01:00
Allison Larson	918e1eb123	Correctly canonicalize lifecycle block when missing hook value (#26285 )	2025-07-16 11:40:16 -07:00
James Rasell	953a149180	client: Allow operators to force a client to renew its identity. (#26277 ) The Nomad client will have its identity renewed according to the TTL which defaults to 24h. In certain situations such as root keyring rotation, operators may want to force clients to renew their identities before the TTL threshold is met. This change introduces a client HTTP and RPC endpoint which will instruct the node to request a new identity at its next heartbeat. This can be used via the API or a new command. While this is a manual intervention step on top of the any keyring rotation, it dramatically reduces the initial feature complexity as it provides an asynchronous and efficient method of renewal that utilises existing functionality.	2025-07-16 14:56:00 +01:00
Tim Gross	35f3f6ce41	scheduler: add disconnect and reschedule info to reconciler output (#26255 ) The `DesiredUpdates` struct that we send to the Read Eval API doesn't include information about disconnect/reconnect and rescheduling. Annotate the `DesiredUpdates` with this data, and adjust the `eval status` command to display only those fields that have non-zero values in order to make the output width manageable. Ref: https://hashicorp.atlassian.net/browse/NMD-815	2025-07-16 08:46:38 -04:00
Allison Larson	3ca518e89c	Add node_pool to blockedEval metric (#26215 ) Adds the node_pool to the blockedEval metrics that get emitted for resource/cpu, along with the dc and node class.	2025-07-15 09:48:04 -07:00
Tim Gross	279775082c	sysbatch: correctly validate that reschedule policy is not allowed (#26279 ) System and sysbatch jobs don't support the reschedule block, because we'd always replace allocations back onto the same node. The job validation for system jobs asserts that the user hasn't set a `reschedule` block so that users aren't submitting jobs expecting it to be supported. But this validation was missing for sysbatch jobs. Validate that sysbatch jobs don't have a reschedule block.	2025-07-15 10:47:02 -04:00
Tim Gross	5c909213ce	scheduler: add reconciler annotations to completed evals (#26188 ) The output of the reconciler stage of scheduling is only visible via debug-level logs, typically accessible only to the cluster admin. We can give job authors better ability to understand what's happening to their jobs if we expose this information to them in the `eval status` command. Add the reconciler's desired updates to the evaluation struct so it can be exposed in the API. This increases the size of evals by roughly 15% in the state store, or a bit more when there are preemptions (but we expect this will be a small minority of evals). Ref: https://hashicorp.atlassian.net/browse/NMD-818 Fixes: https://github.com/hashicorp/nomad/issues/15564	2025-07-07 09:40:21 -04:00
Allison Larson	63f0788747	Expose Kind field for Consul Service Registrations (#26170 ) * consul: Add service kind to jobspec * consul: Add kind to service docs * Add changelog	2025-06-30 14:32:23 -07:00
Tim Gross	aa3c08d069	eval status: enrich with related evals and placed allocs tables (#26156 ) When debugging an evaluation, you almost always want to know about all the related evaluations and what allocations were placed by that evaluation (and where), not just failed placements. We can enrich the command by adding the `related` query parameter to the API, and having the command query for the evaluations allocations automatically. Emit this data as a pair of new tables and expose fields like quota limits, and previous/next/blocked eval without the `-verbose` flag. Update the docs to include the full output and remove references to long-removed behavior of the `-json` flag. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-06-30 09:23:36 -04:00
Juana De La Cuesta	bdfd573fc4	Update the scaling policies when deregistering a job (#25911 ) * func: Update the scaling policies when deregistering a job * func: Add tests for updating the policy * docs: add changelog * func: set back the old order * style: rearrange for clarity and to reuse the watchset * func: set the policies to teh last submitted when starting a job * func: expand tests of teh start job command to include job submission * func: Expand the tests to verify the correct state of the scaling policy after job start * Update command/job_start.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update nomad/fsm_test.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * func: add warning when there is no previous job submission --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-02 16:11:38 +02:00
Michael Smithhisler	4c8257d0c7	client: add once mode to template block (#25922 )	2025-05-28 11:45:11 -04:00
Tim Gross	3f59860254	host volumes: add configuration to GC on node GC (#25903 ) When a node is garbage collected, any dynamic host volumes on the node are orphaned in the state store. We generally don't want to automatically collect these volumes and risk data loss, and have provided a CLI flag to `-force` remove them in #25902. But for clusters running on ephemeral cloud instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add excessive friction. Add a configuration knob to the client configuration to remove host volumes from the state store on node GC. Ref: https://github.com/hashicorp/nomad/pull/25902 Ref: https://github.com/hashicorp/nomad/issues/25762 Ref: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-27 10:22:08 -04:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Tim Gross	41cf1b03b4	host volumes: -force flag for delete (#25902 ) When a node is garbage collected, we leave behind the dynamic host volume in the state store. We don't want to automatically garbage collect the volumes and risk data loss, but we should allow these to be removed via the API. Fixes: https://github.com/hashicorp/nomad/issues/25762 Fixes: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-21 08:55:52 -04:00
Piotr Kazmierczak	cdc308a0eb	wi: new endpoint for listing workload attached ACL policies (#25588 ) This introduces a new HTTP endpoint (and an associated CLI command) for querying ACL policies associated with a workload identity. It allows users that want to learn about the ACL capabilities from within WI-tasks to know what sort of policies are enabled. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-05-19 19:54:12 +02:00
Tim Gross	8a5a057d88	offline license utilization reporting (#25844 ) Nomad Enterprise users operating in air-gapped or otherwise secured environments don't want to send license reporting metrics directly from their servers. Implement manual/offline reporting by periodically recording usage metrics snapshots in the state store, and providing an API and CLI by which cluster administrators can download the snapshot for review and out-of-band transmission to HashiCorp. This is the CE portion of the work required for implemention in the Enterprise product. Nomad CE does not perform utilization reporting. Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673 Ref: https://hashicorp.atlassian.net/browse/NMD-68 Ref: https://go.hashi.co/rfc/nmd-210	2025-05-14 09:51:13 -04:00
Piotr Kazmierczak	df3b00bce0	acl: use WhoAmI RPC endpoint in /acl/token/self (#25547 ) ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-04-22 17:53:39 +02:00
tehut	b11619010e	Add priority flag to Dispatch CLI and API (#25622 ) * Add priority flag to Dispatch CLI and DispatchOpts() helper to HTTP API	2025-04-18 13:24:52 -07:00
James Rasell	85c30dfd1e	test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640 ) The stdlib testing package now includes this interface, so we can remove our dependency on the external library.	2025-04-14 07:43:49 +01:00
Tim Gross	27caae2b2a	api: make attempting to remove peer by address a no-op (#25599 ) In Nomad 1.4.0 we removed support for Raft Protocol v2 entirely. But the `Operator.RemoveRaftPeerByAddress` RPC handler was left in place, along with its supporting HTTP API and command line flags. Using this API will always result in the Raft library error "operation not supported with current protocol version". Unfortunately it's still possible in unit tests to exercise this code path, and these tests are quite flaky. This changeset turns the RPC handler and HTTP API into a no-op, removes the associated command line flags, and removes the flaky tests. I've also cleaned up the test for `RemoveRaftPeerByID` to consolidate test servers and use `shoenig/test`. Fixes: https://hashicorp.atlassian.net/browse/NET-12413 Ref: https://github.com/hashicorp/nomad/pull/13467 Ref: https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#raft-protocol-version-2-unsupported Ref: https://github.com/hashicorp/nomad-enterprise/actions/runs/13201513025/job/36855234398?pr=2302	2025-04-10 09:19:25 -04:00
Daniel Bennett	5c8e436de9	auth: oidc: disable pkce by default (#25600 ) our goal of "enable by default, only for new auth methods" proved to be unwieldy, so instead make it a simple bool, disabled by default.	2025-04-07 12:36:09 -05:00
Daniel Bennett	6a0c4f5a3d	auth: oidc: enable pkce only on new auth methods (#25593 ) trying not to violate the principle of least astonishment. we want to only auto-enable PKCE on new auth methods, rather than new or updated auth methods, to avoid a scenario where a Nomad admin updates an auth method sometime in the future -- something innocent like a new client secret -- and their OIDC provider doesn't like PKCE. the main concern is that the provider won't like PKCE in a totally confusing way. error messages rarely say PKCE directly, so why the user's auth method suddenly broke would be a big mystery. this means that to enable it on existing auth methods, you would set `OIDCDisablePKCE = false`, and the double- negative doesn't feel right, so instead, swap the language, so enabling it on existing methods reads sensibly, and to disable it on new methods reads ok-enough: `OIDCEnablePKCE = false`	2025-04-03 10:56:17 -05:00
dependabot[bot]	d67a74d0f4	chore(deps): bump github.com/gorilla/websocket in /api (#25502 ) Bumps [github.com/gorilla/websocket](https://github.com/gorilla/websocket) from 1.5.0 to 1.5.3. - [Release notes](https://github.com/gorilla/websocket/releases) - [Commits](https://github.com/gorilla/websocket/compare/v1.5.0...v1.5.3) --- updated-dependencies: - dependency-name: github.com/gorilla/websocket dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-25 10:53:27 -04:00
dependabot[bot]	f16104ab84	chore(deps): bump github.com/shoenig/test from 1.7.1 to 1.12.1 in /api (#25501 ) Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 1.7.1 to 1.12.1. - [Release notes](https://github.com/shoenig/test/releases) - [Commits](https://github.com/shoenig/test/compare/v1.7.1...v1.12.1) --- updated-dependencies: - dependency-name: github.com/shoenig/test dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-25 10:52:56 -04:00
dependabot[bot]	de723690f7	chore(deps): bump github.com/felixge/httpsnoop in /api (#25499 ) Bumps [github.com/felixge/httpsnoop](https://github.com/felixge/httpsnoop) from 1.0.3 to 1.0.4. - [Release notes](https://github.com/felixge/httpsnoop/releases) - [Commits](https://github.com/felixge/httpsnoop/compare/v1.0.3...v1.0.4) --- updated-dependencies: - dependency-name: github.com/felixge/httpsnoop dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-25 10:02:00 -04:00
Daniel Bennett	8c609ad762	docs: oidc client assertions and pkce (#25375 )	2025-03-20 09:14:17 -05:00
Daniel Bennett	d98d414c7f	oidc: more tests for client assertions (#25352 ) Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>	2025-03-11 15:56:26 -05:00
Tim Gross	1ffb7ab3fb	dynamic host volumes: allow plugins to return an error message (#25341 ) Errors from `volume create` or `volume delete` only get logged by the client agent, which may make it harder for volume authors to debug these tasks if they are not also the cluster administrator with access to host logs. Allow plugins to include an optional error message in their response. Because we can't count on receiving this response (the error could come before the plugin executes), we parse this message optimistically and include it only if available. Ref: https://hashicorp.atlassian.net/browse/NET-12087	2025-03-11 11:06:57 -04:00
Daniel Bennett	8e56805fea	oidc: support PKCE and client assertion / private key JWT (#25231 ) PKCE is enabled by default for new/updated auth methods. * ref: https://oauth.net/2/pkce/ Client assertions are an optional, more secure replacement for client secrets * ref: https://oauth.net/private-key-jwt/ a change to the existing flow, even without these new options, is that the oidc.Req is retained on the Nomad server (leader) in between auth-url and complete-auth calls. and some fields in auth method config are now more strictly required.	2025-03-10 13:32:53 -05:00
James Rasell	c53ba3e7d1	consul: Remove implicit workload identity when task has a template. (#25298 ) When a task included a template block, Nomad was adding a Consul identity by default which allowed the template to use Consul API template functions even when they were not needed or desired. This change removes the implict addition of Consul identities to tasks when they include a template block. Job specification authors will now need to add a Consul identity or Consul block to their task if they have a template which uses Consul API functions. This change also removes the default addition of a Consul block to all task groups registered and processed by the API package.	2025-03-10 13:49:50 +00:00
Michael Smithhisler	5c4d0e923d	consul: Remove legacy token based authentication workflow (#25217 )	2025-03-05 15:38:11 -05:00
Michael Smithhisler	f2b761f17c	disconnected: removes deprecated disconnect fields (#25284 ) The group level fields stop_after_client_disconnect, max_client_disconnect, and prevent_reschedule_on_lost were deprecated in Nomad 1.8 and replaced by field in the disconnect block. This change removes any logic related to those deprecated fields. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-05 14:46:02 -05:00
James Rasell	7268053174	vault: Remove legacy token based authentication workflow. (#25155 ) The legacy workflow for Vault whereby servers were configured using a token to provide authentication to the Vault API has now been removed. This change also removes the workflow where servers were responsible for deriving Vault tokens for Nomad clients. The deprecated Vault config options used byi the Nomad agent have all been removed except for "token" which is still in use by the Vault Transit keyring implementation. Job specification authors can no longer use the "vault.policies" parameter and should instead use "vault.role" when not using the default workload identity. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-28 07:40:02 +00:00
Piotr Kazmierczak	58c6387323	stateful deployments: task group host volume claims API (#25114 ) This PR introduces API endpoints /v1/volumes/claims/ and /v1/volumes/claim/:id for listing and deleting task group host volume claims, respectively.	2025-02-25 15:51:59 +01:00
Piotr Kazmierczak	611452e1af	stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs (#24993 ) We introduce an alternative solution to the one presented in #24960 which is based on the state store and not previous-next allocation tracking in the reconciler. This new solution reduces cognitive complexity of the scheduler code at the cost of slightly more boilerplate code, but also opens up new possibilities in the future, e.g., allowing users to explicitly "un-stick" volumes with workloads still running. The diagram below illustrates the new logic: SetVolumes() upsertAllocsImpl() sets ns, job +-----------------checks if alloc requests tg in the scheduler v sticky vols and consults \| +-----------------------+ state. If there is no claim, \| \| TaskGroupVolumeClaim: \| it creates one. \| \| - namespace \| \| \| - jobID \| \| \| - tg name \| \| \| - vol ID \| v \| uniquely identify vol \| hasVolumes() +----+------------------+ consults the state \| ^ and returns true \| \| DeleteJobTxn() if there's a match <-----------+ +---------------removes the claim from or if there is no the state previous claim \| \| \| \| +-----------------------------+ +------------------------------------------------------+ scheduler state store	2025-02-07 17:41:01 +01:00
Michael Smithhisler	47c14ddf28	remove remote task execution code (#24909 )	2025-01-29 08:08:34 -05:00
Tim Gross	09eb473189	dynamic host volumes: set status unavailable on failed restore (#24962 ) When a client restarts but can't restore a volume (ex. the plugin is now missing), it's removed from the node fingerprint. So we won't allow future scheduling of the volume, but we were not updating the volume state field to report this reasoning to operators. Make debugging easier and the state field more meaningful by setting the value to "unavailable". Also, remove the unused "deleted" field. We did not implement soft deletes and aren't planning on it for Nomad 1.10.0. Ref: https://hashicorp.atlassian.net/browse/NET-11551	2025-01-27 16:35:53 -05:00
Michael Smithhisler	d621211108	auth: adds option to enable verbose logging during sso (#24892 ) Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2025-01-23 11:40:01 -05:00
Piotr Kazmierczak	ebffcce378	stateful deployments: remove CSIVolumeIDs (#24908 )	2025-01-21 17:00:55 +01:00
Tim Gross	96e539ee87	dynamic host volumes quotas (#24871 ) Allow users to configure a host volumes quota in MB. This will be enforced at the time of provisioning via create/register RPCs. This changeset is the CE version of ENT/2114. Ref: https://github.com/hashicorp/nomad-enterprise/pull/2114 Ref: https://hashicorp.atlassian.net/browse/NET-11549	2025-01-17 11:41:56 -05:00
Tim Gross	203a6533bb	API: host volume access modes should match list in structs package (#24838 ) We changed the list of access modes available for dynamic host volumes in #24705 but neglected to change them in the API package. Update the API package to match. Ref: https://github.com/hashicorp/nomad/pull/24705	2025-01-13 15:00:22 -05:00

1 2 3 4 5 ...

1329 Commits