Commit Graph

27253 Commits

Author SHA1 Message Date
James Rasell
62f1dbebfb server: Add RPC and HTTP functionality for node intro token gen. (#26320)
The node introduction workflow will utilise JWT's that can be used
as authentication tokens on initial client registration. This
change implements the basic builder for this JWT claim type and
the RPC and HTTP handler functionality that will expose this to
the operator.
2025-07-23 14:32:26 +01:00
James Rasell
7466dd71b2 server: Add new server.client_introduction config block. (#26315)
The new configuration block exposes some key options which allow
cluster administrators to control certain client introduction
behaviours.

This change introduces the new block and plumbing, so that it is
exposed in the Nomad server for consumption via internal processes.
2025-07-22 08:50:19 +01:00
James Rasell
dce4284361 Merge branch 'main' into f-NMD-763-identity 2025-07-17 07:35:16 +01:00
Allison Larson
918e1eb123 Correctly canonicalize lifecycle block when missing hook value (#26285) 2025-07-16 11:40:16 -07:00
Aimee Ukasick
0d620607fe add blog links and video to nomad vs k8s (#26286) 2025-07-16 12:56:42 -05:00
James Rasell
953a149180 client: Allow operators to force a client to renew its identity. (#26277)
The Nomad client will have its identity renewed according to the
TTL which defaults to 24h. In certain situations such as root
keyring rotation, operators may want to force clients to renew
their identities before the TTL threshold is met. This change
introduces a client HTTP and RPC endpoint which will instruct the
node to request a new identity at its next heartbeat. This can be
used via the API or a new command.

While this is a manual intervention step on top of the any keyring
rotation, it dramatically reduces the initial feature complexity
as it provides an asynchronous and efficient method of renewal that
utilises existing functionality.
2025-07-16 14:56:00 +01:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Tim Gross
9a288ef493 deployment watcher: refactoring testing (#26284)
While investigating whether the deploymentwatcher would need updates to
implement system deployments, I discovered that some of the tests are racy and
make assertions about called functions without waiting.

Update these tests to wait where needed, and generally clean them up while we're
in here. In particular I've removed the heavyweight mocking in lieu of checking
the call counts and then asserting the expected state store changes.

Ref: https://hashicorp.atlassian.net/browse/NMD-892
2025-07-16 08:46:24 -04:00
Allison Larson
3ca518e89c Add node_pool to blockedEval metric (#26215)
Adds the node_pool to the blockedEval metrics that get emitted for
resource/cpu, along with the dc and node class.
2025-07-15 09:48:04 -07:00
Tim Gross
279775082c sysbatch: correctly validate that reschedule policy is not allowed (#26279)
System and sysbatch jobs don't support the reschedule block, because we'd always
replace allocations back onto the same node. The job validation for system jobs
asserts that the user hasn't set a `reschedule` block so that users aren't
submitting jobs expecting it to be supported. But this validation was missing
for sysbatch jobs.

Validate that sysbatch jobs don't have a reschedule block.
2025-07-15 10:47:02 -04:00
Daniel Bennett
089c148236 allocrunner: run all postrun hooks, even on error (#26271)
e.g. if the consul postrun hook fails, continue running
the subsequent postrun hooks, which among other things
includes network/CNI/iptables cleanup.
2025-07-14 13:55:33 -04:00
James Rasell
8096ea4129 client: Handle identities from servers and use for RPC auth. (#26218)
Nomad servers, if upgraded, can return node identities as part of
the register and update/heartbeat response objects. The Nomad
client will now handle this and store it as appropriate within its
memory and statedb.

The client will now use any stored identity for RPC authentication
with a fallback to the secretID. This supports upgrades paths where
the Nomad clients are updated before the Nomad servers.
2025-07-14 14:24:43 +01:00
Tim Gross
b23ab5ac15 docs: clarify requirements for deleting volumes (#26240)
If you delete a CSI volume, the volume cannot be currently claimed by an
allocation or in the process of being unpublished. This is documented in the CLI
but not the API. Also, the documentation incorrectly says that the `volume
delete` command silently returns without error if the volume doesn't exist, but
that's incorrect.

Fixes: https://github.com/hashicorp/nomad/issues/24756
2025-07-11 15:01:06 -04:00
Tim Gross
bf44eddd9f docs: note that CSI volume name must be unique (#26249)
When we originally implemented CSI, Nomad did not support the `CreateVolume`
workflow, so the volume name field was just a display name. The `CreateVolume`
CSI RPC requires that the volume name be unique. In retrospect, Nomad should
probably have mapped the namespace + ID to the volume name field, but because we
didn't the name field must be unique per storage provider. In future work we
should try to figure out a way to unwind that decision but in the meantime let's
make that requirement clear in the documentation.

Ref: https://gitlab.com/rocketduck/csi-plugin-nfs/-/issues/21
2025-07-11 14:57:53 -04:00
Piotr Kazmierczak
08b3db104d docs: update reconciler diagram to reflect recent refactors (#26260)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-07-11 15:34:07 +02:00
Tim Gross
26302ab25d reconciler: share assertions in property tests (#26259)
Refactor the reconciler property tests to extract functions for safety property
assertions we'll share between different job types for the same reconciler.
2025-07-11 09:27:22 -04:00
Aimee Ukasick
9af1642a1f add redirect to handle new 1.9, 1.8 /commands path (#26254) 2025-07-10 15:08:44 -05:00
Frédéric Praca
7e47aa3a1f fix(doc): fix links for task driver plugins (#26250)
host URL was wrong, changed from develoepr to developer
2025-07-10 14:33:45 -05:00
Tim Gross
3bb1c9aeaf docs: more details for alloc status (#26243)
The `alloc status` documentation is missing information about placement metrics.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
2025-07-10 08:57:37 -04:00
Tim Gross
29bfda6c51 docs: more details for eval status (#26242)
The `eval status` documentation is missing the recently-added reconciler annotations.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
2025-07-10 08:57:27 -04:00
James Rasell
7c5a5782bc client: Use single time variable when handling heartbeat response. (#26238)
When the client handles an update status response from the server,
it modifies its heartbeat stop tracker with a time set once the
RPC call returns. It optionally also emits a log message, if the
client suspects it has missed a heartbeat.

These times were originally tracked by two different calls to the
time function which were executed 2 microseconds apart. There is
no reason we cannot use a single time variable for both uses which
saves us one whole call to time.Now.
2025-07-10 08:07:32 +01:00
Tim Gross
74f7a8f037 scheduler: basic node reconciler safety properties for system jobs (#26216)
Property test assertions for the core safety properties of the node reconciler,
for system jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:44:05 -04:00
Tim Gross
94e03f894a scheduler: basic cluster reconciler safety properties for batch jobs (#26172)
Property test assertions for the core safety proprerties of the cluster
reconciler, for batch jobs. The changeset includes fixes for any bugs found
during work-in-progress, which will get pulled out to their own PRs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:43:55 -04:00
Piotr Kazmierczak
e50db4d1b8 scheduler: property testing of cancelUnneededCanaries (#26204)
In the spirit of #26180

Internal ref: https://hashicorp.atlassian.net/browse/NMD-814
2025-07-09 13:46:13 -04:00
Tim Gross
7c6c1ed0d3 scheduler: reconciler should constrain placements to count (#26239)
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.

Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-09 11:51:01 -04:00
Tim Gross
eb47d1ca11 scheduler: eliminate dead code in node reconciler (#26236)
While working on property testing in #26216, I discovered we had unreachable
code in the node reconciler. The `diffSystemAllocsForNode` function receives a
set of non-terminal allocations, but then has branches where it assumes the
allocations might be terminal. It's trivially provable that these allocs are
always live, as the system scheduler splits the set of known allocs into live
and terminal sets before passing them into the node reconciler.

Eliminate the unreachable code and improve the variable names to make the known
state of the allocs more clear in the reconciler code.

Ref: https://github.com/hashicorp/nomad/pull/26216
2025-07-09 11:31:04 -04:00
Piotr Kazmierczak
8bc6abcd2e scheduler: basic cluster reconciler safety properties for service jobs (#26167) 2025-07-09 17:30:37 +02:00
Tim Gross
009927d4e8 changelog: note that 1.9.11 and 1.8.15 are ENT-only (#26237) 2025-07-09 10:13:58 -04:00
Aimee Ukasick
53b083b8c5 Docs: Nomad IA (#26063)
* Move commands from docs to its own root-level directory

* temporarily use modified dev-portal branch with nomad ia changes

* explicitly clone nomad ia exp branch

* retrigger build, fixed dev-portal broken build

* architecture, concepts and get started individual pages

* fix get started section destinations

* reference section

* update repo comment in website-build.sh to show branch

* docs nav file update capitalization

* update capitalization to force deploy

* remove nomad-vs-kubernetes dir; move content to what is nomad pg

* job section

* Nomad operations category, deploy section

* operations category, govern section

* operations - manage

* operations/scale; concepts scheduling fix

* networking

* monitor

* secure section

* remote auth-methods folder and move up pages to sso; linkcheck

* Fix install2deploy redirects

* fix architecture redirects

* Job section: Add missing section index pages

* Add section index pages so breadcrumbs build correctly

* concepts/index fix front matter indentation

* move task driver plugin config to new deploy section

* Finish adding full URL to tutorials links in nav

* change SSO to Authentication in nav and file system

* Docs NomadIA: Move tutorials into NomadIA branch (#26132)

* Move governance and policy from tutorials to docs

* Move tutorials content to job-declare section

* run jobs section

* stateful workloads

* advanced job scheduling

* deploy section

* manage section

* monitor section

* secure/acl and secure/authorization

* fix example that contains an unseal key in real format

* remove images from sso-vault

* secure/traffic

* secure/workload-identities

* vault-acl change unseal key and root token in command output sample

* remove lines from sample output

* fix front matter

* move nomad pack tutorials to tools

* search/replace /nomad/tutorials links

* update acl overview with content from deleted architecture/acl

* fix spelling mistake

* linkcheck - fix broken links

* fix link to Nomad variables tutorial

* fix link to Prometheus tutorial

* move who uses Nomad to use cases page; move spec/config shortcuts

add dividers

* Move Consul out of Integrations; move namespaces to govern

* move integrations/vault to secure/vault; delete integrations

* move ref arch to docs; rename Deploy Nomad back to Install Nomad

* address feedback

* linkcheck fixes

* Fixed raw_exec redirect

* add info from /nomad/tutorials/manage-jobs/jobs

* update page content with newer tutorial

* link updates for architecture sub-folders

* Add redirects for removed section index pages. Fix links.

* fix broken links from linkcheck

* Revert to use dev-portal main branch instead of nomadIA branch

* build workaround: add intro-nav-data.json with single entry

* fix content-check error

* add intro directory to get around Vercel build error

* workound for emtpry directory

* remove mdx from /intro/ to fix content-check and git snafu

* Add intro index.mdx so Vercel build should work

---------

Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
2025-07-08 19:24:52 -05:00
Chris Roberts
b8e86cccdc Merge pull request #26227 from hashicorp/post-1.10.3-release
Post 1.10.3 release updates
2025-07-08 17:20:30 -07:00
Chris Roberts
eb7eec1770 Merge release 1.10.3 files 2025-07-08 16:50:13 -07:00
hc-github-team-nomad-core
26e16febad Prepare for next release 2025-07-08 16:47:39 -07:00
hc-github-team-nomad-core
ccba3ae6a2 Generate files for 1.10.3 release 2025-07-08 16:47:39 -07:00
Hazmei Abdul Rahman
c2d8424e3f fix: website task driver virt link (#26222) 2025-07-08 11:36:55 -05:00
Juana De La Cuesta
3b44090156 Avoid panic during startup with 1.10.2 (#26219)
* fix: initalize the topology of teh processors to avoid nil pointers

* func: initialize topology to avoid nil pointers

* fix: update the new public method for NodeProcessorResources
2025-07-08 16:07:14 +02:00
Tim Gross
e13ceab855 host volumes: require allocs to be client terminal to delete vols (#26213)
The RPC handler for deleting dynamic host volumes has a check that any
allocations associated with a volume are client-terminal before deleting the
volume. But the state store delete that happens after we send client RPCs to the
plugin checks that the allocs are non-terminal on both server and client.

This can improperly allow deleting a volume from a client but then not being
able to delete it from the state store because of a time-of-check / time-of-use
bug. If the allocation fails/completes on the client before the server marks its
desired status as terminal, or if the allocation is marked server-terminal
during the client RPC, we can get a volume that passes the first check but not
the second check that happens in the state store and cannot be deleted.

Update the state store delete method to require that any allocation for a volume
is client terminal in order to delete the volume, not just server terminal.

Fixes: https://github.com/hashicorp/nomad/issues/26140
Ref: https://hashicorp.atlassian.net/browse/NMD-883
2025-07-07 14:48:06 -04:00
James Rasell
2f30205102 client: Add state functionality for set and get client identities. (#26184)
The Nomad client will persist its own identity within its state
store for restart persistence. The added benefit of using it over
the filesystem is that it supports transactions. This is useful
when considering the identity will be renewed periodically.
2025-07-07 15:28:27 +01:00
Tim Gross
c043d1c850 scheduler: property testing of reconcile reconnecting (#26180)
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-07 09:40:49 -04:00
Tim Gross
d4ab277154 docs: add missing metrics for Consul service client (#26186)
Nomad agents emit metrics for Consul service and check operations, but these
were not documented. Update the metrics reference table to include these
metrics. Note that the metrics are prefixed `nomad.client` but are present on
all agents, because the server registers itself in Consul as well.
2025-07-07 09:40:32 -04:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
Tim Gross
60a953ca00 docs: add upgrade guide note for publish_allocation_metrics (#26187)
In #25870 we fixed a longstanding bug where allocation metrics were being
collected and published even if `telemetry.publish_allocation_metrics` was
disabled (the default). This change is unexpected enough that we should surface
it in the upgrade guide.

Ref: https://github.com/hashicorp/nomad/pull/25870
Ref: https://github.com/hashicorp/nomad/issues/26166
2025-07-07 09:40:00 -04:00
dependabot[bot]
53e2855f47 chore(deps): bump github.com/docker/docker (#26205) 2025-07-07 08:29:23 +00:00
dependabot[bot]
605daee759 chore(deps): bump github.com/docker/cli (#26158) 2025-07-04 11:21:48 +01:00
dependabot[bot]
8e407c7070 chore(deps): bump github.com/docker/docker (#26160) 2025-07-04 10:49:07 +01:00
James Rasell
e158356dd2 client: Remove created directory when mkdir plugin fails to chown. (#26194)
The mkdir plugin creates the directory and then chowns it. In the
event the chown command fails, we should attempt to remove the
directory. Without this, we leave directories on the client in
partial failure situations.
2025-07-04 08:36:36 +01:00
Allison Larson
004fa6132b docs: Fix link in service page documentation (#26174)
* docs: fix link in service page

* docs: correct indentation
2025-07-03 09:42:52 -07:00
dependabot[bot]
6cfef21cce chore(deps): bump go.etcd.io/bbolt from 1.4.1 to 1.4.2 (#26159) 2025-07-03 14:51:13 +01:00
James Rasell
d6757609dc cli: Fix a bug where self token lookups via token CLI flag failed. (#26183)
The meta client looks for both an environment variable and a CLI
flag when generating a client. The CLI UUID checker needs to do
this also, so we account for users using both env vars and CLI
flag tokens.
2025-07-03 13:50:42 +01:00
dependabot[bot]
ae47231304 chore(deps): bump github.com/klauspost/cpuid/v2 from 2.2.10 to 2.2.11 (#26161) 2025-07-03 13:18:36 +01:00
dependabot[bot]
d73d3a1542 chore(deps): bump github.com/prometheus/common from 0.64.0 to 0.65.0 (#26157) 2025-07-03 11:48:49 +01:00