Commit Graph

26759 Commits

Author SHA1 Message Date
James Rasell
f94016816d cli: Add node_prefix read policy to Consul setup task policy. (#25310)
When Nomad registers a service within Consul it is regarded as a
node service. In order for Nomad workloads to read these services,
it must have an ACL policy which includes node_prefix read. If it
does not, the service is filtered out from the result.

This change adds the required permission to the Consul setup
command.
2025-03-10 08:06:09 +00:00
Robert Main
57cd92274c Merge pull request #25192 from hashicorp/dependabot/npm_and_yarn/website/prettier-3.5.2
chore(deps-dev): bump prettier from 3.5.1 to 3.5.2 in /website
2025-03-07 16:14:30 -05:00
Tim Gross
5cc1b4e606 upgrade tests: add transparent proxy workload (#25176)
Add an upgrade test workload for Consul service mesh with transparent
proxy. Note this breaks from the "countdash" demo. The dashboard application
only can verify the backend is up by making a websocket connection, which we
can't do as a health check, and the health check it exposes for that purpose
only passes once the websocket connection has been made. So replace the
dashboard with a minimal nginx reverse proxy to the count-api instead.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-03-07 15:25:26 -05:00
Tim Gross
c3e2d4a652 E2E: remove outdated legacy token workflow tests (#25315)
In https://github.com/hashicorp/nomad/pull/25217 we removed the legacy Consul token workflow, and in https://github.com/hashicorp/nomad/pull/25174 we removed the related E2E tests. But we missed the tests in the `e2e/connect` package.

After removing these tests, Consul-related E2E tests in this repo pass.
2025-03-07 15:09:36 -05:00
Phil Renaud
35e1ea4328 [cli] UI URL hints for common CLI commands (#24454)
* Basic implementation for server members and node status

* Commands for alloc status and job status

* -ui flag for most commands

* url hints for variables

* url hints for job dispatch, evals, and deployments

* agent config ui.cli_url_links to disable

* Fix an issue where path prefix was presumed for variables

* driver uncomment and general cleanup

* -ui flag on the generic status endpoint

* Job run command gets namespaces, and no longer gets ui hints for --output flag

* Dispatch command hints get a namespace, and bunch o tests

* Lots of tests depend on specific output, so let's not mess with them

* figured out what flagAddress is all about for testServer, oof

* Parallel outside of test instances

* Browser-opening test, sorta

* Env var for disabling/enabling CLI hints

* Addressing a few PR comments

* CLI docs available flags now all have -ui

* PR comments addressed; switched the env var to be consistent and scrunched monitor-adjacent hints a bit more

* ui.Output -> ui.Warn; moves hints from stdout to stderr

* isTerminal check and parseBool on command option

* terminal.IsTerminal check removed for test-runner-not-being-terminal reasons
2025-03-07 13:23:35 -05:00
Tim Gross
f3d53e3e2b CSI: restart task on failing initial probe, instead of killing it (#25307)
When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout
expires (by default 30s). But if the plugin never becomes healthy, we're not
restarting the task as documented.

Update the plugin supervisor to trigger a restart instead. We still exit the
supervisor loop at that point to avoid having the supervisor send probes to a
task that isn't running yet. This requires reworking the poststart hook to allow
the supervisor loop to be restarted when the task restarts.

In doing so, I identified that we weren't respecting the task kill context from
the post start hook, which would leave the supervisor running in the window
between when a task is killed because it failed and its stop hooks were
triggered. Combine the two contexts to make sure we stop the supervisor
whichever context gets closed first.

Fixes: https://github.com/hashicorp/nomad/issues/25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264
2025-03-07 10:04:59 -05:00
James Rasell
768ba78e2d deps: Consolidated update of dependabot PRs (#25311)
* chore(deps): bump github.com/hashicorp/go-kms-wrapping/v2
* chore(deps): bump github.com/hashicorp/go-connlimit from 0.3.0 to 0.3.1
* chore(deps): bump github.com/aws/aws-sdk-go-v2/config
* chore(deps): bump github.com/hashicorp/cap from 0.7.0 to 0.9.0
* chore(deps): bump go.uber.org/goleak from 1.2.1 to 1.3.0
2025-03-07 14:38:40 +00:00
James Rasell
c0eccda4f7 template: Set any Consul token generated by workload identity. (#25309) 2025-03-07 14:32:02 +00:00
Tim Gross
f528022e3a upgrade testing: add missing dependency during client upgrades (#25306)
The check to read back node metadata depends on a resource that waits for the
Nomad API, but that resource doesn't wait for the metadata to be written in the
first place (and the client subsequently upgraded). Add this dependency so that
we're reading back the node metadata as the last step.

Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13690355150/job/38282457406
2025-03-07 09:06:04 -05:00
James Rasell
7b156e928a github: Update Vault and Consul versions used in core workflow. (#25287) 2025-03-07 07:20:24 +00:00
Tim Gross
694b10d71c upgrade testing: commit missing volume specification (#25305)
In #25285 we converted the CSI workload for upgrade testing to use a self-hosted
NFS. But the volume spec name got changed to `volume.hcl` in the process, which
is in our `.gitignore` file for the repo. We missed this during testing because
the file existed locally, but it fails in nightly runs.

Ref: https://github.com/hashicorp/nomad/pull/25285
Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13703979647/job/38324786351
2025-03-06 14:36:34 -05:00
Simon Zou
73ceacd236 ListProcesses through PID when cgroup is not found in Linux (#25198)
* ListProcesses through PID when cgroup is not found

* add changelog entry

* update the ListByPid for windows
2025-03-06 17:41:51 +01:00
Piotr Kazmierczak
149141e831 stateful deployments: task group host volume claims docs (#25290) 2025-03-06 17:23:08 +01:00
Piotr Kazmierczak
ed4a5decba stateful deployments: concept and jobspec documentation (#25288) 2025-03-06 17:06:29 +01:00
James Rasell
17fcee5614 deps: Update tool dependencies. (#25275) 2025-03-06 11:51:07 +00:00
James Rasell
2eb35a4678 build: Update Go to v1.24.1 (#25249) 2025-03-06 10:33:14 +00:00
Piotr Kazmierczak
29c7b7ca44 stateful deployments: fix missing prefix search in claim list CLI (#25297) 2025-03-06 11:23:00 +01:00
Juana De La Cuesta
69c2ed55d5 Check for nil values when parsing HCL strings (#25294)
* fix: when parsing hcl durations, check for nil values and fail validation if present

* docs: add changelog

* style: remove unnecesary function
2025-03-06 10:38:33 +01:00
Juana De La Cuesta
6ffe441983 [gh-24931] Return dummy function for moving processes when running rootless (#24944)
* fix: stop executor launch if nomad doesnt have permissions

* func: return move function if c group is not enabled
2025-03-06 10:34:21 +01:00
Michael Smithhisler
5c4d0e923d consul: Remove legacy token based authentication workflow (#25217) 2025-03-05 15:38:11 -05:00
Tim Gross
3fd8a1ed8d scheduler: prevent nil pointer ref when reschedule policy is missing (#24893)
When upgrading from older versions of Nomad, the reschedule policy block may be
nil. There is logic to handle this safely in the `NextRescheduleTimeByTime` used
for allocs on disconnected clients, but it's missing from the
`NextRescheduleTime` method used by more typical allocations. Return an empty
time object in this case.

Fixes: https://github.com/hashicorp/nomad/issues/24846
2025-03-05 15:34:10 -05:00
Michael Smithhisler
f2b761f17c disconnected: removes deprecated disconnect fields (#25284)
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-05 14:46:02 -05:00
Tim Gross
916fe2c7fa upgrade testing: rework CSI test to use self-contained workload (#25285)
Getting the CSI test to work with AWS EFS or EBS has proven to be awkward
because we're having to deal with external APIs with their own consistency
guarantees, as well as challenges around teardown. Make the CSI test entirely
self-contained by using a userland NFS server and the rocketduck CSI plugin.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
Ref: https://gitlab.com/rocketduck/csi-plugin-nfs
2025-03-05 11:48:19 -05:00
Tim Gross
7a051991bd upgrade testing: temporarily disable CSI test (#25283)
The CSI workload is failing and creating complications for teardown, so I'm
reworking it. But this work is taking a while to finish, so while that's in
progress let's disable the CSI workload so that we're running the upgrade tests
all the way through to the end. I expect to be able to revert this in the next
couple days.
2025-03-04 11:21:45 -05:00
Tim Gross
9cc0e2eae0 upgrade testing: make cluster name prefix a variable (#25281)
During initial development of upgrade testing, we had a hard-coded prefix to
distinguish between clusters created for this vs those created by GHA
runners. Update the prefix to be a variable so that developers can add their own
prefix during test workload development.
2025-03-04 11:11:02 -05:00
Juana De La Cuesta
5605f9630d Fix the docker image parser to account for private repos (#24926)
* fix: fix the docker image parser to account for private repos

* style: change the local regex for docker image indentifiers and use docker package instead

* func: return early when no repo found on the image name

* func: return error if no path found in image

* Update drivers/docker/utils.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update coordinator.go

* Update driver.go

* Update network.go

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-04 16:53:20 +01:00
Aimee Ukasick
b33c801039 Docs: Add workload identity and Consul Enterprise info to partition parameter (#25251)
* Docs: Add info to partition parameter. CE-820

* fix link format for Nomad Workload Identities

* fix typo
2025-03-04 09:33:57 -06:00
Juana De La Cuesta
2dadf9fe6c Improve stability (#25244)
* func: add dependencies to avoid race conditions and move the update to each client to the main upgrade scenario

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-04 16:23:07 +01:00
Michael Smithhisler
25cea5c16b e2e: allow consul access to nomad cluster (#25277) 2025-03-04 09:06:50 -05:00
grembo
b6d925987c Allow disabling wait in client configuration (#25255)
Before the fixes in #20165, the wait feature was disabled by
default. After these changes, it's always enabled, which - at
least on some platforms - leads to a significant increase in
load (5-7x).

This patch allows disabling the wait feature in the client
stanza of the configuration file by setting min and max to 0:

    wait {
      min     = "0"
      max     = "0"
    }

Per-template wait blocks in the task description still work like
one would expect.
2025-03-03 16:38:46 -05:00
Juana De La Cuesta
d50a9a474c Add note to stop allocs for system allocs (#25263)
* docs: Add note to stop allocs to make sure system allocs are not rescheduled

* Update stop.mdx

* Update website/content/docs/commands/alloc/stop.mdx

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-03 19:37:24 +01:00
Tim Gross
7e8c7d2896 generic paginator (#25252)
The paginator was developed before generics were available, so we've had to work
around a lack of compile-time safety by creating configuration objects at
runtime that require a lot of branching and type casts. This results in a lot of
added boilerplate in the RPC handlers.

Refactor the paginator to take advantage of generics.
* Move all decision making around tokenization to compile-time by providing
  pre-built generic functions that close over target tokens.
* Remove the `appendFunc` parameter in lieu of a `Stub` function parameter that
  will accept existing `Stub` functions in most cases (with the addition of an
  extra `error` return value).
* Generally remove boilerplate in the RPC handlers as a result, except where a
  given handler wants more complex filtering.

This doesn't reduce the boilerplate we need at the top of many blocking queries
to define the iterator we want based on arguments, which we're typically doing
to decide upon which memdb index we want. That's a query optimization problem
and way beyond the scope of this PR.
2025-03-03 10:08:50 -05:00
Tim Gross
60132ab0cf docs: update renamed attributes (#25265)
A couple of attributes were renamed in #24942. Update example outputs in the API
docs to match.

Ref: https://github.com/hashicorp/nomad/pull/24942#pullrequestreview-2653776939
2025-03-03 09:44:26 -05:00
Tim Gross
1788bfb42e remove addresses from node class hash (#24942)
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.

In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.

Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.

Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
2025-03-03 09:28:32 -05:00
Michael Smithhisler
7867957811 e2e: remove legacy consul token tests (#25174) 2025-02-28 11:31:33 -05:00
David
c52623d7d4 client: remove unused nodeID parameter from host stats metric functions (#25247) 2025-02-28 10:29:38 +01:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Tim Gross
4a62d1b75c upgrade tests: add CSI workload (#25223)
Add an upgrade test workload for CSI with the AWS EFS plugin. In order to
validate this workload, we'll need to deploy the plugin job and then register a
volume with it. So this extends the `run_workloads` module to allow for "pre
scripts" and "post scripts" to be run before and after a given job has been
deployed. We can use that as a model for other test workloads.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-02-27 15:16:04 -05:00
James Rasell
c34f17c377 deps: Consolidated update of dependabot PRs (#25240)
* chore(deps): bump github.com/moby/term from 0.5.0 to 0.5.2

Bumps [github.com/moby/term](https://github.com/moby/term) from 0.5.0 to 0.5.2.
- [Commits](https://github.com/moby/term/compare/v0.5.0...v0.5.2)

---
updated-dependencies:
- dependency-name: github.com/moby/term
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump golang.org/x/mod from 0.22.0 to 0.23.0

Bumps [golang.org/x/mod](https://github.com/golang/mod) from 0.22.0 to 0.23.0.
- [Commits](https://github.com/golang/mod/compare/v0.22.0...v0.23.0)

---
updated-dependencies:
- dependency-name: golang.org/x/mod
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump github.com/zclconf/go-cty from 1.16.0 to 1.16.2

Bumps [github.com/zclconf/go-cty](https://github.com/zclconf/go-cty) from 1.16.0 to 1.16.2.
- [Release notes](https://github.com/zclconf/go-cty/releases)
- [Changelog](https://github.com/zclconf/go-cty/blob/main/CHANGELOG.md)
- [Commits](https://github.com/zclconf/go-cty/compare/v1.16.0...v1.16.2)

---
updated-dependencies:
- dependency-name: github.com/zclconf/go-cty
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump github.com/docker/go-connections from 0.4.0 to 0.5.0

Bumps [github.com/docker/go-connections](https://github.com/docker/go-connections) from 0.4.0 to 0.5.0.
- [Commits](https://github.com/docker/go-connections/compare/v0.4.0...v0.5.0)

---
updated-dependencies:
- dependency-name: github.com/docker/go-connections
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 16:54:44 +00:00
Piotr Kazmierczak
73a193f6d9 stateful deployments: task group host volume claims CLI (#25116)
CLI for interacting with task group host volume claims.
2025-02-27 17:04:48 +01:00
Tim Gross
6ae1444cf4 upgrade testing: debugging assistance (#25232)
Enos buries the Terraform output from provisioning. Add a shell script to load
the environment from provisioning for debugging Nomad during development of
upgrade tests.
2025-02-27 08:35:45 -05:00
James Rasell
bc52ca3142 sec: Update x/crypto and x/oauth2 to resolve scan failures. (#25236)
Fixes:
- vulnerability GO-2025-3487 in golang.org/x/crypto@v0.32.0
- vulnerability GO-2025-3488 in golang.org/x/oauth2@v0.25.0

The go.mod declaration is also needed by the update.
2025-02-27 10:37:54 +00:00
dependabot[bot]
2ea0846b0d chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 (#25234)
Bumps [github.com/go-jose/go-jose/v3](https://github.com/go-jose/go-jose) from 3.0.3 to 3.0.4.
- [Release notes](https://github.com/go-jose/go-jose/releases)
- [Changelog](https://github.com/go-jose/go-jose/blob/main/CHANGELOG.md)
- [Commits](https://github.com/go-jose/go-jose/compare/v3.0.3...v3.0.4)

---
updated-dependencies:
- dependency-name: github.com/go-jose/go-jose/v3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 10:35:11 +01:00
Juana De La Cuesta
461d4268e2 func: add python servers to raw exec workloads (#25230) 2025-02-26 18:05:46 +01:00
Juana De La Cuesta
b13132043b Add new workloads (#25106)
* func: Add more workloads

* Update jobs.sh

* Update versions.sh

* style: format

* Update enos/modules/test_cluster_health/scripts/allocs.sh

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: improve outputs descriptions

* func: change docker workloads to be redis boxes and add healthchecks

* func: register the services on consul

* style: format

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-26 17:02:27 +01:00
Tim Gross
3b9290a11e E2E: fix column parsing for dynamic host volumes test (#25228)
In #25185 we changed the output of `volume status` to include both DHV and CSI
volumes by default. When the E2E test parses the output, it's not expecting the
new section header.

Ref: https://github.com/hashicorp/nomad/pull/25185
2025-02-26 09:52:47 -05:00
Aimee Ukasick
4693f0be2e docs: update Virt beta callout to use Note component (#25184) 2025-02-25 11:26:41 -06:00
Phil Renaud
7d08e79da3 [ui] System, Batch, and Sysbatch jobs' Start Job buttons let you revert to previous versions (#25104)
* Changes the behaviour of system/batch/sysbatch jobs not to look for a latest stable version, as their versions never go to stable

* Dont show job stability on versions page for system/sysbatch/batch jobs

* Tests that depend on jobs to revert specify that they are Service jobs

* Batch jobs added to detail-restart test loop

* Right, they're not stable, they're just versions
2025-02-25 11:10:14 -05:00
Tim Gross
db5022b965 deps: remove actions updates from dependabot (#25211)
Dependabot can update actions to versions that are not in the TSCCR
allowlist. The TSCCR check doesn't happen in CE, which means we don't learn we
have a problem until after we've spent the effort to backport them. Remove the
automation that updates actions automatically until this issue is resolved on
the security team's side.
2025-02-25 10:18:50 -05:00
dependabot[bot]
d9d5e7351a chore(deps): bump github.com/go-jose/go-jose/v4 from 4.0.4 to 4.0.5 (#25204)
Bumps [github.com/go-jose/go-jose/v4](https://github.com/go-jose/go-jose) from 4.0.4 to 4.0.5.
- [Release notes](https://github.com/go-jose/go-jose/releases)
- [Changelog](https://github.com/go-jose/go-jose/blob/main/CHANGELOG.md)
- [Commits](https://github.com/go-jose/go-jose/compare/v4.0.4...v4.0.5)

---
updated-dependencies:
- dependency-name: github.com/go-jose/go-jose/v4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-25 15:55:36 +01:00