Commit Graph

27156 Commits

Author SHA1 Message Date
James Rasell
d3e077a78e enos: Modify Windows TF variable to match new 2022 value. (#26067) 2025-06-17 08:13:36 +01:00
Allison Larson
5e7ec1b32c test: waitForKeyring in SignIdentities test (#26051) 2025-06-16 10:17:28 -07:00
Tim Gross
d6800c41c1 E2E: include Windows 2022 host in test targets (#26003)
Some time ago the Windows host we were using as a Nomad client agent test target
started failing to allow ssh connections. The underlying problem appears to be
with sysprep but I wasn't able to debug the exact cause as it's not an area I
have a lot of expertise in.

Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use
a base image provided by Amazon and then we'll use a userdata script to
bootstrap ssh and some target directories for Terraform to upload files to. The
more modern Windows will let us drop some of extra powershell scripts we were
using as well.

Fixes: https://hashicorp.atlassian.net/browse/NMD-151
Fixes: https://github.com/hashicorp/nomad-e2e/issues/125
2025-06-16 12:12:15 -04:00
Tim Gross
26004c5407 vault: set renew increment to lease duration (#26041)
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.

Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.

Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.

Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
2025-06-13 13:50:54 -04:00
Chris Roberts
fedd042e69 test: update test timeout from 20m to 25m (#26056)
Tests running in CI are starting to bump up to this timeout forcing
re-runs. Adding an additional five minutes to the timeout to help
prevent this from occurring.
2025-06-13 09:23:27 -07:00
Chris Roberts
dfa07e10ed client: fix batch job drain behavior (#26025)
Batch job allocations that are drained from a node will be moved
to an eligible node. However, when no eligible nodes are available
to place the draining allocations, the tasks will end up being
complete and will not be placed when an eligible node becomes
available. This occurs because the drained allocations are
simultaneously stopped on the draining node while attempting to
be placed on an eligible node. The stopping of the allocations on
the draining node result in tasks being killed, but importantly this
kill does not fail the task. The result is tasks reporting as complete
due to their state being dead and not being failed. As such, when an
eligible node becomes available, all tasks will show as complete and
no allocations will need to be placed.

To prevent the behavior described above a check is performed when
the alloc runner kills its tasks. If the allocation's job type is
batch, and the allocation has a desired transition of migrate, the
task will be failed when it is killed. This ensures the task does
not report as complete, and when an eligible node becomes available
the allocations are placed as expected.
2025-06-13 08:28:31 -07:00
James Rasell
42b024db4d net: Remove overcommitted network conditional. (#26053)
The check simply returns false and has done for a number of years,
therefore there is no need to keep it around or the test that
exercises it.
2025-06-13 15:48:34 +01:00
Tim Gross
4eb78f1348 docs: describe shutdown order on lifecycle page (#26035)
We have a description of the order of shutdown in the `task.leader` docs, but
the `lifecycle` block is an intuitive place to look for this same information,
and the behavior is largely governed by that feature anyways.
2025-06-12 15:45:40 -04:00
Aimee Ukasick
23fd87d9c9 Docs: Commands section move "General options" to page bottom (#26001)
* sectionless files plus acl section

* alloc section

* config, deployment sections

* job section

* licence, namespace

* node, node-pool

* operator

* plugin, quota, recommendation

* scaling, sentinel, server, service, system, var, volume

* Add "ENT" label to left nav for enterprise commands

* job tag break into separate folder and files; update options header
2025-06-12 14:31:38 -05:00
Chris Roberts
4dbf645bf7 command: prevent panic on graceful shutdown (#26018)
When performing a graceful shutdown a channel is used to wait for
the agent to leave. The channel is closed when the agent leaves
successfully, but it also is closed within a deferral. If the
agent successfully leaves and closes the channel, a panic will
occur when the channel is closed the second time within the
deferral. To prevent this from occurring, the channel closing
is wrapped within a `OnceFunc` so the channel is only closed
once.
2025-06-12 09:35:57 -07:00
Chris Roberts
eeec603975 command: prevent early exit from graceful shutdown (#26023)
While waiting for the agent to leave during a graceful shutdown
the wait can be interrupted immediately if another signal is
received. It is common that while waiting a `SIGPIPE` is received
from journald causing the wait to end early. This results in the
agent not finishing the leave process and reporting an error when
the process has stopped. Instead of allowing any signal to interrupt
the wait, the signal is checked for a `SIGPIPE` and if matched will
continue waiting.
2025-06-12 08:56:55 -07:00
Piotr Kazmierczak
0ddbc548a3 scheduler: rename reconciliation package to reconciler (#26038)
nouns are better than verbs for package names
2025-06-12 14:36:09 +02:00
James Rasell
c49062c663 test: Fix workload ID claims tests, so cases are not skipped. (#26039) 2025-06-12 13:35:53 +01:00
Piotr Kazmierczak
3dbd9f3f87 ci: add new feasible package to test-core (#26036) 2025-06-12 09:48:01 +02:00
Daniel Bennett
7519df8d06 task env: add NOMAD_UNIX_ADDR var (#25598)
for easier setup when using workload identity + task api
2025-06-11 15:56:51 -04:00
Piotr Kazmierczak
199d12865f scheduler: isolate feasibility (#26031)
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-11 20:11:04 +02:00
Conor Mongey
f7096fb9d6 docker: add cgroupns task config (#25927) 2025-06-11 13:50:44 -04:00
Allison Larson
0a3ffe077c Merge pull request #26028 from hashicorp/post-1.10.2-release
Post 1.10.2 release
2025-06-11 07:38:03 -07:00
dependabot[bot]
4d9504b19a chore(deps): bump tar-fs from 2.1.2 to 2.1.3 in /scripts/screenshots/src (#25965)
Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.2 to 2.1.3.
- [Commits](https://github.com/mafintosh/tar-fs/commits)

---
updated-dependencies:
- dependency-name: tar-fs
  dependency-version: 2.1.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-11 09:00:43 -04:00
Allison Larson
5435bf7c34 Merge release 1.10.2 files 2025-06-10 14:38:50 -07:00
hc-github-team-nomad-core
5f33ccf42f Prepare for next release 2025-06-10 14:35:25 -07:00
hc-github-team-nomad-core
1e49d9eb44 Generate files for 1.10.2 release 2025-06-10 14:35:25 -07:00
Piotr Kazmierczak
76e3c2961a scheduler: isolate reconciliation code (#26002)
This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.
2025-06-10 15:46:39 +02:00
Daniel Bennett
8164d9e1d4 csi: send secrets with snapshot delete command (#26022)
so that -secret arguments make it to the CSI plugin
to carry out the snapshot deletion
2025-06-09 17:02:52 -04:00
Chris Roberts
2cc598ef00 Get ACL policy by job using exact job ID (#26019)
In the original state, when getting ACL policies by job, the
search was performing a prefix-based lookup on the index. This
can result in polcies being applied incorrectly when used for
workload identities. For example, if a `custom-test` policy is
created like so:

```
nomad acl policy apply -namespace=default -job=test-job custom-test ./policy.hcl
```

A job named `test-job` will properly get this ACL policy. However,
due to the lookup being prefix-based on the index, a job named
`test-job-1` will also get this ACL policy.

To prevent this behavior, the lookup behavior on the index is
modified so it is a direct match.
2025-06-09 13:08:29 -07:00
Daniel Bennett
b93479e353 release: add changelog for pr 25921 (ipv6 addr normalization) (#26016) 2025-06-09 15:04:34 -04:00
Deniz Onur Duzgun
abd0efdd76 sec: remove non-hermetic sprig template functions (#25998)
* sec:add sprig template functions in denylists

* remove explicit set which is no longer needed

* go mod tidy

* add changelog

* better changelog and filtered denylist

* go mod tidy with 1.24.4

* edit changelog and remove htpasswd and derive

* fix tests

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* edit changelog

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-09 13:00:47 -04:00
dependabot[bot]
4bd51942e6 chore(deps): bump golang.org/x/mod from 0.24.0 to 0.25.0 (#26005)
Bumps [golang.org/x/mod](https://github.com/golang/mod) from 0.24.0 to 0.25.0.
- [Commits](https://github.com/golang/mod/compare/v0.24.0...v0.25.0)

---
updated-dependencies:
- dependency-name: golang.org/x/mod
  dependency-version: 0.25.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-09 11:47:31 -04:00
dependabot[bot]
30ad9c9e41 chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26004)
Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.14 to 1.29.15.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.14...config/v1.29.15)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.29.15
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-09 11:46:29 -04:00
dependabot[bot]
f7828b2e7d chore(deps): bump golang.org/x/time from 0.11.0 to 0.12.0 (#26008)
Bumps [golang.org/x/time](https://github.com/golang/time) from 0.11.0 to 0.12.0.
- [Commits](https://github.com/golang/time/compare/v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: golang.org/x/time
  dependency-version: 0.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-09 11:45:17 -04:00
dependabot[bot]
1e6f43d543 chore(deps): bump golang.org/x/sync from 0.14.0 to 0.15.0 (#26007)
Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.14.0 to 0.15.0.
- [Commits](https://github.com/golang/sync/compare/v0.14.0...v0.15.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sync
  dependency-version: 0.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-09 11:45:06 -04:00
Bram Vogelaar
68b5d64ed7 docs: update broken link in stateful-workloads.mdx (#26009)
point to correct url
2025-06-09 08:36:37 -04:00
Tim Gross
94c3d23271 build: update toolchain to go 1.24.4 (#25999) 2025-06-05 16:26:20 -04:00
Daniel Bennett
c9da06eac8 chore(deps): bump github.com/docker/cli (#25995)
Bumps [github.com/docker/cli](https://github.com/docker/cli) from 28.1.1+incompatible to 28.2.2+incompatible.
- [Commits](https://github.com/docker/cli/compare/v28.1.1...v28.2.2)

---
updated-dependencies:
- dependency-name: github.com/docker/cli
  dependency-version: 28.2.2+incompatible
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-05 11:52:32 -04:00
dependabot[bot]
6a35c1b8ea chore(deps): bump github.com/docker/docker from 28.1.1+incompatible to 28.2.2+incompatible (#25954)
* chore(deps): bump github.com/docker/docker

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 28.1.1+incompatible to 28.2.2+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v28.1.1...v28.2.2)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-version: 28.2.2+incompatible
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: containerd/errdefs instead of docker/errdefs

moby's errdefs are deprecated as of
f1bb44aeee
and now merely point to containerd's

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2025-06-05 10:26:18 -04:00
Piotr Kazmierczak
ce054aae96 scheduler: add a readme and start documenting low level implementation details (#25986)
In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-05 15:36:17 +02:00
Tobi Lehman
cf9f269ccf docs: Fix typo for GPUs (#25987) 2025-06-05 08:43:30 +01:00
James Rasell
428f329cab rpc: Fix data race in yamux config modification for conn handling. (#25978)
The server RPC handler and RPC connection pool both use a shared
configuration object for custom yamux configuration. Both
sub-systems were modifying the shared object which could cause a
data race. The passed object is now cloned before being modified.

This changes also moves where the yamux configuration is cloned
and modified to the relevant constructor function. This avoids
performing a clone per connection handle or per new connection
generated in the RPC pool.
2025-06-05 08:05:46 +01:00
Daniel Bennett
3ed91193ec ci: windows 2022 runners (upcoming 2019 eol) (#25984)
fix for:
> This is a scheduled Windows Server 2019 brownout.
> The Windows Server 2019 image will be removed on 2025-06-30.
> For more details, see actions/runner-images#12045
2025-06-04 16:55:41 -04:00
James Rasell
e95148c10d consul: Fix data race within test by using mutex to read map. (#25977) 2025-06-04 15:09:37 +01:00
James Rasell
6cf535a86f drainer: Fix data race within test by correctly copying alloc. (#25975)
Some test cases were writing the same allocation object (memory
pointer) to Nomad state in subsequent upsert calls. This causes a
race condition with the drainer job watcher which reads the same
object from Nomad state to perform conditional checks.

The data race is fixed by ensuring the allocation is copied
between writes.
2025-06-04 14:11:17 +01:00
Piotr Kazmierczak
648bacda77 testing: migrate nomad/scheduler off of testify (#25968)
In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-04 09:29:28 +02:00
Tim Gross
34e96932a1 drivers: normalize CPU shares/weights to fit large hosts (#25963)
The `resources.cpu` field is scheduled in MHz. On most Linux task drivers, this
value is then mapped to a `cpu.share` (cgroups v1) or `cpu.weight` (cgroups
v2). But this means on very large hosts where the total compute is greater than
the Linux kernel defined maximum CPU shares, you can't set a `resources.cpu`
value large enough to consume the entire host.

The `cpu.share`/`cpu.weight` value is relative within the parent cgroup's slice,
which is owned by Nomad. So we can fix this by re-normalizing the weight on very
large hosts such that the maximum `resources.cpu` matches up with largest
possible CPU share. This happens in the task driver so that the rest of Nomad
doesn't need to be aware of this implementation detail. Note that these functions 
will result in bad share config if the request is more than the available, but that's 
supposed to be caught in the scheduler so by not catching it here we intentionally 
hit the runc error.

Fixes: https://hashicorp.atlassian.net/browse/NMD-297
Fixes: https://github.com/hashicorp/nomad/issues/7731
Ref: https://go.hashi.co/rfc/nmd-211
2025-06-03 15:57:40 -04:00
Tim Gross
6c630c4bfa docs: expand on recommendations for CPU resource reservation (#25964)
Add some prescriptive guidance to the CPU concepts document around when to use
`resources.cores` vs `resources.cpu`. Extend some of the text to cover cgroups
v2.

Ref: https://hashicorp.atlassian.net/browse/NMD-297
Ref: https://go.hashi.co/rfc/nmd-211
Ref: https://github.com/hashicorp/nomad/pull/25963
2025-06-03 15:57:04 -04:00
dependabot[bot]
ac31a3c629 chore(deps): bump google.golang.org/grpc from 1.72.1 to 1.72.2 (#25953)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.72.1 to 1.72.2.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.72.1...v1.72.2)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.72.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-03 15:45:26 -04:00
3nprob
e79f8e3e98 fix: consider volume_mounts in sidecarTaskDiff (#25878)
* fix: consider volume_mounts in sidecarTaskDiff

* chore: add changelog entry

* test: add test for sidecar task diff

* fix diff test

* make cl match #25528

---------

Co-authored-by: 3np <3np@example.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2025-06-02 09:52:20 -07:00
Juana De La Cuesta
bdfd573fc4 Update the scaling policies when deregistering a job (#25911)
* func: Update the scaling policies when deregistering a job

* func: Add tests for updating the policy

* docs: add changelog

* func: set back the old order

* style: rearrange for clarity and to reuse the watchset

* func: set the policies to teh last submitted when starting a job

* func: expand tests  of teh start job command to include job submission

* func: Expand the tests to verify the correct state of the scaling policy after job start

* Update command/job_start.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update nomad/fsm_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: add warning when there is no previous job submission

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-02 16:11:38 +02:00
James Rasell
ae3eaf80d1 docs: Fix node pool concept missing backtick for style. (#25956) 2025-06-02 09:09:35 +01:00
Piotr Kazmierczak
348177d118 e2e: correct TestSingleAffinities behavior (#25943)
TestSingleAffinities never expected a node with affinity score set to 0 in
the set of returned nodes. However, since #25800, this can happen. What the
test should be checking for instead is that the node with the highest normalized
score has the right affinity.
2025-05-30 19:46:08 +02:00
Tim Gross
beae92cd0b cancel waiting evals when allocs reconnect (#25923)
When a disconnected alloc reconnects, the follow-up evaluation is left pending
and the followup eval ID field isn't cleared. If the allocation later fails, the
followup eval ID prevents the server from creating a new eval for that event.

Update the state store so that updates from the client clear the followup eval
ID if the allocation is reconnecting, and mark the eval as canceled. Update the
FSM to remove those evals from the eval broker's delay heap.

Fixes: https://github.com/hashicorp/nomad/issues/12809
Fixes: https://hashicorp.atlassian.net/browse/NMD-302
2025-05-30 08:57:51 -04:00