nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
James Rasell	d3e077a78e	enos: Modify Windows TF variable to match new 2022 value. (#26067 )	2025-06-17 08:13:36 +01:00
Allison Larson	5e7ec1b32c	test: waitForKeyring in SignIdentities test (#26051 )	2025-06-16 10:17:28 -07:00
Tim Gross	d6800c41c1	E2E: include Windows 2022 host in test targets (#26003 ) Some time ago the Windows host we were using as a Nomad client agent test target started failing to allow ssh connections. The underlying problem appears to be with sysprep but I wasn't able to debug the exact cause as it's not an area I have a lot of expertise in. Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use a base image provided by Amazon and then we'll use a userdata script to bootstrap ssh and some target directories for Terraform to upload files to. The more modern Windows will let us drop some of extra powershell scripts we were using as well. Fixes: https://hashicorp.atlassian.net/browse/NMD-151 Fixes: https://github.com/hashicorp/nomad-e2e/issues/125	2025-06-16 12:12:15 -04:00
Tim Gross	26004c5407	vault: set renew increment to lease duration (#26041 ) When we renew Vault tokens, we use the lease duration to determine how often to renew. But we also set an `increment` value which is never updated from the initial 30s. For periodic tokens this is not a problem because the `increment` field is ignored on renewal. But for non-periodic tokens this prevents the token TTL from being properly incremented. This behavior has been in place since the initial Vault client implementation in #1606 but before the switch to workload identity most (all?) tokens being created were periodic tokens so this was never detected. Fix this bug by updating the request's `increment` field to the lease duration on each renewal. Also switch out a `time.After` call in backoff of the derive token caller with a safe timer so that we don't have to spawn a new goroutine per loop, and have tighter control over when that's GC'd. Ref: https://github.com/hashicorp/nomad/pull/1606 Ref: https://github.com/hashicorp/nomad/issues/25812	2025-06-13 13:50:54 -04:00
Chris Roberts	fedd042e69	test: update test timeout from 20m to 25m (#26056 ) Tests running in CI are starting to bump up to this timeout forcing re-runs. Adding an additional five minutes to the timeout to help prevent this from occurring.	2025-06-13 09:23:27 -07:00
Chris Roberts	dfa07e10ed	client: fix batch job drain behavior (#26025 ) Batch job allocations that are drained from a node will be moved to an eligible node. However, when no eligible nodes are available to place the draining allocations, the tasks will end up being complete and will not be placed when an eligible node becomes available. This occurs because the drained allocations are simultaneously stopped on the draining node while attempting to be placed on an eligible node. The stopping of the allocations on the draining node result in tasks being killed, but importantly this kill does not fail the task. The result is tasks reporting as complete due to their state being dead and not being failed. As such, when an eligible node becomes available, all tasks will show as complete and no allocations will need to be placed. To prevent the behavior described above a check is performed when the alloc runner kills its tasks. If the allocation's job type is batch, and the allocation has a desired transition of migrate, the task will be failed when it is killed. This ensures the task does not report as complete, and when an eligible node becomes available the allocations are placed as expected.	2025-06-13 08:28:31 -07:00
James Rasell	42b024db4d	net: Remove overcommitted network conditional. (#26053 ) The check simply returns false and has done for a number of years, therefore there is no need to keep it around or the test that exercises it.	2025-06-13 15:48:34 +01:00
Tim Gross	4eb78f1348	docs: describe shutdown order on `lifecycle` page (#26035 ) We have a description of the order of shutdown in the `task.leader` docs, but the `lifecycle` block is an intuitive place to look for this same information, and the behavior is largely governed by that feature anyways.	2025-06-12 15:45:40 -04:00
Aimee Ukasick	23fd87d9c9	Docs: Commands section move "General options" to page bottom (#26001 ) * sectionless files plus acl section * alloc section * config, deployment sections * job section * licence, namespace * node, node-pool * operator * plugin, quota, recommendation * scaling, sentinel, server, service, system, var, volume * Add "ENT" label to left nav for enterprise commands * job tag break into separate folder and files; update options header	2025-06-12 14:31:38 -05:00
Chris Roberts	4dbf645bf7	command: prevent panic on graceful shutdown (#26018 ) When performing a graceful shutdown a channel is used to wait for the agent to leave. The channel is closed when the agent leaves successfully, but it also is closed within a deferral. If the agent successfully leaves and closes the channel, a panic will occur when the channel is closed the second time within the deferral. To prevent this from occurring, the channel closing is wrapped within a `OnceFunc` so the channel is only closed once.	2025-06-12 09:35:57 -07:00
Chris Roberts	eeec603975	command: prevent early exit from graceful shutdown (#26023 ) While waiting for the agent to leave during a graceful shutdown the wait can be interrupted immediately if another signal is received. It is common that while waiting a `SIGPIPE` is received from journald causing the wait to end early. This results in the agent not finishing the leave process and reporting an error when the process has stopped. Instead of allowing any signal to interrupt the wait, the signal is checked for a `SIGPIPE` and if matched will continue waiting.	2025-06-12 08:56:55 -07:00
Piotr Kazmierczak	0ddbc548a3	scheduler: rename reconciliation package to `reconciler` (#26038 ) nouns are better than verbs for package names	2025-06-12 14:36:09 +02:00
James Rasell	c49062c663	test: Fix workload ID claims tests, so cases are not skipped. (#26039 )	2025-06-12 13:35:53 +01:00
Piotr Kazmierczak	3dbd9f3f87	ci: add new feasible package to test-core (#26036 )	2025-06-12 09:48:01 +02:00
Daniel Bennett	7519df8d06	task env: add NOMAD_UNIX_ADDR var (#25598 ) for easier setup when using workload identity + task api	2025-06-11 15:56:51 -04:00
Piotr Kazmierczak	199d12865f	scheduler: isolate `feasibility` (#26031 ) This change isolates all the code that deals with node selection in the scheduler into its own package called feasible. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-11 20:11:04 +02:00
Conor Mongey	f7096fb9d6	docker: add cgroupns task config (#25927 )	2025-06-11 13:50:44 -04:00
Allison Larson	0a3ffe077c	Merge pull request #26028 from hashicorp/post-1.10.2-release Post 1.10.2 release	2025-06-11 07:38:03 -07:00
dependabot[bot]	4d9504b19a	chore(deps): bump tar-fs from 2.1.2 to 2.1.3 in /scripts/screenshots/src (#25965 ) Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.2 to 2.1.3. - [Commits](https://github.com/mafintosh/tar-fs/commits) --- updated-dependencies: - dependency-name: tar-fs dependency-version: 2.1.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-11 09:00:43 -04:00
Allison Larson	5435bf7c34	Merge release 1.10.2 files	2025-06-10 14:38:50 -07:00
hc-github-team-nomad-core	5f33ccf42f	Prepare for next release	2025-06-10 14:35:25 -07:00
hc-github-team-nomad-core	1e49d9eb44	Generate files for 1.10.2 release	2025-06-10 14:35:25 -07:00
Piotr Kazmierczak	76e3c2961a	scheduler: isolate reconciliation code (#26002 ) This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.	2025-06-10 15:46:39 +02:00
Daniel Bennett	8164d9e1d4	csi: send secrets with snapshot delete command (#26022 ) so that -secret arguments make it to the CSI plugin to carry out the snapshot deletion	2025-06-09 17:02:52 -04:00
Chris Roberts	2cc598ef00	Get ACL policy by job using exact job ID (#26019 ) In the original state, when getting ACL policies by job, the search was performing a prefix-based lookup on the index. This can result in polcies being applied incorrectly when used for workload identities. For example, if a `custom-test` policy is created like so: ``` nomad acl policy apply -namespace=default -job=test-job custom-test ./policy.hcl ``` A job named `test-job` will properly get this ACL policy. However, due to the lookup being prefix-based on the index, a job named `test-job-1` will also get this ACL policy. To prevent this behavior, the lookup behavior on the index is modified so it is a direct match.	2025-06-09 13:08:29 -07:00
Daniel Bennett	b93479e353	release: add changelog for pr 25921 (ipv6 addr normalization) (#26016 )	2025-06-09 15:04:34 -04:00
Deniz Onur Duzgun	abd0efdd76	sec: remove non-hermetic sprig template functions (#25998 ) * sec:add sprig template functions in denylists * remove explicit set which is no longer needed * go mod tidy * add changelog * better changelog and filtered denylist * go mod tidy with 1.24.4 * edit changelog and remove htpasswd and derive * fix tests * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * edit changelog --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-09 13:00:47 -04:00
dependabot[bot]	4bd51942e6	chore(deps): bump golang.org/x/mod from 0.24.0 to 0.25.0 (#26005 ) Bumps [golang.org/x/mod](https://github.com/golang/mod) from 0.24.0 to 0.25.0. - [Commits](https://github.com/golang/mod/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: golang.org/x/mod dependency-version: 0.25.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:47:31 -04:00
dependabot[bot]	30ad9c9e41	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26004 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.14 to 1.29.15. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.14...config/v1.29.15) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.15 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:46:29 -04:00
dependabot[bot]	f7828b2e7d	chore(deps): bump golang.org/x/time from 0.11.0 to 0.12.0 (#26008 ) Bumps [golang.org/x/time](https://github.com/golang/time) from 0.11.0 to 0.12.0. - [Commits](https://github.com/golang/time/compare/v0.11.0...v0.12.0) --- updated-dependencies: - dependency-name: golang.org/x/time dependency-version: 0.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:45:17 -04:00
dependabot[bot]	1e6f43d543	chore(deps): bump golang.org/x/sync from 0.14.0 to 0.15.0 (#26007 ) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.14.0 to 0.15.0. - [Commits](https://github.com/golang/sync/compare/v0.14.0...v0.15.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-version: 0.15.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:45:06 -04:00
Bram Vogelaar	68b5d64ed7	docs: update broken link in stateful-workloads.mdx (#26009 ) point to correct url	2025-06-09 08:36:37 -04:00
Tim Gross	94c3d23271	build: update toolchain to go 1.24.4 (#25999 )	2025-06-05 16:26:20 -04:00
Daniel Bennett	c9da06eac8	chore(deps): bump github.com/docker/cli (#25995 ) Bumps [github.com/docker/cli](https://github.com/docker/cli) from 28.1.1+incompatible to 28.2.2+incompatible. - [Commits](https://github.com/docker/cli/compare/v28.1.1...v28.2.2) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-version: 28.2.2+incompatible dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-05 11:52:32 -04:00
dependabot[bot]	6a35c1b8ea	chore(deps): bump github.com/docker/docker from 28.1.1+incompatible to 28.2.2+incompatible (#25954 ) * chore(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 28.1.1+incompatible to 28.2.2+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](https://github.com/docker/docker/compare/v28.1.1...v28.2.2) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-version: 28.2.2+incompatible dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * deps: containerd/errdefs instead of docker/errdefs moby's errdefs are deprecated as of `f1bb44aeee` and now merely point to containerd's --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2025-06-05 10:26:18 -04:00
Piotr Kazmierczak	ce054aae96	scheduler: add a readme and start documenting low level implementation details (#25986 ) In an effort to improve the readability and maintainability of nomad/scheduler package, we begin with a README file that describes its operation in more detail than the official documentation does. This PR will be followed by a few small ones that move the code around that package, improve variable naming and also keep that readme up to date. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-05 15:36:17 +02:00
Tobi Lehman	cf9f269ccf	docs: Fix typo for GPUs (#25987 )	2025-06-05 08:43:30 +01:00
James Rasell	428f329cab	rpc: Fix data race in yamux config modification for conn handling. (#25978 ) The server RPC handler and RPC connection pool both use a shared configuration object for custom yamux configuration. Both sub-systems were modifying the shared object which could cause a data race. The passed object is now cloned before being modified. This changes also moves where the yamux configuration is cloned and modified to the relevant constructor function. This avoids performing a clone per connection handle or per new connection generated in the RPC pool.	2025-06-05 08:05:46 +01:00
Daniel Bennett	3ed91193ec	ci: windows 2022 runners (upcoming 2019 eol) (#25984 ) fix for: > This is a scheduled Windows Server 2019 brownout. > The Windows Server 2019 image will be removed on 2025-06-30. > For more details, see actions/runner-images#12045	2025-06-04 16:55:41 -04:00
James Rasell	e95148c10d	consul: Fix data race within test by using mutex to read map. (#25977 )	2025-06-04 15:09:37 +01:00
James Rasell	6cf535a86f	drainer: Fix data race within test by correctly copying alloc. (#25975 ) Some test cases were writing the same allocation object (memory pointer) to Nomad state in subsequent upsert calls. This causes a race condition with the drainer job watcher which reads the same object from Nomad state to perform conditional checks. The data race is fixed by ensuring the allocation is copied between writes.	2025-06-04 14:11:17 +01:00
Piotr Kazmierczak	648bacda77	testing: migrate nomad/scheduler off of testify (#25968 ) In the spirit of #25909, this PR removes testify dependencies from the scheduler package, along with reflect.DeepEqual removal. This is again a combination of semgrep and hx editing magic. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-04 09:29:28 +02:00
Tim Gross	34e96932a1	drivers: normalize CPU shares/weights to fit large hosts (#25963 ) The `resources.cpu` field is scheduled in MHz. On most Linux task drivers, this value is then mapped to a `cpu.share` (cgroups v1) or `cpu.weight` (cgroups v2). But this means on very large hosts where the total compute is greater than the Linux kernel defined maximum CPU shares, you can't set a `resources.cpu` value large enough to consume the entire host. The `cpu.share`/`cpu.weight` value is relative within the parent cgroup's slice, which is owned by Nomad. So we can fix this by re-normalizing the weight on very large hosts such that the maximum `resources.cpu` matches up with largest possible CPU share. This happens in the task driver so that the rest of Nomad doesn't need to be aware of this implementation detail. Note that these functions will result in bad share config if the request is more than the available, but that's supposed to be caught in the scheduler so by not catching it here we intentionally hit the runc error. Fixes: https://hashicorp.atlassian.net/browse/NMD-297 Fixes: https://github.com/hashicorp/nomad/issues/7731 Ref: https://go.hashi.co/rfc/nmd-211	2025-06-03 15:57:40 -04:00
Tim Gross	6c630c4bfa	docs: expand on recommendations for CPU resource reservation (#25964 ) Add some prescriptive guidance to the CPU concepts document around when to use `resources.cores` vs `resources.cpu`. Extend some of the text to cover cgroups v2. Ref: https://hashicorp.atlassian.net/browse/NMD-297 Ref: https://go.hashi.co/rfc/nmd-211 Ref: https://github.com/hashicorp/nomad/pull/25963	2025-06-03 15:57:04 -04:00
dependabot[bot]	ac31a3c629	chore(deps): bump google.golang.org/grpc from 1.72.1 to 1.72.2 (#25953 ) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.72.1 to 1.72.2. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.72.1...v1.72.2) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.72.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-03 15:45:26 -04:00
3nprob	e79f8e3e98	fix: consider volume_mounts in sidecarTaskDiff (#25878 ) * fix: consider volume_mounts in sidecarTaskDiff * chore: add changelog entry * test: add test for sidecar task diff * fix diff test * make cl match #25528 --------- Co-authored-by: 3np <3np@example.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2025-06-02 09:52:20 -07:00
Juana De La Cuesta	bdfd573fc4	Update the scaling policies when deregistering a job (#25911 ) * func: Update the scaling policies when deregistering a job * func: Add tests for updating the policy * docs: add changelog * func: set back the old order * style: rearrange for clarity and to reuse the watchset * func: set the policies to teh last submitted when starting a job * func: expand tests of teh start job command to include job submission * func: Expand the tests to verify the correct state of the scaling policy after job start * Update command/job_start.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update nomad/fsm_test.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * func: add warning when there is no previous job submission --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-02 16:11:38 +02:00
James Rasell	ae3eaf80d1	docs: Fix node pool concept missing backtick for style. (#25956 )	2025-06-02 09:09:35 +01:00
Piotr Kazmierczak	348177d118	e2e: correct `TestSingleAffinities` behavior (#25943 ) TestSingleAffinities never expected a node with affinity score set to 0 in the set of returned nodes. However, since #25800, this can happen. What the test should be checking for instead is that the node with the highest normalized score has the right affinity.	2025-05-30 19:46:08 +02:00
Tim Gross	beae92cd0b	cancel waiting evals when allocs reconnect (#25923 ) When a disconnected alloc reconnects, the follow-up evaluation is left pending and the followup eval ID field isn't cleared. If the allocation later fails, the followup eval ID prevents the server from creating a new eval for that event. Update the state store so that updates from the client clear the followup eval ID if the allocation is reconnecting, and mark the eval as canceled. Update the FSM to remove those evals from the eval broker's delay heap. Fixes: https://github.com/hashicorp/nomad/issues/12809 Fixes: https://hashicorp.atlassian.net/browse/NMD-302	2025-05-30 08:57:51 -04:00

1 2 3 4 5 ...

27156 Commits