nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 09:55:44 +03:00

Author	SHA1	Message	Date
Piotr Kazmierczak	b82fd2e159	scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042 ) Cluster reconciler code is notoriously hard to follow because most of its method continuously mutate the fields of the allocReconciler object. Even for top-level methods it makes the code hard to follow, but gets really gnarly with lower-level methods (of which there are many). This changeset proposes a refactoring that makes the vast majority of said methods return explicit values, and avoid mutating object fields.	2025-06-20 07:37:16 +02:00
Tim Gross	c8dcd3c2db	docker: clamp CPU shares to minimum of 2 (#26081 ) In #25963 we added normalization of CPU shares for large hosts where the total compute was larger than the maximum CPU shares. But if the result after normalization is less than 2, runc will have an integer overflow. We prevent this in the shared executor for the `exec`/`rawexec` driver by clamping to the safe minimum value. Do this for the `docker` driver as well and add test coverage of it for the shared executor too. Fixes: https://github.com/hashicorp/nomad/issues/26080 Ref: https://github.com/hashicorp/nomad/pull/25963	2025-06-19 13:48:06 -04:00
Tim Gross	7bfc04576a	E2E: disable sdnotify for Consul agents (#26078 ) In our E2E environment we've seen some flakiness with the Consul-related tests. As it turns out, the Consul agents are getting restarted every 90s or so because they're timing out their systemd notification. > consul.service: start operation timed out. Terminating. This appears to be a known issue in Consul and we'll try to contribute some help to hunt down the cause if they want help, but in the meantime let's remove it from our systemd unit files for the Consul agents. Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248	2025-06-18 17:03:32 -04:00
Tim Gross	976ea854b0	E2E: fix scaling test assertion for extra Windows host (#26077 ) * E2E: fix scaling test assertion for extra Windows host The scaling test assumes that all nodes will receive the system job. But the job can only run on Linux hosts, so the count will be wrong if we're running a Windows host as part of the cluster. Filter the expected count by the OS. While we're touching this test, let's also migrate it off the legacy framework. * address comments from code review	2025-06-18 17:03:17 -04:00
Tim Gross	3c67ba0516	E2E: update TaskAPI test for Windows (#26074 ) The current version of Windows we're using ships with curl, so we don't need to download it as an artifact anymore. Remove the broken reference to this in the TaskAPI test for Windows. Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/15708894856/job/44267973319	2025-06-17 16:03:50 -04:00
dependabot[bot]	b38fef5c9a	chore(deps): bump brace-expansion in /scripts/screenshots/src (#26069 ) Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.11 to 1.1.12. - [Release notes](https://github.com/juliangruber/brace-expansion/releases) - [Commits](https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.12) --- updated-dependencies: - dependency-name: brace-expansion dependency-version: 1.1.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 17:54:37 +02:00
dependabot[bot]	9553eb1f4f	chore(deps): bump github.com/hashicorp/go-discover from 1.0.0 to 1.1.0 (#26059 ) Bumps [github.com/hashicorp/go-discover](https://github.com/hashicorp/go-discover) from 1.0.0 to 1.1.0. - [Release notes](https://github.com/hashicorp/go-discover/releases) - [Changelog](https://github.com/hashicorp/go-discover/blob/master/CHANGELOG.md) - [Commits](https://github.com/hashicorp/go-discover/compare/v1.0.0...v1.1.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/go-discover dependency-version: 1.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 17:24:16 +02:00
dependabot[bot]	cced11c6d8	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26061 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.15 to 1.29.16. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.15...config/v1.29.16) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.16 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 13:37:23 +02:00
dependabot[bot]	b392919b71	chore(deps): bump go.etcd.io/bbolt from 1.4.0 to 1.4.1 (#26062 ) Bumps [go.etcd.io/bbolt](https://github.com/etcd-io/bbolt) from 1.4.0 to 1.4.1. - [Release notes](https://github.com/etcd-io/bbolt/releases) - [Commits](https://github.com/etcd-io/bbolt/compare/v1.4.0...v1.4.1) --- updated-dependencies: - dependency-name: go.etcd.io/bbolt dependency-version: 1.4.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 13:24:06 +02:00
James Rasell	d3e077a78e	enos: Modify Windows TF variable to match new 2022 value. (#26067 )	2025-06-17 08:13:36 +01:00
Allison Larson	5e7ec1b32c	test: waitForKeyring in SignIdentities test (#26051 )	2025-06-16 10:17:28 -07:00
Tim Gross	d6800c41c1	E2E: include Windows 2022 host in test targets (#26003 ) Some time ago the Windows host we were using as a Nomad client agent test target started failing to allow ssh connections. The underlying problem appears to be with sysprep but I wasn't able to debug the exact cause as it's not an area I have a lot of expertise in. Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use a base image provided by Amazon and then we'll use a userdata script to bootstrap ssh and some target directories for Terraform to upload files to. The more modern Windows will let us drop some of extra powershell scripts we were using as well. Fixes: https://hashicorp.atlassian.net/browse/NMD-151 Fixes: https://github.com/hashicorp/nomad-e2e/issues/125	2025-06-16 12:12:15 -04:00
Tim Gross	26004c5407	vault: set renew increment to lease duration (#26041 ) When we renew Vault tokens, we use the lease duration to determine how often to renew. But we also set an `increment` value which is never updated from the initial 30s. For periodic tokens this is not a problem because the `increment` field is ignored on renewal. But for non-periodic tokens this prevents the token TTL from being properly incremented. This behavior has been in place since the initial Vault client implementation in #1606 but before the switch to workload identity most (all?) tokens being created were periodic tokens so this was never detected. Fix this bug by updating the request's `increment` field to the lease duration on each renewal. Also switch out a `time.After` call in backoff of the derive token caller with a safe timer so that we don't have to spawn a new goroutine per loop, and have tighter control over when that's GC'd. Ref: https://github.com/hashicorp/nomad/pull/1606 Ref: https://github.com/hashicorp/nomad/issues/25812	2025-06-13 13:50:54 -04:00
Chris Roberts	fedd042e69	test: update test timeout from 20m to 25m (#26056 ) Tests running in CI are starting to bump up to this timeout forcing re-runs. Adding an additional five minutes to the timeout to help prevent this from occurring.	2025-06-13 09:23:27 -07:00
Chris Roberts	dfa07e10ed	client: fix batch job drain behavior (#26025 ) Batch job allocations that are drained from a node will be moved to an eligible node. However, when no eligible nodes are available to place the draining allocations, the tasks will end up being complete and will not be placed when an eligible node becomes available. This occurs because the drained allocations are simultaneously stopped on the draining node while attempting to be placed on an eligible node. The stopping of the allocations on the draining node result in tasks being killed, but importantly this kill does not fail the task. The result is tasks reporting as complete due to their state being dead and not being failed. As such, when an eligible node becomes available, all tasks will show as complete and no allocations will need to be placed. To prevent the behavior described above a check is performed when the alloc runner kills its tasks. If the allocation's job type is batch, and the allocation has a desired transition of migrate, the task will be failed when it is killed. This ensures the task does not report as complete, and when an eligible node becomes available the allocations are placed as expected.	2025-06-13 08:28:31 -07:00
James Rasell	42b024db4d	net: Remove overcommitted network conditional. (#26053 ) The check simply returns false and has done for a number of years, therefore there is no need to keep it around or the test that exercises it.	2025-06-13 15:48:34 +01:00
Tim Gross	4eb78f1348	docs: describe shutdown order on `lifecycle` page (#26035 ) We have a description of the order of shutdown in the `task.leader` docs, but the `lifecycle` block is an intuitive place to look for this same information, and the behavior is largely governed by that feature anyways.	2025-06-12 15:45:40 -04:00
Aimee Ukasick	23fd87d9c9	Docs: Commands section move "General options" to page bottom (#26001 ) * sectionless files plus acl section * alloc section * config, deployment sections * job section * licence, namespace * node, node-pool * operator * plugin, quota, recommendation * scaling, sentinel, server, service, system, var, volume * Add "ENT" label to left nav for enterprise commands * job tag break into separate folder and files; update options header	2025-06-12 14:31:38 -05:00
Chris Roberts	4dbf645bf7	command: prevent panic on graceful shutdown (#26018 ) When performing a graceful shutdown a channel is used to wait for the agent to leave. The channel is closed when the agent leaves successfully, but it also is closed within a deferral. If the agent successfully leaves and closes the channel, a panic will occur when the channel is closed the second time within the deferral. To prevent this from occurring, the channel closing is wrapped within a `OnceFunc` so the channel is only closed once.	2025-06-12 09:35:57 -07:00
Chris Roberts	eeec603975	command: prevent early exit from graceful shutdown (#26023 ) While waiting for the agent to leave during a graceful shutdown the wait can be interrupted immediately if another signal is received. It is common that while waiting a `SIGPIPE` is received from journald causing the wait to end early. This results in the agent not finishing the leave process and reporting an error when the process has stopped. Instead of allowing any signal to interrupt the wait, the signal is checked for a `SIGPIPE` and if matched will continue waiting.	2025-06-12 08:56:55 -07:00
Piotr Kazmierczak	0ddbc548a3	scheduler: rename reconciliation package to `reconciler` (#26038 ) nouns are better than verbs for package names	2025-06-12 14:36:09 +02:00
James Rasell	c49062c663	test: Fix workload ID claims tests, so cases are not skipped. (#26039 )	2025-06-12 13:35:53 +01:00
Piotr Kazmierczak	3dbd9f3f87	ci: add new feasible package to test-core (#26036 )	2025-06-12 09:48:01 +02:00
Daniel Bennett	7519df8d06	task env: add NOMAD_UNIX_ADDR var (#25598 ) for easier setup when using workload identity + task api	2025-06-11 15:56:51 -04:00
Piotr Kazmierczak	199d12865f	scheduler: isolate `feasibility` (#26031 ) This change isolates all the code that deals with node selection in the scheduler into its own package called feasible. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-11 20:11:04 +02:00
Conor Mongey	f7096fb9d6	docker: add cgroupns task config (#25927 )	2025-06-11 13:50:44 -04:00
Allison Larson	0a3ffe077c	Merge pull request #26028 from hashicorp/post-1.10.2-release Post 1.10.2 release	2025-06-11 07:38:03 -07:00
dependabot[bot]	4d9504b19a	chore(deps): bump tar-fs from 2.1.2 to 2.1.3 in /scripts/screenshots/src (#25965 ) Bumps [tar-fs](https://github.com/mafintosh/tar-fs) from 2.1.2 to 2.1.3. - [Commits](https://github.com/mafintosh/tar-fs/commits) --- updated-dependencies: - dependency-name: tar-fs dependency-version: 2.1.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-11 09:00:43 -04:00
Allison Larson	5435bf7c34	Merge release 1.10.2 files	2025-06-10 14:38:50 -07:00
hc-github-team-nomad-core	5f33ccf42f	Prepare for next release	2025-06-10 14:35:25 -07:00
hc-github-team-nomad-core	1e49d9eb44	Generate files for 1.10.2 release	2025-06-10 14:35:25 -07:00
Piotr Kazmierczak	76e3c2961a	scheduler: isolate reconciliation code (#26002 ) This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.	2025-06-10 15:46:39 +02:00
Daniel Bennett	8164d9e1d4	csi: send secrets with snapshot delete command (#26022 ) so that -secret arguments make it to the CSI plugin to carry out the snapshot deletion	2025-06-09 17:02:52 -04:00
Chris Roberts	2cc598ef00	Get ACL policy by job using exact job ID (#26019 ) In the original state, when getting ACL policies by job, the search was performing a prefix-based lookup on the index. This can result in polcies being applied incorrectly when used for workload identities. For example, if a `custom-test` policy is created like so: ``` nomad acl policy apply -namespace=default -job=test-job custom-test ./policy.hcl ``` A job named `test-job` will properly get this ACL policy. However, due to the lookup being prefix-based on the index, a job named `test-job-1` will also get this ACL policy. To prevent this behavior, the lookup behavior on the index is modified so it is a direct match.	2025-06-09 13:08:29 -07:00
Daniel Bennett	b93479e353	release: add changelog for pr 25921 (ipv6 addr normalization) (#26016 )	2025-06-09 15:04:34 -04:00
Deniz Onur Duzgun	abd0efdd76	sec: remove non-hermetic sprig template functions (#25998 ) * sec:add sprig template functions in denylists * remove explicit set which is no longer needed * go mod tidy * add changelog * better changelog and filtered denylist * go mod tidy with 1.24.4 * edit changelog and remove htpasswd and derive * fix tests * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * edit changelog --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-09 13:00:47 -04:00
dependabot[bot]	4bd51942e6	chore(deps): bump golang.org/x/mod from 0.24.0 to 0.25.0 (#26005 ) Bumps [golang.org/x/mod](https://github.com/golang/mod) from 0.24.0 to 0.25.0. - [Commits](https://github.com/golang/mod/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: golang.org/x/mod dependency-version: 0.25.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:47:31 -04:00
dependabot[bot]	30ad9c9e41	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26004 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.14 to 1.29.15. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.14...config/v1.29.15) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.15 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:46:29 -04:00
dependabot[bot]	f7828b2e7d	chore(deps): bump golang.org/x/time from 0.11.0 to 0.12.0 (#26008 ) Bumps [golang.org/x/time](https://github.com/golang/time) from 0.11.0 to 0.12.0. - [Commits](https://github.com/golang/time/compare/v0.11.0...v0.12.0) --- updated-dependencies: - dependency-name: golang.org/x/time dependency-version: 0.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:45:17 -04:00
dependabot[bot]	1e6f43d543	chore(deps): bump golang.org/x/sync from 0.14.0 to 0.15.0 (#26007 ) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.14.0 to 0.15.0. - [Commits](https://github.com/golang/sync/compare/v0.14.0...v0.15.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-version: 0.15.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-09 11:45:06 -04:00
Bram Vogelaar	68b5d64ed7	docs: update broken link in stateful-workloads.mdx (#26009 ) point to correct url	2025-06-09 08:36:37 -04:00
Tim Gross	94c3d23271	build: update toolchain to go 1.24.4 (#25999 )	2025-06-05 16:26:20 -04:00
Daniel Bennett	c9da06eac8	chore(deps): bump github.com/docker/cli (#25995 ) Bumps [github.com/docker/cli](https://github.com/docker/cli) from 28.1.1+incompatible to 28.2.2+incompatible. - [Commits](https://github.com/docker/cli/compare/v28.1.1...v28.2.2) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-version: 28.2.2+incompatible dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-05 11:52:32 -04:00
dependabot[bot]	6a35c1b8ea	chore(deps): bump github.com/docker/docker from 28.1.1+incompatible to 28.2.2+incompatible (#25954 ) * chore(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 28.1.1+incompatible to 28.2.2+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](https://github.com/docker/docker/compare/v28.1.1...v28.2.2) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-version: 28.2.2+incompatible dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * deps: containerd/errdefs instead of docker/errdefs moby's errdefs are deprecated as of `f1bb44aeee` and now merely point to containerd's --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2025-06-05 10:26:18 -04:00
Piotr Kazmierczak	ce054aae96	scheduler: add a readme and start documenting low level implementation details (#25986 ) In an effort to improve the readability and maintainability of nomad/scheduler package, we begin with a README file that describes its operation in more detail than the official documentation does. This PR will be followed by a few small ones that move the code around that package, improve variable naming and also keep that readme up to date. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-05 15:36:17 +02:00
Tobi Lehman	cf9f269ccf	docs: Fix typo for GPUs (#25987 )	2025-06-05 08:43:30 +01:00
James Rasell	428f329cab	rpc: Fix data race in yamux config modification for conn handling. (#25978 ) The server RPC handler and RPC connection pool both use a shared configuration object for custom yamux configuration. Both sub-systems were modifying the shared object which could cause a data race. The passed object is now cloned before being modified. This changes also moves where the yamux configuration is cloned and modified to the relevant constructor function. This avoids performing a clone per connection handle or per new connection generated in the RPC pool.	2025-06-05 08:05:46 +01:00
Daniel Bennett	3ed91193ec	ci: windows 2022 runners (upcoming 2019 eol) (#25984 ) fix for: > This is a scheduled Windows Server 2019 brownout. > The Windows Server 2019 image will be removed on 2025-06-30. > For more details, see actions/runner-images#12045	2025-06-04 16:55:41 -04:00
James Rasell	e95148c10d	consul: Fix data race within test by using mutex to read map. (#25977 )	2025-06-04 15:09:37 +01:00
James Rasell	6cf535a86f	drainer: Fix data race within test by correctly copying alloc. (#25975 ) Some test cases were writing the same allocation object (memory pointer) to Nomad state in subsequent upsert calls. This causes a race condition with the drainer job watcher which reads the same object from Nomad state to perform conditional checks. The data race is fixed by ensuring the allocation is copied between writes.	2025-06-04 14:11:17 +01:00

1 2 3 4 5 ...

27215 Commits