nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 10:25:42 +03:00

Author	SHA1	Message	Date
Chris Roberts	493e7b2faa	command: prevent server panic on graceful shutdown (#26171 ) When performing a graceful shutdown the client drain configuration is checked for a deadline which is appended to the timeout. When running as a server the client will not be set. Attempting to get the drain deadline will result in a panic. This checks for the client being available prior to fetching the deadline value.	2025-07-01 15:54:03 -07:00
Chris Roberts	362690ddd1	client: suppress kill task event on completed tasks (#26075 ) The `killTasks` function will kill all the alloc runners task runners. If the task of a task runner has already completed, the killing of the task runner can cause confusion due to the task event showing that the task was signaled even though it is already complete. To prevent this, a check is done when creating the task event to determine if the task has completed. If it has no task event is created and when the task runner is killed, no extra task event is added.	2025-07-01 13:30:52 -07:00
Tim Gross	9a29df2292	scheduler: emit structured logs from reconciliation (#26169 ) Both the cluster reconciler and node reconciler emit a debug-level log line with their results, but these are unstructured multi-line logs that are annoying for operators to parse. Change these to emit structured key-value pairs like we do everywhere else. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-07-01 10:37:44 -04:00
Piotr Kazmierczak	36e7148247	scheduler: doc.go files for new packages (#26177 )	2025-07-01 16:28:33 +02:00
Allison Larson	63f0788747	Expose Kind field for Consul Service Registrations (#26170 ) * consul: Add service kind to jobspec * consul: Add kind to service docs * Add changelog	2025-06-30 14:32:23 -07:00
Tim Gross	aa3c08d069	eval status: enrich with related evals and placed allocs tables (#26156 ) When debugging an evaluation, you almost always want to know about all the related evaluations and what allocations were placed by that evaluation (and where), not just failed placements. We can enrich the command by adding the `related` query parameter to the API, and having the command query for the evaluations allocations automatically. Emit this data as a pair of new tables and expose fields like quota limits, and previous/next/blocked eval without the `-verbose` flag. Update the docs to include the full output and remove references to long-removed behavior of the `-json` flag. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-06-30 09:23:36 -04:00
Piotr Kazmierczak	0c2fcb3e30	docs: explicitly list all schedulers enabled by default (#26150 ) Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-26 17:37:26 +02:00
Tim Gross	ec8250ed30	property test generation for reconciler (#26142 ) As part of ongoing work to make the scheduler more legible and more robustly tested, we're implementing property testing of at least the reconciler. This changeset provides some infrastructure we'll need for generating the test cases using `pgregory.net/rapid`, without building out any of the property assertions yet (that'll be in upcoming PRs over the next couple weeks). The alloc reconciler generator produces a job, a previous version of the job, a set of tainted nodes, and a set of existing allocations. The node reconciler generator produces a job, a set of nodes, and allocations on those nodes. Reconnecting allocs are not yet well-covered by these generators, and with ~40 dimensions covered so far we may need to pull those out to their own tests in order to get good coverage. Note the scenarios only randomize fields of interest; fields like the job name that don't impact the reconciler would use up available shrink cycles on failed tests without actually reducing the scope of the scenario. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/flyingmutant/rapid	2025-06-26 11:09:53 -04:00
Juana De La Cuesta	0a84587c65	Add the data dog rate limiter to the autoscaler docs (#26130 ) * func: add documentation for the data dog rate limiter * Update datadog.mdx * Update website/content/tools/autoscaling/plugins/apm/datadog.mdx Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> * Update website/content/tools/autoscaling/plugins/apm/datadog.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-26 12:51:12 +02:00
Mattias Fjellström	8e6b2e1b63	docs: adding note on azure msi for server join (#26141 )	2025-06-26 10:29:06 +02:00
Elijah Wright	f76d9e0cec	jobspec: define DiffID for Constraint and Affinity (#26134 )	2025-06-25 17:42:25 +02:00
Piotr Kazmierczak	7647491588	cli: fix panic when starting stopped jobs with no scaling policies (#26131 ) Restoring scaling policies during the start of a stopped job did not account for jobs that didn't have any scaling policies, and led to a panic when users tried to restart such jobs.	2025-06-25 11:19:56 +02:00
James Rasell	7a5f5750b0	test: Wait for client when enabled in test agent if possible. (#26129 ) When a test starts an agent and the client is enabled, we can wait until this reaches the ready state within the set up method. This mimics what we already do with leadership and the root keyring and should reduce flakey tests where it assume the client is ready as soon as the set up function returns, which is not guaranteed. The change exposed a couple of TLS reload tests which were not using the test agent correctly. They were setting up a client even though it would never be able to join the cluster due to TLS configuration issues. These have been fixed.	2025-06-25 10:00:28 +01:00
James Rasell	30b5e91f3c	test: Fix TLS reload tests. (#26135 ) The tests ran fine in CI but were done before #26107 was raised and merged. This then altered the test behavior on merge to the main branch.	2025-06-25 09:15:14 +01:00
James Rasell	216140255d	cli: Do not always add global DNS name to certificate DNS names. (#26086 ) No matter the passed region identifier, the CLI was always adding "<role>.global.nomad" to the certificate DNS names. This is not what we expect and has been removed. While here, the long deprecated cluster-region flag has been removed. This removal only impacts CLI functionality, so is safe to do.	2025-06-25 07:35:56 +01:00
Piotr Kazmierczak	27da75044e	scheduler: move tests that depend on calling schedulers into `integration` package (#26037 )	2025-06-24 09:31:10 +02:00
James Rasell	a3e096b0c9	tls: Reset server TLS authenticator when TLS config reloaded. (#26107 ) The Nomad server uses an authenticator backend for RPC handling which includes TLS verification. This verification setting is configured based on the servers TLS configuration object and is built when a new server is constructed. The bug occurs when a servers TLS configuration is reloaded which can change the desired TLS verification handling. In this case, the authenticator is not updated, meaning the RPC mTLS verification is not modified, even if the configuration indicates it should. This change adds a new function on the authenticator to allow updating its TLS verification rule. This new function is called when a servers TLS configuration is reloaded.	2025-06-24 08:30:15 +01:00
dependabot[bot]	9cbadf3e34	chore(deps): bump google.golang.org/grpc from 1.72.2 to 1.73.0 (#26102 ) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.73.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 21:06:14 +02:00
Paweł Bęza	1e328e8341	Docs: fix indentation in job annotations description for `/v1/job/:job_id/plan` response (#26115 )	2025-06-23 13:16:35 -05:00
Daniel Bennett	949b23602c	e2e: ui: bump playwright version (#26119 )	2025-06-23 13:31:11 -04:00
dependabot[bot]	cda267814f	chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#26101 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.38.0 to 0.39.0. - [Commits](https://github.com/golang/crypto/compare/v0.38.0...v0.39.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 17:51:26 +02:00
dependabot[bot]	13e32429b2	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26098 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.16 to 1.29.17. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.16...config/v1.29.17) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.17 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 17:39:57 +02:00
Piotr Kazmierczak	05c3b5050c	ci: align CE build command with ENT (#26108 ) In hashicorp/nomad-enterprise#2592 we introduced a divergence in how Nomad CE and ENT build their binaries. Nomad CE used a more sophisticated approach, setting uid, gid and home environment variables in the docker run command. Despite mine (and others) best efforts, we were not able to do the same in the ENT repo, which relies on special git settings that allow it to pull dependencies from private repositories, and left a different docker run command there, that just inherited GHA runner user and copied the resulting tarball instead of moving it. #26090 then attempted to remedy #25910 resulting from docker run command ignoring ${{ env.GO_TAGS }} if run with custom --env, but the resulting backport broke ent builds. This PR restores ENT behavior of building Nomad builds with GHA runner user, thus inheriting runner's environment on ent.	2025-06-23 17:13:22 +02:00
Tim Gross	74389cc306	update Vault API dependency and pin HCL dependencies (#26089 ) For reasons of backwards compatibility, Nomad uses an older branch of HCL1 (`v1.0.1-nomad`) and HCL2 (`v2.20.2-nomad-1`) and backports a limited set of changes to those branches. But the Vault API also has their own HCL1 branch, currently tagged as `v1.0.1-vault-7`. Normally this isn't a problem because Nomad pins to our own branch and we don't call any of the Vault API package's HCL code anyways. But in Vault's branch some functions were changed that break our build unless we backport them. We've backported enough of Vault's changes to make our HCL1 branch build, and now have tags on the HCL repo so that we can pin to specific tags instead of random commits. Fixes: https://hashicorp.atlassian.net/browse/NMD-850 Fixes: https://github.com/hashicorp/nomad/pull/26006 Ref: https://github.com/hashicorp/hcl/pull/760	2025-06-23 10:02:12 -04:00
Piotr Kazmierczak	12ddb6db94	scheduler: capture reconciler state in ReconcilerState object (#26088 ) This changeset separates reconciler fields into their own sub-struct to make testing easier and the code more explicit about what fields relate to which state.	2025-06-23 15:36:39 +02:00
Mattias Fjellström	e2a30df14c	docs: clarified azure cloud join requirements (#26091 )	2025-06-23 08:34:56 -05:00
Piotr Kazmierczak	8f98dca8f8	ci: docker GO_TAGS must be quoted (#26105 ) ent builds use multiple tags	2025-06-23 10:14:47 +02:00
James Rasell	d1f77a48ab	rpc: Use client only auth for node get client allocs endpoint. (#26084 ) The RPC is only ever called from a Nomad client which means we can move it away from the generic Authenticate function to the tighter AuthenticateClientOnly one. An addition check to ensure the ACL object allows client operations is performed, mimicking other endpoints of this nature.	2025-06-23 07:44:32 +01:00
Aimee Ukasick	cdde082362	Docs bug: Fix broken link on concepts/job.mdx (#26093 )	2025-06-20 17:16:33 -05:00
Allison Larson	732a671da6	ci: pass go_tags to linux docker builder (#26090 )	2025-06-20 11:54:50 -07:00
Piotr Kazmierczak	1030760d3f	scheduler: adjust method comments and names to reflect recent refactoring (#26085 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-20 17:23:31 +02:00
Piotr Kazmierczak	b82fd2e159	scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042 ) Cluster reconciler code is notoriously hard to follow because most of its method continuously mutate the fields of the allocReconciler object. Even for top-level methods it makes the code hard to follow, but gets really gnarly with lower-level methods (of which there are many). This changeset proposes a refactoring that makes the vast majority of said methods return explicit values, and avoid mutating object fields.	2025-06-20 07:37:16 +02:00
Tim Gross	c8dcd3c2db	docker: clamp CPU shares to minimum of 2 (#26081 ) In #25963 we added normalization of CPU shares for large hosts where the total compute was larger than the maximum CPU shares. But if the result after normalization is less than 2, runc will have an integer overflow. We prevent this in the shared executor for the `exec`/`rawexec` driver by clamping to the safe minimum value. Do this for the `docker` driver as well and add test coverage of it for the shared executor too. Fixes: https://github.com/hashicorp/nomad/issues/26080 Ref: https://github.com/hashicorp/nomad/pull/25963	2025-06-19 13:48:06 -04:00
Tim Gross	7bfc04576a	E2E: disable sdnotify for Consul agents (#26078 ) In our E2E environment we've seen some flakiness with the Consul-related tests. As it turns out, the Consul agents are getting restarted every 90s or so because they're timing out their systemd notification. > consul.service: start operation timed out. Terminating. This appears to be a known issue in Consul and we'll try to contribute some help to hunt down the cause if they want help, but in the meantime let's remove it from our systemd unit files for the Consul agents. Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248	2025-06-18 17:03:32 -04:00
Tim Gross	976ea854b0	E2E: fix scaling test assertion for extra Windows host (#26077 ) * E2E: fix scaling test assertion for extra Windows host The scaling test assumes that all nodes will receive the system job. But the job can only run on Linux hosts, so the count will be wrong if we're running a Windows host as part of the cluster. Filter the expected count by the OS. While we're touching this test, let's also migrate it off the legacy framework. * address comments from code review	2025-06-18 17:03:17 -04:00
Tim Gross	3c67ba0516	E2E: update TaskAPI test for Windows (#26074 ) The current version of Windows we're using ships with curl, so we don't need to download it as an artifact anymore. Remove the broken reference to this in the TaskAPI test for Windows. Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/15708894856/job/44267973319	2025-06-17 16:03:50 -04:00
dependabot[bot]	b38fef5c9a	chore(deps): bump brace-expansion in /scripts/screenshots/src (#26069 ) Bumps [brace-expansion](https://github.com/juliangruber/brace-expansion) from 1.1.11 to 1.1.12. - [Release notes](https://github.com/juliangruber/brace-expansion/releases) - [Commits](https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.12) --- updated-dependencies: - dependency-name: brace-expansion dependency-version: 1.1.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 17:54:37 +02:00
dependabot[bot]	9553eb1f4f	chore(deps): bump github.com/hashicorp/go-discover from 1.0.0 to 1.1.0 (#26059 ) Bumps [github.com/hashicorp/go-discover](https://github.com/hashicorp/go-discover) from 1.0.0 to 1.1.0. - [Release notes](https://github.com/hashicorp/go-discover/releases) - [Changelog](https://github.com/hashicorp/go-discover/blob/master/CHANGELOG.md) - [Commits](https://github.com/hashicorp/go-discover/compare/v1.0.0...v1.1.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/go-discover dependency-version: 1.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 17:24:16 +02:00
dependabot[bot]	cced11c6d8	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26061 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.15 to 1.29.16. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.15...config/v1.29.16) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.16 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 13:37:23 +02:00
dependabot[bot]	b392919b71	chore(deps): bump go.etcd.io/bbolt from 1.4.0 to 1.4.1 (#26062 ) Bumps [go.etcd.io/bbolt](https://github.com/etcd-io/bbolt) from 1.4.0 to 1.4.1. - [Release notes](https://github.com/etcd-io/bbolt/releases) - [Commits](https://github.com/etcd-io/bbolt/compare/v1.4.0...v1.4.1) --- updated-dependencies: - dependency-name: go.etcd.io/bbolt dependency-version: 1.4.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-17 13:24:06 +02:00
James Rasell	d3e077a78e	enos: Modify Windows TF variable to match new 2022 value. (#26067 )	2025-06-17 08:13:36 +01:00
Allison Larson	5e7ec1b32c	test: waitForKeyring in SignIdentities test (#26051 )	2025-06-16 10:17:28 -07:00
Tim Gross	d6800c41c1	E2E: include Windows 2022 host in test targets (#26003 ) Some time ago the Windows host we were using as a Nomad client agent test target started failing to allow ssh connections. The underlying problem appears to be with sysprep but I wasn't able to debug the exact cause as it's not an area I have a lot of expertise in. Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use a base image provided by Amazon and then we'll use a userdata script to bootstrap ssh and some target directories for Terraform to upload files to. The more modern Windows will let us drop some of extra powershell scripts we were using as well. Fixes: https://hashicorp.atlassian.net/browse/NMD-151 Fixes: https://github.com/hashicorp/nomad-e2e/issues/125	2025-06-16 12:12:15 -04:00
Tim Gross	26004c5407	vault: set renew increment to lease duration (#26041 ) When we renew Vault tokens, we use the lease duration to determine how often to renew. But we also set an `increment` value which is never updated from the initial 30s. For periodic tokens this is not a problem because the `increment` field is ignored on renewal. But for non-periodic tokens this prevents the token TTL from being properly incremented. This behavior has been in place since the initial Vault client implementation in #1606 but before the switch to workload identity most (all?) tokens being created were periodic tokens so this was never detected. Fix this bug by updating the request's `increment` field to the lease duration on each renewal. Also switch out a `time.After` call in backoff of the derive token caller with a safe timer so that we don't have to spawn a new goroutine per loop, and have tighter control over when that's GC'd. Ref: https://github.com/hashicorp/nomad/pull/1606 Ref: https://github.com/hashicorp/nomad/issues/25812	2025-06-13 13:50:54 -04:00
Chris Roberts	fedd042e69	test: update test timeout from 20m to 25m (#26056 ) Tests running in CI are starting to bump up to this timeout forcing re-runs. Adding an additional five minutes to the timeout to help prevent this from occurring.	2025-06-13 09:23:27 -07:00
Chris Roberts	dfa07e10ed	client: fix batch job drain behavior (#26025 ) Batch job allocations that are drained from a node will be moved to an eligible node. However, when no eligible nodes are available to place the draining allocations, the tasks will end up being complete and will not be placed when an eligible node becomes available. This occurs because the drained allocations are simultaneously stopped on the draining node while attempting to be placed on an eligible node. The stopping of the allocations on the draining node result in tasks being killed, but importantly this kill does not fail the task. The result is tasks reporting as complete due to their state being dead and not being failed. As such, when an eligible node becomes available, all tasks will show as complete and no allocations will need to be placed. To prevent the behavior described above a check is performed when the alloc runner kills its tasks. If the allocation's job type is batch, and the allocation has a desired transition of migrate, the task will be failed when it is killed. This ensures the task does not report as complete, and when an eligible node becomes available the allocations are placed as expected.	2025-06-13 08:28:31 -07:00
James Rasell	42b024db4d	net: Remove overcommitted network conditional. (#26053 ) The check simply returns false and has done for a number of years, therefore there is no need to keep it around or the test that exercises it.	2025-06-13 15:48:34 +01:00
Tim Gross	4eb78f1348	docs: describe shutdown order on `lifecycle` page (#26035 ) We have a description of the order of shutdown in the `task.leader` docs, but the `lifecycle` block is an intuitive place to look for this same information, and the behavior is largely governed by that feature anyways.	2025-06-12 15:45:40 -04:00
Aimee Ukasick	23fd87d9c9	Docs: Commands section move "General options" to page bottom (#26001 ) * sectionless files plus acl section * alloc section * config, deployment sections * job section * licence, namespace * node, node-pool * operator * plugin, quota, recommendation * scaling, sentinel, server, service, system, var, volume * Add "ENT" label to left nav for enterprise commands * job tag break into separate folder and files; update options header	2025-06-12 14:31:38 -05:00
Chris Roberts	4dbf645bf7	command: prevent panic on graceful shutdown (#26018 ) When performing a graceful shutdown a channel is used to wait for the agent to leave. The channel is closed when the agent leaves successfully, but it also is closed within a deferral. If the agent successfully leaves and closes the channel, a panic will occur when the channel is closed the second time within the deferral. To prevent this from occurring, the channel closing is wrapped within a `OnceFunc` so the channel is only closed once.	2025-06-12 09:35:57 -07:00

1 2 3 4 5 ...

27196 Commits