nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 01:45:44 +03:00

Author	SHA1	Message	Date
Piotr Kazmierczak	8bc6abcd2e	scheduler: basic cluster reconciler safety properties for service jobs (#26167 )	2025-07-09 17:30:37 +02:00
Tim Gross	009927d4e8	changelog: note that 1.9.11 and 1.8.15 are ENT-only (#26237 )	2025-07-09 10:13:58 -04:00
Aimee Ukasick	53b083b8c5	Docs: Nomad IA (#26063 ) * Move commands from docs to its own root-level directory * temporarily use modified dev-portal branch with nomad ia changes * explicitly clone nomad ia exp branch * retrigger build, fixed dev-portal broken build * architecture, concepts and get started individual pages * fix get started section destinations * reference section * update repo comment in website-build.sh to show branch * docs nav file update capitalization * update capitalization to force deploy * remove nomad-vs-kubernetes dir; move content to what is nomad pg * job section * Nomad operations category, deploy section * operations category, govern section * operations - manage * operations/scale; concepts scheduling fix * networking * monitor * secure section * remote auth-methods folder and move up pages to sso; linkcheck * Fix install2deploy redirects * fix architecture redirects * Job section: Add missing section index pages * Add section index pages so breadcrumbs build correctly * concepts/index fix front matter indentation * move task driver plugin config to new deploy section * Finish adding full URL to tutorials links in nav * change SSO to Authentication in nav and file system * Docs NomadIA: Move tutorials into NomadIA branch (#26132) * Move governance and policy from tutorials to docs * Move tutorials content to job-declare section * run jobs section * stateful workloads * advanced job scheduling * deploy section * manage section * monitor section * secure/acl and secure/authorization * fix example that contains an unseal key in real format * remove images from sso-vault * secure/traffic * secure/workload-identities * vault-acl change unseal key and root token in command output sample * remove lines from sample output * fix front matter * move nomad pack tutorials to tools * search/replace /nomad/tutorials links * update acl overview with content from deleted architecture/acl * fix spelling mistake * linkcheck - fix broken links * fix link to Nomad variables tutorial * fix link to Prometheus tutorial * move who uses Nomad to use cases page; move spec/config shortcuts add dividers * Move Consul out of Integrations; move namespaces to govern * move integrations/vault to secure/vault; delete integrations * move ref arch to docs; rename Deploy Nomad back to Install Nomad * address feedback * linkcheck fixes * Fixed raw_exec redirect * add info from /nomad/tutorials/manage-jobs/jobs * update page content with newer tutorial * link updates for architecture sub-folders * Add redirects for removed section index pages. Fix links. * fix broken links from linkcheck * Revert to use dev-portal main branch instead of nomadIA branch * build workaround: add intro-nav-data.json with single entry * fix content-check error * add intro directory to get around Vercel build error * workound for emtpry directory * remove mdx from /intro/ to fix content-check and git snafu * Add intro index.mdx so Vercel build should work --------- Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>	2025-07-08 19:24:52 -05:00
Chris Roberts	b8e86cccdc	Merge pull request #26227 from hashicorp/post-1.10.3-release Post 1.10.3 release updates	2025-07-08 17:20:30 -07:00
Chris Roberts	eb7eec1770	Merge release 1.10.3 files	2025-07-08 16:50:13 -07:00
hc-github-team-nomad-core	26e16febad	Prepare for next release	2025-07-08 16:47:39 -07:00
hc-github-team-nomad-core	ccba3ae6a2	Generate files for 1.10.3 release	2025-07-08 16:47:39 -07:00
Hazmei Abdul Rahman	c2d8424e3f	fix: website task driver virt link (#26222 )	2025-07-08 11:36:55 -05:00
Juana De La Cuesta	3b44090156	Avoid panic during startup with 1.10.2 (#26219 ) * fix: initalize the topology of teh processors to avoid nil pointers * func: initialize topology to avoid nil pointers * fix: update the new public method for NodeProcessorResources	2025-07-08 16:07:14 +02:00
Tim Gross	e13ceab855	host volumes: require allocs to be client terminal to delete vols (#26213 ) The RPC handler for deleting dynamic host volumes has a check that any allocations associated with a volume are client-terminal before deleting the volume. But the state store delete that happens after we send client RPCs to the plugin checks that the allocs are non-terminal on both server and client. This can improperly allow deleting a volume from a client but then not being able to delete it from the state store because of a time-of-check / time-of-use bug. If the allocation fails/completes on the client before the server marks its desired status as terminal, or if the allocation is marked server-terminal during the client RPC, we can get a volume that passes the first check but not the second check that happens in the state store and cannot be deleted. Update the state store delete method to require that any allocation for a volume is client terminal in order to delete the volume, not just server terminal. Fixes: https://github.com/hashicorp/nomad/issues/26140 Ref: https://hashicorp.atlassian.net/browse/NMD-883	2025-07-07 14:48:06 -04:00
Tim Gross	c043d1c850	scheduler: property testing of reconcile reconnecting (#26180 ) To help break down the larger property tests we're doing in #26167 and #26172 into more manageable chunks, pull out a property test for just the `reconcileReconnecting` method. This method helpfully already defines its important properties, so we can implement those as test assertions. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/hashicorp/nomad/pull/26167 Ref: https://github.com/hashicorp/nomad/pull/26172	2025-07-07 09:40:49 -04:00
Tim Gross	d4ab277154	docs: add missing metrics for Consul service client (#26186 ) Nomad agents emit metrics for Consul service and check operations, but these were not documented. Update the metrics reference table to include these metrics. Note that the metrics are prefixed `nomad.client` but are present on all agents, because the server registers itself in Consul as well.	2025-07-07 09:40:32 -04:00
Tim Gross	5c909213ce	scheduler: add reconciler annotations to completed evals (#26188 ) The output of the reconciler stage of scheduling is only visible via debug-level logs, typically accessible only to the cluster admin. We can give job authors better ability to understand what's happening to their jobs if we expose this information to them in the `eval status` command. Add the reconciler's desired updates to the evaluation struct so it can be exposed in the API. This increases the size of evals by roughly 15% in the state store, or a bit more when there are preemptions (but we expect this will be a small minority of evals). Ref: https://hashicorp.atlassian.net/browse/NMD-818 Fixes: https://github.com/hashicorp/nomad/issues/15564	2025-07-07 09:40:21 -04:00
Tim Gross	60a953ca00	docs: add upgrade guide note for publish_allocation_metrics (#26187 ) In #25870 we fixed a longstanding bug where allocation metrics were being collected and published even if `telemetry.publish_allocation_metrics` was disabled (the default). This change is unexpected enough that we should surface it in the upgrade guide. Ref: https://github.com/hashicorp/nomad/pull/25870 Ref: https://github.com/hashicorp/nomad/issues/26166	2025-07-07 09:40:00 -04:00
dependabot[bot]	53e2855f47	chore(deps): bump github.com/docker/docker (#26205 )	2025-07-07 08:29:23 +00:00
dependabot[bot]	605daee759	chore(deps): bump github.com/docker/cli (#26158 )	2025-07-04 11:21:48 +01:00
dependabot[bot]	8e407c7070	chore(deps): bump github.com/docker/docker (#26160 )	2025-07-04 10:49:07 +01:00
James Rasell	e158356dd2	client: Remove created directory when mkdir plugin fails to chown. (#26194 ) The mkdir plugin creates the directory and then chowns it. In the event the chown command fails, we should attempt to remove the directory. Without this, we leave directories on the client in partial failure situations.	2025-07-04 08:36:36 +01:00
Allison Larson	004fa6132b	docs: Fix link in service page documentation (#26174 ) * docs: fix link in service page * docs: correct indentation	2025-07-03 09:42:52 -07:00
dependabot[bot]	6cfef21cce	chore(deps): bump go.etcd.io/bbolt from 1.4.1 to 1.4.2 (#26159 )	2025-07-03 14:51:13 +01:00
James Rasell	d6757609dc	cli: Fix a bug where self token lookups via token CLI flag failed. (#26183 ) The meta client looks for both an environment variable and a CLI flag when generating a client. The CLI UUID checker needs to do this also, so we account for users using both env vars and CLI flag tokens.	2025-07-03 13:50:42 +01:00
dependabot[bot]	ae47231304	chore(deps): bump github.com/klauspost/cpuid/v2 from 2.2.10 to 2.2.11 (#26161 )	2025-07-03 13:18:36 +01:00
dependabot[bot]	d73d3a1542	chore(deps): bump github.com/prometheus/common from 0.64.0 to 0.65.0 (#26157 )	2025-07-03 11:48:49 +01:00
Chris Roberts	4c66930a6e	drainer: respect max parallel setting when draining (#26175 ) When draining nodes allocs are checked for a healthy state and marked to be drained, with the value in the max parallel setting determining how many allocs will be migrated. Depending on the circumstances, however, the max parallel setting may not be properly respected. Given a job with max parallel set to one, a group count greater than one, and allocs on multiple nodes: Draining a single node will result in one alloc being marked to drain. If another node is immediately drained the alloc running on the first node will be seen as "healthy" and another alloc will be marked to be drained resulting in two allocs being marked for migration at the same time. This can lead to issues with service availablility. To prevent this allocs can only be marked as healthy when the alloc has not been marked for migration. This prevents migrating allocs being seen as healthy which results in the max parallel setting being properly respected.	2025-07-02 12:43:45 -07:00
Chris Roberts	493e7b2faa	command: prevent server panic on graceful shutdown (#26171 ) When performing a graceful shutdown the client drain configuration is checked for a deadline which is appended to the timeout. When running as a server the client will not be set. Attempting to get the drain deadline will result in a panic. This checks for the client being available prior to fetching the deadline value.	2025-07-01 15:54:03 -07:00
Chris Roberts	362690ddd1	client: suppress kill task event on completed tasks (#26075 ) The `killTasks` function will kill all the alloc runners task runners. If the task of a task runner has already completed, the killing of the task runner can cause confusion due to the task event showing that the task was signaled even though it is already complete. To prevent this, a check is done when creating the task event to determine if the task has completed. If it has no task event is created and when the task runner is killed, no extra task event is added.	2025-07-01 13:30:52 -07:00
Tim Gross	9a29df2292	scheduler: emit structured logs from reconciliation (#26169 ) Both the cluster reconciler and node reconciler emit a debug-level log line with their results, but these are unstructured multi-line logs that are annoying for operators to parse. Change these to emit structured key-value pairs like we do everywhere else. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-07-01 10:37:44 -04:00
Piotr Kazmierczak	36e7148247	scheduler: doc.go files for new packages (#26177 )	2025-07-01 16:28:33 +02:00
Allison Larson	63f0788747	Expose Kind field for Consul Service Registrations (#26170 ) * consul: Add service kind to jobspec * consul: Add kind to service docs * Add changelog	2025-06-30 14:32:23 -07:00
Tim Gross	aa3c08d069	eval status: enrich with related evals and placed allocs tables (#26156 ) When debugging an evaluation, you almost always want to know about all the related evaluations and what allocations were placed by that evaluation (and where), not just failed placements. We can enrich the command by adding the `related` query parameter to the API, and having the command query for the evaluations allocations automatically. Emit this data as a pair of new tables and expose fields like quota limits, and previous/next/blocked eval without the `-verbose` flag. Update the docs to include the full output and remove references to long-removed behavior of the `-json` flag. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-06-30 09:23:36 -04:00
Piotr Kazmierczak	0c2fcb3e30	docs: explicitly list all schedulers enabled by default (#26150 ) Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-26 17:37:26 +02:00
Tim Gross	ec8250ed30	property test generation for reconciler (#26142 ) As part of ongoing work to make the scheduler more legible and more robustly tested, we're implementing property testing of at least the reconciler. This changeset provides some infrastructure we'll need for generating the test cases using `pgregory.net/rapid`, without building out any of the property assertions yet (that'll be in upcoming PRs over the next couple weeks). The alloc reconciler generator produces a job, a previous version of the job, a set of tainted nodes, and a set of existing allocations. The node reconciler generator produces a job, a set of nodes, and allocations on those nodes. Reconnecting allocs are not yet well-covered by these generators, and with ~40 dimensions covered so far we may need to pull those out to their own tests in order to get good coverage. Note the scenarios only randomize fields of interest; fields like the job name that don't impact the reconciler would use up available shrink cycles on failed tests without actually reducing the scope of the scenario. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/flyingmutant/rapid	2025-06-26 11:09:53 -04:00
Juana De La Cuesta	0a84587c65	Add the data dog rate limiter to the autoscaler docs (#26130 ) * func: add documentation for the data dog rate limiter * Update datadog.mdx * Update website/content/tools/autoscaling/plugins/apm/datadog.mdx Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> * Update website/content/tools/autoscaling/plugins/apm/datadog.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-26 12:51:12 +02:00
Mattias Fjellström	8e6b2e1b63	docs: adding note on azure msi for server join (#26141 )	2025-06-26 10:29:06 +02:00
Elijah Wright	f76d9e0cec	jobspec: define DiffID for Constraint and Affinity (#26134 )	2025-06-25 17:42:25 +02:00
Piotr Kazmierczak	7647491588	cli: fix panic when starting stopped jobs with no scaling policies (#26131 ) Restoring scaling policies during the start of a stopped job did not account for jobs that didn't have any scaling policies, and led to a panic when users tried to restart such jobs.	2025-06-25 11:19:56 +02:00
James Rasell	7a5f5750b0	test: Wait for client when enabled in test agent if possible. (#26129 ) When a test starts an agent and the client is enabled, we can wait until this reaches the ready state within the set up method. This mimics what we already do with leadership and the root keyring and should reduce flakey tests where it assume the client is ready as soon as the set up function returns, which is not guaranteed. The change exposed a couple of TLS reload tests which were not using the test agent correctly. They were setting up a client even though it would never be able to join the cluster due to TLS configuration issues. These have been fixed.	2025-06-25 10:00:28 +01:00
James Rasell	30b5e91f3c	test: Fix TLS reload tests. (#26135 ) The tests ran fine in CI but were done before #26107 was raised and merged. This then altered the test behavior on merge to the main branch.	2025-06-25 09:15:14 +01:00
James Rasell	216140255d	cli: Do not always add global DNS name to certificate DNS names. (#26086 ) No matter the passed region identifier, the CLI was always adding "<role>.global.nomad" to the certificate DNS names. This is not what we expect and has been removed. While here, the long deprecated cluster-region flag has been removed. This removal only impacts CLI functionality, so is safe to do.	2025-06-25 07:35:56 +01:00
Piotr Kazmierczak	27da75044e	scheduler: move tests that depend on calling schedulers into `integration` package (#26037 )	2025-06-24 09:31:10 +02:00
James Rasell	a3e096b0c9	tls: Reset server TLS authenticator when TLS config reloaded. (#26107 ) The Nomad server uses an authenticator backend for RPC handling which includes TLS verification. This verification setting is configured based on the servers TLS configuration object and is built when a new server is constructed. The bug occurs when a servers TLS configuration is reloaded which can change the desired TLS verification handling. In this case, the authenticator is not updated, meaning the RPC mTLS verification is not modified, even if the configuration indicates it should. This change adds a new function on the authenticator to allow updating its TLS verification rule. This new function is called when a servers TLS configuration is reloaded.	2025-06-24 08:30:15 +01:00
dependabot[bot]	9cbadf3e34	chore(deps): bump google.golang.org/grpc from 1.72.2 to 1.73.0 (#26102 ) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.73.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 21:06:14 +02:00
Paweł Bęza	1e328e8341	Docs: fix indentation in job annotations description for `/v1/job/:job_id/plan` response (#26115 )	2025-06-23 13:16:35 -05:00
Daniel Bennett	949b23602c	e2e: ui: bump playwright version (#26119 )	2025-06-23 13:31:11 -04:00
dependabot[bot]	cda267814f	chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#26101 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.38.0 to 0.39.0. - [Commits](https://github.com/golang/crypto/compare/v0.38.0...v0.39.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 17:51:26 +02:00
dependabot[bot]	13e32429b2	chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26098 ) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.16 to 1.29.17. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.16...config/v1.29.17) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.29.17 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-23 17:39:57 +02:00
Piotr Kazmierczak	05c3b5050c	ci: align CE build command with ENT (#26108 ) In hashicorp/nomad-enterprise#2592 we introduced a divergence in how Nomad CE and ENT build their binaries. Nomad CE used a more sophisticated approach, setting uid, gid and home environment variables in the docker run command. Despite mine (and others) best efforts, we were not able to do the same in the ENT repo, which relies on special git settings that allow it to pull dependencies from private repositories, and left a different docker run command there, that just inherited GHA runner user and copied the resulting tarball instead of moving it. #26090 then attempted to remedy #25910 resulting from docker run command ignoring ${{ env.GO_TAGS }} if run with custom --env, but the resulting backport broke ent builds. This PR restores ENT behavior of building Nomad builds with GHA runner user, thus inheriting runner's environment on ent.	2025-06-23 17:13:22 +02:00
Tim Gross	74389cc306	update Vault API dependency and pin HCL dependencies (#26089 ) For reasons of backwards compatibility, Nomad uses an older branch of HCL1 (`v1.0.1-nomad`) and HCL2 (`v2.20.2-nomad-1`) and backports a limited set of changes to those branches. But the Vault API also has their own HCL1 branch, currently tagged as `v1.0.1-vault-7`. Normally this isn't a problem because Nomad pins to our own branch and we don't call any of the Vault API package's HCL code anyways. But in Vault's branch some functions were changed that break our build unless we backport them. We've backported enough of Vault's changes to make our HCL1 branch build, and now have tags on the HCL repo so that we can pin to specific tags instead of random commits. Fixes: https://hashicorp.atlassian.net/browse/NMD-850 Fixes: https://github.com/hashicorp/nomad/pull/26006 Ref: https://github.com/hashicorp/hcl/pull/760	2025-06-23 10:02:12 -04:00
Piotr Kazmierczak	12ddb6db94	scheduler: capture reconciler state in ReconcilerState object (#26088 ) This changeset separates reconciler fields into their own sub-struct to make testing easier and the code more explicit about what fields relate to which state.	2025-06-23 15:36:39 +02:00
Mattias Fjellström	e2a30df14c	docs: clarified azure cloud join requirements (#26091 )	2025-06-23 08:34:56 -05:00

1 2 3 4 5 ...

27220 Commits