nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Juana De La Cuesta	bdfd573fc4	Update the scaling policies when deregistering a job (#25911 ) * func: Update the scaling policies when deregistering a job * func: Add tests for updating the policy * docs: add changelog * func: set back the old order * style: rearrange for clarity and to reuse the watchset * func: set the policies to teh last submitted when starting a job * func: expand tests of teh start job command to include job submission * func: Expand the tests to verify the correct state of the scaling policy after job start * Update command/job_start.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update nomad/fsm_test.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * func: add warning when there is no previous job submission --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-02 16:11:38 +02:00
James Rasell	ae3eaf80d1	docs: Fix node pool concept missing backtick for style. (#25956 )	2025-06-02 09:09:35 +01:00
Piotr Kazmierczak	348177d118	e2e: correct `TestSingleAffinities` behavior (#25943 ) TestSingleAffinities never expected a node with affinity score set to 0 in the set of returned nodes. However, since #25800, this can happen. What the test should be checking for instead is that the node with the highest normalized score has the right affinity.	2025-05-30 19:46:08 +02:00
Tim Gross	beae92cd0b	cancel waiting evals when allocs reconnect (#25923 ) When a disconnected alloc reconnects, the follow-up evaluation is left pending and the followup eval ID field isn't cleared. If the allocation later fails, the followup eval ID prevents the server from creating a new eval for that event. Update the state store so that updates from the client clear the followup eval ID if the allocation is reconnecting, and mark the eval as canceled. Update the FSM to remove those evals from the eval broker's delay heap. Fixes: https://github.com/hashicorp/nomad/issues/12809 Fixes: https://hashicorp.atlassian.net/browse/NMD-302	2025-05-30 08:57:51 -04:00
Tim Gross	48b1b01e69	prevent client deadlock and incorrect timing on stop_on_client_after (#25946 ) The `disconnect.stop_on_client_after` feature is implemented as a loop on the client that's intended to wait on the shortest timeout of all the allocations on the node and then check whether the interval since the last heartbeat has been longer than the timeout. It uses a buffered channel of allocations written and read from the same goroutine to push "stops" from the timeout expiring to the next pass through the loop. Unfortunately if there are multiple allocations that need to be stopped in the same timeout event, or even if a previous event has not yet been dequeued, then sending on the channel will block and the entire goroutine deadlocks itself. While fixing this, I also discovered that the `stop_on_client_after` and heartbeat loops can synchronize in a pathological way that extends the `stop_on_client_after` window. If a heartbeat fails close to the beginning of the shortest `stop_on_client_after` window, the loop will end up waiting until almost 2x the intended wait period. While fixing both of those issues, I discovered that the existing tests had a bug such that we were asserting that an allocrunner was being destroyed when it had already exited. This commit includes the following: * Rework the watch loop so that we handle the stops in the same case as the timer expiration, rather than using a channel in the method scope. * Remove the alloc intervals map field from the struct and keep it in the method scope, in order to discourage writing racy tests that read its value. * Reset the timer whenever we receive a heartbeat, which forces the two intervals to synchronize correctly. * Minor refactoring of the disconnect timeout lookup to improve brevity. Fixes: https://github.com/hashicorp/nomad/issues/24679 Ref: https://hashicorp.atlassian.net/browse/NMD-407	2025-05-29 15:05:33 -04:00
Matt John	edcef4d1e2	docs: Minor docs fix for Virt task driver (#25947 )	2025-05-29 08:06:21 +01:00
Michael Smithhisler	4c8257d0c7	client: add once mode to template block (#25922 )	2025-05-28 11:45:11 -04:00
Piotr Kazmierczak	a10c2f6de7	e2e: mention in the terraform readme that we require a local Consul binary (#25944 )	2025-05-28 17:12:57 +02:00
Piotr Kazmierczak	5dd880ad61	docs: upgrade guide entry for /v1/acl/token/self changes (#25940 ) During #25547 and #25588 work, incorrect response codes from /v1/acl/token/self were changed, but we did not make a note about this in the upgrade guide.	2025-05-28 16:22:37 +02:00
Tim Gross	3f59860254	host volumes: add configuration to GC on node GC (#25903 ) When a node is garbage collected, any dynamic host volumes on the node are orphaned in the state store. We generally don't want to automatically collect these volumes and risk data loss, and have provided a CLI flag to `-force` remove them in #25902. But for clusters running on ephemeral cloud instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add excessive friction. Add a configuration knob to the client configuration to remove host volumes from the state store on node GC. Ref: https://github.com/hashicorp/nomad/pull/25902 Ref: https://github.com/hashicorp/nomad/issues/25762 Ref: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-27 10:22:08 -04:00
Dylan Hyun	4d8c873b46	UI: update selector of test in job actions (#25905 ) Ref: HDS-4324	2025-05-27 09:59:24 -04:00
James Rasell	e3fea745eb	docs: Remove long removed client iops metrics from monitoring page. (#25926 )	2025-05-23 16:14:16 +01:00
James Rasell	99b215205e	ui: Handle new token self response object when ACLs are disabled. (#25881 ) * ui: Handle new token self response object when ACLs are disabled. The ACL self lookup now returns a spoof token when ACLs are disbaled, rather than an error. The UI needs to be updated to handle this change, so permissions checks are not performed which grey out buttons such as client drain incorrectly. * changelog: add entry for #25881	2025-05-23 09:16:58 -04:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Daniel Bennett	15c01e5a49	ipv6: normalize addrs per RFC-5942 §4 (#25921 ) https://datatracker.ietf.org/doc/html/rfc5952#section-4 * copy NormalizeAddr func from vault * PRs hashicorp/vault#29228 & hashicorp/vault#29517 * normalize bind/advertise addrs * normalize consul/vault addrs	2025-05-22 14:21:30 -04:00
Tim Gross	cfe6349378	testing: migrate nomad/state off testify (#25909 ) We've been gradually migrating from `testify` to `shoenig/test` on a test-by-test basis. While working on a large refactoring in the state store, I found this to create a lot of diffs incidental to the refactoring. In this changeset, I've used a prototype collection of semgrep fix rules to autofix most of the uses of testify in the `nomad/state` package. Then I went in manually and fixed any resulting problems, as well as a few minor test bugs that `shoenig/test` catches and `testify` does not because of its API. I've also added a semgrep rule for marking a package as "testify clean", so that we don't accidentally add it back to any package we manage to remove it from going forward. While I'm here, I've removed most of the uses of `reflect.DeepEqual` in the tests as well as cleaned up some older idioms that Go has nicer syntax for now.	2025-05-22 09:18:46 -04:00
Aimee Ukasick	30e63fe6f3	Docs: Add "/v1" where missing so endpoint docs are consistent (#25798 ) * add /v1 where missing * finish adding /v1 where missing * remove v1 from ui.mdx	2025-05-22 08:11:46 -05:00
Aimee Ukasick	c12ad24de0	Docs: SEO updates to operations, other specs sections (#25518 ) * seo operation section * other specifications section * Update website/content/docs/other-specifications/variables.mdx Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --------- Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>	2025-05-22 07:47:05 -05:00
Chris Roberts	1aa416e2f2	Support applying policy to all jobs within namespace (#25871 ) Workflow identities currently support ACL policies being applied to a job ID within a namespace. With this update an ACL policy can be applied to a namespace. This results in the ACL policy being applied to all jobs within the namespace.	2025-05-21 07:44:14 -07:00
Tim Gross	41cf1b03b4	host volumes: -force flag for delete (#25902 ) When a node is garbage collected, we leave behind the dynamic host volume in the state store. We don't want to automatically garbage collect the volumes and risk data loss, but we should allow these to be removed via the API. Fixes: https://github.com/hashicorp/nomad/issues/25762 Fixes: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-21 08:55:52 -04:00
Tim Gross	b6d9424c4b	semgrep: adjust forbidden package rule for regex matches (#25904 ) We have several semgrep rules forbidding imports of packages we don't want. While testing out a new rule I discovered that the rule we have is completely ineffective. Update the rule to detect imports using the Go language plugin, including regex matching on some packages where it's forbidden to import the root but fine to import a subpackage or different version. The go-set import rule is an example of one where our `go-set/v3` imports fails the re-written check unless we use the regex syntax. If you replace the pattern rule with `import "=~/github.com\/hashicorp\/go-set/v3$/"` it would fail.	2025-05-20 16:39:24 -04:00
Aimee Ukasick	13a59a57e6	Docs: Community plugins section refactor (#25891 ) * Docs: Point community plugins links to their docs/repos * fix typo in redirect.js	2025-05-20 08:35:06 -05:00
Tim Gross	0e728b87db	E2E: remove dnsmasq and references to ECS plugin (#25892 ) The DNS configuration for our E2E cluster uses dnsmasq to pass all DNS through Consul. But there's a circular reference in systemd configurations that sometimes causes the Docker service to fail, this is causing test flakes during upgrade testing because we count the number of nodes and expect `system` jobs using Docker to run on all nodes. We no longer have any tests that require Consul DNS, so remove the complication of dnsmasq to break the reference cycle. Also, while I was looking at this I noticed we still had setup that would configure the ECS remote task driver plugin, which is archived. Remove this as well. Ref: https://hashicorp.atlassian.net/browse/NMD-162	2025-05-20 08:26:22 -04:00
dependabot[bot]	2d63abd80f	chore(deps): bump github.com/prometheus/common from 0.63.0 to 0.64.0 (#25875 ) Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.63.0 to 0.64.0. - [Release notes](https://github.com/prometheus/common/releases) - [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md) - [Commits](https://github.com/prometheus/common/compare/v0.63.0...v0.64.0) --- updated-dependencies: - dependency-name: github.com/prometheus/common dependency-version: 0.64.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-19 14:48:05 -07:00
dependabot[bot]	0f2aace6ef	chore(deps): bump github.com/zclconf/go-cty from 1.16.2 to 1.16.3 (#25874 ) Bumps [github.com/zclconf/go-cty](https://github.com/zclconf/go-cty) from 1.16.2 to 1.16.3. - [Release notes](https://github.com/zclconf/go-cty/releases) - [Changelog](https://github.com/zclconf/go-cty/blob/main/CHANGELOG.md) - [Commits](https://github.com/zclconf/go-cty/compare/v1.16.2...v1.16.3) --- updated-dependencies: - dependency-name: github.com/zclconf/go-cty dependency-version: 1.16.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-19 14:44:06 -07:00
dependabot[bot]	0a8a3df208	chore(deps): bump google.golang.org/grpc from 1.72.0 to 1.72.1 (#25873 ) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.72.0 to 1.72.1. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.72.0...v1.72.1) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.72.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-19 14:43:12 -07:00
Michael Smithhisler	979d34a461	state store: persist NextAllocation correctly when upserting allocations (#25799 ) --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-19 15:45:04 -04:00
Tim Gross	456d95a19e	scheduler: account for affinity value of zero in score normalization (#25800 ) If there are no affinities on a job, we don't want to count an affinity score of zero in the number of scores we divide the normalized score by. This is how we handle other scoring components like node reschedule penalties on nodes that weren't running the previous allocation. But we also exclude counting the affinity in the case where we have affinity but the value is zero. In pathological cases, this can result in a node with a low affinity being picked over a node with no affinity, because the denominator is 1 larger. Include zero-value affinities in the count of scores if the job has affinities but the value just happens to be zero. Fixes: https://github.com/hashicorp/nomad/issues/25621	2025-05-19 14:10:00 -04:00
Piotr Kazmierczak	cdc308a0eb	wi: new endpoint for listing workload attached ACL policies (#25588 ) This introduces a new HTTP endpoint (and an associated CLI command) for querying ACL policies associated with a workload identity. It allows users that want to learn about the ACL capabilities from within WI-tasks to know what sort of policies are enabled. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-05-19 19:54:12 +02:00
Piotr Kazmierczak	953910dc5d	docs: emphasize HOME and USER env vars for tasks that use custom `user` setting (#25879 ) In #25859 we fixed the task environment variables to account for user field setting. This PR follows up with documentation adjustments.	2025-05-19 19:00:54 +02:00
Aimee Ukasick	986f3c727a	Docs: SEO job spec section (#25612 ) * action page * change all page_title fields * update title * constraint through migrate pages * update page title and heading to use sentence case * fix front matter description * Apply suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --------- Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>	2025-05-19 09:02:07 -05:00
Tim Gross	77c8acb422	telemetry: fix excessive CPU consumption in executor (#25870 ) Collecting metrics from processes is expensive, especially on platforms like Windows. The executor code has a 5s cache of stats to ensure that we don't thrash syscalls on nodes running many allocations. But the timestamp used to calculate TTL of this cache was never being set, so we were always treating it as expired. This causes excess CPU utilization on client nodes. Ensure that when we fill the cache, we set the timestamp. In testing on Windows, this reduces exector CPU overhead by roughly 75%. This changeset includes two other related items: * The `telemetry.publish_allocation_metrics` field correctly prevents a node from publishing metrics, but the stats hook on the taskrunner still collects the metrics, which can be expensive. Thread the configuration value into the stats hook so that we don't collect if `telemetry.publish_allocation_metrics = false`. * The `linuxProcStats` type in the executor's `procstats` package is misnamed as a result of a couple rounds of refactoring. It's used by all task executors, not just Linux. Rename this and move a comment about how Windows processes are listed so that the comment is closer to where the logic is implemented. Fixes: https://github.com/hashicorp/nomad/issues/23323 Fixes: https://hashicorp.atlassian.net/browse/NMD-455	2025-05-19 09:24:13 -04:00
Piotr Kazmierczak	0fa0624576	exec: Fix incorrect `HOME` and `USER` env variables for tasks that have `user` set (#25859 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-16 15:02:45 +02:00
Allison Larson	fd16f80b5a	Only error on constraints if no allocs are running (#25850 ) * Only error on constraints if no allocs are running When running `nomad job run <JOB>` multiple times with constraints defined, there should be no error as a result of filtering out nodes that do not/have not ever satsified the constraints. When running a systems job with constraint, any run after an initial startup returns an exit(2) and a warning about unplaced allocations due to constraints. An error that is not encountered on the initial run, though the constraint stays the same. This is because the node that satisfies the condition is already running the allocation, and the placement is ignored. Another placement is attempted, but the only node(s) left are the ones that do not satisfy the constraint. Nomad views this case (no allocations that were attempted to placed could be placed successfully) as an error, and reports it as such. In reality, no allocations should be placed or updated in this case, but it should not be treated as an error. This change uses the `ignored` placements from diffSystemAlloc to attempt to determine if the case encountered is an error (no ignored placements means that nothing is already running, and is an error), or is not one (an ignored placement means that the task is already running somewhere on a node). It does this at the point where `failedTGAlloc` is populated, so placement functionality isn't changed, just the field that populates error. There is functionality that should be preserved which (correctly) notifies a user if a job is attempted that cannot be run on any node due to the constraints filtering out all available nodes. This should still behave as expected. * Add changelog entry * Handle in-place updates for constrained system jobs * Update .changelog/25850.txt Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com> * Remove conditionals --------- Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2025-05-15 15:14:03 -07:00
Tim Gross	9ee2582379	upgrade test: remove change mode from Vault workload (#25861 ) During the upgrade test we can trigger a re-render of the Vault secret due to client restart before the allocrunner has marked the task as running, which triggers the change mode on the template and restarts the task. This results in a race where the alloc is still "pending" when we go to check it. We never change the value of this secret in upgrade testing, so paper over this race condition by setting a "noop" change mode.	2025-05-15 10:10:58 -04:00
James Rasell	be84613dc3	test: Only run and lint Linux network hook test on Linux. (#25858 )	2025-05-15 13:33:37 +01:00
Martina Santangelo	18eddf53a4	commands: adds job start command to start stopped jobs (#24150 ) --------- Co-authored-by: Michael Smithhisler <michael.smithhisler@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-14 15:17:44 -04:00
Tim Gross	8a87c33594	build: pin actionlint workflow (#25855 ) We're required to pin Docker images for Actions to a specific SHA now and this is tripping scans in the Enterprise repo. Update the actionlint image. Ref: https://go.hashi.co/memo/sec-032	2025-05-14 14:25:37 -04:00
James Rasell	ef25c3d55a	cli: Fix help indentation format on node meta commands. (#25851 )	2025-05-14 14:53:48 +01:00
Tim Gross	8a5a057d88	offline license utilization reporting (#25844 ) Nomad Enterprise users operating in air-gapped or otherwise secured environments don't want to send license reporting metrics directly from their servers. Implement manual/offline reporting by periodically recording usage metrics snapshots in the state store, and providing an API and CLI by which cluster administrators can download the snapshot for review and out-of-band transmission to HashiCorp. This is the CE portion of the work required for implemention in the Enterprise product. Nomad CE does not perform utilization reporting. Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673 Ref: https://hashicorp.atlassian.net/browse/NMD-68 Ref: https://go.hashi.co/rfc/nmd-210	2025-05-14 09:51:13 -04:00
Aimee Ukasick	79d35f072a	Move environment section; CE-712 (#25845 )	2025-05-13 12:31:08 -05:00
Piotr Kazmierczak	57cd7d7bca	admin: Post 1.10.1 release (#25842 )	2025-05-13 14:46:47 +02:00
Tim Gross	6c9f2fdd29	reduce upgrade testing flakes (#25839 ) This changeset includes several adjustments to the upgrade testing scripts to reduce flakes and make problems more understandable: * When a node is drained prior to the 3rd client upgrade, it's entirely possible the 3rd client to be upgraded is the drained node. This results in miscounting the expected number of allocations because many of them will be "complete" (service/batch) or "pending" (system). Leave the system jobs running during drains and only count the running allocations at that point as the expected set. Move the inline script that gets this count into a script file for legibility. * When the last initial workload is deployed, it's possible for it to be briefly still in "pending" when we move to the next step. Poll for a short window for the expected count of jobs. * Make sure that any scripts that are being run right after a server or client is coming back up can handle temporary unavailability gracefully. * Change the debugging output of several scripts to avoid having the debug output run into the error message (Ex. "some allocs are not running" looked like the first allocation running was the missing allocation). * Add some notes to the README about running locally with `-dev` builds and tagging a cluster with your own name. Ref: https://hashicorp.atlassian.net/browse/NMD-162	2025-05-13 08:40:22 -04:00
Piotr Kazmierczak	1bbd9eb4b0	sentinel	2025-05-13 14:38:59 +02:00
Piotr Kazmierczak	c590c4dd3c	Merge release 1.10.1 files	2025-05-13 14:28:42 +02:00
hc-github-team-nomad-core	31b7a94a88	Prepare for next release	2025-05-13 14:26:48 +02:00
hc-github-team-nomad-core	9ef42e9807	Generate files for 1.10.1 release	2025-05-13 14:26:48 +02:00
Juana De La Cuesta	695ba2c159	Fix the verify alloc script (#25837 ) * fix: use the raw option on jq to avoid trating the " like a char * Update verify_allocs.sh	2025-05-12 14:53:28 +02:00
dependabot[bot]	120c7bd6e0	chore(deps): bump golang.org/x/sync from 0.13.0 to 0.14.0 (#25828 ) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.13.0 to 0.14.0. - [Commits](https://github.com/golang/sync/compare/v0.13.0...v0.14.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-version: 0.14.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-12 11:13:10 +02:00
dependabot[bot]	6de7523de3	chore(deps): bump google.golang.org/grpc from 1.71.1 to 1.72.0 (#25767 ) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.71.1 to 1.72.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.71.1...v1.72.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.72.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-12 10:51:22 +02:00

1 2 3 4 5 ...

27110 Commits