nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 18:35:44 +03:00

Author	SHA1	Message	Date
Piotr Kazmierczak	f9b95ae896	scheduler: account for infeasible nodes when reconciling system jobs (#26868 ) Node reconciler never took node feasibility into account. In cases when there were nodes excluded from allocation placement due to constraints not being met, for example, the desired total or desired canary numbers were never updated in the reconciler to account for that. Thus, deployments would never become successful.	2025-10-02 16:17:46 +02:00
Piotr Kazmierczak	eaa0fe0e27	scheduler: always set the right deployment status for system jobs that require promotion (#26851 ) In cases where system jobs had the same amount of canary allocations deployed as there were eligible nodes, the scheduler would incorrectly mark the deployment as complete, as if auto promotion was set. This edge case uncovered a bug in the setDeploymentStatusAndUpdates method, and since we round up canary nodes, it may not be such an edge case afterall. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-09-30 09:18:59 +02:00
Piotr Kazmierczak	46dfd9d992	scheduler: do not create deployments for system job reschedules (#26789 ) System jobs that get rescheduled should not get new deployments.	2025-09-18 14:54:54 +02:00
Michael Smithhisler	f58e915bd3	scheduler: allow device count to use different vendors/models (#26649 ) A small optimization in the scheduler required users to specify specific models of devices if the required count was higher than the individual model/vendor on the node. This change removes that optimization to allow for more intuitive device scheduling when different vendor/model device types exist on a node.	2025-09-10 07:12:38 -04:00
Michael Smithhisler	37da98be1c	Merge pull request #26681 from hashicorp/NMD-760-nomad-secrets-block Secrets Block: merge feature branch to main	2025-09-09 10:46:18 -04:00
Michael Smithhisler	ac32b0864d	scheduler: adds implicit constraint for secrets plugin node attributes (#26303 )	2025-09-05 16:08:23 -04:00
Tim Gross	ce614e6b7a	scheduler: `upgrade` block testing for system deployments (#26579 ) This changeset adds system scheduler tests of various permutations of the `update` block. It also fixes a number of bugs discovered in the process. * Don't create deployment for in-flight rollout. If a system job is in the middle of a rollout prior to upgrading to a version of Nomad with system deployments, we'll end up creating a system deployment which might never complete because previously placed allocs will not be tracked. Check to see if we have existing allocs that should belong to the new deployment and prevent a deployment from being created in that case. * Ensure we call `Copy` on `Deployment` to avoid state store corruption. * Don't limit canary counts by `max_parallel`. * Never create deployments for `sysbatch` jobs. Ref: https://hashicorp.atlassian.net/browse/NMD-761	2025-09-05 10:22:42 -04:00
Piotr Kazmierczak	a083495240	system scheduler: correction to Test_computeCanaryNodes (#26707 )	2025-09-05 16:20:34 +02:00
Piotr Kazmierczak	276ab8a4c6	system scheduler: keep track of previously used canary nodes (#26697 ) In the system scheduler, we need to keep track which nodes were previously used as "canary nodes" and not pick them at random, in case of previously failed canaries or changes to the amount of canaries in the jobspec. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-09-05 15:32:08 +02:00
Michael Smithhisler	65c7f34f2d	secrets: Add secrets block to job spec (#26076 )	2025-09-04 15:58:03 -04:00
Piotr Kazmierczak	14e98a2420	scheduler: fix promotions of system job canaries (#26652 ) This changeset adjusts the handling of allocations placement when we're promoting a deployment, and it corrects the behavior of isDeploymentComplete, which previously would never mark promoted deployment as complete.	2025-09-03 16:09:36 +02:00
Piotr Kazmierczak	8b8e21dc0e	scheduler: check if system job deploy is complete before other guards (#26651 )	2025-08-28 17:29:13 +02:00
Piotr Kazmierczak	de342ee48b	scheduler: correct dstate total/canary counts for system deployments (#26641 )	2025-08-28 16:24:52 +02:00
Piotr Kazmierczak	ca96de15d0	scheduler: correct handling of MaxParallel and obsoleting Stagger in the system scheduler (#26631 )	2025-08-27 09:38:35 +02:00
Tim Gross	5c444b8922	system scheduler: account for per task group max_parallel (#26635 ) The system scheduler's `evictAndPlace` function does not account for per task group `max_parallel`, as needed to support system deployments. Push the rolling upgrade strategy check into this function and return that the deployment was limited if any one of the task groups is limited.	2025-08-27 09:38:18 +02:00
Piotr Kazmierczak	7c4faf9227	scheduler: monitor deployments correctly (#26605 ) Corrects two minor bugs that prevented proper deployment monitoring for systems jobs: populating the new deployment field of the system scheduler object, and correcting allocrunner health checks that were guarded not to run on system jobs.	2025-08-25 15:29:13 +02:00
Piotr Kazmierczak	3d373c9a6a	scheduler: support canary deployments for system jobs (#26499 ) This changeset introduces canary deployments for system jobs. Canaries work a little different for system jobs than for service jobs. The integer in the update block of a task group is interpreted as a percentage of eligible nodes that this task group update should be deployed to (rounded up to the nearest integer, so, e.g., for 5 eligible nodes and canary value set to 50, we will deploy to 3 nodes). In contrast to service jobs, system job canaries are not tracked, i.e., the scheduler doesn't need to know which allocations are canaries and which are not, since any node can only run one system job. Canary deployments are marked for promotion and if promoted, the scheduler simply performs an update as usual, replacing allocations belonging to a previous job version, and leaving new ones intact.	2025-08-22 15:02:40 +02:00
Piotr Kazmierczak	0e6e5ef8d1	scheduler: handle deployment completeness in the node reconciler (#26445 ) This PR introduces marking deployments as complete if there are no remaining placements to be made for a given task group.	2025-08-21 18:34:59 +02:00
Piotr Kazmierczak	c33e30596c	scheduler: support deployments in the `NodeReconciler` (#26318 ) This is the initial implementation of deployments for the system and sysbatch reconciler. It does not support updates or canaries at this point, it simply provides the necessary plumbing for deployments.	2025-08-21 18:34:59 +02:00
Tim Gross	80ddb7392a	scheduler: fix debug-level logging for node reconciler (#26583 ) In #26169 we started emitting structured logs from the reconciler. But the node reconciler results are `AllocTuple` structs and not counts, so the information we put in the logs ends up being pointer addresses in hex. Fix this so that we're recording the number of allocs in each bucket instead. Fix another misleading log-line while we're here. Ref: https://github.com/hashicorp/nomad/pull/26169	2025-08-19 15:17:17 -04:00
Piotr Kazmierczak	e86d815472	scheduler: avoid importing the Planner test harness in scheduler calls (#26544 ) For a while now, we've had only 2 implementations of the Planner interface in Nomad: one was the Worker, and the other was the scheduler test harness, which was then used as argument to the scheduler constructors in FSM and job endpoint RPC. That's not great, and one of the recent refactors made it apparent that we're importing testing code in places we really shouldn't. We finally got called out for it, and this PR attempts to remedy the situation by splitting the Harness into Plan (which contains actual plan submission logic) and separating it from testing code.	2025-08-18 19:35:34 +02:00
Tim Gross	d1186ae53e	scheduler: don't suppress blocked evals on delay if previous expires (#26523 ) In #8099 we fixed a bug where garbage collecting a job with `disconnect.stop_on_client_after` would spawn recursive delayed evals. But when applied to disconnected allocs with `replace=true`, the fix prevents us from emitting a blocked eval if there's no room for the replacement. Update the guard on creating blocked evals so that rather than checking for `IsZero` that we check for being later than the `WaitUntil`. This separates this guard from the logic guarding the creation of delayed evals so that we can potentially create both when needed. Ref: https://github.com/hashicorp/nomad/pull/8099/files#r435198418	2025-08-15 10:53:52 -04:00
Aimee Ukasick	a30cb2f137	Update UI, code comment, and README links to docs, tutorials (#26429 ) * Update UI, code comment, and README links to docs, tutorials * fix typo in ephemeral disks learn more link url * feedback on typo Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-08-06 09:40:23 -05:00
Tim Gross	4ce937884d	scheduler: move result mutation into `computeStop` (#26351 ) The `computeStop` method returns two values that only get used to mutate the result and the untainted set. Move the mutation into the method to match the work done in #26325. Ref: https://github.com/hashicorp/nomad/pull/26325 Ref: https://hashicorp.atlassian.net/browse/NMD-819	2025-07-29 08:23:06 -04:00
Tim Gross	26554e544e	scheduler: move result mutation into `computeUpdates` (#26336 ) The `computeUpdate` method returns 4 different values, some of which are just different shapes of the same data and only ever get used to be applied to the result in the caller. Move the mutation of the result into `computeUpdates` to match the work done in #26325. Clean up the return signature so that only slices we need downstream are returned, and fix the incorrect docstring. Also fix a silent bug where the `inplace` set includes the original alloc and not the updated version. This has no functional change because all existing callers only ever look at the length of this slice, but it will prevent future bugs if that ever changes. Ref: https://github.com/hashicorp/nomad/pull/26325 Ref: https://hashicorp.atlassian.net/browse/NMD-819	2025-07-25 08:21:37 -04:00
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
Tim Gross	2c4be7fc2e	Reconciler mutation improvements (#26325 ) Refactors of the `computeGroup` code in the reconciler to make understanding its mutations more manageable. Some of this work makes mutation more consistent but more importantly it's intended to make it readily _detectable_ while still being readable. Includes: * In the `computeCanaries` function, we mutate the dstate and the result and then the return values are used to further mutate the result in the caller. Move all this mutation into the function. * In the `computeMigrations` function, we mutate the result and then the return values are used to further mutate the result in the caller. Move all this mutation into the function. * In the `cancelUnneededCanaries` function, we mutate the result and then the return values are used to further mutate the result in the caller. Move all this mutation into the function, and annotate which `allocSet`s are mutated by taking a pointer to the set. * The `createRescheduleLaterEvals` function currently mutates the results and returns updates to mutate the results in the caller. Move all this mutation into the function to help cleanup `computeGroup`. * Extract `computeReconnecting` method from `computeGroup`. There's some tangled logic in `computeGroup` for determining changes to make for reconnecting allocations. Pull this out into its own function. Annotate mutability in the function by passing pointers to `allocSet` where needed, and mutate the result to update counts. Rename the old `computeReconnecting` method to `appendReconnectingUpdates` to mirror the naming of the similar logic for disconnects. * Extract `computeDisconnecting` method from `computeGroup`. There's some tangled logic in `computeGroup` for determining changes to make for disconnected allocations. Pull this out into its own function. Annotate mutability in the function by passing pointers to `allocSet` where needed, and mutate the result to update counts. * The `appendUnknownDisconnectingUpdates` method used to create updates for disconnected allocations mutates one of its `allocSet` arguments to change the allocations that the reschedule now set points to. Pull this update out into the caller. * A handful of small docstring and helper function fixes Ref: https://hashicorp.atlassian.net/browse/NMD-819	2025-07-24 08:33:49 -04:00
Tim Gross	e675491eb6	refactor uses of `allocSet` in reconciler (#26324 ) The reconciler contains a large set of methods and functions that operate on `allocSet` (a map of allocation IDs to their allocs). Update these so that they are consistently methods that are documented to not consume the `allocSet`. This sets the stage for further improvements around mutability in the reconciler. This changeset also includes a few related refactors: * Use the `allocSet` alias in every location it's relevant in the reconciler, for consistency and clarity. * Move the filter functions and related helpers in the `allocs.go` file into the `filters.go` file. * Update the method receiver on `allocSet` to match everywhere and generally improve the docstrings on the filter functions. Ref: https://hashicorp.atlassian.net/browse/NMD-819	2025-07-23 08:57:41 -04:00
Piotr Kazmierczak	973a554808	scheduler: remove unnecessary reconnecting and ignore allocset assignment (#26298 ) These values aren't used anywhere, and the code is confusing as is.	2025-07-21 09:06:52 +02:00
Tim Gross	333dd94362	scheduler: exit early on count=0 and filter out server-terminal (#26292 ) When a task group is removed from a jobspec, the reconciler stops all allocations and immediately returns from `computeGroup`. We can do the same for when the group has been scaled-to-zero, but doing so runs into an inconsistency in the way that server-terminal allocations are handled. Prior to this change server-terminal allocations fall through `computeGroup` without being marked as `ignore`, unless they are terminal canaries, in which case they are marked `stop` (but this is a no-op). This inconsistency causes a _tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it more difficult to make test assertions for `stop` vs `ignore` vs fallthrough. Remove this inconsistency by filtering out server-terminal allocations early in `computeGroup`. This brings the cluster reconciler's behavior closer to the node reconciler's behavior, except that the node reconciler discards _all_ terminal allocations because it doesn't support rescheduling. This changeset required adjustments to two tests, but the tests themselves were a bit of a mess: * In https://github.com/hashicorp/nomad/pull/25726 we added a test of how canaries were treated when on draining nodes. But the test didn't correctly configure the job with an update block, leading to misleading test behavior. Fix the test to exercise the intended behavior and refactor for clarity. * While working on reconciler behaviors around stopped allocations, I found it extremely hard to follow the intent of the disconnected client tests because many of the fields in the table-driven test are switches for more complex behavior or just tersely named. Attempt to make this a little more legible by moving some branches directly into fields, renaming some fields, and flattening out some branching. Ref: https://hashicorp.atlassian.net/browse/NMD-819	2025-07-18 08:51:52 -04:00
Tim Gross	35f3f6ce41	scheduler: add disconnect and reschedule info to reconciler output (#26255 ) The `DesiredUpdates` struct that we send to the Read Eval API doesn't include information about disconnect/reconnect and rescheduling. Annotate the `DesiredUpdates` with this data, and adjust the `eval status` command to display only those fields that have non-zero values in order to make the output width manageable. Ref: https://hashicorp.atlassian.net/browse/NMD-815	2025-07-16 08:46:38 -04:00
Allison Larson	3ca518e89c	Add node_pool to blockedEval metric (#26215 ) Adds the node_pool to the blockedEval metrics that get emitted for resource/cpu, along with the dc and node class.	2025-07-15 09:48:04 -07:00
Piotr Kazmierczak	08b3db104d	docs: update reconciler diagram to reflect recent refactors (#26260 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-07-11 15:34:07 +02:00
Tim Gross	26302ab25d	reconciler: share assertions in property tests (#26259 ) Refactor the reconciler property tests to extract functions for safety property assertions we'll share between different job types for the same reconciler.	2025-07-11 09:27:22 -04:00
Tim Gross	74f7a8f037	scheduler: basic node reconciler safety properties for system jobs (#26216 ) Property test assertions for the core safety properties of the node reconciler, for system jobs. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/hashicorp/nomad/pull/26167	2025-07-09 14:44:05 -04:00
Tim Gross	94e03f894a	scheduler: basic cluster reconciler safety properties for batch jobs (#26172 ) Property test assertions for the core safety proprerties of the cluster reconciler, for batch jobs. The changeset includes fixes for any bugs found during work-in-progress, which will get pulled out to their own PRs. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/hashicorp/nomad/pull/26167	2025-07-09 14:43:55 -04:00
Piotr Kazmierczak	e50db4d1b8	scheduler: property testing of cancelUnneededCanaries (#26204 ) In the spirit of #26180 Internal ref: https://hashicorp.atlassian.net/browse/NMD-814	2025-07-09 13:46:13 -04:00
Tim Gross	7c6c1ed0d3	scheduler: reconciler should constrain placements to count (#26239 ) While working on property testing in #26172 we discovered there are scenarios where the reconciler will produce more than the expected number of placements. Testing of those scenarios at the whole-scheduler level shows that this gets handled correctly downstream of the reconciler, but this makes it harder to reason about reconciler behavior. Cap the number of placements in the reconciler. Ref: https://github.com/hashicorp/nomad/pull/26172	2025-07-09 11:51:01 -04:00
Tim Gross	eb47d1ca11	scheduler: eliminate dead code in node reconciler (#26236 ) While working on property testing in #26216, I discovered we had unreachable code in the node reconciler. The `diffSystemAllocsForNode` function receives a set of non-terminal allocations, but then has branches where it assumes the allocations might be terminal. It's trivially provable that these allocs are always live, as the system scheduler splits the set of known allocs into live and terminal sets before passing them into the node reconciler. Eliminate the unreachable code and improve the variable names to make the known state of the allocs more clear in the reconciler code. Ref: https://github.com/hashicorp/nomad/pull/26216	2025-07-09 11:31:04 -04:00
Piotr Kazmierczak	8bc6abcd2e	scheduler: basic cluster reconciler safety properties for service jobs (#26167 )	2025-07-09 17:30:37 +02:00
Tim Gross	c043d1c850	scheduler: property testing of reconcile reconnecting (#26180 ) To help break down the larger property tests we're doing in #26167 and #26172 into more manageable chunks, pull out a property test for just the `reconcileReconnecting` method. This method helpfully already defines its important properties, so we can implement those as test assertions. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/hashicorp/nomad/pull/26167 Ref: https://github.com/hashicorp/nomad/pull/26172	2025-07-07 09:40:49 -04:00
Tim Gross	5c909213ce	scheduler: add reconciler annotations to completed evals (#26188 ) The output of the reconciler stage of scheduling is only visible via debug-level logs, typically accessible only to the cluster admin. We can give job authors better ability to understand what's happening to their jobs if we expose this information to them in the `eval status` command. Add the reconciler's desired updates to the evaluation struct so it can be exposed in the API. This increases the size of evals by roughly 15% in the state store, or a bit more when there are preemptions (but we expect this will be a small minority of evals). Ref: https://hashicorp.atlassian.net/browse/NMD-818 Fixes: https://github.com/hashicorp/nomad/issues/15564	2025-07-07 09:40:21 -04:00
Tim Gross	9a29df2292	scheduler: emit structured logs from reconciliation (#26169 ) Both the cluster reconciler and node reconciler emit a debug-level log line with their results, but these are unstructured multi-line logs that are annoying for operators to parse. Change these to emit structured key-value pairs like we do everywhere else. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-07-01 10:37:44 -04:00
Piotr Kazmierczak	36e7148247	scheduler: doc.go files for new packages (#26177 )	2025-07-01 16:28:33 +02:00
Tim Gross	ec8250ed30	property test generation for reconciler (#26142 ) As part of ongoing work to make the scheduler more legible and more robustly tested, we're implementing property testing of at least the reconciler. This changeset provides some infrastructure we'll need for generating the test cases using `pgregory.net/rapid`, without building out any of the property assertions yet (that'll be in upcoming PRs over the next couple weeks). The alloc reconciler generator produces a job, a previous version of the job, a set of tainted nodes, and a set of existing allocations. The node reconciler generator produces a job, a set of nodes, and allocations on those nodes. Reconnecting allocs are not yet well-covered by these generators, and with ~40 dimensions covered so far we may need to pull those out to their own tests in order to get good coverage. Note the scenarios only randomize fields of interest; fields like the job name that don't impact the reconciler would use up available shrink cycles on failed tests without actually reducing the scope of the scenario. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/flyingmutant/rapid	2025-06-26 11:09:53 -04:00
Piotr Kazmierczak	27da75044e	scheduler: move tests that depend on calling schedulers into `integration` package (#26037 )	2025-06-24 09:31:10 +02:00
Piotr Kazmierczak	12ddb6db94	scheduler: capture reconciler state in ReconcilerState object (#26088 ) This changeset separates reconciler fields into their own sub-struct to make testing easier and the code more explicit about what fields relate to which state.	2025-06-23 15:36:39 +02:00
Piotr Kazmierczak	1030760d3f	scheduler: adjust method comments and names to reflect recent refactoring (#26085 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-20 17:23:31 +02:00
Piotr Kazmierczak	b82fd2e159	scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042 ) Cluster reconciler code is notoriously hard to follow because most of its method continuously mutate the fields of the allocReconciler object. Even for top-level methods it makes the code hard to follow, but gets really gnarly with lower-level methods (of which there are many). This changeset proposes a refactoring that makes the vast majority of said methods return explicit values, and avoid mutating object fields.	2025-06-20 07:37:16 +02:00
Piotr Kazmierczak	0ddbc548a3	scheduler: rename reconciliation package to `reconciler` (#26038 ) nouns are better than verbs for package names	2025-06-12 14:36:09 +02:00

1 2 3 4 5 ...

971 Commits