Commit Graph

949 Commits

Author SHA1 Message Date
Aimee Ukasick
a30cb2f137 Update UI, code comment, and README links to docs, tutorials (#26429)
* Update UI, code comment, and README links to docs, tutorials

* fix typo in ephemeral disks learn more link url

* feedback on typo

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-08-06 09:40:23 -05:00
Tim Gross
4ce937884d scheduler: move result mutation into computeStop (#26351)
The `computeStop` method returns two values that only get used to mutate the
result and the untainted set. Move the mutation into the method to match the
work done in #26325.

Ref: https://github.com/hashicorp/nomad/pull/26325
Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-29 08:23:06 -04:00
Tim Gross
26554e544e scheduler: move result mutation into computeUpdates (#26336)
The `computeUpdate` method returns 4 different values, some of which are just
different shapes of the same data and only ever get used to be applied to the
result in the caller. Move the mutation of the result into `computeUpdates` to
match the work done in #26325. Clean up the return signature so that only slices
we need downstream are returned, and fix the incorrect docstring.

Also fix a silent bug where the `inplace` set includes the original alloc and
not the updated version. This has no functional change because all existing
callers only ever look at the length of this slice, but it will prevent future
bugs if that ever changes.

Ref: https://github.com/hashicorp/nomad/pull/26325
Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-25 08:21:37 -04:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
Tim Gross
2c4be7fc2e Reconciler mutation improvements (#26325)
Refactors of the `computeGroup` code in the reconciler to make understanding its
mutations more manageable. Some of this work makes mutation more consistent but
more importantly it's intended to make it readily _detectable_ while still being
readable. Includes:

* In the `computeCanaries` function, we mutate the dstate and the result and
  then the return values are used to further mutate the result in the
  caller. Move all this mutation into the function.

* In the `computeMigrations` function, we mutate the result and then the return
  values are used to further mutate the result in the caller. Move all this
  mutation into the function.

* In the `cancelUnneededCanaries` function, we mutate the result and then the
  return values are used to further mutate the result in the caller. Move all
  this mutation into the function, and annotate which `allocSet`s are mutated by
  taking a pointer to the set.

* The `createRescheduleLaterEvals` function currently mutates the results and
  returns updates to mutate the results in the caller. Move all this mutation
  into the function to help cleanup `computeGroup`.

* Extract `computeReconnecting` method from `computeGroup`. There's some tangled
  logic in `computeGroup` for determining changes to make for reconnecting
  allocations. Pull this out into its own function. Annotate mutability in the
  function by passing pointers to `allocSet` where needed, and mutate the result
  to update counts. Rename the old `computeReconnecting` method to
  `appendReconnectingUpdates` to mirror the naming of the similar logic for
  disconnects.

* Extract `computeDisconnecting` method from `computeGroup`. There's some
  tangled logic in `computeGroup` for determining changes to make for
  disconnected allocations. Pull this out into its own function. Annotate
  mutability in the function by passing pointers to `allocSet` where needed, and
  mutate the result to update counts.

* The `appendUnknownDisconnectingUpdates` method used to create updates for
  disconnected allocations mutates one of its `allocSet` arguments to change the
  allocations that the reschedule now set points to. Pull this update out into
  the caller.

* A handful of small docstring and helper function fixes


Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-24 08:33:49 -04:00
Tim Gross
e675491eb6 refactor uses of allocSet in reconciler (#26324)
The reconciler contains a large set of methods and functions that operate on
`allocSet` (a map of allocation IDs to their allocs). Update these so that they
are consistently methods that are documented to not consume the `allocSet`. This
sets the stage for further improvements around mutability in the reconciler.

This changeset also includes a few related refactors:
* Use the `allocSet` alias in every location it's relevant in the reconciler,
  for consistency and clarity.
* Move the filter functions and related helpers in the `allocs.go` file into the
  `filters.go` file.
* Update the method receiver on `allocSet` to match everywhere and generally
  improve the docstrings on the filter functions.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-23 08:57:41 -04:00
Piotr Kazmierczak
973a554808 scheduler: remove unnecessary reconnecting and ignore allocset assignment (#26298)
These values aren't used anywhere, and the code is confusing as is.
2025-07-21 09:06:52 +02:00
Tim Gross
333dd94362 scheduler: exit early on count=0 and filter out server-terminal (#26292)
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.

Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.

This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.

This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In https://github.com/hashicorp/nomad/pull/25726 we added a test of how
  canaries were treated when on draining nodes. But the test didn't correctly
  configure the job with an update block, leading to misleading test
  behavior. Fix the test to exercise the intended behavior and refactor for
  clarity.
* While working on reconciler behaviors around stopped allocations, I found it
  extremely hard to follow the intent of the disconnected client tests because
  many of the fields in the table-driven test are switches for more complex
  behavior or just tersely named. Attempt to make this a little more legible by
  moving some branches directly into fields, renaming some fields, and
  flattening out some branching.

Ref: https://hashicorp.atlassian.net/browse/NMD-819
2025-07-18 08:51:52 -04:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Allison Larson
3ca518e89c Add node_pool to blockedEval metric (#26215)
Adds the node_pool to the blockedEval metrics that get emitted for
resource/cpu, along with the dc and node class.
2025-07-15 09:48:04 -07:00
Piotr Kazmierczak
08b3db104d docs: update reconciler diagram to reflect recent refactors (#26260)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-07-11 15:34:07 +02:00
Tim Gross
26302ab25d reconciler: share assertions in property tests (#26259)
Refactor the reconciler property tests to extract functions for safety property
assertions we'll share between different job types for the same reconciler.
2025-07-11 09:27:22 -04:00
Tim Gross
74f7a8f037 scheduler: basic node reconciler safety properties for system jobs (#26216)
Property test assertions for the core safety properties of the node reconciler,
for system jobs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:44:05 -04:00
Tim Gross
94e03f894a scheduler: basic cluster reconciler safety properties for batch jobs (#26172)
Property test assertions for the core safety proprerties of the cluster
reconciler, for batch jobs. The changeset includes fixes for any bugs found
during work-in-progress, which will get pulled out to their own PRs.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
2025-07-09 14:43:55 -04:00
Piotr Kazmierczak
e50db4d1b8 scheduler: property testing of cancelUnneededCanaries (#26204)
In the spirit of #26180

Internal ref: https://hashicorp.atlassian.net/browse/NMD-814
2025-07-09 13:46:13 -04:00
Tim Gross
7c6c1ed0d3 scheduler: reconciler should constrain placements to count (#26239)
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.

Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-09 11:51:01 -04:00
Tim Gross
eb47d1ca11 scheduler: eliminate dead code in node reconciler (#26236)
While working on property testing in #26216, I discovered we had unreachable
code in the node reconciler. The `diffSystemAllocsForNode` function receives a
set of non-terminal allocations, but then has branches where it assumes the
allocations might be terminal. It's trivially provable that these allocs are
always live, as the system scheduler splits the set of known allocs into live
and terminal sets before passing them into the node reconciler.

Eliminate the unreachable code and improve the variable names to make the known
state of the allocs more clear in the reconciler code.

Ref: https://github.com/hashicorp/nomad/pull/26216
2025-07-09 11:31:04 -04:00
Piotr Kazmierczak
8bc6abcd2e scheduler: basic cluster reconciler safety properties for service jobs (#26167) 2025-07-09 17:30:37 +02:00
Tim Gross
c043d1c850 scheduler: property testing of reconcile reconnecting (#26180)
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-07 09:40:49 -04:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
Tim Gross
9a29df2292 scheduler: emit structured logs from reconciliation (#26169)
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-07-01 10:37:44 -04:00
Piotr Kazmierczak
36e7148247 scheduler: doc.go files for new packages (#26177) 2025-07-01 16:28:33 +02:00
Tim Gross
ec8250ed30 property test generation for reconciler (#26142)
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).

The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.

Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
2025-06-26 11:09:53 -04:00
Piotr Kazmierczak
27da75044e scheduler: move tests that depend on calling schedulers into integration package (#26037) 2025-06-24 09:31:10 +02:00
Piotr Kazmierczak
12ddb6db94 scheduler: capture reconciler state in ReconcilerState object (#26088)
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
2025-06-23 15:36:39 +02:00
Piotr Kazmierczak
1030760d3f scheduler: adjust method comments and names to reflect recent refactoring (#26085)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-20 17:23:31 +02:00
Piotr Kazmierczak
b82fd2e159 scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042)
Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
2025-06-20 07:37:16 +02:00
Piotr Kazmierczak
0ddbc548a3 scheduler: rename reconciliation package to reconciler (#26038)
nouns are better than verbs for package names
2025-06-12 14:36:09 +02:00
Piotr Kazmierczak
199d12865f scheduler: isolate feasibility (#26031)
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-11 20:11:04 +02:00
Piotr Kazmierczak
76e3c2961a scheduler: isolate reconciliation code (#26002)
This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.
2025-06-10 15:46:39 +02:00
Piotr Kazmierczak
ce054aae96 scheduler: add a readme and start documenting low level implementation details (#25986)
In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-05 15:36:17 +02:00
Piotr Kazmierczak
648bacda77 testing: migrate nomad/scheduler off of testify (#25968)
In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-04 09:29:28 +02:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Tim Gross
456d95a19e scheduler: account for affinity value of zero in score normalization (#25800)
If there are no affinities on a job, we don't want to count an affinity score of
zero in the number of scores we divide the normalized score by. This is how we
handle other scoring components like node reschedule penalties on nodes that
weren't running the previous allocation.

But we also exclude counting the affinity in the case where we have affinity but
the value is zero. In pathological cases, this can result in a node with a low
affinity being picked over a node with no affinity, because the denominator is 1
larger. Include zero-value affinities in the count of scores if the job has
affinities but the value just happens to be zero.

Fixes: https://github.com/hashicorp/nomad/issues/25621
2025-05-19 14:10:00 -04:00
Allison Larson
fd16f80b5a Only error on constraints if no allocs are running (#25850)
* Only error on constraints if no allocs are running

When running `nomad job run <JOB>` multiple times with constraints
defined, there should be no error as a result of filtering out nodes
that do not/have not ever satsified the constraints.

When running a systems job with constraint, any run after an initial
startup returns an exit(2) and a warning about unplaced allocations due
to constraints. An error that is not encountered on the initial run,
though the constraint stays the same.

This is because the node that satisfies the condition is already running
the allocation, and the placement is ignored. Another placement is
attempted, but the only node(s) left are the ones that do not satisfy
the constraint. Nomad views this case (no allocations that were
attempted to placed could be placed successfully) as an error, and
reports it as such. In reality, no allocations should be placed or
updated in this case, but it should not be treated as an error.

This change uses the `ignored` placements from diffSystemAlloc to attempt to
determine if the case encountered is an error (no ignored placements
means that nothing is already running, and is an error), or is not one
(an ignored placement means that the task is already running somewhere
on a node). It does this at the point where `failedTGAlloc` is
populated, so placement functionality isn't changed, just the field that
populates error.

There is functionality that should be preserved which (correctly)
notifies a user if a job is attempted that cannot be run on any node due
to the constraints filtering out all available nodes. This should still
behave as expected.

* Add changelog entry

* Handle in-place updates for constrained system jobs

* Update .changelog/25850.txt

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Remove conditionals

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2025-05-15 15:14:03 -07:00
Tim Gross
5208ad4c2c scheduler: allow canaries to be migrated on node drain (#25726)
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
2025-04-24 09:24:28 -04:00
Allison Larson
50513a87b7 Preserve core resources during inplace service alloc updates (#25705)
* Preserve core resources during inplace service alloc updates

When an alloc is running with the core resources specified, and the
alloc is able to be updated in place, the cores it is running on should
be preserved.

This fixes a bug where the allocation's task's core resources
(CPU.ReservedCores) would be recomputed each time the reconciler checked
that the allocation could continue to run on the given node. Under
circumstances where a different core on the node became available before
this check was made, the selection process could compute this new core
as the core to run on, regardless of core the allocation was already
running on. The check takes into account other allocations running on
the node with reserved cores, but cannot check itself.

When this would happen for multiple allocations being evaluated in a
single plan, the selection process would see the other cores being
previously reserved but be unaware of the one it ran on, resulting in
the same core being chosen over and over for each allocation that was
being checked, and updated in the state store (but not on the node).
Once those cores were chosen and committed for multiple allocs, the node
appears to be exhausted on the cores dimension, and it would prevent any
additional allocations from being started on the node.

The reconciler check/computation for allocations that are being updated
in place and have resources.cores defined is effectively a check that
the node has the available cores to run on, not a computation that
should be changed. The fix still performs the check, but once it is
successful any existing ReservedCores are preserved. Because any changes
to this resource is considered a "destructive change", this can be
confidently preserved during the inplace update.

* Adjust reservedCores scheduler test

* Add changelog entry
2025-04-23 10:38:47 -07:00
Tim Gross
c205688857 scheduler: fix state corruption from rescheduler tracker updates (#25698)
In #12319 we fixed a bug where updates to the reschedule tracker would be
dropped if the follow-up allocation failed to be placed by the scheduler in the
later evaluation. We did this by mutating the previous allocation's reschedule
tracker. But we did this without copying the previous allocation first and then
making sure the updated copy was in the plan. This is unfortunately unsafe and
corrupts the state store on the server where the scheduler ran; it may cause a
race condition in RPC handlers and it causes the server to be out of sync with
the other servers. This was discovered while trying to make all our tests
race-free, but likely impacts production users.

Copy the previous allocation before updating the reschedule tracker, and swap
out the updated allocation in the plan. This also requires that we include the
reschedule tracker in the "normalized" (stripped-down) allocations we send to
the leader as part of a plan.

Ref: https://github.com/hashicorp/nomad/pull/12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
2025-04-18 08:42:54 -04:00
Tim Gross
5c89b07f11 CI: run copywrite on PRs, not just after merges (#25658)
* CI: run copywrite on PRs, not just after merges
* fix a missing copyright header
2025-04-10 17:01:34 -04:00
Carlos Galdino
048c5bcba9 Use core ID when selecting cores (#25340)
* Use core ID when selecting cores

If the available cores are not a continuous set, the core selector might
panic when trying to select cores.

For example, consider a scenario where the available cores for the selector are the following:

    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]

This list contains 46 cores, because cores with IDs 0 and 24 are not
included in the list

Before this patch, if we requested 46 cores, the selector would panic
trying to access the item with index 46 in `cs.topology.Cores`.

This patch changes the selector to use the core ID instead when looking
for a core inside `cs.topology.Cores`. This prevents an out of bounds
access that was causing the panic.

Note: The patch is straightforward with the change. Perhaps a better
long-term solution would be to restructure the `numalib.Topology.Cores`
field to be a `map[ID]Core`, but that is a much larger change that is
more difficult to land. Also, the amount of cores in our case is
small—at most 192—so a search won't have any noticeable impact.

* Add changelog entry

* Build list of IDs inline
2025-04-10 13:04:15 -07:00
James Rasell
4c4cb2c6ad agent: Fix misaligned contextual k/v logging arguments. (#25629)
Arguments passed to hclog log lines should always have an even
number to provide the expected k/v output.
2025-04-10 14:40:21 +01:00
Michael Smithhisler
f2b761f17c disconnected: removes deprecated disconnect fields (#25284)
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-05 14:46:02 -05:00
Tim Gross
1788bfb42e remove addresses from node class hash (#24942)
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.

In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.

Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.

Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
2025-03-03 09:28:32 -05:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Piotr Kazmierczak
58c6387323 stateful deployments: task group host volume claims API (#25114)
This PR introduces API endpoints /v1/volumes/claims/ and /v1/volumes/claim/:id
for listing and deleting task group host volume claims, respectively.
2025-02-25 15:51:59 +01:00
Tim Gross
dc58f247ed docs: clarify reschedule, migrate, and replacement terminology (#24929)
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-18 09:31:03 -05:00
Paweł Bęza
43885f6854 Allow for in-place update when affinity or spread was changed (#25109)
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.

Fixes #25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-14 14:33:18 -05:00
Piotr Kazmierczak
5468829260 stateful deployments: fix return in the hasVolumes feasibility check (#25084)
A return statement was missing in the sticky volume check—when we weren't able
to find a suitable volume, we did not return false. This was caught by e2e
test.

This PR fixes the issue, and corrects and expands the unit test.
2025-02-11 18:57:48 +01:00
Piotr Kazmierczak
611452e1af stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs (#24993)
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store
2025-02-07 17:41:01 +01:00
Matt Keeler
833e240597 Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856)
* Upgrade to using hashicorp/go-metrics@v0.5.4

This also requires bumping the dependencies for:

* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)

Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
2025-01-31 15:22:00 -05:00