* Update UI, code comment, and README links to docs, tutorials
* fix typo in ephemeral disks learn more link url
* feedback on typo
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The `computeUpdate` method returns 4 different values, some of which are just
different shapes of the same data and only ever get used to be applied to the
result in the caller. Move the mutation of the result into `computeUpdates` to
match the work done in #26325. Clean up the return signature so that only slices
we need downstream are returned, and fix the incorrect docstring.
Also fix a silent bug where the `inplace` set includes the original alloc and
not the updated version. This has no functional change because all existing
callers only ever look at the length of this slice, but it will prevent future
bugs if that ever changes.
Ref: https://github.com/hashicorp/nomad/pull/26325
Ref: https://hashicorp.atlassian.net/browse/NMD-819
Refactors of the `computeGroup` code in the reconciler to make understanding its
mutations more manageable. Some of this work makes mutation more consistent but
more importantly it's intended to make it readily _detectable_ while still being
readable. Includes:
* In the `computeCanaries` function, we mutate the dstate and the result and
then the return values are used to further mutate the result in the
caller. Move all this mutation into the function.
* In the `computeMigrations` function, we mutate the result and then the return
values are used to further mutate the result in the caller. Move all this
mutation into the function.
* In the `cancelUnneededCanaries` function, we mutate the result and then the
return values are used to further mutate the result in the caller. Move all
this mutation into the function, and annotate which `allocSet`s are mutated by
taking a pointer to the set.
* The `createRescheduleLaterEvals` function currently mutates the results and
returns updates to mutate the results in the caller. Move all this mutation
into the function to help cleanup `computeGroup`.
* Extract `computeReconnecting` method from `computeGroup`. There's some tangled
logic in `computeGroup` for determining changes to make for reconnecting
allocations. Pull this out into its own function. Annotate mutability in the
function by passing pointers to `allocSet` where needed, and mutate the result
to update counts. Rename the old `computeReconnecting` method to
`appendReconnectingUpdates` to mirror the naming of the similar logic for
disconnects.
* Extract `computeDisconnecting` method from `computeGroup`. There's some
tangled logic in `computeGroup` for determining changes to make for
disconnected allocations. Pull this out into its own function. Annotate
mutability in the function by passing pointers to `allocSet` where needed, and
mutate the result to update counts.
* The `appendUnknownDisconnectingUpdates` method used to create updates for
disconnected allocations mutates one of its `allocSet` arguments to change the
allocations that the reschedule now set points to. Pull this update out into
the caller.
* A handful of small docstring and helper function fixes
Ref: https://hashicorp.atlassian.net/browse/NMD-819
The reconciler contains a large set of methods and functions that operate on
`allocSet` (a map of allocation IDs to their allocs). Update these so that they
are consistently methods that are documented to not consume the `allocSet`. This
sets the stage for further improvements around mutability in the reconciler.
This changeset also includes a few related refactors:
* Use the `allocSet` alias in every location it's relevant in the reconciler,
for consistency and clarity.
* Move the filter functions and related helpers in the `allocs.go` file into the
`filters.go` file.
* Update the method receiver on `allocSet` to match everywhere and generally
improve the docstrings on the filter functions.
Ref: https://hashicorp.atlassian.net/browse/NMD-819
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.
Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.
This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.
This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In https://github.com/hashicorp/nomad/pull/25726 we added a test of how
canaries were treated when on draining nodes. But the test didn't correctly
configure the job with an update block, leading to misleading test
behavior. Fix the test to exercise the intended behavior and refactor for
clarity.
* While working on reconciler behaviors around stopped allocations, I found it
extremely hard to follow the intent of the disconnected client tests because
many of the fields in the table-driven test are switches for more complex
behavior or just tersely named. Attempt to make this a little more legible by
moving some branches directly into fields, renaming some fields, and
flattening out some branching.
Ref: https://hashicorp.atlassian.net/browse/NMD-819
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.
Ref: https://hashicorp.atlassian.net/browse/NMD-815
Refactor the reconciler property tests to extract functions for safety property
assertions we'll share between different job types for the same reconciler.
While working on property testing in #26172 we discovered there are scenarios
where the reconciler will produce more than the expected number of
placements. Testing of those scenarios at the whole-scheduler level shows that
this gets handled correctly downstream of the reconciler, but this makes it
harder to reason about reconciler behavior. Cap the number of placements in the
reconciler.
Ref: https://github.com/hashicorp/nomad/pull/26172
While working on property testing in #26216, I discovered we had unreachable
code in the node reconciler. The `diffSystemAllocsForNode` function receives a
set of non-terminal allocations, but then has branches where it assumes the
allocations might be terminal. It's trivially provable that these allocs are
always live, as the system scheduler splits the set of known allocs into live
and terminal sets before passing them into the node reconciler.
Eliminate the unreachable code and improve the variable names to make the known
state of the allocs more clear in the reconciler code.
Ref: https://github.com/hashicorp/nomad/pull/26216
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.
Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).
The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.
Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.
Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
If there are no affinities on a job, we don't want to count an affinity score of
zero in the number of scores we divide the normalized score by. This is how we
handle other scoring components like node reschedule penalties on nodes that
weren't running the previous allocation.
But we also exclude counting the affinity in the case where we have affinity but
the value is zero. In pathological cases, this can result in a node with a low
affinity being picked over a node with no affinity, because the denominator is 1
larger. Include zero-value affinities in the count of scores if the job has
affinities but the value just happens to be zero.
Fixes: https://github.com/hashicorp/nomad/issues/25621
* Only error on constraints if no allocs are running
When running `nomad job run <JOB>` multiple times with constraints
defined, there should be no error as a result of filtering out nodes
that do not/have not ever satsified the constraints.
When running a systems job with constraint, any run after an initial
startup returns an exit(2) and a warning about unplaced allocations due
to constraints. An error that is not encountered on the initial run,
though the constraint stays the same.
This is because the node that satisfies the condition is already running
the allocation, and the placement is ignored. Another placement is
attempted, but the only node(s) left are the ones that do not satisfy
the constraint. Nomad views this case (no allocations that were
attempted to placed could be placed successfully) as an error, and
reports it as such. In reality, no allocations should be placed or
updated in this case, but it should not be treated as an error.
This change uses the `ignored` placements from diffSystemAlloc to attempt to
determine if the case encountered is an error (no ignored placements
means that nothing is already running, and is an error), or is not one
(an ignored placement means that the task is already running somewhere
on a node). It does this at the point where `failedTGAlloc` is
populated, so placement functionality isn't changed, just the field that
populates error.
There is functionality that should be preserved which (correctly)
notifies a user if a job is attempted that cannot be run on any node due
to the constraints filtering out all available nodes. This should still
behave as expected.
* Add changelog entry
* Handle in-place updates for constrained system jobs
* Update .changelog/25850.txt
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
* Remove conditionals
---------
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.
When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.
Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
* Preserve core resources during inplace service alloc updates
When an alloc is running with the core resources specified, and the
alloc is able to be updated in place, the cores it is running on should
be preserved.
This fixes a bug where the allocation's task's core resources
(CPU.ReservedCores) would be recomputed each time the reconciler checked
that the allocation could continue to run on the given node. Under
circumstances where a different core on the node became available before
this check was made, the selection process could compute this new core
as the core to run on, regardless of core the allocation was already
running on. The check takes into account other allocations running on
the node with reserved cores, but cannot check itself.
When this would happen for multiple allocations being evaluated in a
single plan, the selection process would see the other cores being
previously reserved but be unaware of the one it ran on, resulting in
the same core being chosen over and over for each allocation that was
being checked, and updated in the state store (but not on the node).
Once those cores were chosen and committed for multiple allocs, the node
appears to be exhausted on the cores dimension, and it would prevent any
additional allocations from being started on the node.
The reconciler check/computation for allocations that are being updated
in place and have resources.cores defined is effectively a check that
the node has the available cores to run on, not a computation that
should be changed. The fix still performs the check, but once it is
successful any existing ReservedCores are preserved. Because any changes
to this resource is considered a "destructive change", this can be
confidently preserved during the inplace update.
* Adjust reservedCores scheduler test
* Add changelog entry
In #12319 we fixed a bug where updates to the reschedule tracker would be
dropped if the follow-up allocation failed to be placed by the scheduler in the
later evaluation. We did this by mutating the previous allocation's reschedule
tracker. But we did this without copying the previous allocation first and then
making sure the updated copy was in the plan. This is unfortunately unsafe and
corrupts the state store on the server where the scheduler ran; it may cause a
race condition in RPC handlers and it causes the server to be out of sync with
the other servers. This was discovered while trying to make all our tests
race-free, but likely impacts production users.
Copy the previous allocation before updating the reschedule tracker, and swap
out the updated allocation in the plan. This also requires that we include the
reschedule tracker in the "normalized" (stripped-down) allocations we send to
the leader as part of a plan.
Ref: https://github.com/hashicorp/nomad/pull/12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
* Use core ID when selecting cores
If the available cores are not a continuous set, the core selector might
panic when trying to select cores.
For example, consider a scenario where the available cores for the selector are the following:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]
This list contains 46 cores, because cores with IDs 0 and 24 are not
included in the list
Before this patch, if we requested 46 cores, the selector would panic
trying to access the item with index 46 in `cs.topology.Cores`.
This patch changes the selector to use the core ID instead when looking
for a core inside `cs.topology.Cores`. This prevents an out of bounds
access that was causing the panic.
Note: The patch is straightforward with the change. Perhaps a better
long-term solution would be to restructure the `numalib.Topology.Cores`
field to be a `map[ID]Core`, but that is a much larger change that is
more difficult to land. Also, the amount of cores in our case is
small—at most 192—so a search won't have any noticeable impact.
* Add changelog entry
* Build list of IDs inline
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:
* restart: when the tasks of an allocation fail and we try to restart the tasks
in place.
* reschedule: when the `restart` block runs out of attempts (or the allocation
fails before tasks even start), and we need to move
the allocation to another node to try again.
* migrate: when the user has asked to drain a node and we need to move the
allocations. These are not failures, so we don't want to propagate the
reschedule tracker.
* replacement: when a node is lost, we don't count that against the `reschedule`
tracker for the allocations on the node (it's not the allocation's "fault",
after all). We don't want to run the `migrate` machinery here here either, as we
can't contact the down node. To the scheduler, this is effectively the same as
if we bumped the `group.count`
* replacement for `disconnect.replace = true`: this is a replacement, but the
replacement is intended to be temporary, so we propagate the reschedule tracker.
Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.
Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.
Fixes#25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
A return statement was missing in the sticky volume check—when we weren't able
to find a suitable volume, we did not return false. This was caught by e2e
test.
This PR fixes the issue, and corrects and expands the unit test.
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.
The diagram below illustrates the new logic:
SetVolumes() upsertAllocsImpl()
sets ns, job +-----------------checks if alloc requests
tg in the scheduler v sticky vols and consults
| +-----------------------+ state. If there is no claim,
| | TaskGroupVolumeClaim: | it creates one.
| | - namespace |
| | - jobID |
| | - tg name |
| | - vol ID |
v | uniquely identify vol |
hasVolumes() +----+------------------+
consults the state | ^
and returns true | | DeleteJobTxn()
if there's a match <-----------+ +---------------removes the claim from
or if there is no the state
previous claim
| | | |
+-----------------------------+ +------------------------------------------------------+
scheduler state store
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.