The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.
Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).
The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.
Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.
Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
If there are no affinities on a job, we don't want to count an affinity score of
zero in the number of scores we divide the normalized score by. This is how we
handle other scoring components like node reschedule penalties on nodes that
weren't running the previous allocation.
But we also exclude counting the affinity in the case where we have affinity but
the value is zero. In pathological cases, this can result in a node with a low
affinity being picked over a node with no affinity, because the denominator is 1
larger. Include zero-value affinities in the count of scores if the job has
affinities but the value just happens to be zero.
Fixes: https://github.com/hashicorp/nomad/issues/25621
* Only error on constraints if no allocs are running
When running `nomad job run <JOB>` multiple times with constraints
defined, there should be no error as a result of filtering out nodes
that do not/have not ever satsified the constraints.
When running a systems job with constraint, any run after an initial
startup returns an exit(2) and a warning about unplaced allocations due
to constraints. An error that is not encountered on the initial run,
though the constraint stays the same.
This is because the node that satisfies the condition is already running
the allocation, and the placement is ignored. Another placement is
attempted, but the only node(s) left are the ones that do not satisfy
the constraint. Nomad views this case (no allocations that were
attempted to placed could be placed successfully) as an error, and
reports it as such. In reality, no allocations should be placed or
updated in this case, but it should not be treated as an error.
This change uses the `ignored` placements from diffSystemAlloc to attempt to
determine if the case encountered is an error (no ignored placements
means that nothing is already running, and is an error), or is not one
(an ignored placement means that the task is already running somewhere
on a node). It does this at the point where `failedTGAlloc` is
populated, so placement functionality isn't changed, just the field that
populates error.
There is functionality that should be preserved which (correctly)
notifies a user if a job is attempted that cannot be run on any node due
to the constraints filtering out all available nodes. This should still
behave as expected.
* Add changelog entry
* Handle in-place updates for constrained system jobs
* Update .changelog/25850.txt
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
* Remove conditionals
---------
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.
When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.
Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
* Preserve core resources during inplace service alloc updates
When an alloc is running with the core resources specified, and the
alloc is able to be updated in place, the cores it is running on should
be preserved.
This fixes a bug where the allocation's task's core resources
(CPU.ReservedCores) would be recomputed each time the reconciler checked
that the allocation could continue to run on the given node. Under
circumstances where a different core on the node became available before
this check was made, the selection process could compute this new core
as the core to run on, regardless of core the allocation was already
running on. The check takes into account other allocations running on
the node with reserved cores, but cannot check itself.
When this would happen for multiple allocations being evaluated in a
single plan, the selection process would see the other cores being
previously reserved but be unaware of the one it ran on, resulting in
the same core being chosen over and over for each allocation that was
being checked, and updated in the state store (but not on the node).
Once those cores were chosen and committed for multiple allocs, the node
appears to be exhausted on the cores dimension, and it would prevent any
additional allocations from being started on the node.
The reconciler check/computation for allocations that are being updated
in place and have resources.cores defined is effectively a check that
the node has the available cores to run on, not a computation that
should be changed. The fix still performs the check, but once it is
successful any existing ReservedCores are preserved. Because any changes
to this resource is considered a "destructive change", this can be
confidently preserved during the inplace update.
* Adjust reservedCores scheduler test
* Add changelog entry
In #12319 we fixed a bug where updates to the reschedule tracker would be
dropped if the follow-up allocation failed to be placed by the scheduler in the
later evaluation. We did this by mutating the previous allocation's reschedule
tracker. But we did this without copying the previous allocation first and then
making sure the updated copy was in the plan. This is unfortunately unsafe and
corrupts the state store on the server where the scheduler ran; it may cause a
race condition in RPC handlers and it causes the server to be out of sync with
the other servers. This was discovered while trying to make all our tests
race-free, but likely impacts production users.
Copy the previous allocation before updating the reschedule tracker, and swap
out the updated allocation in the plan. This also requires that we include the
reschedule tracker in the "normalized" (stripped-down) allocations we send to
the leader as part of a plan.
Ref: https://github.com/hashicorp/nomad/pull/12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
* Use core ID when selecting cores
If the available cores are not a continuous set, the core selector might
panic when trying to select cores.
For example, consider a scenario where the available cores for the selector are the following:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]
This list contains 46 cores, because cores with IDs 0 and 24 are not
included in the list
Before this patch, if we requested 46 cores, the selector would panic
trying to access the item with index 46 in `cs.topology.Cores`.
This patch changes the selector to use the core ID instead when looking
for a core inside `cs.topology.Cores`. This prevents an out of bounds
access that was causing the panic.
Note: The patch is straightforward with the change. Perhaps a better
long-term solution would be to restructure the `numalib.Topology.Cores`
field to be a `map[ID]Core`, but that is a much larger change that is
more difficult to land. Also, the amount of cores in our case is
small—at most 192—so a search won't have any noticeable impact.
* Add changelog entry
* Build list of IDs inline
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:
* restart: when the tasks of an allocation fail and we try to restart the tasks
in place.
* reschedule: when the `restart` block runs out of attempts (or the allocation
fails before tasks even start), and we need to move
the allocation to another node to try again.
* migrate: when the user has asked to drain a node and we need to move the
allocations. These are not failures, so we don't want to propagate the
reschedule tracker.
* replacement: when a node is lost, we don't count that against the `reschedule`
tracker for the allocations on the node (it's not the allocation's "fault",
after all). We don't want to run the `migrate` machinery here here either, as we
can't contact the down node. To the scheduler, this is effectively the same as
if we bumped the `group.count`
* replacement for `disconnect.replace = true`: this is a replacement, but the
replacement is intended to be temporary, so we propagate the reschedule tracker.
Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.
Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.
Fixes#25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
A return statement was missing in the sticky volume check—when we weren't able
to find a suitable volume, we did not return false. This was caught by e2e
test.
This PR fixes the issue, and corrects and expands the unit test.
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.
The diagram below illustrates the new logic:
SetVolumes() upsertAllocsImpl()
sets ns, job +-----------------checks if alloc requests
tg in the scheduler v sticky vols and consults
| +-----------------------+ state. If there is no claim,
| | TaskGroupVolumeClaim: | it creates one.
| | - namespace |
| | - jobID |
| | - tg name |
| | - vol ID |
v | uniquely identify vol |
hasVolumes() +----+------------------+
consults the state | ^
and returns true | | DeleteJobTxn()
if there's a match <-----------+ +---------------removes the claim from
or if there is no the state
previous claim
| | | |
+-----------------------------+ +------------------------------------------------------+
scheduler state store
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
When a client restarts but can't restore a volume (ex. the plugin is now
missing), it's removed from the node fingerprint. So we won't allow future
scheduling of the volume, but we were not updating the volume state field to
report this reasoning to operators. Make debugging easier and the state field
more meaningful by setting the value to "unavailable".
Also, remove the unused "deleted" field. We did not implement soft deletes and
aren't planning on it for Nomad 1.10.0.
Ref: https://hashicorp.atlassian.net/browse/NET-11551
When we implemented CSI, the types of the fields for access mode and attachment
mode on volume requests were defined with a prefix "CSI". This gets confusing
now that we have dynamic host volumes using the same fields. Fortunately the
original was a typedef on string, and the Go API in the `api` package just uses
strings directly, so we can change the name of the type without breaking
backwards compatibility for the msgpack wire format.
Update the names to `VolumeAccessMode` and `VolumeAttachmentMode`. Keep the CSI
and DHV specific value constant names for these fields (they aren't currently
1:1), so that we can easily differentiate in a given bit of code which values
are valid.
Ref: https://github.com/hashicorp/nomad/pull/24881#discussion_r1920702890
Some comment cleanups as we're wrapping up dynamic host volumes work:
* We're not going to implement mount_options for host volumes, as the dynamic
host volumes don't have the equivalent of the stage/publish phase that CSI
volumes do. Users who want that sort of thing will pass them as `parameter`
field during volume create/register.
* The scheduler feasibility check prevents a dynamic host volume being claimed
by a job in the wrong namespace, but the comment incorrectly identifies that
code path as only being about the race between fingerprint and delete. Update
the comment to make the intent clear so that we don't accidentally remove this
behavior in the future.
When we feasibility check a dynamic host volume against a volume request, we
check the attachment mode and access mode. This only ensures that the
capabilities match, but doesn't enforce the semantics of the capabilities
against other claims that may be made on the allocation.
Add support for checking the requested capability against other allocations that
the volume claimed.
Ref: https://github.com/hashicorp/nomad/pull/24479
Static host volumes have a simple readonly toggle, but dynamic host volumes have
a more complex set of capabilities similar to CSI volumes. Update the
feasibility checker to account for these capabilities and volume readiness.
Also fixes a minor bug in the state store where a soft-delete (not yet
implemented) could cause a volume to be marked ready again. This is needed to
support testing the readiness checking in the scheduler.
Ref: https://github.com/hashicorp/nomad/pull/24479
When making a request to create a dynamic host volumes, users can pass a node
pool and constraints instead of a specific node ID.
This changeset implements a node scheduling logic by instantiating a filter by
node pool and constraint checker borrowed from the scheduler package. Because
host volumes with the same name can't land on the same host, we don't need to
support `distinct_hosts`/`distinct_property`; this would be challenging anyways
without building out a much larger node iteration mechanism to keep track of
usage across multiple hosts.
Ref: https://github.com/hashicorp/nomad/pull/24479
Core scheduler relies on a special table in the state store—the TimeTable—to
figure out which objects can be GC'd. The TimeTable correlates Raft indices
with objects insertion time, a solution we used before most of the objects we
store in the state contained timestamps. This introduced a bit of a memory
overhead and complexity, but most importantly meant that any GC threshold users
set greater than timeTableLimit = 72 * time.Hour was ignored. This PR removes
the TimeTable and relies on object timestamps to determine whether they could
be GCd or not.
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:
* The `operator snapshot state` command gains the ability to display key
metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
existing snapshot.
(CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files))
* scheduler: enhance numa aware scheduling with support for devices
* cr: add comments
a change to the network{cni{}} block means that
the user wants the network config to change,
and that only happens during initial alloc setup,
so we need to replace the alloc(s) to get
fresh network(s) to reconfigure from scratch.
e.g. a job plan diff like this
```
+/- Task Group: "g" (1 in-place update)
+ Network {
+ CNIConfig {
+ a: "ayy"
}
```
should instead be
```
+/- Task Group: "g" (1 create/destroy update)
+ Network {
+ CNIConfig {
+ a: "ayy"
}
```
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.
This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.
Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.
For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.
Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070