When we feasibility check a dynamic host volume against a volume request, we
check the attachment mode and access mode. This only ensures that the
capabilities match, but doesn't enforce the semantics of the capabilities
against other claims that may be made on the allocation.
Add support for checking the requested capability against other allocations that
the volume claimed.
Ref: https://github.com/hashicorp/nomad/pull/24479
Static host volumes have a simple readonly toggle, but dynamic host volumes have
a more complex set of capabilities similar to CSI volumes. Update the
feasibility checker to account for these capabilities and volume readiness.
Also fixes a minor bug in the state store where a soft-delete (not yet
implemented) could cause a volume to be marked ready again. This is needed to
support testing the readiness checking in the scheduler.
Ref: https://github.com/hashicorp/nomad/pull/24479
When making a request to create a dynamic host volumes, users can pass a node
pool and constraints instead of a specific node ID.
This changeset implements a node scheduling logic by instantiating a filter by
node pool and constraint checker borrowed from the scheduler package. Because
host volumes with the same name can't land on the same host, we don't need to
support `distinct_hosts`/`distinct_property`; this would be challenging anyways
without building out a much larger node iteration mechanism to keep track of
usage across multiple hosts.
Ref: https://github.com/hashicorp/nomad/pull/24479
Core scheduler relies on a special table in the state store—the TimeTable—to
figure out which objects can be GC'd. The TimeTable correlates Raft indices
with objects insertion time, a solution we used before most of the objects we
store in the state contained timestamps. This introduced a bit of a memory
overhead and complexity, but most importantly meant that any GC threshold users
set greater than timeTableLimit = 72 * time.Hour was ignored. This PR removes
the TimeTable and relies on object timestamps to determine whether they could
be GCd or not.
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:
* The `operator snapshot state` command gains the ability to display key
metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
existing snapshot.
(CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files))
* scheduler: enhance numa aware scheduling with support for devices
* cr: add comments
a change to the network{cni{}} block means that
the user wants the network config to change,
and that only happens during initial alloc setup,
so we need to replace the alloc(s) to get
fresh network(s) to reconfigure from scratch.
e.g. a job plan diff like this
```
+/- Task Group: "g" (1 in-place update)
+ Network {
+ CNIConfig {
+ a: "ayy"
}
```
should instead be
```
+/- Task Group: "g" (1 create/destroy update)
+ Network {
+ CNIConfig {
+ a: "ayy"
}
```
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.
This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.
Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.
For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.
Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.
Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.
Fixes: https://hashicorp.atlassian.net/browse/NET-9924
When an allocation fails it triggers an evaluation. The evaluation is processed
and the scheduler sees it needs to reschedule, which triggers a follow-up
eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement
alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated).
But in the case where the follow-up eval can't place all allocs (there aren't
enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then
creates a blocked eval. The plan applier stops the failed alloc. Then when the
blocked eval is processed, the job is missing an allocation, so the scheduler
creates a new allocation. This allocation is _not_ a replacement from the
perspective of the scheduler, so it's not handed off a `RescheduleTracker`.
This changeset fixes this by annotating the reschedule tracker whenever the
scheduler can't place a replacement allocation. We check this annotation for
allocations that have the `stop` desired status when filtering out allocations
to pass to the reschedule tracker. I've also included tests that cover this case
and expands coverage of the relevant area of the code.
Fixes: https://github.com/hashicorp/nomad/issues/12147
Fixes: https://github.com/hashicorp/nomad/issues/17072
Some of our scheduler tests use the `AllocName` function from the structs
package incorrectly. This function should always receive the `Job.ID` and not
the `Job.Name`. Fix this to prevent future bugs from copy-pasting usage around.
While working on #20462#12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:
* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
test setup rather than implied by whether we re-process the pending/blocked
eval.
Ref: https://github.com/hashicorp/nomad/issues/20462
Ref: https://github.com/hashicorp/nomad/pull/12319
While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.
In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
This commit introduces the new options for reconciling a reconnecting allocation and its replacement:
Best score (Current implementation)
Keep original
Keep replacement
Keep the one that has run the longest time
It is achieved by adding a new dependency to the allocReconciler that calls the corresponding function depending on the task group's disconnect strategy. For more detailed information, refer to the new stanza for disconnected clientes RFC.
It resolves 15144
* tests: swap testify for test in plugins/csi/client_test.go
* tests: swap testify for test in testutil/
* tests: swap testify for test in host_test.go
* tests: swap testify for test in plugin_test.go
* tests: swap testify for test in utils_test.go
* tests: swap testify for test in scheduler/
* tests: swap testify for test in parse_test.go
* tests: swap testify for test in attribute_test.go
* tests: swap testify for test in plugins/drivers/
* tests: swap testify for test in command/
* tests: fixup some test usages
* go: run go mod tidy
* windows: cpuset test only on linux
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.
Fixes: https://github.com/hashicorp/nomad/issues/13139
This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.
In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366
Co-authored-by: Dom Lavery <dom@circleci.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Clients prior to Nomad 1.7 cannot support the new workload identity-based
authentication to Consul and Vault. Add an implicit Nomad version constraint on
job submission for task groups that use the new workflow.
Includes a constraint test showing same-version prelease handling.
* Update distinct_host feasibility checking to honor the job's namespace. Fixes#9792
* Added test to verify original condition and that fix resolved it.
* Added documentation
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.
The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
* core: plumbing to support numa aware scheduling
* core: apply node resources compatibility upon fsm rstore
Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.
* lint: learn how to spell pseudo
Host volumes were considered regular feasibility checks. This had two
unintended consequences.
The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.
If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.
```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
feasible := check.Feasible(option)
if !feasible {
// If the job hasn't escaped, set it to be ineligible since it
// failed a job check.
if !jobEscaped {
evalElig.SetJobEligibility(false, option.ComputedClass)
}
continue OUTER
}
}
```
This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.
```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
// Fast path the ineligible case
metrics.FilterNode(option, "computed class ineligible")
continue
```
The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.
But when the first allocation found a node match, it registered the node
class as eligible for the task group.
```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```
This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.
```go
case EvalComputedClassEligible:
// Fast path the eligible case
if w.available(option) {
return option
}
// We match the class but are temporarily unavailable
continue OUTER
```
These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).
And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).
By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.
Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21