The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.
Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.
Fixes: https://hashicorp.atlassian.net/browse/NET-9924
When an allocation fails it triggers an evaluation. The evaluation is processed
and the scheduler sees it needs to reschedule, which triggers a follow-up
eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement
alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated).
But in the case where the follow-up eval can't place all allocs (there aren't
enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then
creates a blocked eval. The plan applier stops the failed alloc. Then when the
blocked eval is processed, the job is missing an allocation, so the scheduler
creates a new allocation. This allocation is _not_ a replacement from the
perspective of the scheduler, so it's not handed off a `RescheduleTracker`.
This changeset fixes this by annotating the reschedule tracker whenever the
scheduler can't place a replacement allocation. We check this annotation for
allocations that have the `stop` desired status when filtering out allocations
to pass to the reschedule tracker. I've also included tests that cover this case
and expands coverage of the relevant area of the code.
Fixes: https://github.com/hashicorp/nomad/issues/12147
Fixes: https://github.com/hashicorp/nomad/issues/17072
Some of our scheduler tests use the `AllocName` function from the structs
package incorrectly. This function should always receive the `Job.ID` and not
the `Job.Name`. Fix this to prevent future bugs from copy-pasting usage around.
While working on #20462#12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:
* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
test setup rather than implied by whether we re-process the pending/blocked
eval.
Ref: https://github.com/hashicorp/nomad/issues/20462
Ref: https://github.com/hashicorp/nomad/pull/12319
While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.
In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
This commit introduces the new options for reconciling a reconnecting allocation and its replacement:
Best score (Current implementation)
Keep original
Keep replacement
Keep the one that has run the longest time
It is achieved by adding a new dependency to the allocReconciler that calls the corresponding function depending on the task group's disconnect strategy. For more detailed information, refer to the new stanza for disconnected clientes RFC.
It resolves 15144
* tests: swap testify for test in plugins/csi/client_test.go
* tests: swap testify for test in testutil/
* tests: swap testify for test in host_test.go
* tests: swap testify for test in plugin_test.go
* tests: swap testify for test in utils_test.go
* tests: swap testify for test in scheduler/
* tests: swap testify for test in parse_test.go
* tests: swap testify for test in attribute_test.go
* tests: swap testify for test in plugins/drivers/
* tests: swap testify for test in command/
* tests: fixup some test usages
* go: run go mod tidy
* windows: cpuset test only on linux
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.
Fixes: https://github.com/hashicorp/nomad/issues/13139
This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.
In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366
Co-authored-by: Dom Lavery <dom@circleci.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Clients prior to Nomad 1.7 cannot support the new workload identity-based
authentication to Consul and Vault. Add an implicit Nomad version constraint on
job submission for task groups that use the new workflow.
Includes a constraint test showing same-version prelease handling.
* Update distinct_host feasibility checking to honor the job's namespace. Fixes#9792
* Added test to verify original condition and that fix resolved it.
* Added documentation
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.
The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
* core: plumbing to support numa aware scheduling
* core: apply node resources compatibility upon fsm rstore
Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.
* lint: learn how to spell pseudo
Host volumes were considered regular feasibility checks. This had two
unintended consequences.
The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.
If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.
```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
feasible := check.Feasible(option)
if !feasible {
// If the job hasn't escaped, set it to be ineligible since it
// failed a job check.
if !jobEscaped {
evalElig.SetJobEligibility(false, option.ComputedClass)
}
continue OUTER
}
}
```
This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.
```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
// Fast path the ineligible case
metrics.FilterNode(option, "computed class ineligible")
continue
```
The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.
But when the first allocation found a node match, it registered the node
class as eligible for the task group.
```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```
This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.
```go
case EvalComputedClassEligible:
// Fast path the eligible case
if w.available(option) {
return option
}
// We match the class but are temporarily unavailable
continue OUTER
```
These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).
And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).
By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.
Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.
Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.
Fixes: #18112
In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
Although most of the time jobs will be assigned to a single node pool, users may
want to set the node pool to "all" and then constraint to a subset of node
pools. Add support for setting a contraint like `${node.pool}`.
Implement scheduler support for node pool:
* When a scheduler is invoked, we get a set of the ready nodes in the DCs that
are allowed for that job. Extend the filter to include the node pool.
* Ensure that changes to a job's node pool are picked up as destructive
allocation updates.
* Add `NodesInPool` as a metric to all reporting done by the scheduler.
* Add the node-in-pool the filter to the `Node.Register` RPC so that we don't
generate spurious evals for nodes in the wrong pool.
When calculating the score in the `SpreadIterator`, the score boost is
proportional to the difference between the current and desired count. But when
there are implicit spread targets, the current count is the sum of the possible
implicit targets, which results in incorrect scoring unless there's only one
implicit target.
This changeset updates the `propertySet` struct to accept a set of explicit
target values so it can detect when a property value falls into the implicit set
and should be combined with other implicit values.
Fixes: #11823
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.
Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.
Fixes: #8863
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.
This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.
This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.
Fixes#17076
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.
This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.
---
Fixes: #11175
Co-authored-by: Thomas Weber <towe75@googlemail.com>
* Honor value for distinct_hosts constraint
* Add test for feasibility checking for `false`
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>