In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Some of our scheduler tests use the `AllocName` function from the structs
package incorrectly. This function should always receive the `Job.ID` and not
the `Job.Name`. Fix this to prevent future bugs from copy-pasting usage around.
* tests: swap testify for test in plugins/csi/client_test.go
* tests: swap testify for test in testutil/
* tests: swap testify for test in host_test.go
* tests: swap testify for test in plugin_test.go
* tests: swap testify for test in utils_test.go
* tests: swap testify for test in scheduler/
* tests: swap testify for test in parse_test.go
* tests: swap testify for test in attribute_test.go
* tests: swap testify for test in plugins/drivers/
* tests: swap testify for test in command/
* tests: fixup some test usages
* go: run go mod tidy
* windows: cpuset test only on linux
Host volumes were considered regular feasibility checks. This had two
unintended consequences.
The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.
If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.
```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
feasible := check.Feasible(option)
if !feasible {
// If the job hasn't escaped, set it to be ineligible since it
// failed a job check.
if !jobEscaped {
evalElig.SetJobEligibility(false, option.ComputedClass)
}
continue OUTER
}
}
```
This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.
```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
// Fast path the ineligible case
metrics.FilterNode(option, "computed class ineligible")
continue
```
The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.
But when the first allocation found a node match, it registered the node
class as eligible for the task group.
```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```
This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.
```go
case EvalComputedClassEligible:
// Fast path the eligible case
if w.available(option) {
return option
}
// We match the class but are temporarily unavailable
continue OUTER
```
These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).
And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).
By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
CSI `CreateVolume` RPC is idempotent given that the topology,
capabilities, and parameters are unchanged. CSI volumes have many
user-defined fields that are immutable once set, and many fields that
are not user-settable.
Update the `Register` RPC so that updating a volume via the API merges
onto any existing volume without touching Nomad-controlled fields,
while validating it with the same strict requirements expected for
idempotent `CreateVolume` RPCs.
Also, clarify that this state store method is used for everything, not just
for the `Register` RPC.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match
* state_store: csi volumes/plugins store the index in the txn
* nomad: csi_endpoint_test require index checks need uint64()
* nomad: other tests using int 0 not uint64(0)
* structs: pass index into New, but not other struct methods
* state_store: csi plugin indexes, use new struct interface
* nomad: csi_endpoint_test check index/query meta (on explicit 0)
* structs: NewCSIVolume takes an index arg now
* scheduler/test: NewCSIVolume takes an index arg now
This brings test code and mocks up to date with the fingerprinter. This was a slightly larger change than I anticipated, but I think it's good for two reasons:
1. More semanitcally correct. `os.name` is something like "Windows 10 Pro" or "Ubuntu", while `kernel.name` is "windows" or "linux". `os.version` and `kernel.version` match these semantics.
2. `kernel.name` is much easier to grep for than `os`, which is helpful because oracle can't help us with strings.