scheduler: fix quadratic performance with spread blocks (#11712)

When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.
2026-01-06 18:35:44 +03:00 · 2021-12-21 10:10:01 -05:00
parent 20bbdba041
commit 2d4e5b8fe9
4 changed files with 260 additions and 3 deletions
--- a/website/content/docs/job-specification/spread.mdx
+++ b/website/content/docs/job-specification/spread.mdx
@@ -54,8 +54,12 @@ spread stanza. Spread scores are combined with other scoring factors such as bin

 A job or task group can have more than one spread criteria, with weights to express relative preference.

-Spread criteria are treated as a soft preference by the Nomad scheduler.
-If no nodes match a given spread criteria, placement is still successful.
+Spread criteria are treated as a soft preference by the Nomad
+scheduler. If no nodes match a given spread criteria, placement is
+still successful. To avoid scoring every node for every placement,
+allocations may not be perfectly spread. Spread works best on
+attributes with similar number of nodes: identically configured racks
+or similarly configured datacenters.

 Spread may be expressed on [attributes][interpolation] or [client metadata][client-meta].
 Additionally, spread may be specified at the [job][job] and [group][group] levels for ultimate flexibility. Job level spread criteria are inherited by all task groups in the job.