Commit Graph

20 Commits

Author SHA1 Message Date
Tim Gross
c49359ad58 scheduler: prevent panic in spread iterator during alloc stop
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.

To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
  failed allocation.

In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.

This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
2022-02-09 19:53:06 -05:00
Preetha Appan
be897cadc3 More error->debug for logging in the bin packing iterator 2019-12-12 15:50:16 -06:00
Preetha Appan
566dd71486 Fix comment and assert score in test case 2019-05-15 12:35:57 -05:00
Nick Ethier
5709bf7b54 fix missing brace 2019-05-15 13:02:04 -04:00
Nick Ethier
ea843a507a scheduler: add check to prohibit returning inf during spread boost calculation 2019-05-15 13:00:24 -04:00
Alex Dadgar
bc42873e07 Change types of weights on spread/affinity 2019-01-30 12:20:38 -08:00
Alex Dadgar
260b566c91 server 2018-09-15 16:23:13 -07:00
Preetha Appan
f6cbfbfef6 Track top k nodes by norm score rather than top k nodes per scorer 2018-09-04 16:10:11 -05:00
Preetha Appan
72570e0698 fix linting error 2018-09-04 16:10:11 -05:00
Preetha Appan
31b2102055 Fix scoring logic for uneven spread to incorporate current alloc count
Also addressed other small code review comments
2018-09-04 16:10:11 -05:00
Preetha Appan
1ac696da56 more cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan
fc48be3656 added some unit tests for -1 spread score 2018-09-04 16:10:11 -05:00
Preetha Appan
f881c4f266 comment and formatting cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan
2dfdd4874f fix scoring algorithm when min count == current count 2018-09-04 16:10:11 -05:00
Preetha Appan
35bda8c975 Remove hardcoded boosts for even spread.
instead, calculate them based on delta between current and minimum value
2018-09-04 16:10:11 -05:00
Preetha Appan
7a5791f39e Implement support for even spread across datacenters, with unit test 2018-09-04 16:10:11 -05:00
Preetha Appan
56de0d0a11 Support implicit spread target to account for remaining desired counts 2018-09-04 16:10:11 -05:00
Preetha Appan
5f1d40e4c3 fix comments 2018-09-04 16:10:11 -05:00
Preetha Appan
bf84a5985a Include spreads configured at job level when precomputing weights/desired counts. 2018-09-04 16:10:11 -05:00
Preetha Appan
fd697272a7 Implement spread iterator that scores according to percentage of desired count in each target.
Added this as a new step in the stack and some unit tests
2018-09-04 16:10:11 -05:00