Commit Graph

55 Commits

Author SHA1 Message Date
Tim Gross
2d4e5b8fe9 scheduler: fix quadratic performance with spread blocks (#11712)
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
2021-12-21 10:10:01 -05:00
Seth Hoenig
61ee443ee6 core: implement system batch scheduler
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.

Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.

As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).

Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.

Closes #2527
2021-08-03 10:30:47 -04:00
Tim Gross
2b63a093ac quotas: evaluate quota feasibility last in scheduler (#10753)
The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.

This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.
2021-06-14 10:11:40 -04:00
Mahmood Ali
28b8767b27 Allow configuring memory oversubscription (#10466)
Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.

This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).

I opted to have the memory oversubscription be an opt-in, but happy to change it.  To enable it, operators should call the API with:
```json
{
  "MemoryOversubscriptionEnabled": true
}
```

If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.

The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):

* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored

==> Monitoring evaluation "7c444157"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
    Evaluation within deployment: "9d826f13"
    Allocation "aa5c3cad" created: node "9272088e", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"

# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
  "MemoryMB": 256,
  "MemoryMaxMB": 0
}
```
2021-04-29 22:09:56 -04:00
Tim Gross
7c7569674c CSI: unique volume per allocation
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.

Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
2021-03-18 15:35:11 -04:00
Mahmood Ali
5720266c91 Respect alloc job version for lost/failed allocs
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.

Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
2020-08-19 09:52:48 -04:00
Nick Ethier
60c301758c scheduler: do network feasibility checking for system jobs (#8256) 2020-06-24 16:01:00 -04:00
Nick Ethier
ad8ced3873 multi-interface network support 2020-06-19 09:42:10 -04:00
Nick Ethier
33ce12cda9 CNI Implementation (#7518) 2020-06-18 11:05:29 -07:00
Mahmood Ali
9f11857ad1 Open source Preemption code
Nomad 0.12 OSS is to include preemption feature.

This commit moves the private code for managing preemption to OSS
repository.
2020-05-27 15:02:01 -04:00
Mahmood Ali
5078e0cfed tests and some clean up 2020-05-01 13:13:30 -04:00
Charlie Voiselle
1af6a2adf1 Wiring algorithm to scheduler calls 2020-05-01 13:13:29 -04:00
Michael Schurter
a61a775b62 core: fix comment on system stack
This makes me do a double take every time I run into it, so what if we
just changed it?
2020-04-09 15:19:11 -07:00
Lang Martin
ce9dbe619f csi: the scheduler allows a job with a volume write claim to be updated (#7438)
* nomad/structs/csi: split CanWrite into health, in use

* scheduler/scheduler: expose AllocByID in the state interface

* nomad/state/state_store_test

* scheduler/stack: SetJobID on the matcher

* scheduler/feasible: when a volume writer is in use, check if it's us

* scheduler/feasible: remove SetJob

* nomad/state/state_store: denormalize allocs before Claim

* nomad/structs/csi: return errors on claim, with context

* nomad/csi_endpoint_test: new alloc doesn't look like an update

* nomad/state/state_store_test: change test reference to CanWrite
2020-03-23 21:21:04 -04:00
Lang Martin
9c9a0c5eb5 csi: volume ids are only unique per namespace (#7358)
* nomad/state/schema: use the namespace compound index

* scheduler/scheduler: CSIVolumeByID interface signature namespace

* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace

* scheduler/feasible: pass the captured namespace to CSIVolumeByID

* nomad/state/state_store: use namespace in csi_volume index

* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim

* nomad/core_sched: pass the namespace in volumeClaimReap

* nomad/node_endpoint_test: namespaces in Claim testing

* nomad/csi_endpoint: pass RequestNamespace to state.*

* nomad/csi_endpoint_test: appropriately failed test

* command/alloc_status_test: appropriately failed test

* node_endpoint_test: avoid notTheNamespace for the job

* scheduler/feasible_test: call SetJob to capture the namespace

* nomad/csi_endpoint: ACL check the req namespace, query by namespace

* nomad/state/state_store: remove deregister namespace check

* nomad/state/state_store: remove unused CSIVolumes

* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace

* nomad/csi_endpoint: ACL check

* nomad/state/state_store_test: remove call to state.CSIVolumes

* nomad/core_sched_test: job namespace match so claim gc works
2020-03-23 13:59:25 -04:00
Lang Martin
f370e25843 CSI: Scheduler knows about CSI constraints and availability (#6995)
* structs: piggyback csi volumes on host volumes for job specs

* state_store: CSIVolumeByID always includes plugins, matches usecase

* scheduler/feasible: csi volume checker

* scheduler/stack: add csi volumes

* contributing: update rpc checklist

* scheduler: add volumes to State interface

* scheduler/feasible: introduce new checker collection tgAvailable

* scheduler/stack: taskGroupCSIVolumes checker is transient

* state_store CSIVolumeDenormalizePlugins comment clarity

* structs: remote TODO comment in TaskGroup Validate

* scheduler/feasible: CSIVolumeChecker hasPlugins improve comment

* scheduler/feasible_test: set t.Parallel

* Update nomad/state/state_store.go

Co-Authored-By: Danielle <dani@hashicorp.com>

* Update scheduler/feasible.go

Co-Authored-By: Danielle <dani@hashicorp.com>

* structs: lift ControllerRequired to each volume

* state_store: store plug.ControllerRequired, use it for volume health

* feasible: csi match fast path remove stale host volume copied logic

* scheduler/feasible: improve comments

Co-authored-by: Danielle <dani@builds.terrible.systems>
2020-03-23 13:58:29 -04:00
Danielle Lancashire
709abbc675 scheduler: Add a feasability checker for Host Vols 2019-08-12 15:39:08 +02:00
Preetha Appan
4743561396 Refactor scheduler package to enable preemption for batch/service jobs 2019-04-10 20:24:01 -05:00
Preetha Appan
6966e3c3e8 Make preemption config a struct to allow for enabling based on scheduler type 2018-10-30 11:06:32 -05:00
Preetha Appan
2143fa2ab7 Use scheduler config from state store to enable/disable preemption 2018-10-30 11:06:32 -05:00
Alex Dadgar
670c7e57dc add to stack 2018-10-13 12:27:49 -07:00
Alex Dadgar
49c2d4f775 Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Preetha Appan
fd697272a7 Implement spread iterator that scores according to percentage of desired count in each target.
Added this as a new step in the stack and some unit tests
2018-09-04 16:10:11 -05:00
Preetha Appan
b5042067e7 Remove unnecessary reset 2018-09-04 16:10:11 -05:00
Preetha Appan
00924555a8 Implement affinity support in generic scheduler 2018-09-04 16:10:11 -05:00
Preetha Appan
a2cdb5d6c0 Add more clarification in comment 2018-01-31 09:58:05 -06:00
Preetha Appan
8d1395ea16 Better score threshold 2018-01-31 09:58:05 -06:00
Preetha Appan
3429dfa716 Limit iterator uses a score threshold and a maxSkip value to be able to skip lower scoring nodes 2018-01-31 09:58:05 -06:00
Preetha Appan
4cbef07d37 Prevent side effect modification of select options when preferred nodes are set 2018-01-31 09:56:53 -06:00
Preetha Appan
c6c0741bd8 Add helper methods, use require and other code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan
5ecb7895bb Reschedule previous allocs and track their reschedule attempts 2018-01-31 09:56:53 -06:00
Alex Dadgar
f6fbb36054 sync 2017-10-13 14:36:02 -07:00
Alex Dadgar
653a1c37f6 Split distinct property and host iterator and add iterator to system stack 2017-03-08 19:00:10 -08:00
Alex Dadgar
e2ee3f4904 Double the anti-affinity for placing same task group on node 2017-03-06 11:52:53 -08:00
Diptanu Choudhury
a6e0077f72 Implemented SetPrefferingNodes in stack 2016-08-30 16:17:50 -07:00
Diptanu Choudhury
7da66e169c Making the scheduler use LocalDisk instead of Resources.DiskMB 2016-08-25 12:27:42 -05:00
Alex Dadgar
d487295960 Fix computed class when the job has multiple task groups 2016-02-03 21:22:18 -08:00
Alex Dadgar
450252f8ae Respond to comments 2016-01-26 16:43:42 -08:00
Alex Dadgar
0ad3575897 FeasibilityWrapper uses computed node class eligibility to call feasibility checks minimally 2016-01-26 15:16:43 -08:00
Alex Dadgar
2ab5790b6f Rename Dynamic -> ProposedAllocConstraintIterator 2015-10-26 14:12:54 -07:00
Alex Dadgar
9572878a92 Add dynamic constraint to generic_scheduler 2015-10-22 15:09:03 -07:00
Alex Dadgar
2405101328 Remove base nodes from stack constructors 2015-10-16 17:05:23 -07:00
Alex Dadgar
7feb5f1978 Refactor task group constraint logic in generic/system stack 2015-10-16 14:00:51 -07:00
Alex Dadgar
b24f48a4ed System scheduler and system stack 2015-10-14 18:39:44 -07:00
Armon Dadgar
5fb980bc53 scheduler: do not skip job anti-affinity 2015-09-22 22:20:07 -07:00
Armon Dadgar
ca67742fbb scheduler: thread through the TaskResources 2015-09-13 15:20:50 -07:00
Armon Dadgar
924bf123a1 scheduler: binpacker makes network offers 2015-09-13 14:31:32 -07:00
Armon Dadgar
40b84e3023 scheduler: recompute scan limit on SetNodes 2015-09-11 12:03:41 -07:00
Armon Dadgar
efdf717991 scheduler: allow updating the base nodes 2015-09-07 11:30:13 -07:00
Armon Dadgar
f2327acbe1 scheduler: adding job anti-affinity to the generic stack 2015-08-16 10:37:11 -07:00