nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Tim Gross	2d4e5b8fe9	scheduler: fix quadratic performance with spread blocks (#11712 ) When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.	2021-12-21 10:10:01 -05:00
Seth Hoenig	61ee443ee6	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Tim Gross	2b63a093ac	quotas: evaluate quota feasibility last in scheduler (#10753 ) The `QuotaIterator` is used as the source of nodes passed into feasibility checking for constraints. Every node that passes the quota check counts the allocation resources agains the quota, and as a result we count nodes which will be later filtered out by constraints. Therefore for jobs with constraints, nodes that are feasibility checked but fail have been counted against quotas. This failure mode is order dependent; if all the unfiltered nodes happen to be quota checked first, everything works as expected. This changeset moves the `QuotaIterator` to happen last among all feasibility checkers (but before ranking). The `QuotaIterator` will never receive filtered nodes so it will calculate quotas correctly.	2021-06-14 10:11:40 -04:00
Mahmood Ali	28b8767b27	Allow configuring memory oversubscription (#10466 ) Cluster operators want to have better control over memory oversubscription and may want to enable/disable it based on their experience. This PR adds a scheduler configuration field to control memory oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config). I opted to have the memory oversubscription be an opt-in, but happy to change it. To enable it, operators should call the API with: ```json { "MemoryOversubscriptionEnabled": true } ``` If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not enabled" warnings, but the jobs will be accepted without them accessing the additional memory. The warning message is like: ``` $ nomad job run /tmp/j Job Warnings: 1 warning(s): * Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored ==> Monitoring evaluation "7c444157" Evaluation triggered by job "example" ==> Monitoring evaluation "7c444157" Evaluation within deployment: "9d826f13" Allocation "aa5c3cad" created: node "9272088e", group "cache" Evaluation status changed: "pending" -> "complete" ==> Evaluation "7c444157" finished with status "complete" # then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory: $ nomad alloc status -json aa5c3cad \| jq '.AllocatedResources.Tasks["redis"].Memory' { "MemoryMB": 256, "MemoryMaxMB": 0 } ```	2021-04-29 22:09:56 -04:00
Tim Gross	7c7569674c	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Mahmood Ali	5720266c91	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Nick Ethier	60c301758c	scheduler: do network feasibility checking for system jobs (#8256 )	2020-06-24 16:01:00 -04:00
Nick Ethier	ad8ced3873	multi-interface network support	2020-06-19 09:42:10 -04:00
Nick Ethier	33ce12cda9	CNI Implementation (#7518 )	2020-06-18 11:05:29 -07:00
Mahmood Ali	9f11857ad1	Open source Preemption code Nomad 0.12 OSS is to include preemption feature. This commit moves the private code for managing preemption to OSS repository.	2020-05-27 15:02:01 -04:00
Mahmood Ali	5078e0cfed	tests and some clean up	2020-05-01 13:13:30 -04:00
Charlie Voiselle	1af6a2adf1	Wiring algorithm to scheduler calls	2020-05-01 13:13:29 -04:00
Michael Schurter	a61a775b62	core: fix comment on system stack This makes me do a double take every time I run into it, so what if we just changed it?	2020-04-09 15:19:11 -07:00
Lang Martin	ce9dbe619f	csi: the scheduler allows a job with a volume write claim to be updated (#7438 ) * nomad/structs/csi: split CanWrite into health, in use * scheduler/scheduler: expose AllocByID in the state interface * nomad/state/state_store_test * scheduler/stack: SetJobID on the matcher * scheduler/feasible: when a volume writer is in use, check if it's us * scheduler/feasible: remove SetJob * nomad/state/state_store: denormalize allocs before Claim * nomad/structs/csi: return errors on claim, with context * nomad/csi_endpoint_test: new alloc doesn't look like an update * nomad/state/state_store_test: change test reference to CanWrite	2020-03-23 21:21:04 -04:00
Lang Martin	9c9a0c5eb5	csi: volume ids are only unique per namespace (#7358 ) * nomad/state/schema: use the namespace compound index * scheduler/scheduler: CSIVolumeByID interface signature namespace * scheduler/stack: SetJob on CSIVolumeChecker to capture namespace * scheduler/feasible: pass the captured namespace to CSIVolumeByID * nomad/state/state_store: use namespace in csi_volume index * nomad/fsm: pass namespace to CSIVolumeDeregister & Claim * nomad/core_sched: pass the namespace in volumeClaimReap * nomad/node_endpoint_test: namespaces in Claim testing * nomad/csi_endpoint: pass RequestNamespace to state.* * nomad/csi_endpoint_test: appropriately failed test * command/alloc_status_test: appropriately failed test * node_endpoint_test: avoid notTheNamespace for the job * scheduler/feasible_test: call SetJob to capture the namespace * nomad/csi_endpoint: ACL check the req namespace, query by namespace * nomad/state/state_store: remove deregister namespace check * nomad/state/state_store: remove unused CSIVolumes * scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace * nomad/csi_endpoint: ACL check * nomad/state/state_store_test: remove call to state.CSIVolumes * nomad/core_sched_test: job namespace match so claim gc works	2020-03-23 13:59:25 -04:00
Lang Martin	f370e25843	CSI: Scheduler knows about CSI constraints and availability (#6995 ) * structs: piggyback csi volumes on host volumes for job specs * state_store: CSIVolumeByID always includes plugins, matches usecase * scheduler/feasible: csi volume checker * scheduler/stack: add csi volumes * contributing: update rpc checklist * scheduler: add volumes to State interface * scheduler/feasible: introduce new checker collection tgAvailable * scheduler/stack: taskGroupCSIVolumes checker is transient * state_store CSIVolumeDenormalizePlugins comment clarity * structs: remote TODO comment in TaskGroup Validate * scheduler/feasible: CSIVolumeChecker hasPlugins improve comment * scheduler/feasible_test: set t.Parallel * Update nomad/state/state_store.go Co-Authored-By: Danielle <dani@hashicorp.com> * Update scheduler/feasible.go Co-Authored-By: Danielle <dani@hashicorp.com> * structs: lift ControllerRequired to each volume * state_store: store plug.ControllerRequired, use it for volume health * feasible: csi match fast path remove stale host volume copied logic * scheduler/feasible: improve comments Co-authored-by: Danielle <dani@builds.terrible.systems>	2020-03-23 13:58:29 -04:00
Danielle Lancashire	709abbc675	scheduler: Add a feasability checker for Host Vols	2019-08-12 15:39:08 +02:00
Preetha Appan	4743561396	Refactor scheduler package to enable preemption for batch/service jobs	2019-04-10 20:24:01 -05:00
Preetha Appan	6966e3c3e8	Make preemption config a struct to allow for enabling based on scheduler type	2018-10-30 11:06:32 -05:00
Preetha Appan	2143fa2ab7	Use scheduler config from state store to enable/disable preemption	2018-10-30 11:06:32 -05:00
Alex Dadgar	670c7e57dc	add to stack	2018-10-13 12:27:49 -07:00
Alex Dadgar	49c2d4f775	Scheduler uses allocated resources	2018-10-02 17:08:25 -07:00
Preetha Appan	fd697272a7	Implement spread iterator that scores according to percentage of desired count in each target. Added this as a new step in the stack and some unit tests	2018-09-04 16:10:11 -05:00
Preetha Appan	b5042067e7	Remove unnecessary reset	2018-09-04 16:10:11 -05:00
Preetha Appan	00924555a8	Implement affinity support in generic scheduler	2018-09-04 16:10:11 -05:00
Preetha Appan	a2cdb5d6c0	Add more clarification in comment	2018-01-31 09:58:05 -06:00
Preetha Appan	8d1395ea16	Better score threshold	2018-01-31 09:58:05 -06:00
Preetha Appan	3429dfa716	Limit iterator uses a score threshold and a maxSkip value to be able to skip lower scoring nodes	2018-01-31 09:58:05 -06:00
Preetha Appan	4cbef07d37	Prevent side effect modification of select options when preferred nodes are set	2018-01-31 09:56:53 -06:00
Preetha Appan	c6c0741bd8	Add helper methods, use require and other code review feedback	2018-01-31 09:56:53 -06:00
Preetha Appan	5ecb7895bb	Reschedule previous allocs and track their reschedule attempts	2018-01-31 09:56:53 -06:00
Alex Dadgar	f6fbb36054	sync	2017-10-13 14:36:02 -07:00
Alex Dadgar	653a1c37f6	Split distinct property and host iterator and add iterator to system stack	2017-03-08 19:00:10 -08:00
Alex Dadgar	e2ee3f4904	Double the anti-affinity for placing same task group on node	2017-03-06 11:52:53 -08:00
Diptanu Choudhury	a6e0077f72	Implemented SetPrefferingNodes in stack	2016-08-30 16:17:50 -07:00
Diptanu Choudhury	7da66e169c	Making the scheduler use LocalDisk instead of Resources.DiskMB	2016-08-25 12:27:42 -05:00
Alex Dadgar	d487295960	Fix computed class when the job has multiple task groups	2016-02-03 21:22:18 -08:00
Alex Dadgar	450252f8ae	Respond to comments	2016-01-26 16:43:42 -08:00
Alex Dadgar	0ad3575897	FeasibilityWrapper uses computed node class eligibility to call feasibility checks minimally	2016-01-26 15:16:43 -08:00
Alex Dadgar	2ab5790b6f	Rename Dynamic -> ProposedAllocConstraintIterator	2015-10-26 14:12:54 -07:00
Alex Dadgar	9572878a92	Add dynamic constraint to generic_scheduler	2015-10-22 15:09:03 -07:00
Alex Dadgar	2405101328	Remove base nodes from stack constructors	2015-10-16 17:05:23 -07:00
Alex Dadgar	7feb5f1978	Refactor task group constraint logic in generic/system stack	2015-10-16 14:00:51 -07:00
Alex Dadgar	b24f48a4ed	System scheduler and system stack	2015-10-14 18:39:44 -07:00
Armon Dadgar	5fb980bc53	scheduler: do not skip job anti-affinity	2015-09-22 22:20:07 -07:00
Armon Dadgar	ca67742fbb	scheduler: thread through the TaskResources	2015-09-13 15:20:50 -07:00
Armon Dadgar	924bf123a1	scheduler: binpacker makes network offers	2015-09-13 14:31:32 -07:00
Armon Dadgar	40b84e3023	scheduler: recompute scan limit on SetNodes	2015-09-11 12:03:41 -07:00
Armon Dadgar	efdf717991	scheduler: allow updating the base nodes	2015-09-07 11:30:13 -07:00
Armon Dadgar	f2327acbe1	scheduler: adding job anti-affinity to the generic stack	2015-08-16 10:37:11 -07:00

1 2

55 Commits