nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-09 11:55:42 +03:00

Author	SHA1	Message	Date
Tim Gross	5c909213ce	scheduler: add reconciler annotations to completed evals (#26188 ) The output of the reconciler stage of scheduling is only visible via debug-level logs, typically accessible only to the cluster admin. We can give job authors better ability to understand what's happening to their jobs if we expose this information to them in the `eval status` command. Add the reconciler's desired updates to the evaluation struct so it can be exposed in the API. This increases the size of evals by roughly 15% in the state store, or a bit more when there are preemptions (but we expect this will be a small minority of evals). Ref: https://hashicorp.atlassian.net/browse/NMD-818 Fixes: https://github.com/hashicorp/nomad/issues/15564	2025-07-07 09:40:21 -04:00
Tim Gross	9a29df2292	scheduler: emit structured logs from reconciliation (#26169 ) Both the cluster reconciler and node reconciler emit a debug-level log line with their results, but these are unstructured multi-line logs that are annoying for operators to parse. Change these to emit structured key-value pairs like we do everywhere else. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-07-01 10:37:44 -04:00
Piotr Kazmierczak	12ddb6db94	scheduler: capture reconciler state in ReconcilerState object (#26088 ) This changeset separates reconciler fields into their own sub-struct to make testing easier and the code more explicit about what fields relate to which state.	2025-06-23 15:36:39 +02:00
Piotr Kazmierczak	b82fd2e159	scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042 ) Cluster reconciler code is notoriously hard to follow because most of its method continuously mutate the fields of the allocReconciler object. Even for top-level methods it makes the code hard to follow, but gets really gnarly with lower-level methods (of which there are many). This changeset proposes a refactoring that makes the vast majority of said methods return explicit values, and avoid mutating object fields.	2025-06-20 07:37:16 +02:00
Piotr Kazmierczak	0ddbc548a3	scheduler: rename reconciliation package to `reconciler` (#26038 ) nouns are better than verbs for package names	2025-06-12 14:36:09 +02:00
Piotr Kazmierczak	199d12865f	scheduler: isolate `feasibility` (#26031 ) This change isolates all the code that deals with node selection in the scheduler into its own package called feasible. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-11 20:11:04 +02:00
Piotr Kazmierczak	76e3c2961a	scheduler: isolate reconciliation code (#26002 ) This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.	2025-06-10 15:46:39 +02:00
Tim Gross	c205688857	scheduler: fix state corruption from rescheduler tracker updates (#25698 ) In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users. Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan. Ref: https://github.com/hashicorp/nomad/pull/12319 Fixes: https://hashicorp.atlassian.net/browse/NET-12357	2025-04-18 08:42:54 -04:00
Tim Gross	dc58f247ed	docs: clarify reschedule, migrate, and replacement terminology (#24929 ) Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-18 09:31:03 -05:00
Piotr Kazmierczak	611452e1af	stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs (#24993 ) We introduce an alternative solution to the one presented in #24960 which is based on the state store and not previous-next allocation tracking in the reconciler. This new solution reduces cognitive complexity of the scheduler code at the cost of slightly more boilerplate code, but also opens up new possibilities in the future, e.g., allowing users to explicitly "un-stick" volumes with workloads still running. The diagram below illustrates the new logic: SetVolumes() upsertAllocsImpl() sets ns, job +-----------------checks if alloc requests tg in the scheduler v sticky vols and consults \| +-----------------------+ state. If there is no claim, \| \| TaskGroupVolumeClaim: \| it creates one. \| \| - namespace \| \| \| - jobID \| \| \| - tg name \| \| \| - vol ID \| v \| uniquely identify vol \| hasVolumes() +----+------------------+ consults the state \| ^ and returns true \| \| DeleteJobTxn() if there's a match <-----------+ +---------------removes the claim from or if there is no the state previous claim \| \| \| \| +-----------------------------+ +------------------------------------------------------+ scheduler state store	2025-02-07 17:41:01 +01:00
Michael Smithhisler	47c14ddf28	remove remote task execution code (#24909 )	2025-01-29 08:08:34 -05:00
Piotr Kazmierczak	8cbb74786c	stateful deployments: find feasible node for sticky host volumes (#24558 ) This changeset implements node feasibility checks for sticky host volumes.	2024-12-19 09:25:55 -05:00
Tim Gross	fa70267787	scheduler: `RescheduleTracker` dropped if follow-up fails placements (#12319 ) When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: https://github.com/hashicorp/nomad/issues/12147 Fixes: https://github.com/hashicorp/nomad/issues/17072	2024-06-10 11:15:40 -04:00
James Rasell	3c8eb54dfc	scheduler: ensure dup alloc names are fixed before plan submit. (#18873 ) This change fixes a bug within the generic scheduler which meant duplicate alloc indexes (names) could be submitted to the plan applier and written to state. The bug originates from the placements calculation notion that names of allocations being replaced are blindly copied to their replacement. This is not correct in all cases, particularly when dealing with canaries. The fix updates the alloc name index tracker to include minor duplicate tracking. This can be used when computing placements to ensure duplicate are found, and a new name picked before the plan is submitted. The name index tracking is now passed from the reconciler to the generic scheduler via the results, so this does not have to be regenerated, or another data structure used.	2023-10-27 14:16:41 +01:00
hashicorp-copywrite[bot]	a9d61ea3fd	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:29 -05:00
Luiz Aoqui	13ee343853	core: remove unnecessary call to SetNodes and adds DC downgrade test (#17655 )	2023-06-22 13:26:14 -04:00
Luiz Aoqui	f4c7182873	node pools: apply node pool scheduler configuration (#17598 )	2023-06-21 20:31:50 -04:00
Tim Gross	9a6078a2ae	node pools: implement support in scheduler (#17443 ) Implement scheduler support for node pool: * When a scheduler is invoked, we get a set of the ready nodes in the DCs that are allowed for that job. Extend the filter to include the node pool. * Ensure that changes to a job's node pool are picked up as destructive allocation updates. * Add `NodesInPool` as a metric to all reporting done by the scheduler. * Add the node-in-pool the filter to the `Node.Register` RPC so that we don't generate spurious evals for nodes in the wrong pool.	2023-06-07 10:39:03 -04:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	b8a472d692	ephemeral disk: `migrate` should imply `sticky` (#16826 ) The `ephemeral_disk` block's `migrate` field allows for best-effort migration of the ephemeral disk data to new nodes. The documentation says the `migrate` field is only respected if `sticky=true`, but in fact if client ACLs are not set the data is migrated even if `sticky=false`. The existing behavior when client ACLs are disabled has existed since the early implementation, so "fixing" that case now would silently break backwards compatibility. Additionally, having `migrate` not imply `sticky` seems nonsensical: it suggests that if we place on a new node we migrate the data but if we place on the same node, we throw the data away! Update so that `migrate=true` implies `sticky=true` as follows: * The failure mode when client ACLs are enabled comes from the server not passing along a migration token. Update the server so that the server provides a migration token whenever `migrate=true` and not just when `sticky=true` too. * Update the scheduler so that `migrate` implies `sticky`. * Update the client so that we check for `migrate \|\| sticky` where appropriate. * Refactor the E2E tests to move them off the old framework and make the intention of the test more clear.	2023-04-07 16:33:45 -04:00
Piotr Kazmierczak	949a6f60c7	renamed stanza to block for consistency with other projects (#15941 )	2023-01-30 15:48:43 +01:00
Luiz Aoqui	6a3cf74f32	scheduler: log stack in case of panic (#15303 )	2022-11-17 18:59:33 -05:00
Luiz Aoqui	7828c02a50	Update alloc after reconnect and enforece client heartbeat order (#15068 ) * scheduler: allow updates after alloc reconnects When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic. * scheduler: prevent spurious placement on reconnect When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events. * rpc: only allow alloc updates from `ready` nodes Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done. * chagelog: add entry for #15068 * code review * client: skip terminal allocations on reconnect When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation. * apply more code review changes * scheduler: persist changes to reconnected allocs Reconnected allocs have a new AllocState entry that must be persisted by the plan applier. * rpc: read node ID from allocs in UpdateAlloc The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations. * scheduler: remove logic to handle exceptional case This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it. * more code review * remove another unused test	2022-11-04 16:25:11 -04:00
Derek Strickland	f5de802993	system_scheduler: support disconnected clients (#12555 ) * structs: Add helper method for checking if alloc is configured to disconnect * system_scheduler: Add support for disconnected clients	2022-04-15 09:31:32 -04:00
Tim Gross	6a49a0fb81	set minimum version for disconnected client mode to 1.3.0 (#12530 )	2022-04-08 16:48:37 -04:00
Derek Strickland	8ac3e642e6	reconciler: 2 phase reconnects and tests (#12333 ) * structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.	2022-04-05 17:13:10 -04:00
Derek Strickland	bab317300e	Add description for allocs stopped due to reconnect (#12270 )	2022-04-05 17:12:23 -04:00
Derek Strickland	6329f44148	disconnected clients: ensure servers meet minimum required version (#12202 ) * planner: expose ServerMeetsMinimumVersion via Planner interface * filterByTainted: add flag indicating disconnect support * allocReconciler: accept and pass disconnect support flag * tests: update dependent tests	2022-04-05 17:12:23 -04:00
Derek Strickland	35752655b0	disconnected clients: Add reconnect task event (#12133 ) * Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push	2022-04-05 17:12:23 -04:00
Derek Strickland	786180601d	reconciler: support disconnected clients (#12058 ) * Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-04-05 17:10:37 -04:00
Derek Strickland	cefc58dd7b	reconciler: refactor `computeGroup` (#12033 ) The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include. Renamed functions - handleGroupCanaries -> cancelUnneededCanaries - handleDelayedLost -> createLostLaterEvals - handeDelayedReschedules -> createRescheduleLaterEvals New functions - filterAndStopAll - initializeDeploymentState - requiresCanaries - computeCanaries - computeUnderProvisionedBy - computeReplacements - computeDestructiveUpdates - computeMigrations - createDeployment - isDeploymentComplete	2022-02-10 16:24:51 -05:00
Tim Gross	f811169267	scheduler: recover from panic (#12009 ) If processing a specific evaluation causes the scheduler (and therefore the entire server) to panic, that evaluation will never get a chance to be nack'd and cleared from the state store. It will get dequeued by another scheduler, causing that server to panic, and so forth until all servers are in a panic loop. This prevents the operator from intervening to remove the evaluation or update the state. Recover the goroutine from the top-level `Process` methods for each scheduler so that this condition can be detected without panicking the server process. This will lead to a loop of recovering the scheduler goroutine until the eval can be removed or nack'd, but that's much better than taking a downtime.	2022-02-07 11:47:53 -05:00
Luiz Aoqui	8a427a470a	scheduler: detect and log unexpected scheduling collisions (#11793 )	2022-01-14 20:09:14 -05:00
James Rasell	80dcae7216	core: allow setting and propagation of eval priority on job de/registration (#11532 ) This change modifies the Nomad job register and deregister RPCs to accept an updated option set which includes eval priority. This param is optional and override the use of the job priority to set the eval priority. In order to ensure all evaluations as a result of the request use the same eval priority, the priority is shared to the allocReconciler and deploymentWatcher. This creates a new distinction between eval priority and job priority. The Nomad agent HTTP API has been modified to allow setting the eval priority on job update and delete. To keep consistency with the current v1 API, job update accepts this as a payload param; job delete accepts this as a query param. Any user supplied value is validated within the agent HTTP handler removing the need to pass invalid requests to the server. The register and deregister opts functions now all for setting the eval priority on requests. The change includes a small change to the DeregisterOpts function which handles nil opts. This brings the function inline with the RegisterOpts.	2021-11-23 09:23:31 +01:00
Mahmood Ali	56a7cc61d0	scheduler: stop allocs in unrelated nodes (#11391 ) The system scheduler should leave allocs on draining nodes as-is, but stop node stop allocs on nodes that are no longer part of the job datacenters. Previously, the scheduler did not make the distinction and left system job allocs intact if they are already running. I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 . Fixes https://github.com/hashicorp/nomad/issues/11373	2021-10-27 07:04:13 -07:00
Seth Hoenig	61ee443ee6	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Michael Schurter	2e6eb84a57	Merge pull request #10248 from hashicorp/f-remotetask-2021 core: propagate remote task handles	2021-04-30 08:57:26 -07:00
Michael Schurter	d3d6c60e63	clarify docs from pr comments	2021-04-30 08:31:31 -07:00
Luiz Aoqui	c7114921fa	Add metrics for blocked eval resources (#10454 ) * add metrics for blocked eval resources * docs: add new blocked_evals metrics * fix to call `pruneStats` instead of `stats.prune` directly	2021-04-29 15:03:45 -04:00
Michael Schurter	d50fb2a00e	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Tim Gross	7c7569674c	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Kris Hicks	85ed8ddd4f	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Kris Hicks	be6e5e9e6d	scheduler: Fix always-false sort func (#9547 ) Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2020-12-08 09:57:47 -08:00
Michael Schurter	a55f46e9ba	api: add field filters to /v1/{allocations,nodes} Fixes #9017 The ?resources=true query parameter includes resources in the object stub listings. Specifically: - For `/v1/nodes?resources=true` both the `NodeResources` and `ReservedResources` field are included. - For `/v1/allocations?resources=true` the `AllocatedResources` field is included. The ?task_states=false query parameter removes TaskStates from /v1/allocations responses. (By default TaskStates are included.)	2020-10-14 10:35:22 -07:00
Mahmood Ali	6f6a93b262	Handle migration of non-deployment jobs This handles the case where a job when from no-deployment to deployment with canaries. Consider a case where a `max_parallel=0` job is submitted as version 0, then an update is submitted with `max_parallel=1, canary=1` as verion 1. In this case, we will have 1 canary alloc, and all remaining allocs will be version 0. Until the deployment is promoted, we ought to replace the canaries with version 0 job (which isn't associated with a deployment).	2020-08-26 10:36:34 -04:00
Mahmood Ali	92bb3728c9	tweak stack job manipulation To address review comments	2020-08-25 17:37:19 -04:00
Mahmood Ali	cb038b1a8c	Have Plan.AppendAlloc accept the job	2020-08-25 17:22:09 -04:00
Mahmood Ali	5720266c91	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Mahmood Ali	b00d226c40	this is OSS	2020-06-22 10:28:45 -04:00
Nick Ethier	ad8ced3873	multi-interface network support	2020-06-19 09:42:10 -04:00

1 2 3 4 5

224 Commits