From dc58f247ed2b088c354842607e7e5fae7b7060d9 Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Tue, 18 Feb 2025 09:31:03 -0500 Subject: [PATCH] docs: clarify reschedule, migrate, and replacement terminology (#24929) Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick --- command/job_restart.go | 25 ++++++------ scheduler/generic_sched.go | 9 +++-- website/content/docs/commands/job/restart.mdx | 34 ++++++++--------- website/content/docs/configuration/server.mdx | 10 ++--- .../docs/job-specification/disconnect.mdx | 32 +++++++++++----- .../content/docs/job-specification/group.mdx | 38 +++++++++---------- .../docs/job-specification/migrate.mdx | 10 +++++ .../docs/job-specification/reschedule.mdx | 22 ++++++++--- .../docs/job-specification/restart.mdx | 6 ++- 9 files changed, 111 insertions(+), 75 deletions(-) diff --git a/command/job_restart.go b/command/job_restart.go index 30385b85c..5848f13f6 100644 --- a/command/job_restart.go +++ b/command/job_restart.go @@ -124,18 +124,18 @@ Usage: nomad job restart [options] batch. It is also possible to specify additional time to wait between batches. - Allocations can be restarted in-place or rescheduled. When restarting - in-place the command may target specific tasks in the allocations, restart - only tasks that are currently running, or restart all tasks, even the ones - that have already run. Allocations can also be targeted by group. When both - groups and tasks are defined only the tasks for the allocations of those - groups are restarted. + You may restart in-place or migrated allocations. When restarting in-place, + the command may target specific tasks in the allocations, restart only tasks + that are currently running, or restart all tasks, even the ones that have + already run. Groups and tasks can also target allocations. When you define + both groups and tasks, Nomad restarts only the tasks for the allocations of + those groups. - When rescheduling, the current allocations are stopped triggering the Nomad - scheduler to create replacement allocations that may be placed in different + When migrating, Nomad stops the current allocations, triggering the Nomad + scheduler to create new allocations that may be placed in different clients. The command waits until the new allocations have client status - 'ready' before proceeding with the remaining batches. Services health checks - are not taken into account. + 'ready' before proceeding with the remaining batches. The command does not + consider service health checks. By default the command restarts all running tasks in-place with one allocation per batch. @@ -183,12 +183,13 @@ Restart Options: proceed. If 'fail' the command exits immediately. Defaults to 'ask'. -reschedule - If set, allocations are stopped and rescheduled instead of restarted + If set, allocations are stopped and migrated instead of restarted in-place. Since the group is not modified the restart does not create a new deployment, and so values defined in 'update' blocks, such as 'max_parallel', are not taken into account. This option cannot be used with '-task'. Only jobs of type 'batch', 'service', and 'system' can be - rescheduled. + migrated. Note that despite the name of this flag, this command migrates but + does not reschedule allocations, so it ignores the 'reschedule' block. -task= Specify the task to restart. Can be specified multiple times. If groups are diff --git a/scheduler/generic_sched.go b/scheduler/generic_sched.go index 89a889d66..e341625cd 100644 --- a/scheduler/generic_sched.go +++ b/scheduler/generic_sched.go @@ -469,7 +469,8 @@ func (s *GenericScheduler) computeJobAllocs() error { return s.computePlacements(destructive, place, results.taskGroupAllocNameIndexes) } -// downgradedJobForPlacement returns the job appropriate for non-canary placement replacement +// downgradedJobForPlacement returns the previous stable version of the job for +// downgrading a placement for non-canaries func (s *GenericScheduler) downgradedJobForPlacement(p placementResult) (string, *structs.Job, error) { ns, jobID := s.job.Namespace, s.job.ID tgName := p.TaskGroup().Name @@ -587,8 +588,8 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul } // Check if we should stop the previous allocation upon successful - // placement of its replacement. This allow atomic placements/stops. We - // stop the allocation before trying to find a replacement because this + // placement of the new alloc. This allow atomic placements/stops. We + // stop the allocation before trying to place the new alloc because this // frees the resources currently used by the previous allocation. stopPrevAlloc, stopPrevAllocDesc := missing.StopPreviousAlloc() prevAllocation := missing.PreviousAllocation() @@ -715,7 +716,7 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul // Track the fact that we didn't find a placement s.failedTGAllocs[tg.Name] = s.ctx.Metrics() - // If we weren't able to find a replacement for the allocation, back + // If we weren't able to find a placement for the allocation, back // out the fact that we asked to stop the allocation. if stopPrevAlloc { s.plan.PopUpdate(prevAllocation) diff --git a/website/content/docs/commands/job/restart.mdx b/website/content/docs/commands/job/restart.mdx index 6f87e40a8..0f04619f5 100644 --- a/website/content/docs/commands/job/restart.mdx +++ b/website/content/docs/commands/job/restart.mdx @@ -32,18 +32,17 @@ The command can operate in batches and wait until all restarted or rescheduled allocations are running again before proceeding to the next batch. It is also possible to specify additional time to wait between batches. -Allocations can be restarted in-place or rescheduled. When restarting -in-place the command may target specific tasks in the allocations, restart -only tasks that are currently running, or restart all tasks, even the ones -that have already run. Allocations can also be targeted by groups and tasks. -When both groups and tasks are defined only the tasks for the allocations of -those groups are restarted. +You may restart in-place or migrated allocations. When restarting in-place, the +command may target specific tasks in the allocations, restart only tasks that +are currently running, or restart all tasks, even the ones that have already +run. Groups and tasks can also target allocations. When you define both groups +and tasks, Nomad restarts only the tasks for the allocations of those groups. -When rescheduling, the current allocations are stopped triggering the Nomad -scheduler to create replacement allocations that may be placed in different -clients. The command waits until the new allocations have client status `ready` -before proceeding with the remaining batches. Services health checks are not -taken into account. +When migrating, Nomad stops the current allocations, triggering the Nomad +scheduler to create new allocations that may be placed in different clients. The +command waits until the new allocations have client status `ready` before +proceeding with the remaining batches. The command does not consider service +health checks. By default the command restarts all running tasks in-place with one allocation per batch. @@ -82,12 +81,13 @@ of the exact job ID. shutdown or restart. Note that using this flag will result in failed network connections to the allocation being restarted. -- `-reschedule`: If set, allocations are stopped and rescheduled instead of - restarted in-place. Since the group is not modified the restart does not - create a new deployment, and so values defined in [`update`][] blocks, such - as [`max_parallel`][], are not taken into account. This option cannot be used - with `-task`. Only jobs of type `batch`, `service`, and `system` can be - rescheduled. +- `-reschedule`: If set, Nomad stops and migrates allocations instead of + restarting in-place. Since the group is not modified, the restart does not + create a new deployment, and so values defined in [`update`][] blocks, such as + [`max_parallel`][], are not considered. This option cannot be used with + `-task`. You may only migrate jobs of type `batch`, `service`, and `system`. + Note that despite the name of this flag, this command migrates but does not + reschedule allocations, so it ignores the `reschedule` block. - `-on-error=`: Determines what action to take when an error happens during a restart batch. If `ask` the command stops and waits for user diff --git a/website/content/docs/configuration/server.mdx b/website/content/docs/configuration/server.mdx index afc69dd26..ffcd2993d 100644 --- a/website/content/docs/configuration/server.mdx +++ b/website/content/docs/configuration/server.mdx @@ -438,17 +438,17 @@ Nomad Clients periodically heartbeat to Nomad Servers to confirm they are operating as expected. Nomad Clients which do not heartbeat in the specified amount of time are considered `down` and their allocations are marked as `lost` or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set) -and rescheduled. +and replaced. The various heartbeat related parameters allow you to tune the following tradeoffs: -- The longer the heartbeat period, the longer a `down` Client's workload will - take to be rescheduled. +- The longer the heartbeat period, the longer Nomad takes to replace a `down` + Client's workload. - The shorter the heartbeat period, the more likely transient network issues, leader elections, and other temporary issues could cause a perfectly functional Client and its workloads to be marked as `down` and the work - rescheduled. + replaced. While Nomad Clients can connect to any Server, all heartbeats are forwarded to the leader for processing. Since this heartbeat processing consumes resources, @@ -510,7 +510,7 @@ system has for a delay in noticing crashed Clients. For example a `failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the largest clusters ample time to heartbeat after an election. However if the election was due to a datacenter-wide failure affecting Clients, it will be 30 -minutes before Nomad recognizes that they are `down` and reschedules their +minutes before Nomad recognizes that they are `down` and replaces their work. [encryption]: /nomad/tutorials/transport-security/security-gossip-encryption 'Nomad Encryption Overview' diff --git a/website/content/docs/job-specification/disconnect.mdx b/website/content/docs/job-specification/disconnect.mdx index f54ec1a40..71060c143 100644 --- a/website/content/docs/job-specification/disconnect.mdx +++ b/website/content/docs/job-specification/disconnect.mdx @@ -14,7 +14,14 @@ description: |- The `disconnect` block describes the system's behavior in case of a network partition. By default, without a `disconnect` block, if an allocation is on a node that misses heartbeats, the allocation will be marked `lost` and will be -rescheduled. +replaced. + +Replacement happens when a node is lost. When a node is drained, Nomad +[migrates][] the allocations instead, and Nomad ignores the `disconnect` +block. When a Nomad agent fails to set up the allocation or the tasks of an +allocation fail more than their [`restart`][] block allows, Nomad +[reschedules][] the allocations and ignores the `disconnect`. + ```hcl job "docs" { @@ -51,11 +58,12 @@ same `disconnect` block. Refer to [the Lost After section][lost-after] for more details. -- `replace` `(bool: false)` - Specifies if the disconnected allocation should - be replaced by a new one rescheduled on a different node. If false and the - node it is running on becomes disconnected or goes down, this allocation - won't be rescheduled and will be reported as `unknown` until the node reconnects, - or until the allocation is manually stopped: +- `replace` `(bool: false)` - Specifies if Nomad should replace the disconnected + allocation with a new one rescheduled on a different node. Nomad considers the + replacement allocation a reschedule and obeys the job's [`reschedule`][] + block. If false and the node the allocation is running on disconnects + or goes down, Nomad does not replace this allocation and reports `unknown` + until the node reconnects, or until you manually stop the allocation. ```plaintext `nomad alloc stop ` @@ -84,7 +92,7 @@ same `disconnect` block. - `keep_original`: Always keep the original allocation. Bear in mind when choosing this option, it can have crashed while the client was disconnected. - - `keep_replacement`: Always keep the allocation that was rescheduled + - `keep_replacement`: Always keep the allocation that was replaced to replace the disconnected one. - `best_score`: Keep the allocation running on the node with the best score. @@ -102,17 +110,17 @@ The following examples only show the `disconnect` blocks. Remember that the This example shows how `stop_on_client_after` interacts with other blocks. For the `first` group, after the default 10 second [`heartbeat_grace`] window expires and 90 more seconds passes, the -server will reschedule the allocation. The client will wait 90 seconds +server replaces the allocation. The client waits 90 seconds before sending a stop signal (`SIGTERM`) to the `first-task` task. After 15 more seconds because of the task's `kill_timeout`, the client will send `SIGKILL`. The `second` group does not have -`stop_on_client_after`, so the server will reschedule the +`stop_on_client_after`, so the server replaces the allocation after the 10 second [`heartbeat_grace`] expires. It will not be stopped on the client, regardless of how long the client is out of touch. Note that if the server's clocks are not closely synchronized with -each other, the server may reschedule the group before the client has +each other, the server may replace the group before the client has stopped the allocation. Operators should ensure that clock drift between servers is as small as possible. @@ -217,3 +225,7 @@ group "second" { [stop-after]: /nomad/docs/job-specification/disconnect#stop-after [lost-after]: /nomad/docs/job-specification/disconnect#lost-after [`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile +[migrates]: /nomad/docs/job-specification/migrate +[`restart`]: /nomad/docs/job-specification/restart +[reschedules]: /nomad/docs/job-specification/reschedule +[`reschedule`]: /nomad/docs/job-specification/reschedule diff --git a/website/content/docs/job-specification/group.mdx b/website/content/docs/job-specification/group.mdx index f8338e212..c2c8dc134 100644 --- a/website/content/docs/job-specification/group.mdx +++ b/website/content/docs/job-specification/group.mdx @@ -48,9 +48,9 @@ job "docs" { ephemeral disk requirements of the group. Ephemeral disks can be marked as sticky and support live data migrations. -- `disconnect` ([disconnect][]: nil) - Specifies the disconnect - strategy for the server and client for all tasks in this group in case of a - network partition. The tasks can be left unconnected, stopped or replaced +- `disconnect` ([disconnect][]: nil) - Specifies the disconnect + strategy for the server and client for all tasks in this group in case of a + network partition. The tasks can be left unconnected, stopped or replaced when the client disconnects. The policy for reconciliation in case the client regains connectivity is also specified here. @@ -65,14 +65,14 @@ job "docs" { requirements and configuration, including static and dynamic port allocations, for the group. -- `prevent_reschedule_on_lost` `(bool: false)` - Defines the reschedule behaviour - of an allocation when the node it is running on misses heartbeats. - When enabled, if the node it is running on becomes disconnected - or goes down, this allocations wont be rescheduled and will show up as `unknown` - until the node comes back up or it is manually restarted. +- `prevent_reschedule_on_lost` `(bool: false)` - Defines the replacement + behavior of an allocation when the node it is running on misses heartbeats. + When enabled, if the node disconnects or goes down, + Nomad does not replace this allocation and shows it as `unknown` until the node + reconnects or you manually restart the node. - This behaviour will only modify the reschedule process on the server. - To modify the allocation behaviour on the client, see + This behavior only modifies the replacement process on the server. To + modify the allocation behavior on the client, refer to [`stop_after_client_disconnect`](#stop_after_client_disconnect). The `unknown` allocation has to be manually stopped to run it again. @@ -84,7 +84,7 @@ job "docs" { Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the same time requires that [rescheduling is disabled entirely][`disable_rescheduling`]. - This field was deprecated in favour of `replace` on the [`disconnect`] block, + We deprecated this field in favor of `replace` on the [`disconnect`] block, see [example below][disconect_migration] for more details about migrating. - `reschedule` ([Reschedule][]: nil) - Allows to specify a @@ -299,18 +299,18 @@ issues with stateful tasks or tasks with long restart times. Instead, an operator may desire that these allocations reconnect without a restart. When `max_client_disconnect` or `disconnect.lost_after` is specified, -the Nomad server will mark clients that fail to heartbeat as "disconnected" +the Nomad server marks clients that fail to heartbeat as "disconnected" rather than "down", and will mark allocations on a disconnected client as "unknown" rather than "lost". These allocations may continue to run on the disconnected client. Replacement allocations will be scheduled according to the -allocations' `disconnect.replace` settings. until the disconnected client -reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown" -allocations with their replacements and will decide which ones to keep according -to the `disconnect.replace` setting. If the `max_client_disconnect` or -`disconnect.losta_after` duration expires before the client reconnects, +allocations' `disconnect.replace` settings. until the disconnected client +reconnects. Once a disconnected client reconnects, Nomad compares the "unknown" +allocations with their replacements and decides which ones to keep according +to the `disconnect.replace` setting. If the `max_client_disconnect` or +`disconnect.losta_after` duration expires before the client reconnects, the allocations will be marked "lost". Clients that contain "unknown" allocations will transition to "disconnected" -rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after` +rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after` duration has expired. In the example code below, if both of these task groups were placed on the same @@ -390,7 +390,7 @@ will remain as `unknown` and won't be rescheduled. #### Migration to `disconnect` block The new configuration fileds in the disconnect block work exactly the same as the -ones they are replacing: +ones they are replacing: * `stop_after_client_disconnect` is replaced by `stop_after` * `max_client_disconnect` is replaced by `lost_after` * `prevent_reschedule_on_lost` is replaced by `replace` diff --git a/website/content/docs/job-specification/migrate.mdx b/website/content/docs/job-specification/migrate.mdx index e92348246..cf0006411 100644 --- a/website/content/docs/job-specification/migrate.mdx +++ b/website/content/docs/job-specification/migrate.mdx @@ -22,6 +22,13 @@ If specified at the job level, the configuration will apply to all groups within the job. Only service jobs with a count greater than 1 support migrate blocks. +Migrating happens when a Nomad node is drained. When a node is lost, Nomad +[replaces][] the allocations instead and ignores the `migrate` block. When the +agent fails to set up the allocation or the tasks of an allocation more than +their [`restart`][] block allows, Nomad [reschedules][] the allocations instead +and ignores the `migrate` block. + + ```hcl job "docs" { migrate { @@ -78,3 +85,6 @@ on node draining. [count]: /nomad/docs/job-specification/group#count [drain]: /nomad/docs/commands/node/drain [deadline]: /nomad/docs/commands/node/drain#deadline +[replaces]: /nomad/docs/job-specification/disconnect#replace +[`restart`]: /nomad/docs/job-specification/restart +[reschedules]: /nomad/docs/job-specification/reschedule diff --git a/website/content/docs/job-specification/reschedule.mdx b/website/content/docs/job-specification/reschedule.mdx index dfdf1ed52..f436ef63c 100644 --- a/website/content/docs/job-specification/reschedule.mdx +++ b/website/content/docs/job-specification/reschedule.mdx @@ -22,15 +22,21 @@ description: >- ]} /> -The `reschedule` block specifies the group's rescheduling strategy. If specified at the job -level, the configuration will apply to all groups within the job. If the -reschedule block is present on both the job and the group, they are merged with -the group block taking the highest precedence and then the job. +The `reschedule` block specifies the group's rescheduling strategy. If specified +at the job level, the configuration will apply to all groups within the job. If +the reschedule block is present on both the job and the group, they are merged +with the group block taking the highest precedence and then the job. -Nomad will attempt to schedule the allocation on another node if any of its -task statuses become `failed`. The scheduler prefers to create a replacement +Nomad will attempt to schedule the allocation on another node if any of its task +statuses become `failed`. The scheduler prefers to create a replacement allocation on a node that was not used by a previous allocation. +Rescheduling happens when the Nomad agent fails to set up the allocation or the +tasks of an allocation fail more than their [`restart`][] block allows. When a +node is drained, Nomad [migrates][] the allocations instead and ignores the +`reschedule` block. When a node is lost, Nomad [replaces][] the allocations +instead and ignores the `reschedule` block. + ```hcl job "docs" { @@ -131,3 +137,7 @@ job "docs" { ``` [`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline +[`restart`]: /nomad/docs/job-specification/restart +[migrates]: /nomad/docs/job-specification/migrate +[replaces]: /nomad/docs/job-specification/disconnect#replace +[reschedules]: /nomad/docs/job-specification/reschedule diff --git a/website/content/docs/job-specification/restart.mdx b/website/content/docs/job-specification/restart.mdx index 7347deba6..1803d72dc 100644 --- a/website/content/docs/job-specification/restart.mdx +++ b/website/content/docs/job-specification/restart.mdx @@ -14,7 +14,8 @@ description: The "restart" block configures a group's behavior on task failure. /> The `restart` block configures a task's behavior on task failure. Restarts -happen on the client that is running the task. +happen on the client that is running the task. Restarts are different from +[rescheduling][], which happens when the tasks run out of restart attempts. ```hcl job "docs" { @@ -88,7 +89,7 @@ level, so that the Connect sidecar can inherit the default `restart`. than `attempts` times in an interval. For a detailed explanation of these values and their behavior, please see the [mode values section](#mode-values). -- `render_templates` `(bool: false)` - Specifies whether to re-render all +- `render_templates` `(bool: false)` - Specifies whether to re-render all templates when a task is restarted. If set to `true`, all templates will be re-rendered when the task restarts. This can be useful for re-fetching Vault secrets, even if the lease on the existing secrets has not yet expired. @@ -192,3 +193,4 @@ restart { [sidecar_task]: /nomad/docs/job-specification/sidecar_task [`reschedule`]: /nomad/docs/job-specification/reschedule +[rescheduling]: /nomad/docs/job-specification/reschedule