mirror of
https://github.com/kemko/nomad.git
synced 2026-01-01 16:05:42 +03:00
docs: clarify reschedule, migrate, and replacement terminology (#24929)
Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
This commit is contained in:
@@ -124,18 +124,18 @@ Usage: nomad job restart [options] <job>
|
|||||||
batch. It is also possible to specify additional time to wait between
|
batch. It is also possible to specify additional time to wait between
|
||||||
batches.
|
batches.
|
||||||
|
|
||||||
Allocations can be restarted in-place or rescheduled. When restarting
|
You may restart in-place or migrated allocations. When restarting in-place,
|
||||||
in-place the command may target specific tasks in the allocations, restart
|
the command may target specific tasks in the allocations, restart only tasks
|
||||||
only tasks that are currently running, or restart all tasks, even the ones
|
that are currently running, or restart all tasks, even the ones that have
|
||||||
that have already run. Allocations can also be targeted by group. When both
|
already run. Groups and tasks can also target allocations. When you define
|
||||||
groups and tasks are defined only the tasks for the allocations of those
|
both groups and tasks, Nomad restarts only the tasks for the allocations of
|
||||||
groups are restarted.
|
those groups.
|
||||||
|
|
||||||
When rescheduling, the current allocations are stopped triggering the Nomad
|
When migrating, Nomad stops the current allocations, triggering the Nomad
|
||||||
scheduler to create replacement allocations that may be placed in different
|
scheduler to create new allocations that may be placed in different
|
||||||
clients. The command waits until the new allocations have client status
|
clients. The command waits until the new allocations have client status
|
||||||
'ready' before proceeding with the remaining batches. Services health checks
|
'ready' before proceeding with the remaining batches. The command does not
|
||||||
are not taken into account.
|
consider service health checks.
|
||||||
|
|
||||||
By default the command restarts all running tasks in-place with one
|
By default the command restarts all running tasks in-place with one
|
||||||
allocation per batch.
|
allocation per batch.
|
||||||
@@ -183,12 +183,13 @@ Restart Options:
|
|||||||
proceed. If 'fail' the command exits immediately. Defaults to 'ask'.
|
proceed. If 'fail' the command exits immediately. Defaults to 'ask'.
|
||||||
|
|
||||||
-reschedule
|
-reschedule
|
||||||
If set, allocations are stopped and rescheduled instead of restarted
|
If set, allocations are stopped and migrated instead of restarted
|
||||||
in-place. Since the group is not modified the restart does not create a new
|
in-place. Since the group is not modified the restart does not create a new
|
||||||
deployment, and so values defined in 'update' blocks, such as
|
deployment, and so values defined in 'update' blocks, such as
|
||||||
'max_parallel', are not taken into account. This option cannot be used with
|
'max_parallel', are not taken into account. This option cannot be used with
|
||||||
'-task'. Only jobs of type 'batch', 'service', and 'system' can be
|
'-task'. Only jobs of type 'batch', 'service', and 'system' can be
|
||||||
rescheduled.
|
migrated. Note that despite the name of this flag, this command migrates but
|
||||||
|
does not reschedule allocations, so it ignores the 'reschedule' block.
|
||||||
|
|
||||||
-task=<task-name>
|
-task=<task-name>
|
||||||
Specify the task to restart. Can be specified multiple times. If groups are
|
Specify the task to restart. Can be specified multiple times. If groups are
|
||||||
|
|||||||
@@ -469,7 +469,8 @@ func (s *GenericScheduler) computeJobAllocs() error {
|
|||||||
return s.computePlacements(destructive, place, results.taskGroupAllocNameIndexes)
|
return s.computePlacements(destructive, place, results.taskGroupAllocNameIndexes)
|
||||||
}
|
}
|
||||||
|
|
||||||
// downgradedJobForPlacement returns the job appropriate for non-canary placement replacement
|
// downgradedJobForPlacement returns the previous stable version of the job for
|
||||||
|
// downgrading a placement for non-canaries
|
||||||
func (s *GenericScheduler) downgradedJobForPlacement(p placementResult) (string, *structs.Job, error) {
|
func (s *GenericScheduler) downgradedJobForPlacement(p placementResult) (string, *structs.Job, error) {
|
||||||
ns, jobID := s.job.Namespace, s.job.ID
|
ns, jobID := s.job.Namespace, s.job.ID
|
||||||
tgName := p.TaskGroup().Name
|
tgName := p.TaskGroup().Name
|
||||||
@@ -587,8 +588,8 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Check if we should stop the previous allocation upon successful
|
// Check if we should stop the previous allocation upon successful
|
||||||
// placement of its replacement. This allow atomic placements/stops. We
|
// placement of the new alloc. This allow atomic placements/stops. We
|
||||||
// stop the allocation before trying to find a replacement because this
|
// stop the allocation before trying to place the new alloc because this
|
||||||
// frees the resources currently used by the previous allocation.
|
// frees the resources currently used by the previous allocation.
|
||||||
stopPrevAlloc, stopPrevAllocDesc := missing.StopPreviousAlloc()
|
stopPrevAlloc, stopPrevAllocDesc := missing.StopPreviousAlloc()
|
||||||
prevAllocation := missing.PreviousAllocation()
|
prevAllocation := missing.PreviousAllocation()
|
||||||
@@ -715,7 +716,7 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
|
|||||||
// Track the fact that we didn't find a placement
|
// Track the fact that we didn't find a placement
|
||||||
s.failedTGAllocs[tg.Name] = s.ctx.Metrics()
|
s.failedTGAllocs[tg.Name] = s.ctx.Metrics()
|
||||||
|
|
||||||
// If we weren't able to find a replacement for the allocation, back
|
// If we weren't able to find a placement for the allocation, back
|
||||||
// out the fact that we asked to stop the allocation.
|
// out the fact that we asked to stop the allocation.
|
||||||
if stopPrevAlloc {
|
if stopPrevAlloc {
|
||||||
s.plan.PopUpdate(prevAllocation)
|
s.plan.PopUpdate(prevAllocation)
|
||||||
|
|||||||
@@ -32,18 +32,17 @@ The command can operate in batches and wait until all restarted or
|
|||||||
rescheduled allocations are running again before proceeding to the next batch.
|
rescheduled allocations are running again before proceeding to the next batch.
|
||||||
It is also possible to specify additional time to wait between batches.
|
It is also possible to specify additional time to wait between batches.
|
||||||
|
|
||||||
Allocations can be restarted in-place or rescheduled. When restarting
|
You may restart in-place or migrated allocations. When restarting in-place, the
|
||||||
in-place the command may target specific tasks in the allocations, restart
|
command may target specific tasks in the allocations, restart only tasks that
|
||||||
only tasks that are currently running, or restart all tasks, even the ones
|
are currently running, or restart all tasks, even the ones that have already
|
||||||
that have already run. Allocations can also be targeted by groups and tasks.
|
run. Groups and tasks can also target allocations. When you define both groups
|
||||||
When both groups and tasks are defined only the tasks for the allocations of
|
and tasks, Nomad restarts only the tasks for the allocations of those groups.
|
||||||
those groups are restarted.
|
|
||||||
|
|
||||||
When rescheduling, the current allocations are stopped triggering the Nomad
|
When migrating, Nomad stops the current allocations, triggering the Nomad
|
||||||
scheduler to create replacement allocations that may be placed in different
|
scheduler to create new allocations that may be placed in different clients. The
|
||||||
clients. The command waits until the new allocations have client status `ready`
|
command waits until the new allocations have client status `ready` before
|
||||||
before proceeding with the remaining batches. Services health checks are not
|
proceeding with the remaining batches. The command does not consider service
|
||||||
taken into account.
|
health checks.
|
||||||
|
|
||||||
By default the command restarts all running tasks in-place with one allocation
|
By default the command restarts all running tasks in-place with one allocation
|
||||||
per batch.
|
per batch.
|
||||||
@@ -82,12 +81,13 @@ of the exact job ID.
|
|||||||
shutdown or restart. Note that using this flag will result in failed network
|
shutdown or restart. Note that using this flag will result in failed network
|
||||||
connections to the allocation being restarted.
|
connections to the allocation being restarted.
|
||||||
|
|
||||||
- `-reschedule`: If set, allocations are stopped and rescheduled instead of
|
- `-reschedule`: If set, Nomad stops and migrates allocations instead of
|
||||||
restarted in-place. Since the group is not modified the restart does not
|
restarting in-place. Since the group is not modified, the restart does not
|
||||||
create a new deployment, and so values defined in [`update`][] blocks, such
|
create a new deployment, and so values defined in [`update`][] blocks, such as
|
||||||
as [`max_parallel`][], are not taken into account. This option cannot be used
|
[`max_parallel`][], are not considered. This option cannot be used with
|
||||||
with `-task`. Only jobs of type `batch`, `service`, and `system` can be
|
`-task`. You may only migrate jobs of type `batch`, `service`, and `system`.
|
||||||
rescheduled.
|
Note that despite the name of this flag, this command migrates but does not
|
||||||
|
reschedule allocations, so it ignores the `reschedule` block.
|
||||||
|
|
||||||
- `-on-error=<ask|fail>`: Determines what action to take when an error happens
|
- `-on-error=<ask|fail>`: Determines what action to take when an error happens
|
||||||
during a restart batch. If `ask` the command stops and waits for user
|
during a restart batch. If `ask` the command stops and waits for user
|
||||||
|
|||||||
@@ -438,17 +438,17 @@ Nomad Clients periodically heartbeat to Nomad Servers to confirm they are
|
|||||||
operating as expected. Nomad Clients which do not heartbeat in the specified
|
operating as expected. Nomad Clients which do not heartbeat in the specified
|
||||||
amount of time are considered `down` and their allocations are marked as `lost`
|
amount of time are considered `down` and their allocations are marked as `lost`
|
||||||
or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set)
|
or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set)
|
||||||
and rescheduled.
|
and replaced.
|
||||||
|
|
||||||
The various heartbeat related parameters allow you to tune the following
|
The various heartbeat related parameters allow you to tune the following
|
||||||
tradeoffs:
|
tradeoffs:
|
||||||
|
|
||||||
- The longer the heartbeat period, the longer a `down` Client's workload will
|
- The longer the heartbeat period, the longer Nomad takes to replace a `down`
|
||||||
take to be rescheduled.
|
Client's workload.
|
||||||
- The shorter the heartbeat period, the more likely transient network issues,
|
- The shorter the heartbeat period, the more likely transient network issues,
|
||||||
leader elections, and other temporary issues could cause a perfectly
|
leader elections, and other temporary issues could cause a perfectly
|
||||||
functional Client and its workloads to be marked as `down` and the work
|
functional Client and its workloads to be marked as `down` and the work
|
||||||
rescheduled.
|
replaced.
|
||||||
|
|
||||||
While Nomad Clients can connect to any Server, all heartbeats are forwarded to
|
While Nomad Clients can connect to any Server, all heartbeats are forwarded to
|
||||||
the leader for processing. Since this heartbeat processing consumes resources,
|
the leader for processing. Since this heartbeat processing consumes resources,
|
||||||
@@ -510,7 +510,7 @@ system has for a delay in noticing crashed Clients. For example a
|
|||||||
`failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the
|
`failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the
|
||||||
largest clusters ample time to heartbeat after an election. However if the
|
largest clusters ample time to heartbeat after an election. However if the
|
||||||
election was due to a datacenter-wide failure affecting Clients, it will be 30
|
election was due to a datacenter-wide failure affecting Clients, it will be 30
|
||||||
minutes before Nomad recognizes that they are `down` and reschedules their
|
minutes before Nomad recognizes that they are `down` and replaces their
|
||||||
work.
|
work.
|
||||||
|
|
||||||
[encryption]: /nomad/tutorials/transport-security/security-gossip-encryption 'Nomad Encryption Overview'
|
[encryption]: /nomad/tutorials/transport-security/security-gossip-encryption 'Nomad Encryption Overview'
|
||||||
|
|||||||
@@ -14,7 +14,14 @@ description: |-
|
|||||||
The `disconnect` block describes the system's behavior in case of a network
|
The `disconnect` block describes the system's behavior in case of a network
|
||||||
partition. By default, without a `disconnect` block, if an allocation is on a
|
partition. By default, without a `disconnect` block, if an allocation is on a
|
||||||
node that misses heartbeats, the allocation will be marked `lost` and will be
|
node that misses heartbeats, the allocation will be marked `lost` and will be
|
||||||
rescheduled.
|
replaced.
|
||||||
|
|
||||||
|
Replacement happens when a node is lost. When a node is drained, Nomad
|
||||||
|
[migrates][] the allocations instead, and Nomad ignores the `disconnect`
|
||||||
|
block. When a Nomad agent fails to set up the allocation or the tasks of an
|
||||||
|
allocation fail more than their [`restart`][] block allows, Nomad
|
||||||
|
[reschedules][] the allocations and ignores the `disconnect`.
|
||||||
|
|
||||||
|
|
||||||
```hcl
|
```hcl
|
||||||
job "docs" {
|
job "docs" {
|
||||||
@@ -51,11 +58,12 @@ same `disconnect` block.
|
|||||||
|
|
||||||
Refer to [the Lost After section][lost-after] for more details.
|
Refer to [the Lost After section][lost-after] for more details.
|
||||||
|
|
||||||
- `replace` `(bool: false)` - Specifies if the disconnected allocation should
|
- `replace` `(bool: false)` - Specifies if Nomad should replace the disconnected
|
||||||
be replaced by a new one rescheduled on a different node. If false and the
|
allocation with a new one rescheduled on a different node. Nomad considers the
|
||||||
node it is running on becomes disconnected or goes down, this allocation
|
replacement allocation a reschedule and obeys the job's [`reschedule`][]
|
||||||
won't be rescheduled and will be reported as `unknown` until the node reconnects,
|
block. If false and the node the allocation is running on disconnects
|
||||||
or until the allocation is manually stopped:
|
or goes down, Nomad does not replace this allocation and reports `unknown`
|
||||||
|
until the node reconnects, or until you manually stop the allocation.
|
||||||
|
|
||||||
```plaintext
|
```plaintext
|
||||||
`nomad alloc stop <alloc ID>`
|
`nomad alloc stop <alloc ID>`
|
||||||
@@ -84,7 +92,7 @@ same `disconnect` block.
|
|||||||
- `keep_original`: Always keep the original allocation. Bear in mind
|
- `keep_original`: Always keep the original allocation. Bear in mind
|
||||||
when choosing this option, it can have crashed while the client was
|
when choosing this option, it can have crashed while the client was
|
||||||
disconnected.
|
disconnected.
|
||||||
- `keep_replacement`: Always keep the allocation that was rescheduled
|
- `keep_replacement`: Always keep the allocation that was replaced
|
||||||
to replace the disconnected one.
|
to replace the disconnected one.
|
||||||
- `best_score`: Keep the allocation running on the node with the best
|
- `best_score`: Keep the allocation running on the node with the best
|
||||||
score.
|
score.
|
||||||
@@ -102,17 +110,17 @@ The following examples only show the `disconnect` blocks. Remember that the
|
|||||||
This example shows how `stop_on_client_after` interacts with
|
This example shows how `stop_on_client_after` interacts with
|
||||||
other blocks. For the `first` group, after the default 10 second
|
other blocks. For the `first` group, after the default 10 second
|
||||||
[`heartbeat_grace`] window expires and 90 more seconds passes, the
|
[`heartbeat_grace`] window expires and 90 more seconds passes, the
|
||||||
server will reschedule the allocation. The client will wait 90 seconds
|
server replaces the allocation. The client waits 90 seconds
|
||||||
before sending a stop signal (`SIGTERM`) to the `first-task`
|
before sending a stop signal (`SIGTERM`) to the `first-task`
|
||||||
task. After 15 more seconds because of the task's `kill_timeout`, the
|
task. After 15 more seconds because of the task's `kill_timeout`, the
|
||||||
client will send `SIGKILL`. The `second` group does not have
|
client will send `SIGKILL`. The `second` group does not have
|
||||||
`stop_on_client_after`, so the server will reschedule the
|
`stop_on_client_after`, so the server replaces the
|
||||||
allocation after the 10 second [`heartbeat_grace`] expires. It will
|
allocation after the 10 second [`heartbeat_grace`] expires. It will
|
||||||
not be stopped on the client, regardless of how long the client is out
|
not be stopped on the client, regardless of how long the client is out
|
||||||
of touch.
|
of touch.
|
||||||
|
|
||||||
Note that if the server's clocks are not closely synchronized with
|
Note that if the server's clocks are not closely synchronized with
|
||||||
each other, the server may reschedule the group before the client has
|
each other, the server may replace the group before the client has
|
||||||
stopped the allocation. Operators should ensure that clock drift
|
stopped the allocation. Operators should ensure that clock drift
|
||||||
between servers is as small as possible.
|
between servers is as small as possible.
|
||||||
|
|
||||||
@@ -217,3 +225,7 @@ group "second" {
|
|||||||
[stop-after]: /nomad/docs/job-specification/disconnect#stop-after
|
[stop-after]: /nomad/docs/job-specification/disconnect#stop-after
|
||||||
[lost-after]: /nomad/docs/job-specification/disconnect#lost-after
|
[lost-after]: /nomad/docs/job-specification/disconnect#lost-after
|
||||||
[`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile
|
[`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile
|
||||||
|
[migrates]: /nomad/docs/job-specification/migrate
|
||||||
|
[`restart`]: /nomad/docs/job-specification/restart
|
||||||
|
[reschedules]: /nomad/docs/job-specification/reschedule
|
||||||
|
[`reschedule`]: /nomad/docs/job-specification/reschedule
|
||||||
|
|||||||
@@ -48,9 +48,9 @@ job "docs" {
|
|||||||
ephemeral disk requirements of the group. Ephemeral disks can be marked as
|
ephemeral disk requirements of the group. Ephemeral disks can be marked as
|
||||||
sticky and support live data migrations.
|
sticky and support live data migrations.
|
||||||
|
|
||||||
- `disconnect` <code>([disconnect][]: nil)</code> - Specifies the disconnect
|
- `disconnect` <code>([disconnect][]: nil)</code> - Specifies the disconnect
|
||||||
strategy for the server and client for all tasks in this group in case of a
|
strategy for the server and client for all tasks in this group in case of a
|
||||||
network partition. The tasks can be left unconnected, stopped or replaced
|
network partition. The tasks can be left unconnected, stopped or replaced
|
||||||
when the client disconnects. The policy for reconciliation in case the client
|
when the client disconnects. The policy for reconciliation in case the client
|
||||||
regains connectivity is also specified here.
|
regains connectivity is also specified here.
|
||||||
|
|
||||||
@@ -65,14 +65,14 @@ job "docs" {
|
|||||||
requirements and configuration, including static and dynamic port allocations,
|
requirements and configuration, including static and dynamic port allocations,
|
||||||
for the group.
|
for the group.
|
||||||
|
|
||||||
- `prevent_reschedule_on_lost` `(bool: false)` - Defines the reschedule behaviour
|
- `prevent_reschedule_on_lost` `(bool: false)` - Defines the replacement
|
||||||
of an allocation when the node it is running on misses heartbeats.
|
behavior of an allocation when the node it is running on misses heartbeats.
|
||||||
When enabled, if the node it is running on becomes disconnected
|
When enabled, if the node disconnects or goes down,
|
||||||
or goes down, this allocations wont be rescheduled and will show up as `unknown`
|
Nomad does not replace this allocation and shows it as `unknown` until the node
|
||||||
until the node comes back up or it is manually restarted.
|
reconnects or you manually restart the node.
|
||||||
|
|
||||||
This behaviour will only modify the reschedule process on the server.
|
This behavior only modifies the replacement process on the server. To
|
||||||
To modify the allocation behaviour on the client, see
|
modify the allocation behavior on the client, refer to
|
||||||
[`stop_after_client_disconnect`](#stop_after_client_disconnect).
|
[`stop_after_client_disconnect`](#stop_after_client_disconnect).
|
||||||
|
|
||||||
The `unknown` allocation has to be manually stopped to run it again.
|
The `unknown` allocation has to be manually stopped to run it again.
|
||||||
@@ -84,7 +84,7 @@ job "docs" {
|
|||||||
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
|
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
|
||||||
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].
|
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].
|
||||||
|
|
||||||
This field was deprecated in favour of `replace` on the [`disconnect`] block,
|
We deprecated this field in favor of `replace` on the [`disconnect`] block,
|
||||||
see [example below][disconect_migration] for more details about migrating.
|
see [example below][disconect_migration] for more details about migrating.
|
||||||
|
|
||||||
- `reschedule` <code>([Reschedule][]: nil)</code> - Allows to specify a
|
- `reschedule` <code>([Reschedule][]: nil)</code> - Allows to specify a
|
||||||
@@ -299,18 +299,18 @@ issues with stateful tasks or tasks with long restart times.
|
|||||||
|
|
||||||
Instead, an operator may desire that these allocations reconnect without a
|
Instead, an operator may desire that these allocations reconnect without a
|
||||||
restart. When `max_client_disconnect` or `disconnect.lost_after` is specified,
|
restart. When `max_client_disconnect` or `disconnect.lost_after` is specified,
|
||||||
the Nomad server will mark clients that fail to heartbeat as "disconnected"
|
the Nomad server marks clients that fail to heartbeat as "disconnected"
|
||||||
rather than "down", and will mark allocations on a disconnected client as
|
rather than "down", and will mark allocations on a disconnected client as
|
||||||
"unknown" rather than "lost". These allocations may continue to run on the
|
"unknown" rather than "lost". These allocations may continue to run on the
|
||||||
disconnected client. Replacement allocations will be scheduled according to the
|
disconnected client. Replacement allocations will be scheduled according to the
|
||||||
allocations' `disconnect.replace` settings. until the disconnected client
|
allocations' `disconnect.replace` settings. until the disconnected client
|
||||||
reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown"
|
reconnects. Once a disconnected client reconnects, Nomad compares the "unknown"
|
||||||
allocations with their replacements and will decide which ones to keep according
|
allocations with their replacements and decides which ones to keep according
|
||||||
to the `disconnect.replace` setting. If the `max_client_disconnect` or
|
to the `disconnect.replace` setting. If the `max_client_disconnect` or
|
||||||
`disconnect.losta_after` duration expires before the client reconnects,
|
`disconnect.losta_after` duration expires before the client reconnects,
|
||||||
the allocations will be marked "lost".
|
the allocations will be marked "lost".
|
||||||
Clients that contain "unknown" allocations will transition to "disconnected"
|
Clients that contain "unknown" allocations will transition to "disconnected"
|
||||||
rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
|
rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
|
||||||
duration has expired.
|
duration has expired.
|
||||||
|
|
||||||
In the example code below, if both of these task groups were placed on the same
|
In the example code below, if both of these task groups were placed on the same
|
||||||
@@ -390,7 +390,7 @@ will remain as `unknown` and won't be rescheduled.
|
|||||||
#### Migration to `disconnect` block
|
#### Migration to `disconnect` block
|
||||||
|
|
||||||
The new configuration fileds in the disconnect block work exactly the same as the
|
The new configuration fileds in the disconnect block work exactly the same as the
|
||||||
ones they are replacing:
|
ones they are replacing:
|
||||||
* `stop_after_client_disconnect` is replaced by `stop_after`
|
* `stop_after_client_disconnect` is replaced by `stop_after`
|
||||||
* `max_client_disconnect` is replaced by `lost_after`
|
* `max_client_disconnect` is replaced by `lost_after`
|
||||||
* `prevent_reschedule_on_lost` is replaced by `replace`
|
* `prevent_reschedule_on_lost` is replaced by `replace`
|
||||||
|
|||||||
@@ -22,6 +22,13 @@ If specified at the job level, the configuration will apply to all groups
|
|||||||
within the job. Only service jobs with a count greater than 1 support migrate
|
within the job. Only service jobs with a count greater than 1 support migrate
|
||||||
blocks.
|
blocks.
|
||||||
|
|
||||||
|
Migrating happens when a Nomad node is drained. When a node is lost, Nomad
|
||||||
|
[replaces][] the allocations instead and ignores the `migrate` block. When the
|
||||||
|
agent fails to set up the allocation or the tasks of an allocation more than
|
||||||
|
their [`restart`][] block allows, Nomad [reschedules][] the allocations instead
|
||||||
|
and ignores the `migrate` block.
|
||||||
|
|
||||||
|
|
||||||
```hcl
|
```hcl
|
||||||
job "docs" {
|
job "docs" {
|
||||||
migrate {
|
migrate {
|
||||||
@@ -78,3 +85,6 @@ on node draining.
|
|||||||
[count]: /nomad/docs/job-specification/group#count
|
[count]: /nomad/docs/job-specification/group#count
|
||||||
[drain]: /nomad/docs/commands/node/drain
|
[drain]: /nomad/docs/commands/node/drain
|
||||||
[deadline]: /nomad/docs/commands/node/drain#deadline
|
[deadline]: /nomad/docs/commands/node/drain#deadline
|
||||||
|
[replaces]: /nomad/docs/job-specification/disconnect#replace
|
||||||
|
[`restart`]: /nomad/docs/job-specification/restart
|
||||||
|
[reschedules]: /nomad/docs/job-specification/reschedule
|
||||||
|
|||||||
@@ -22,15 +22,21 @@ description: >-
|
|||||||
]}
|
]}
|
||||||
/>
|
/>
|
||||||
|
|
||||||
The `reschedule` block specifies the group's rescheduling strategy. If specified at the job
|
The `reschedule` block specifies the group's rescheduling strategy. If specified
|
||||||
level, the configuration will apply to all groups within the job. If the
|
at the job level, the configuration will apply to all groups within the job. If
|
||||||
reschedule block is present on both the job and the group, they are merged with
|
the reschedule block is present on both the job and the group, they are merged
|
||||||
the group block taking the highest precedence and then the job.
|
with the group block taking the highest precedence and then the job.
|
||||||
|
|
||||||
Nomad will attempt to schedule the allocation on another node if any of its
|
Nomad will attempt to schedule the allocation on another node if any of its task
|
||||||
task statuses become `failed`. The scheduler prefers to create a replacement
|
statuses become `failed`. The scheduler prefers to create a replacement
|
||||||
allocation on a node that was not used by a previous allocation.
|
allocation on a node that was not used by a previous allocation.
|
||||||
|
|
||||||
|
Rescheduling happens when the Nomad agent fails to set up the allocation or the
|
||||||
|
tasks of an allocation fail more than their [`restart`][] block allows. When a
|
||||||
|
node is drained, Nomad [migrates][] the allocations instead and ignores the
|
||||||
|
`reschedule` block. When a node is lost, Nomad [replaces][] the allocations
|
||||||
|
instead and ignores the `reschedule` block.
|
||||||
|
|
||||||
|
|
||||||
```hcl
|
```hcl
|
||||||
job "docs" {
|
job "docs" {
|
||||||
@@ -131,3 +137,7 @@ job "docs" {
|
|||||||
```
|
```
|
||||||
|
|
||||||
[`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline
|
[`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline
|
||||||
|
[`restart`]: /nomad/docs/job-specification/restart
|
||||||
|
[migrates]: /nomad/docs/job-specification/migrate
|
||||||
|
[replaces]: /nomad/docs/job-specification/disconnect#replace
|
||||||
|
[reschedules]: /nomad/docs/job-specification/reschedule
|
||||||
|
|||||||
@@ -14,7 +14,8 @@ description: The "restart" block configures a group's behavior on task failure.
|
|||||||
/>
|
/>
|
||||||
|
|
||||||
The `restart` block configures a task's behavior on task failure. Restarts
|
The `restart` block configures a task's behavior on task failure. Restarts
|
||||||
happen on the client that is running the task.
|
happen on the client that is running the task. Restarts are different from
|
||||||
|
[rescheduling][], which happens when the tasks run out of restart attempts.
|
||||||
|
|
||||||
```hcl
|
```hcl
|
||||||
job "docs" {
|
job "docs" {
|
||||||
@@ -88,7 +89,7 @@ level, so that the Connect sidecar can inherit the default `restart`.
|
|||||||
than `attempts` times in an interval. For a detailed explanation of these
|
than `attempts` times in an interval. For a detailed explanation of these
|
||||||
values and their behavior, please see the [mode values section](#mode-values).
|
values and their behavior, please see the [mode values section](#mode-values).
|
||||||
|
|
||||||
- `render_templates` `(bool: false)` - Specifies whether to re-render all
|
- `render_templates` `(bool: false)` - Specifies whether to re-render all
|
||||||
templates when a task is restarted. If set to `true`, all templates will be re-rendered
|
templates when a task is restarted. If set to `true`, all templates will be re-rendered
|
||||||
when the task restarts. This can be useful for re-fetching Vault secrets, even if the
|
when the task restarts. This can be useful for re-fetching Vault secrets, even if the
|
||||||
lease on the existing secrets has not yet expired.
|
lease on the existing secrets has not yet expired.
|
||||||
@@ -192,3 +193,4 @@ restart {
|
|||||||
|
|
||||||
[sidecar_task]: /nomad/docs/job-specification/sidecar_task
|
[sidecar_task]: /nomad/docs/job-specification/sidecar_task
|
||||||
[`reschedule`]: /nomad/docs/job-specification/reschedule
|
[`reschedule`]: /nomad/docs/job-specification/reschedule
|
||||||
|
[rescheduling]: /nomad/docs/job-specification/reschedule
|
||||||
|
|||||||
Reference in New Issue
Block a user