mirror of
https://github.com/kemko/nomad.git
synced 2026-01-06 18:35:44 +03:00
Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
144 lines
4.9 KiB
Plaintext
144 lines
4.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: reschedule Block - Job Specification
|
|
description: >-
|
|
The "reschedule" block specifies the group's rescheduling strategy upon
|
|
|
|
allocation failures. Nomad will only attempt to reschedule failed allocations
|
|
on
|
|
|
|
to another node only after any local
|
|
[restarts](/nomad/docs/job-specification/restart)
|
|
|
|
have been exceeded.
|
|
---
|
|
|
|
# `reschedule` Block
|
|
|
|
<Placement
|
|
groups={[
|
|
['job', 'reschedule'],
|
|
['job', 'group', 'reschedule'],
|
|
]}
|
|
/>
|
|
|
|
The `reschedule` block specifies the group's rescheduling strategy. If specified
|
|
at the job level, the configuration will apply to all groups within the job. If
|
|
the reschedule block is present on both the job and the group, they are merged
|
|
with the group block taking the highest precedence and then the job.
|
|
|
|
Nomad will attempt to schedule the allocation on another node if any of its task
|
|
statuses become `failed`. The scheduler prefers to create a replacement
|
|
allocation on a node that was not used by a previous allocation.
|
|
|
|
Rescheduling happens when the Nomad agent fails to set up the allocation or the
|
|
tasks of an allocation fail more than their [`restart`][] block allows. When a
|
|
node is drained, Nomad [migrates][] the allocations instead and ignores the
|
|
`reschedule` block. When a node is lost, Nomad [replaces][] the allocations
|
|
instead and ignores the `reschedule` block.
|
|
|
|
|
|
```hcl
|
|
job "docs" {
|
|
group "example" {
|
|
reschedule {
|
|
attempts = 15
|
|
interval = "1h"
|
|
delay = "30s"
|
|
delay_function = "exponential"
|
|
max_delay = "120s"
|
|
unlimited = false
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
~> The reschedule block does not apply to `system` or `sysbatch` jobs because
|
|
they run on every node.
|
|
|
|
## `reschedule` Parameters
|
|
|
|
- `attempts` `(int: <varies>)` - Specifies the number of reschedule attempts
|
|
allowed in the configured interval. Defaults vary by job type, see below
|
|
for more information.
|
|
|
|
- `interval` `(string: <varies>)` - Specifies the sliding window which begins
|
|
when the first reschedule attempt starts and ensures that only `attempts`
|
|
number of reschedule happen within it. If more than `attempts` number of
|
|
failures happen with this interval, Nomad will not reschedule any more.
|
|
|
|
- `delay` `(string: <varies>)` - Specifies the duration to wait before attempting
|
|
to reschedule a failed task. This is specified using a label suffix like "30s" or "1h".
|
|
Delay cannot be less than 5 seconds.
|
|
|
|
- `delay_function` `(string: <varies>)` - Specifies the function that is used to
|
|
calculate subsequent reschedule delays. The initial delay is specified by the delay parameter.
|
|
`delay_function` has three possible values which are described below.
|
|
|
|
- `constant` - The delay between reschedule attempts stays constant at the `delay` value.
|
|
- `exponential` - The delay between reschedule attempts doubles.
|
|
- `fibonacci` - The delay between reschedule attempts is calculated by adding the two most recent
|
|
delays applied. For example if `delay` is set to 5 seconds, the next five reschedule attempts will be
|
|
delayed by 5 seconds, 5 seconds, 10 seconds, 15 seconds, and 25 seconds respectively.
|
|
|
|
- `max_delay` `(string: <varies>)` - is an upper bound on the delay beyond which it will not increase. This parameter
|
|
is used when `delay_function` is `exponential` or `fibonacci`, and is ignored when `constant` delay is used.
|
|
|
|
- `unlimited` `(boolean:<varies>)` - `unlimited` enables unlimited reschedule attempts. If this is
|
|
set to `true` the `attempts` and `interval` fields are not used. The [`progress_deadline`][]
|
|
parameter within the update block is still adhered to when this is set to `true`, meaning no more
|
|
reschedule attempts are triggered once the [`progress_deadline`][] is reached.
|
|
|
|
Information about reschedule attempts are displayed in the CLI and API for
|
|
allocations. Rescheduling is enabled by default for service and batch jobs
|
|
with the options shown below.
|
|
|
|
### `reschedule` Parameter Defaults
|
|
|
|
The values for the `reschedule` parameters vary by job type. Below are the
|
|
defaults by job type:
|
|
|
|
- The default batch reschedule policy is:
|
|
|
|
```hcl
|
|
reschedule {
|
|
attempts = 1
|
|
interval = "24h"
|
|
unlimited = false
|
|
delay = "5s"
|
|
delay_function = "constant"
|
|
}
|
|
```
|
|
|
|
- The default service reschedule policy is:
|
|
|
|
```hcl
|
|
reschedule {
|
|
delay = "30s"
|
|
delay_function = "exponential"
|
|
max_delay = "1h"
|
|
unlimited = true
|
|
}
|
|
```
|
|
|
|
### Disabling rescheduling
|
|
|
|
To disable rescheduling, set the `attempts` parameter to zero and `unlimited` to false.
|
|
|
|
```hcl
|
|
job "docs" {
|
|
group "example" {
|
|
reschedule {
|
|
attempts = 0
|
|
unlimited = false
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
[`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline
|
|
[`restart`]: /nomad/docs/job-specification/restart
|
|
[migrates]: /nomad/docs/job-specification/migrate
|
|
[replaces]: /nomad/docs/job-specification/disconnect#replace
|
|
[reschedules]: /nomad/docs/job-specification/reschedule
|