Files
nomad/website/content/docs/job-specification/reschedule.mdx
Tim Gross dc58f247ed docs: clarify reschedule, migrate, and replacement terminology (#24929)
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-18 09:31:03 -05:00

144 lines
4.9 KiB
Plaintext

---
layout: docs
page_title: reschedule Block - Job Specification
description: >-
The "reschedule" block specifies the group's rescheduling strategy upon
allocation failures. Nomad will only attempt to reschedule failed allocations
on
to another node only after any local
[restarts](/nomad/docs/job-specification/restart)
have been exceeded.
---
# `reschedule` Block
<Placement
groups={[
['job', 'reschedule'],
['job', 'group', 'reschedule'],
]}
/>
The `reschedule` block specifies the group's rescheduling strategy. If specified
at the job level, the configuration will apply to all groups within the job. If
the reschedule block is present on both the job and the group, they are merged
with the group block taking the highest precedence and then the job.
Nomad will attempt to schedule the allocation on another node if any of its task
statuses become `failed`. The scheduler prefers to create a replacement
allocation on a node that was not used by a previous allocation.
Rescheduling happens when the Nomad agent fails to set up the allocation or the
tasks of an allocation fail more than their [`restart`][] block allows. When a
node is drained, Nomad [migrates][] the allocations instead and ignores the
`reschedule` block. When a node is lost, Nomad [replaces][] the allocations
instead and ignores the `reschedule` block.
```hcl
job "docs" {
group "example" {
reschedule {
attempts = 15
interval = "1h"
delay = "30s"
delay_function = "exponential"
max_delay = "120s"
unlimited = false
}
}
}
```
~> The reschedule block does not apply to `system` or `sysbatch` jobs because
they run on every node.
## `reschedule` Parameters
- `attempts` `(int: <varies>)` - Specifies the number of reschedule attempts
allowed in the configured interval. Defaults vary by job type, see below
for more information.
- `interval` `(string: <varies>)` - Specifies the sliding window which begins
when the first reschedule attempt starts and ensures that only `attempts`
number of reschedule happen within it. If more than `attempts` number of
failures happen with this interval, Nomad will not reschedule any more.
- `delay` `(string: <varies>)` - Specifies the duration to wait before attempting
to reschedule a failed task. This is specified using a label suffix like "30s" or "1h".
Delay cannot be less than 5 seconds.
- `delay_function` `(string: <varies>)` - Specifies the function that is used to
calculate subsequent reschedule delays. The initial delay is specified by the delay parameter.
`delay_function` has three possible values which are described below.
- `constant` - The delay between reschedule attempts stays constant at the `delay` value.
- `exponential` - The delay between reschedule attempts doubles.
- `fibonacci` - The delay between reschedule attempts is calculated by adding the two most recent
delays applied. For example if `delay` is set to 5 seconds, the next five reschedule attempts will be
delayed by 5 seconds, 5 seconds, 10 seconds, 15 seconds, and 25 seconds respectively.
- `max_delay` `(string: <varies>)` - is an upper bound on the delay beyond which it will not increase. This parameter
is used when `delay_function` is `exponential` or `fibonacci`, and is ignored when `constant` delay is used.
- `unlimited` `(boolean:<varies>)` - `unlimited` enables unlimited reschedule attempts. If this is
set to `true` the `attempts` and `interval` fields are not used. The [`progress_deadline`][]
parameter within the update block is still adhered to when this is set to `true`, meaning no more
reschedule attempts are triggered once the [`progress_deadline`][] is reached.
Information about reschedule attempts are displayed in the CLI and API for
allocations. Rescheduling is enabled by default for service and batch jobs
with the options shown below.
### `reschedule` Parameter Defaults
The values for the `reschedule` parameters vary by job type. Below are the
defaults by job type:
- The default batch reschedule policy is:
```hcl
reschedule {
attempts = 1
interval = "24h"
unlimited = false
delay = "5s"
delay_function = "constant"
}
```
- The default service reschedule policy is:
```hcl
reschedule {
delay = "30s"
delay_function = "exponential"
max_delay = "1h"
unlimited = true
}
```
### Disabling rescheduling
To disable rescheduling, set the `attempts` parameter to zero and `unlimited` to false.
```hcl
job "docs" {
group "example" {
reschedule {
attempts = 0
unlimited = false
}
}
}
```
[`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline
[`restart`]: /nomad/docs/job-specification/restart
[migrates]: /nomad/docs/job-specification/migrate
[replaces]: /nomad/docs/job-specification/disconnect#replace
[reschedules]: /nomad/docs/job-specification/reschedule