Files
nomad/website/content/docs/job-specification/restart.mdx
Tim Gross dc58f247ed docs: clarify reschedule, migrate, and replacement terminology (#24929)
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-18 09:31:03 -05:00

197 lines
5.2 KiB
Plaintext

---
layout: docs
page_title: restart Block - Job Specification
description: The "restart" block configures a group's behavior on task failure.
---
# `restart` Block
<Placement
groups={[
['job', 'group', 'restart'],
['job', 'group', 'task', 'restart'],
]}
/>
The `restart` block configures a task's behavior on task failure. Restarts
happen on the client that is running the task. Restarts are different from
[rescheduling][], which happens when the tasks run out of restart attempts.
```hcl
job "docs" {
group "example" {
restart {
attempts = 3
delay = "30s"
}
}
}
```
If specified at the group level, the configuration is inherited by all
tasks in the group, including any [sidecar tasks][sidecar_task]. If
also present on the task, the policy is merged with the restart policy
from the encapsulating task group.
For example, assuming that the task group restart policy is:
```hcl
restart {
interval = "30m"
attempts = 2
delay = "15s"
mode = "fail"
render_templates = true
}
```
and the individual task restart policy is:
```hcl
restart {
attempts = 5
}
```
then the effective restart policy for the task will be:
```hcl
restart {
interval = "30m"
attempts = 5
delay = "15s"
mode = "fail"
render_templates = true
}
```
Because sidecar tasks don't accept a `restart` block, it's recommended
that you set the `restart` for jobs with sidecar tasks at the task
level, so that the Connect sidecar can inherit the default `restart`.
## `restart` Parameters
- `attempts` `(int: <varies>)` - Specifies the number of restarts allowed in the
configured interval. Defaults vary by job type, see below for more
information.
- `delay` `(string: "15s")` - Specifies the duration to wait before restarting a
task. This is specified using a label suffix like "30s" or "1h". A random
jitter of up to 25% is added to the delay.
- `interval` `(string: <varies>)` - Specifies the duration which begins when the
first task starts and ensures that only `attempts` number of restarts happens
within it. If more than `attempts` number of failures happen, behavior is
controlled by `mode`. This is specified using a label suffix like "30s" or
"1h". Defaults vary by job type, see below for more information.
- `mode` `(string: "fail")` - Controls the behavior when the task fails more
than `attempts` times in an interval. For a detailed explanation of these
values and their behavior, please see the [mode values section](#mode-values).
- `render_templates` `(bool: false)` - Specifies whether to re-render all
templates when a task is restarted. If set to `true`, all templates will be re-rendered
when the task restarts. This can be useful for re-fetching Vault secrets, even if the
lease on the existing secrets has not yet expired.
### `restart` Parameter Defaults
The values for many of the `restart` parameters vary by job type. Here are the
defaults by job type:
- The default batch restart policy is:
```hcl
restart {
attempts = 3
delay = "15s"
interval = "24h"
mode = "fail"
render_templates = false
}
```
- The default service and system job restart policy is:
```hcl
restart {
interval = "30m"
attempts = 2
delay = "15s"
mode = "fail"
render_templates = false
}
```
### Disabling restart
To disable restarting, set the `attempts` parameter to zero and `mode` to `"fail"`.
```hcl
job "docs" {
group "example" {
restart {
attempts = 0
mode = "fail"
}
}
}
```
### `mode` Values
This section details the specific values for the "mode" parameter in the Nomad
job specification for constraints. The mode is always specified as a string:
```hcl
restart {
mode = "..."
}
```
- `"delay"` - Instructs the client to wait until another `interval`
before restarting the task.
- `"fail"` - Instructs the client not to attempt to restart the task
once the number of `attempts` have been used. This is the default
behavior. This mode is useful for non-idempotent jobs which are
unlikely to succeed after a few failures. The allocation will be
marked as failed and the scheduler will attempt to reschedule the
allocation according to the
[`reschedule`] block.
### `restart` Examples
With the following `restart` block, a failing task will restart 3
times with 15 seconds between attempts, and then wait 10 minutes
before attempting another 3 attempts. The task restart will never fail
the entire allocation.
```hcl
restart {
attempts = 3
delay = "15s"
interval = "10m"
mode = "delay"
}
```
With the following `restart` block, a task that fails after 1
minute, after 2 minutes, and after 3 minutes will be restarted each
time. If it fails again before 10 minutes, the entire allocation will
be marked as failed and the scheduler will follow the group's
[`reschedule`] specification, possibly resulting in a new evaluation.
```hcl
restart {
attempts = 3
delay = "15s"
interval = "10m"
mode = "fail"
}
```
[sidecar_task]: /nomad/docs/job-specification/sidecar_task
[`reschedule`]: /nomad/docs/job-specification/reschedule
[rescheduling]: /nomad/docs/job-specification/reschedule