Files
nomad/website/content/docs/job-specification/disconnect.mdx
Aimee Ukasick 986f3c727a Docs: SEO job spec section (#25612)
* action page

* change all page_title fields

* update title

* constraint through migrate pages

* update page title and heading to use sentence case

* fix front matter description

* Apply suggestions from code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2025-05-19 09:02:07 -05:00

230 lines
8.3 KiB
Plaintext

---
layout: docs
page_title: disconnect block in the job specification
description: |-
Describe system behavior for disconnected allocations in the `disconnect` block of the Nomad job specification. Specify whether Nomad should replace a disconnected allocation. Configure node reconciliation behavior. Set a duration for Nomad to attempt allocation reconnection and a duration after which a disconnected Nomad client stops its allocations. Review examples for stop after and lost after configuration.
---
# `disconnect` block in the job specification
<Placement groups={['job', 'group', 'disconnect']} />
The `disconnect` block describes the system's behavior in case of a network
partition. By default, without a `disconnect` block, if an allocation is on a
node that misses heartbeats, the allocation will be marked `lost` and will be
replaced.
Replacement happens when a node is lost. When a node is drained, Nomad
[migrates][] the allocations instead, and Nomad ignores the `disconnect`
block. When a Nomad agent fails to set up the allocation or the tasks of an
allocation fail more than their [`restart`][] block allows, Nomad
[reschedules][] the allocations and ignores the `disconnect`.
```hcl
job "docs" {
group "example" {
disconnect {
lost_after = "6h"
replace = false
reconcile = "keep_original"
}
}
group "example2" {
disconnect {
stop_on_client_after = "12h"
replace = false
reconcile = "keep_original"
}
}
}
```
~> Note that you cannot use both`lost_after` and `stop_on_client_after` in the
same `disconnect` block.
## Parameters
- `lost_after` `(string: "")` - Specifies a duration during which a Nomad client
will attempt to reconnect allocations after it fails to heartbeat in the
[`heartbeat_grace`][] window. It defaults to "", which is equivalent to
having the disconnect block be nil.
You cannot use `lost_after` and `stop_on_client_after` in the same
`disconnect` block.
Refer to [the Lost After section][lost-after] for more details.
- `replace` `(bool: false)` - Specifies if Nomad should replace the disconnected
allocation with a new one rescheduled on a different node. Nomad considers the
replacement allocation a reschedule and obeys the job's [`reschedule`][]
block. If false and the node the allocation is running on disconnects
or goes down, Nomad does not replace this allocation and reports `unknown`
until the node reconnects, or until you manually stop the allocation.
```plaintext
`nomad alloc stop <alloc ID>`
```
If true, a new alloc will be placed immediately upon the node becoming
disconnected.
- `stop_on_client_after` `(string: "")` - Specifies a duration after which a
disconnected Nomad client will stop its allocations. Setting
`stop_on_client_after` shorter than `lost_after` and `replace = false` at the
same time is not permitted and will cause a validation error, because this
would lead to a state where no allocations can be scheduled.
The Nomad client process must be running for this to occur.
You cannot use `stop_on_client_after` and `lost_after` in the same
`disconnect` block.
Refer to [the Stop After section][stop-after] for more details.
- `reconcile` `(string: "best_score")` - Specifies which allocation to keep once
the previously disconnected node regains connectivity.
It has four possible values which are described below:
- `keep_original`: Always keep the original allocation. Bear in mind
when choosing this option, it can have crashed while the client was
disconnected.
- `keep_replacement`: Always keep the allocation that was replaced
to replace the disconnected one.
- `best_score`: Keep the allocation running on the node with the best
score.
- `longest_running`: Keep the allocation that has been up and running
continuously for the longest time.
## Examples
The following examples only show the `disconnect` blocks. Remember that the
`disconnect` block is only valid in the placements listed previously.
### Stop After
This example shows how `stop_on_client_after` interacts with
other blocks. For the `first` group, after the default 10 second
[`heartbeat_grace`] window expires and 90 more seconds passes, the
server replaces the allocation. The client waits 90 seconds
before sending a stop signal (`SIGTERM`) to the `first-task`
task. After 15 more seconds because of the task's `kill_timeout`, the
client will send `SIGKILL`. The `second` group does not have
`stop_on_client_after`, so the server replaces the
allocation after the 10 second [`heartbeat_grace`] expires. It will
not be stopped on the client, regardless of how long the client is out
of touch.
Note that if the server's clocks are not closely synchronized with
each other, the server may replace the group before the client has
stopped the allocation. Operators should ensure that clock drift
between servers is as small as possible.
Note also that a group using this feature will be stopped on the
client if the Nomad server cluster fails, since the client will be
unable to contact any server in that case. Groups opting in to this
feature are therefore exposed to an additional runtime dependency and
potential point of failure.
```hcl
group "first" {
disconnect {
stop_on_client_after = "90s"
}
task "first-task" {
kill_timeout = "15s"
}
}
group "second" {
task "second-task" {
kill_timeout = "5s"
}
}
```
### Lost After
By default, allocations running on a client that fails to heartbeat will be
marked "lost". When a client reconnects, its allocations, which may still be
healthy, will restart because they have been marked "lost". This can cause
issues with stateful tasks or tasks with long restart times.
Instead, an operator may desire that these allocations reconnect without a
restart. When `lost_after` is specified, the Nomad server will mark
clients that fail to heartbeat as "disconnected" rather than "down", and will
mark allocations on a disconnected client as "unknown" rather than "lost".
These allocations may continue to run on the disconnected client. Replacement
allocations will be scheduled according to the allocations' `replace` settings
until the disconnected client reconnects. Once a disconnected client reconnects,
Nomad will compare the "unknown" allocations with their replacements will
decide which ones to keep according to the `reconcile` setting.
If the `lost_after` duration expires before the client reconnects,
the allocations will be marked "lost". Clients that contain "unknown"
allocations will transition to "disconnected" rather than "down" until the last
`lost_after` duration has expired.
In the example code below, if both of these task groups were placed on the same
client and that client experienced a network outage, both of the group's
allocations would be marked as "disconnected" at two minutes because of the
client's `heartbeat_grace` value of "2m". If the network outage continued for
eight hours, and the client continued to fail to heartbeat, the client would
remain in a "disconnected" state, as the first group's `lost_after`
is twelve hours. Once all groups' `lost_after` durations are
exceeded, in this case in twelve hours, the client node will be marked as "down"
and the allocation will be marked as "lost". If the client had reconnected
before twelve hours had passed, the allocations would gracefully reconnect
using the strategy defined by [`reconcile`].
Lost After is useful for edge deployments, or scenarios when
operators want zero on-client downtime due to node connectivity issues. This
setting cannot be used with `stop_on_client_after`.
```hcl
# server_config.hcl
server {
enabled = true
heartbeat_grace = "2m"
}
```
```hcl
# jobspec.nomad
group "first" {
disconnect {
lost_after = "12h"
reconcile = "best_score"
}
task "first-task" {
...
}
}
group "second" {
disconnect {
lost_after = "12h"
reconcile = "keep_original"
}
task "second-task" {
...
}
}
```
[`heartbeat_grace`]: /nomad/docs/configuration/server#heartbeat_grace
[stop-after]: /nomad/docs/job-specification/disconnect#stop-after
[lost-after]: /nomad/docs/job-specification/disconnect#lost-after
[`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile
[migrates]: /nomad/docs/job-specification/migrate
[`restart`]: /nomad/docs/job-specification/restart
[reschedules]: /nomad/docs/job-specification/reschedule
[`reschedule`]: /nomad/docs/job-specification/reschedule