diff --git a/contributing/architecture-eval-triggers.md b/contributing/architecture-eval-triggers.md
index ece5c9aa5..af11f1290 100644
--- a/contributing/architecture-eval-triggers.md
+++ b/contributing/architecture-eval-triggers.md
@@ -75,7 +75,7 @@ The list below covers each trigger and what can trigger it.
* **job-scaling**: Scaling a Job will result in 1 Evaluation created, plus any
follow-up Evaluations associated with scheduling, planning, or deployments.
* **max-disconnect-timeout**: When an Allocation is in the `unknown` state for
- longer than the [`max_client_disconnect`][] window, the scheduler will create
+ longer than the [`disconnect.lost_after`][] window, the scheduler will create
1 Evaluation.
* **reconnect**: When a Node in the `disconnected` state reconnects, Nomad will
create 1 Evaluation per job with an allocation on the reconnected Node.
@@ -256,4 +256,4 @@ and eventually need to be garbage collected.
[`structs.go`]: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/structs/structs.go#L10857-L10875
[`update`]: https://developer.hashicorp.com/nomad/docs/job-specification/update
[`restart` attempts]: https://developer.hashicorp.com/nomad/docs/job-specification/restart
-[`max_client_disconnect`]: https://developer.hashicorp.com/nomad/docs/job-specification/group#max-client-disconnect
+[`disconnect.lost_after`]: https://developer.hashicorp.com/nomad/docs/job-specification/disconnect#lost_after
diff --git a/website/content/docs/configuration/server.mdx b/website/content/docs/configuration/server.mdx
index 0ce01ad84..609a10249 100644
--- a/website/content/docs/configuration/server.mdx
+++ b/website/content/docs/configuration/server.mdx
@@ -424,7 +424,7 @@ server {
Nomad Clients periodically heartbeat to Nomad Servers to confirm they are
operating as expected. Nomad Clients which do not heartbeat in the specified
amount of time are considered `down` and their allocations are marked as `lost`
-or `disconnected` (if [`max_client_disconnect`][max_client_disconnect] is set)
+or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set)
and rescheduled.
The various heartbeat related parameters allow you to tune the following
@@ -509,6 +509,6 @@ work.
[`nomad operator gossip keyring generate`]: /nomad/docs/commands/operator/gossip/keyring-generate
[search]: /nomad/docs/configuration/search
[encryption key]: /nomad/docs/operations/key-management
-[max_client_disconnect]: /nomad/docs/job-specification/group#max-client-disconnect
+[disconnect.lost_after]: /nomad/docs/job-specification/disconnect#lost_after
[herd]: https://en.wikipedia.org/wiki/Thundering_herd_problem
[wi]: /nomad/docs/concepts/workload-identity
diff --git a/website/content/docs/job-specification/disconnect.mdx b/website/content/docs/job-specification/disconnect.mdx
new file mode 100644
index 000000000..c5c7db1da
--- /dev/null
+++ b/website/content/docs/job-specification/disconnect.mdx
@@ -0,0 +1,201 @@
+---
+layout: docs
+page_title: disconnect Block - Job Specification
+description: |-
+ The "disconnect" block describes the behavior of both the Nomad server and
+ client in case of a network partition, as well as how to reconcile the workloads
+ in case of a reconnection.
+---
+
+# `disconnect` Block
+
+
+
+The `disconnect` block describes the system's behavior in case of a network
+partition. By default, without a `disconnect` block, if an allocation is on a
+node that misses heartbeats, the allocation will be marked `lost` and will be
+rescheduled.
+
+```hcl
+ job "docs" {
+ group "example" {
+ disconnect {
+ lost_after = "6h"
+ stop_after = "2h"
+ replace = false
+ reconcile = "keep_original"
+ }
+ }
+ }
+```
+
+## `disconnect` Parameters
+
+- `lost_after` `(string: "")` - Specifies a duration during which a Nomad client
+ will attempt to reconnect allocations after it fails to heartbeat
+ in the [`heartbeat_grace`][] window. It defaults to "" which is equivalent to
+ having the disconnect block be nil.
+
+ See [the example code below][lost_after] for more details. This setting cannot
+ be used with [`stop_after`].
+
+- `replace` `(bool: false)` - Specifies if the disconnected allocation should
+ be replaced by a new one rescheduled on a different node. If false and the
+ node it is running on becomes disconnected or goes down, this allocation
+ won't be rescheduled and will be reported as `unknown` until the node reconnects,
+ or until the allocation is manually stopped:
+
+ ```plaintext
+ `nomad alloc stop `
+ ```
+
+ If true, a new alloc will be placed immediately upon the node becoming
+ disconnected.
+
+- `stop_after` `(string: "")` - Specifies a duration after which a disconnected
+ Nomad client will stop its allocations. Setting `stop_after` shorter than
+ `lost_after` and `replace = false` at the same time is not permitted and
+ will cause a validation error, because this would lead to a state where no
+ allocations can be scheduled.
+
+ The Nomad client process must be running for this to occur. This setting
+ cannot be used with [`lost_after`].
+
+- `reconcile` `(string: "best_score")` - Specifies which allocation to keep once
+ the previously disconnected node regains connectivity.
+ It has four possible values which are described below:
+
+ - `keep_original`: Always keep the original allocation. Bear in mind
+ when choosing this option, it can have crashed while the client was
+ disconnected.
+ - `keep_replacement`: Always keep the allocation that was rescheduled
+ to replace the disconnected one.
+ - `best_score`: Keep the allocation running on the node with the best
+ score.
+ - `longest_running`: Keep the allocation that has been up and running
+ continuously for the longest time.
+
+
+## `disconnect` Examples
+
+The following examples only show the `disconnect` blocks. Remember that the
+`disconnect` block is only valid in the placements listed above.
+
+### Stop After
+
+This example shows how `stop_after` interacts with
+other blocks. For the `first` group, after the default 10 second
+[`heartbeat_grace`] window expires and 90 more seconds passes, the
+server will reschedule the allocation. The client will wait 90 seconds
+before sending a stop signal (`SIGTERM`) to the `first-task`
+task. After 15 more seconds because of the task's `kill_timeout`, the
+client will send `SIGKILL`. The `second` group does not have
+`stop_after`, so the server will reschedule the
+allocation after the 10 second [`heartbeat_grace`] expires. It will
+not be stopped on the client, regardless of how long the client is out
+of touch.
+
+Note that if the server's clocks are not closely synchronized with
+each other, the server may reschedule the group before the client has
+stopped the allocation. Operators should ensure that clock drift
+between servers is as small as possible.
+
+Note also that a group using this feature will be stopped on the
+client if the Nomad server cluster fails, since the client will be
+unable to contact any server in that case. Groups opting in to this
+feature are therefore exposed to an additional runtime dependency and
+potential point of failure.
+
+```hcl
+group "first" {
+ stop_after_client_disconnect = "90s"
+
+ task "first-task" {
+ kill_timeout = "15s"
+ }
+}
+
+group "second" {
+
+ task "second-task" {
+ kill_timeout = "5s"
+ }
+}
+```
+
+### Lost After
+
+By default, allocations running on a client that fails to heartbeat will be
+marked "lost". When a client reconnects, its allocations, which may still be
+healthy, will restart because they have been marked "lost". This can cause
+issues with stateful tasks or tasks with long restart times.
+
+Instead, an operator may desire that these allocations reconnect without a
+restart. When `lost_after` is specified, the Nomad server will mark
+clients that fail to heartbeat as "disconnected" rather than "down", and will
+mark allocations on a disconnected client as "unknown" rather than "lost".
+These allocations may continue to run on the disconnected client. Replacement
+allocations will be scheduled according to the allocations' `replace` settings
+until the disconnected client reconnects. Once a disconnected client reconnects,
+Nomad will compare the "unknown" allocations with their replacements will
+decide which ones to keep according to the `reconcile` setting.
+If the `lost_after` duration expires before the client reconnects,
+the allocations will be marked "lost". Clients that contain "unknown"
+allocations will transition to "disconnected" rather than "down" until the last
+`lost_after` duration has expired.
+
+In the example code below, if both of these task groups were placed on the same
+client and that client experienced a network outage, both of the group's
+allocations would be marked as "disconnected" at two minutes because of the
+client's `heartbeat_grace` value of "2m". If the network outage continued for
+eight hours, and the client continued to fail to heartbeat, the client would
+remain in a "disconnected" state, as the first group's `lost_after`
+is twelve hours. Once all groups' `lost_after` durations are
+exceeded, in this case in twelve hours, the client node will be marked as "down"
+and the allocation will be marked as "lost". If the client had reconnected
+before twelve hours had passed, the allocations would gracefully reconnect
+using the strategy defined by [`reconcile`].
+
+Lost After is useful for edge deployments, or scenarios when
+operators want zero on-client downtime due to node connectivity issues. This
+setting cannot be used with [`stop_after`].
+
+```hcl
+# server_config.hcl
+
+server {
+ enabled = true
+ heartbeat_grace = "2m"
+}
+```
+
+```hcl
+# jobspec.nomad
+
+group "first" {
+ disconnect {
+ lost_after = "12h"
+ reconcile = "best_score"
+ }
+
+ task "first-task" {
+ ...
+ }
+}
+
+group "second" {
+ disconnect {
+ lost_after = "12h"
+ reconcile = "keep_original"
+ }
+
+ task "second-task" {
+ ...
+ }
+}
+```
+
+[`heartbeat_grace`]: /nomad/docs/configuration/server#heartbeat_grace
+[`stop_after`]: /nomad/docs/job-specification/disconnect#stop_after
+[`lost_after`]: /nomad/docs/job-specification/disconnect#replace_after
+[`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile
\ No newline at end of file
diff --git a/website/content/docs/job-specification/group.mdx b/website/content/docs/job-specification/group.mdx
index 364577c08..f8338e212 100644
--- a/website/content/docs/job-specification/group.mdx
+++ b/website/content/docs/job-specification/group.mdx
@@ -48,6 +48,12 @@ job "docs" {
ephemeral disk requirements of the group. Ephemeral disks can be marked as
sticky and support live data migrations.
+- `disconnect` ([disconnect][]: nil) - Specifies the disconnect
+ strategy for the server and client for all tasks in this group in case of a
+ network partition. The tasks can be left unconnected, stopped or replaced
+ when the client disconnects. The policy for reconciliation in case the client
+ regains connectivity is also specified here.
+
- `meta` ([Meta][]: nil) - Specifies a key-value map that annotates
with user-defined metadata.
@@ -59,10 +65,6 @@ job "docs" {
requirements and configuration, including static and dynamic port allocations,
for the group.
-- `reschedule` ([Reschedule][]: nil) - Allows to specify a
- rescheduling strategy. Nomad will then attempt to schedule the task on another
- node if any of the group allocation statuses become "failed".
-
- `prevent_reschedule_on_lost` `(bool: false)` - Defines the reschedule behaviour
of an allocation when the node it is running on misses heartbeats.
When enabled, if the node it is running on becomes disconnected
@@ -82,6 +84,13 @@ job "docs" {
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].
+ This field was deprecated in favour of `replace` on the [`disconnect`] block,
+ see [example below][disconect_migration] for more details about migrating.
+
+- `reschedule` ([Reschedule][]: nil) - Allows to specify a
+ rescheduling strategy. Nomad will then attempt to schedule the task on another
+ node if any of the group allocation statuses become "failed".
+
- `restart` ([Restart][]: nil) - Specifies the restart policy for
all tasks in this group. If omitted, a default policy exists for each job
type, which can be found in the [restart block documentation][restart].
@@ -115,12 +124,16 @@ job "docs" {
The Nomad client process must be running for this to occur. This setting
cannot be used with [`max_client_disconnect`].
+ This field was deprecated in favour of `stop_after` on the [`disconnect`] block.
+
- `max_client_disconnect` `(string: "")` - Specifies a duration during which a
Nomad client will attempt to reconnect allocations after it fails to heartbeat
in the [`heartbeat_grace`] window. See [the example code
below][max-client-disconnect] for more details. This setting cannot be used
with [`stop_after_client_disconnect`].
+ This field was deprecated in favour of `lost_after` on the [`disconnect`] block.
+
- `task` ([Task][]: <required>) - Specifies one or more tasks to run
within this group. This can be specified multiple times, to add a task as part
of the group.
@@ -285,17 +298,20 @@ healthy, will restart because they have been marked "lost". This can cause
issues with stateful tasks or tasks with long restart times.
Instead, an operator may desire that these allocations reconnect without a
-restart. When `max_client_disconnect` is specified, the Nomad server will mark
-clients that fail to heartbeat as "disconnected" rather than "down", and will
-mark allocations on a disconnected client as "unknown" rather than "lost". These
-allocations may continue to run on the disconnected client. Replacement
-allocations will be scheduled according to the allocations' reschedule policy
-until the disconnected client reconnects. Once a disconnected client reconnects,
-Nomad will compare the "unknown" allocations with their replacements and keep
-the one with the best node score. If the `max_client_disconnect` duration
-expires before the client reconnects, the allocations will be marked "lost".
+restart. When `max_client_disconnect` or `disconnect.lost_after` is specified,
+the Nomad server will mark clients that fail to heartbeat as "disconnected"
+rather than "down", and will mark allocations on a disconnected client as
+"unknown" rather than "lost". These allocations may continue to run on the
+disconnected client. Replacement allocations will be scheduled according to the
+allocations' `disconnect.replace` settings. until the disconnected client
+reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown"
+allocations with their replacements and will decide which ones to keep according
+to the `disconnect.replace` setting. If the `max_client_disconnect` or
+`disconnect.losta_after` duration expires before the client reconnects,
+the allocations will be marked "lost".
Clients that contain "unknown" allocations will transition to "disconnected"
-rather than "down" until the last `max_client_disconnect` duration has expired.
+rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
+duration has expired.
In the example code below, if both of these task groups were placed on the same
client and that client experienced a network outage, both of the group's
@@ -371,6 +387,45 @@ If [`max_client_disconnect`](#max_client_disconnect) is set and
the node will be transition from `disconnected` to `down`. The allocation
will remain as `unknown` and won't be rescheduled.
+#### Migration to `disconnect` block
+
+The new configuration fileds in the disconnect block work exactly the same as the
+ones they are replacing:
+ * `stop_after_client_disconnect` is replaced by `stop_after`
+ * `max_client_disconnect` is replaced by `lost_after`
+ * `prevent_reschedule_on_lost` is replaced by `replace`
+
+To keep the same behaviour as the old configuration upon reconnection, the
+`reconcile` option should be set to `best_score`.
+
+The following example shows how to migrate from the old configuration to the new one:
+
+```hcl
+job "docs" {
+ group "example" {
+ max_client_disconnect = "6h"
+ stop_after_client_disconnect = "2h"
+ prevent_reschedule_on_lost = true
+ }
+}
+```
+Can be directly translated to:
+
+```hcl
+job "docs" {
+ group "example" {
+ disconnect {
+ lost_after = "6h"
+ stop_after = "2h"
+ replace = false
+ reconcile = "best_score"
+ }
+ }
+ }
+```
+
+All use constrains still apply with the disconnect block as they did before:
+ - `stop_after` and `lost_after` can't be used together.
[task]: /nomad/docs/job-specification/task 'Nomad task Job Specification'
[job]: /nomad/docs/job-specification/job 'Nomad job Job Specification'
@@ -389,6 +444,7 @@ will remain as `unknown` and won't be rescheduled.
[migrate]: /nomad/docs/job-specification/migrate 'Nomad migrate Job Specification'
[network]: /nomad/docs/job-specification/network 'Nomad network Job Specification'
[reschedule]: /nomad/docs/job-specification/reschedule 'Nomad reschedule Job Specification'
+[disconnect]: /nomad/docs/job-specification/disconnect 'Nomad disconnect Job Specification'
[restart]: /nomad/docs/job-specification/restart 'Nomad restart Job Specification'
[service]: /nomad/docs/job-specification/service 'Nomad service Job Specification'
[service_discovery]: /nomad/docs/integrations/consul-integration#service-discovery 'Nomad Service Discovery'
@@ -396,3 +452,4 @@ will remain as `unknown` and won't be rescheduled.
[vault]: /nomad/docs/job-specification/vault 'Nomad vault Job Specification'
[volume]: /nomad/docs/job-specification/volume 'Nomad volume Job Specification'
[`consul.name`]: /nomad/docs/configuration/consul#name
+[disconect_migration]: /nomad/docs/job-specification/group#migration_to_disconnect_block
diff --git a/website/content/docs/upgrade/upgrade-specific.mdx b/website/content/docs/upgrade/upgrade-specific.mdx
index 071932051..87a039d50 100644
--- a/website/content/docs/upgrade/upgrade-specific.mdx
+++ b/website/content/docs/upgrade/upgrade-specific.mdx
@@ -14,6 +14,12 @@ their upgrades as a result of new features or changed behavior. This page is
used to document those details separately from the standard upgrade flow.
## Nomad 1.8.0
+Nomad 1.8.0 introduces a `disconnect` block meant to group all the configuration
+options related to disconnected client's and server's behavior, causing the
+deprecation of the fileds `stop_after_client_disconnect`, `max_client_disconnect`
+and `prevent_reschedule_on_lost`. This block also introduces new options for
+allocations reconciliation if the client regains connectivity.
+
#### Removal of `raw_exec` option `no_cgroups`
diff --git a/website/data/docs-nav-data.json b/website/data/docs-nav-data.json
index f6d5f7585..51fa8c4c9 100644
--- a/website/data/docs-nav-data.json
+++ b/website/data/docs-nav-data.json
@@ -1703,6 +1703,10 @@
"title": "expose",
"path": "job-specification/expose"
},
+ {
+ "title": "disconnect",
+ "path": "job-specification/disconnect"
+ },
{
"title": "gateway",
"path": "job-specification/gateway"