mirror of
https://github.com/kemko/nomad.git
synced 2026-01-05 18:05:42 +03:00
Expand the job settings to include the disconnect block and set as deprecated the fields that will be replaced by it.
456 lines
17 KiB
Plaintext
456 lines
17 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: group Block - Job Specification
|
|
description: |-
|
|
The "group" block defines a series of tasks that should be co-located on the
|
|
same Nomad client. Any task within a group will be placed on the same client.
|
|
---
|
|
|
|
# `group` Block
|
|
|
|
<Placement groups={['job', 'group']} />
|
|
|
|
The `group` block defines a series of tasks that should be co-located on the
|
|
same Nomad client. Any [task][] within a group will be placed on the same
|
|
client.
|
|
|
|
```hcl
|
|
job "docs" {
|
|
group "example" {
|
|
# ...
|
|
}
|
|
}
|
|
```
|
|
|
|
## `group` Parameters
|
|
|
|
- `constraint` <code>([Constraint][]: nil)</code> -
|
|
This can be provided multiple times to define additional constraints.
|
|
|
|
- `affinity` <code>([Affinity][]: nil)</code> - This can be provided
|
|
multiple times to define preferred placement criteria.
|
|
|
|
- `spread` <code>([Spread][spread]: nil)</code> - This can be provided
|
|
multiple times to define criteria for spreading allocations across a
|
|
node attribute or metadata. See the
|
|
[Nomad spread reference](/nomad/docs/job-specification/spread) for more details.
|
|
|
|
- `count` `(int)` - Specifies the number of instances that should be running
|
|
under for this group. This value must be non-negative. This defaults to the
|
|
`min` value specified in the [`scaling`](/nomad/docs/job-specification/scaling)
|
|
block, if present; otherwise, this defaults to `1`.
|
|
|
|
- `consul` <code>([Consul][consul]: nil)</code> - Specifies Consul configuration
|
|
options specific to the group. These options will be applied to all tasks and
|
|
services in the group unless a task has its own `consul` block.
|
|
|
|
- `ephemeral_disk` <code>([EphemeralDisk][]: nil)</code> - Specifies the
|
|
ephemeral disk requirements of the group. Ephemeral disks can be marked as
|
|
sticky and support live data migrations.
|
|
|
|
- `disconnect` <code>([disconnect][]: nil)</code> - Specifies the disconnect
|
|
strategy for the server and client for all tasks in this group in case of a
|
|
network partition. The tasks can be left unconnected, stopped or replaced
|
|
when the client disconnects. The policy for reconciliation in case the client
|
|
regains connectivity is also specified here.
|
|
|
|
- `meta` <code>([Meta][]: nil)</code> - Specifies a key-value map that annotates
|
|
with user-defined metadata.
|
|
|
|
- `migrate` <code>([Migrate][]: nil)</code> - Specifies the group strategy for
|
|
migrating off of draining nodes. Only service jobs with a count greater than
|
|
1 support migrate blocks.
|
|
|
|
- `network` <code>([Network][]: <optional>)</code> - Specifies the network
|
|
requirements and configuration, including static and dynamic port allocations,
|
|
for the group.
|
|
|
|
- `prevent_reschedule_on_lost` `(bool: false)` - Defines the reschedule behaviour
|
|
of an allocation when the node it is running on misses heartbeats.
|
|
When enabled, if the node it is running on becomes disconnected
|
|
or goes down, this allocations wont be rescheduled and will show up as `unknown`
|
|
until the node comes back up or it is manually restarted.
|
|
|
|
This behaviour will only modify the reschedule process on the server.
|
|
To modify the allocation behaviour on the client, see
|
|
[`stop_after_client_disconnect`](#stop_after_client_disconnect).
|
|
|
|
The `unknown` allocation has to be manually stopped to run it again.
|
|
|
|
```plaintext
|
|
`nomad alloc stop <alloc ID>`
|
|
```
|
|
|
|
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
|
|
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].
|
|
|
|
This field was deprecated in favour of `replace` on the [`disconnect`] block,
|
|
see [example below][disconect_migration] for more details about migrating.
|
|
|
|
- `reschedule` <code>([Reschedule][]: nil)</code> - Allows to specify a
|
|
rescheduling strategy. Nomad will then attempt to schedule the task on another
|
|
node if any of the group allocation statuses become "failed".
|
|
|
|
- `restart` <code>([Restart][]: nil)</code> - Specifies the restart policy for
|
|
all tasks in this group. If omitted, a default policy exists for each job
|
|
type, which can be found in the [restart block documentation][restart].
|
|
|
|
- `service` <code>([Service][]: nil)</code> - Specifies integrations with Nomad
|
|
or [Consul](/nomad/docs/configuration/consul) for service discovery. Nomad
|
|
automatically registers each service when an allocation is started and
|
|
de-registers them when the allocation is destroyed.
|
|
|
|
- `shutdown_delay` `(string: "0s")` - Specifies the duration to wait when
|
|
stopping a group's tasks. The delay occurs between Consul or Nomad service
|
|
deregistration and sending each task a shutdown signal. Ideally, services
|
|
would fail health checks once they receive a shutdown signal. Alternatively,
|
|
`shutdown_delay` may be set to give in-flight requests time to complete
|
|
before shutting down. A group level `shutdown_delay` will run regardless
|
|
if there are any defined group [services](/nomad/docs/job-specification/group#service)
|
|
and only applies to these services. In addition, tasks may have their own
|
|
[`shutdown_delay`](/nomad/docs/job-specification/task#shutdown_delay) which waits
|
|
between de-registering task services and stopping the task.
|
|
|
|
- `stop_after_client_disconnect` `(string: "")` - Specifies a duration after
|
|
which a Nomad client will stop allocations, if it cannot communicate with the
|
|
servers. By default, a client will not stop an allocation until explicitly
|
|
told to by a server. A client that fails to heartbeat to a server within the
|
|
[`heartbeat_grace`] window and any allocations running on it will be marked
|
|
"lost" and Nomad will schedule replacement allocations. The replaced
|
|
allocations will normally continue to run on the non-responsive client. But
|
|
you may want them to stop instead — for example, allocations requiring
|
|
exclusive access to an external resource. When specified, the Nomad client
|
|
will stop them after this duration.
|
|
The Nomad client process must be running for this to occur. This setting
|
|
cannot be used with [`max_client_disconnect`].
|
|
|
|
This field was deprecated in favour of `stop_after` on the [`disconnect`] block.
|
|
|
|
- `max_client_disconnect` `(string: "")` - Specifies a duration during which a
|
|
Nomad client will attempt to reconnect allocations after it fails to heartbeat
|
|
in the [`heartbeat_grace`] window. See [the example code
|
|
below][max-client-disconnect] for more details. This setting cannot be used
|
|
with [`stop_after_client_disconnect`].
|
|
|
|
This field was deprecated in favour of `lost_after` on the [`disconnect`] block.
|
|
|
|
- `task` <code>([Task][]: <required>)</code> - Specifies one or more tasks to run
|
|
within this group. This can be specified multiple times, to add a task as part
|
|
of the group.
|
|
|
|
- `update` <code>([Update][update]: nil)</code> - Specifies the task's update
|
|
strategy. When omitted, a default update strategy is applied.
|
|
|
|
- `vault` <code>([Vault][]: nil)</code> - Specifies the set of Vault policies
|
|
required by all tasks in this group. Overrides a `vault` block set at the
|
|
`job` level.
|
|
|
|
- `volume` <code>([Volume][]: nil)</code> - Specifies the volumes that are
|
|
required by tasks within the group.
|
|
|
|
## `group` Examples
|
|
|
|
The following examples only show the `group` blocks. Remember that the
|
|
`group` block is only valid in the placements listed above.
|
|
|
|
### Specifying Count
|
|
|
|
This example specifies that 5 instances of the tasks within this group should be
|
|
running:
|
|
|
|
```hcl
|
|
group "example" {
|
|
count = 5
|
|
}
|
|
```
|
|
|
|
### Tasks with Constraint
|
|
|
|
This example shows two abbreviated tasks with a constraint on the group. This
|
|
will restrict the tasks to 64-bit operating systems.
|
|
|
|
```hcl
|
|
group "example" {
|
|
constraint {
|
|
attribute = "${attr.cpu.arch}"
|
|
value = "amd64"
|
|
}
|
|
|
|
task "cache" {
|
|
# ...
|
|
}
|
|
|
|
task "server" {
|
|
# ...
|
|
}
|
|
}
|
|
```
|
|
|
|
### Metadata
|
|
|
|
This example show arbitrary user-defined metadata on the group:
|
|
|
|
```hcl
|
|
group "example" {
|
|
meta {
|
|
my-key = "my-value"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Network
|
|
|
|
This example shows network constraints as specified in the [network][] block
|
|
which uses the `bridge` networking mode, dynamically allocates two ports, and
|
|
statically allocates one port:
|
|
|
|
```hcl
|
|
group "example" {
|
|
network {
|
|
mode = "bridge"
|
|
port "http" {}
|
|
port "https" {}
|
|
port "lb" {
|
|
static = "8889"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Service Discovery
|
|
|
|
This example creates a service in Consul. To read more about service discovery
|
|
in Nomad, please see the [Nomad service discovery documentation][service_discovery].
|
|
|
|
```hcl
|
|
group "example" {
|
|
network {
|
|
port "api" {}
|
|
}
|
|
|
|
service {
|
|
name = "example"
|
|
port = "api"
|
|
tags = ["default"]
|
|
|
|
check {
|
|
type = "tcp"
|
|
interval = "10s"
|
|
timeout = "2s"
|
|
}
|
|
}
|
|
|
|
task "api" { ... }
|
|
}
|
|
```
|
|
|
|
### Stop After Client Disconnect
|
|
|
|
This example shows how `stop_after_client_disconnect` interacts with
|
|
other blocks. For the `first` group, after the default 10 second
|
|
[`heartbeat_grace`] window expires and 90 more seconds passes, the
|
|
server will reschedule the allocation. The client will wait 90 seconds
|
|
before sending a stop signal (`SIGTERM`) to the `first-task`
|
|
task. After 15 more seconds because of the task's `kill_timeout`, the
|
|
client will send `SIGKILL`. The `second` group does not have
|
|
`stop_after_client_disconnect`, so the server will reschedule the
|
|
allocation after the 10 second [`heartbeat_grace`] expires. It will
|
|
not be stopped on the client, regardless of how long the client is out
|
|
of touch.
|
|
|
|
Note that if the server's clocks are not closely synchronized with
|
|
each other, the server may reschedule the group before the client has
|
|
stopped the allocation. Operators should ensure that clock drift
|
|
between servers is as small as possible.
|
|
|
|
Note also that a group using this feature will be stopped on the
|
|
client if the Nomad server cluster fails, since the client will be
|
|
unable to contact any server in that case. Groups opting in to this
|
|
feature are therefore exposed to an additional runtime dependency and
|
|
potential point of failure.
|
|
|
|
```hcl
|
|
group "first" {
|
|
stop_after_client_disconnect = "90s"
|
|
|
|
task "first-task" {
|
|
kill_timeout = "15s"
|
|
}
|
|
}
|
|
|
|
group "second" {
|
|
|
|
task "second-task" {
|
|
kill_timeout = "5s"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Max Client Disconnect
|
|
|
|
`max_client_disconnect` specifies a duration during which a Nomad client will
|
|
attempt to reconnect allocations after it fails to heartbeat in the
|
|
[`heartbeat_grace`] window.
|
|
|
|
By default, allocations running on a client that fails to heartbeat will be
|
|
marked "lost". When a client reconnects, its allocations, which may still be
|
|
healthy, will restart because they have been marked "lost". This can cause
|
|
issues with stateful tasks or tasks with long restart times.
|
|
|
|
Instead, an operator may desire that these allocations reconnect without a
|
|
restart. When `max_client_disconnect` or `disconnect.lost_after` is specified,
|
|
the Nomad server will mark clients that fail to heartbeat as "disconnected"
|
|
rather than "down", and will mark allocations on a disconnected client as
|
|
"unknown" rather than "lost". These allocations may continue to run on the
|
|
disconnected client. Replacement allocations will be scheduled according to the
|
|
allocations' `disconnect.replace` settings. until the disconnected client
|
|
reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown"
|
|
allocations with their replacements and will decide which ones to keep according
|
|
to the `disconnect.replace` setting. If the `max_client_disconnect` or
|
|
`disconnect.losta_after` duration expires before the client reconnects,
|
|
the allocations will be marked "lost".
|
|
Clients that contain "unknown" allocations will transition to "disconnected"
|
|
rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
|
|
duration has expired.
|
|
|
|
In the example code below, if both of these task groups were placed on the same
|
|
client and that client experienced a network outage, both of the group's
|
|
allocations would be marked as "disconnected" at two minutes because of the
|
|
client's `heartbeat_grace` value of "2m". If the network outage continued for
|
|
eight hours, and the client continued to fail to heartbeat, the client would
|
|
remain in a "disconnected" state, as the first group's `max_client_disconnect`
|
|
is twelve hours. Once all groups' `max_client_disconnect` durations are
|
|
exceeded, in this case in twelve hours, the client node will be marked as "down"
|
|
and the allocation will be marked as "lost". If the client had reconnected
|
|
before twelve hours had passed, the allocations would gracefully reconnect
|
|
without a restart.
|
|
|
|
Max Client Disconnect is useful for edge deployments, or scenarios when
|
|
operators want zero on-client downtime due to node connectivity issues. This
|
|
setting cannot be used with [`stop_after_client_disconnect`].
|
|
|
|
```hcl
|
|
# server_config.hcl
|
|
|
|
server {
|
|
enabled = true
|
|
heartbeat_grace = "2m"
|
|
}
|
|
```
|
|
|
|
```hcl
|
|
# jobspec.nomad
|
|
|
|
group "first" {
|
|
max_client_disconnect = "12h"
|
|
|
|
task "first-task" {
|
|
...
|
|
}
|
|
}
|
|
|
|
group "second" {
|
|
max_client_disconnect = "6h"
|
|
|
|
task "second-task" {
|
|
...
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Max Client Disconnect and Prevent Reschedule On Lost
|
|
|
|
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
|
|
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].
|
|
|
|
```hcl
|
|
# jobspec.nomad
|
|
|
|
group "first" {
|
|
max_client_disconnect = "12h"
|
|
prevent_reschedule_on_lost = true
|
|
|
|
reschedule {
|
|
attempts = 0
|
|
unlimited = false
|
|
}
|
|
|
|
task "first-task" {
|
|
...
|
|
}
|
|
}
|
|
```
|
|
|
|
If [`max_client_disconnect`](#max_client_disconnect) is set and
|
|
`prevent_reschedule_on_lost = true`, allocations on disconnected nodes will be
|
|
`unknown` until the `max_client_disconnect` window expires, at which point
|
|
the node will be transition from `disconnected` to `down`. The allocation
|
|
will remain as `unknown` and won't be rescheduled.
|
|
|
|
#### Migration to `disconnect` block
|
|
|
|
The new configuration fileds in the disconnect block work exactly the same as the
|
|
ones they are replacing:
|
|
* `stop_after_client_disconnect` is replaced by `stop_after`
|
|
* `max_client_disconnect` is replaced by `lost_after`
|
|
* `prevent_reschedule_on_lost` is replaced by `replace`
|
|
|
|
To keep the same behaviour as the old configuration upon reconnection, the
|
|
`reconcile` option should be set to `best_score`.
|
|
|
|
The following example shows how to migrate from the old configuration to the new one:
|
|
|
|
```hcl
|
|
job "docs" {
|
|
group "example" {
|
|
max_client_disconnect = "6h"
|
|
stop_after_client_disconnect = "2h"
|
|
prevent_reschedule_on_lost = true
|
|
}
|
|
}
|
|
```
|
|
Can be directly translated to:
|
|
|
|
```hcl
|
|
job "docs" {
|
|
group "example" {
|
|
disconnect {
|
|
lost_after = "6h"
|
|
stop_after = "2h"
|
|
replace = false
|
|
reconcile = "best_score"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
All use constrains still apply with the disconnect block as they did before:
|
|
- `stop_after` and `lost_after` can't be used together.
|
|
|
|
[task]: /nomad/docs/job-specification/task 'Nomad task Job Specification'
|
|
[job]: /nomad/docs/job-specification/job 'Nomad job Job Specification'
|
|
[constraint]: /nomad/docs/job-specification/constraint 'Nomad constraint Job Specification'
|
|
[consul]: /nomad/docs/job-specification/consul
|
|
[consul_namespace]: /nomad/docs/commands/job/run#consul-namespace
|
|
[spread]: /nomad/docs/job-specification/spread 'Nomad spread Job Specification'
|
|
[affinity]: /nomad/docs/job-specification/affinity 'Nomad affinity Job Specification'
|
|
[ephemeraldisk]: /nomad/docs/job-specification/ephemeral_disk 'Nomad ephemeral_disk Job Specification'
|
|
[`heartbeat_grace`]: /nomad/docs/configuration/server#heartbeat_grace
|
|
[`max_client_disconnect`]: /nomad/docs/job-specification/group#max_client_disconnect
|
|
[`disable_rescheduling`]: /nomad/docs/job-specification/reschedule#disabling-rescheduling
|
|
[max-client-disconnect]: /nomad/docs/job-specification/group#max-client-disconnect 'the example code below'
|
|
[`stop_after_client_disconnect`]: /nomad/docs/job-specification/group#stop_after_client_disconnect
|
|
[meta]: /nomad/docs/job-specification/meta 'Nomad meta Job Specification'
|
|
[migrate]: /nomad/docs/job-specification/migrate 'Nomad migrate Job Specification'
|
|
[network]: /nomad/docs/job-specification/network 'Nomad network Job Specification'
|
|
[reschedule]: /nomad/docs/job-specification/reschedule 'Nomad reschedule Job Specification'
|
|
[disconnect]: /nomad/docs/job-specification/disconnect 'Nomad disconnect Job Specification'
|
|
[restart]: /nomad/docs/job-specification/restart 'Nomad restart Job Specification'
|
|
[service]: /nomad/docs/job-specification/service 'Nomad service Job Specification'
|
|
[service_discovery]: /nomad/docs/integrations/consul-integration#service-discovery 'Nomad Service Discovery'
|
|
[update]: /nomad/docs/job-specification/update 'Nomad update Job Specification'
|
|
[vault]: /nomad/docs/job-specification/vault 'Nomad vault Job Specification'
|
|
[volume]: /nomad/docs/job-specification/volume 'Nomad volume Job Specification'
|
|
[`consul.name`]: /nomad/docs/configuration/consul#name
|
|
[disconect_migration]: /nomad/docs/job-specification/group#migration_to_disconnect_block
|