mirror of
https://github.com/kemko/nomad.git
synced 2026-01-01 16:05:42 +03:00
The server startup could "hang" to the view of an operator if it had a key that could not be decrypted or replicated loaded from the FSM at startup. In order to prevent this happening, the server startup function will now use a timeout to wait for the encrypter to be ready. If the timeout is reached, the error is sent back to the caller which fails the CLI command. This bubbling of error message will also flush to logs which will provide addition operator feedback. The server only cares about keys loaded from the FSM snapshot and trailing logs before the encrypter should be classed as ready. So that the encrypter ready function does not get blocked by keys added outside of the initial Raft load, we take a snapshot of the decryption tasks as we enter the blocking call, and class these as our barrier.
536 lines
26 KiB
Plaintext
536 lines
26 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: server Block in Agent Configuration
|
|
description: |-
|
|
Configure Nomad agent server mode in the `server` block of a Nomad agent configuration. Server mode lets the agent participate in scheduling decisions, register with service discovery, and handle join failures. Configure bootstrapping, authoritative region, redundancy zone, data directory, Nomad cluster behavior, client heartbeat period, schedulers, garbage collection, Raft and Raft's BoltDB store, OIDC for workload identity, leader plan rejection, as well as job priority, job source content size, and tracked job versions.
|
|
---
|
|
|
|
# `server` Block in Agent Configuration
|
|
|
|
<Placement groups={['server']} />
|
|
|
|
This page provides reference information for configuring Nomad agent server mode
|
|
in the `server` block of a Nomad agent configuration. Server mode lets the agent
|
|
participate in scheduling decisions, register with service discovery, and handle
|
|
join failures. Configure bootstrapping, authoritative region, redundancy zone,
|
|
data directory, Nomad cluster behavior, client heartbeat period, schedulers,
|
|
garbage collection, Raft and Raft's BoltDB store, OIDC for workload identity,
|
|
and leader plan rejection, as well as job priority, job source content size, and
|
|
tracked job versions.
|
|
|
|
```hcl
|
|
server {
|
|
enabled = true
|
|
bootstrap_expect = 3
|
|
server_join {
|
|
retry_join = [ "1.1.1.1", "2.2.2.2" ]
|
|
retry_max = 3
|
|
retry_interval = "15s"
|
|
}
|
|
}
|
|
```
|
|
|
|
## `server` Parameters
|
|
|
|
- `authoritative_region` `(string: "")` - Specifies the authoritative region,
|
|
which provides a single source of truth for global configurations such as ACL
|
|
Policies and global ACL tokens in multi-region, federated deployments.
|
|
Non-authoritative regions will replicate from the authoritative to act as a
|
|
mirror. By default, the local region is assumed to be authoritative. Setting
|
|
`authoritative_region` assumes that ACLs have been bootstrapped in the
|
|
authoritative region. Refer to [Configure for multiple regions][] in the ACLs
|
|
tutorial.
|
|
|
|
- `bootstrap_expect` `(int: required)` - Specifies the number of server nodes to
|
|
wait for before bootstrapping. It is most common to use the odd-numbered
|
|
integers `3` or `5` for this value, depending on the cluster size. A value of
|
|
`1` does not provide any fault tolerance and is not recommended for production
|
|
use cases.
|
|
|
|
- `data_dir` `(string: "")` - Specifies the directory to use for server-specific
|
|
data, including the replicated log. When this parameter is empty, Nomad will
|
|
generate the path using the [top-level `data_dir`][top_level_data_dir] suffixed
|
|
with `server`, like `"/opt/nomad/server"`. The
|
|
[top-level data_dir][top_level_data_dir] must be set, even when setting this
|
|
value. This must be an absolute path. Nomad will create the directory on the
|
|
host, if it does not exist when the agent process starts.
|
|
|
|
- `enabled` `(bool: false)` - Specifies if this agent should run in server mode.
|
|
All other server options depend on this value being set.
|
|
|
|
- `enabled_schedulers` `(array<string>: [all])` - Specifies which sub-schedulers
|
|
this server will handle. This can be used to restrict the evaluations that
|
|
worker threads will dequeue for processing.
|
|
|
|
- `enable_event_broker` `(bool: true)` - Specifies if this server will generate
|
|
events for its event stream.
|
|
|
|
- `encrypt` `(string: "")` - Specifies the secret key to use for encryption of
|
|
Nomad server's gossip network traffic. This key must be 32 bytes that are
|
|
[RFC4648] "URL and filename safe" base64-encoded. You can generate an
|
|
appropriately-formatted key with the [`nomad operator gossip keyring
|
|
generate`] command. The provided key is automatically persisted to the data
|
|
directory and loaded automatically whenever the agent is restarted. This means
|
|
that to encrypt Nomad server's gossip protocol, this option only needs to be
|
|
provided once on each agent's initial startup sequence. If it is provided
|
|
after Nomad has been initialized with an encryption key, then the provided key
|
|
is ignored and a warning will be displayed. Refer to the [encryption
|
|
documentation][encryption] for more details on this option and its impact on
|
|
the cluster.
|
|
|
|
- `event_buffer_size` `(int: 100)` - Specifies the number of events generated
|
|
by the server to be held in memory. Increasing this value enables new
|
|
subscribers to have a larger look back window when initially subscribing.
|
|
Decreasing will lower the amount of memory used for the event buffer.
|
|
|
|
- `node_gc_threshold` `(string: "24h")` - Specifies how long a node must be in a
|
|
terminal state before it is garbage collected and purged from the system. This
|
|
is specified using a label suffix like "30s" or "1h".
|
|
|
|
- `job_gc_interval` `(string: "5m")` - Specifies the interval between the job
|
|
garbage collections. Only jobs who have been terminal for at least
|
|
`job_gc_threshold` will be collected. Lowering the interval will perform more
|
|
frequent but smaller collections. Raising the interval will perform collections
|
|
less frequently but collect more jobs at a time. Reducing this interval is
|
|
useful if there is a large throughput of tasks, leading to a large set of
|
|
dead jobs. This is specified using a label suffix like "30s" or "3m".
|
|
|
|
- `job_gc_threshold` `(string: "4h")` - Specifies the minimum time a job must be
|
|
in the terminal state before it is eligible for garbage collection. This is
|
|
specified using a label suffix like "30s" or "1h".
|
|
|
|
- `eval_gc_threshold` `(string: "1h")` - Specifies the minimum time an
|
|
evaluation must be in the terminal state before it is eligible for garbage
|
|
collection. This is specified using a label suffix like "30s" or "1h". Note
|
|
that batch job evaluations are controlled via `batch_eval_gc_threshold`.
|
|
|
|
- `batch_eval_gc_threshold` `(string: "24h")` - Specifies the minimum time an
|
|
evaluation stemming from a batch job must be in the terminal state before it is
|
|
eligible for garbage collection. This is specified using a label suffix like
|
|
"30s" or "1h". Note that the threshold is a necessary but insufficient condition
|
|
for collection, and the most recent evaluation won't be garbage collected even if
|
|
it breaches the threshold.
|
|
|
|
- `deployment_gc_threshold` `(string: "1h")` - Specifies the minimum time a
|
|
deployment must be in the terminal state before it is eligible for garbage
|
|
collection. This is specified using a label suffix like "30s" or "1h".
|
|
|
|
- `csi_volume_claim_gc_interval` `(string: "5m")` - Specifies the interval
|
|
between CSI volume claim garbage collections.
|
|
|
|
- `csi_volume_claim_gc_threshold` `(string: "1h")` - Specifies the minimum age of
|
|
a CSI volume before it is eligible to have its claims garbage collected.
|
|
This is specified using a label suffix like "30s" or "1h".
|
|
|
|
- `csi_plugin_gc_threshold` `(string: "1h")` - Specifies the minimum age of a
|
|
CSI plugin before it is eligible for garbage collection if not in use.
|
|
This is specified using a label suffix like "30s" or "1h".
|
|
|
|
- `acl_token_gc_threshold` `(string: "1h")` - Specifies the minimum age of an
|
|
expired ACL token before it is eligible for garbage collection. This is
|
|
specified using a label suffix like "30s" or "1h".
|
|
|
|
- `default_scheduler_config` <code>(<a href="/nomad/api-docs/operator/scheduler#update-scheduler-configuration">scheduler_configuration:</a></code>`nil`) - Specifies the initial default scheduler config when
|
|
bootstrapping cluster. The parameter is ignored once the cluster is
|
|
bootstrapped or value is updated through the [API
|
|
endpoint][update-scheduler-config]. Refer to [the example
|
|
section](#configuring-scheduler-config) for more details.
|
|
|
|
- `heartbeat_grace` `(string: "10s")` - Specifies the additional time given
|
|
beyond the heartbeat TTL of Clients to account for network and processing
|
|
delays and clock skew. This is specified using a label suffix like "30s" or
|
|
"1h". Refer to the [Client Heartbeats](#client-heartbeats) section for
|
|
details.
|
|
|
|
- `min_heartbeat_ttl` `(string: "10s")` - Specifies the minimum time between
|
|
Client heartbeats. This is used as a floor to prevent excessive updates. This
|
|
is specified using a label suffix like "30s" or "1h". Refer to the [Client
|
|
Heartbeats](#client-heartbeats) section for details.
|
|
|
|
- `failover_heartbeat_ttl` `(string: "5m")` - The time by which all Clients must
|
|
heartbeat after a Server leader election. This is specified using a label
|
|
suffix like "30s" or "1h". Refer to the [Client
|
|
Heartbeats](#client-heartbeats) section for details.
|
|
|
|
- `max_heartbeats_per_second` `(float: 50.0)` - Specifies the maximum target
|
|
rate of heartbeats being processed per second. This allows the TTL to be
|
|
increased to meet the target rate. Refer to the [Client
|
|
Heartbeats](#client-heartbeats) section for details.
|
|
|
|
- `non_voting_server` `(bool: false)` - (Enterprise-only) Specifies whether
|
|
this server will act as a non-voting member of the cluster to help provide
|
|
read scalability.
|
|
|
|
- `num_schedulers` `(int: [num-cores])` - Specifies the number of parallel
|
|
scheduler threads to run. This can be as many as one per core, or `0` to
|
|
disallow this server from making any scheduling decisions. This defaults to
|
|
the number of CPU cores.
|
|
|
|
- `license_path` `(string: "")` - Specifies the path to load a Nomad Enterprise
|
|
license from. This must be an absolute path
|
|
(ex. `/etc/nomad.d/license.hclic`). The license can also be set by setting
|
|
`NOMAD_LICENSE_PATH` or by setting `NOMAD_LICENSE` as the entire license
|
|
value. `license_path` has the highest precedence, followed by `NOMAD_LICENSE`
|
|
and then `NOMAD_LICENSE_PATH`.
|
|
|
|
- `plan_rejection_tracker` <code>([PlanRejectionTracker](#plan_rejection_tracker-parameters))</code> -
|
|
Configuration for the plan rejection tracker that the Nomad leader uses to
|
|
track the history of plan rejections.
|
|
|
|
- `raft_boltdb` - This is a nested object that allows configuring options for
|
|
Raft's BoltDB based log store.
|
|
- `no_freelist_sync` - Setting this to `true` will disable syncing the BoltDB
|
|
freelist to disk within the `raft.db` file. Not syncing the freelist to disk
|
|
will reduce disk IO required for write operations at the expense of longer
|
|
server startup times.
|
|
|
|
- `raft_protocol` `(int: 3)` - Specifies the Raft protocol version to use when
|
|
communicating with other Nomad servers. This affects available Autopilot
|
|
features and is typically not required as the agent internally knows the
|
|
latest version, but may be useful in some upgrade scenarios. Must be `3` in
|
|
Nomad v1.4 or later.
|
|
|
|
- `raft_multiplier` `(int: 1)` - An integer multiplier used by Nomad servers to
|
|
scale key Raft timing parameters. Omitting this value or setting it to 0 uses
|
|
default timing described in the following paragraph. Lower values are used to
|
|
tighten timing and increase sensitivity while higher values relax timings and
|
|
reduce sensitivity. Tuning this affects the time it takes Nomad to detect
|
|
leader failures and to perform leader elections, at the expense of requiring
|
|
more network and CPU resources for better performance. The maximum allowed
|
|
value is 10.
|
|
|
|
By default, Nomad will use the highest-performance timing, currently equivalent
|
|
to setting this to a value of 1. Increasing the timings makes leader election
|
|
less likely during periods of networking issues or resource starvation. Since
|
|
leader elections pause Nomad's normal work, it may be beneficial for slow or
|
|
unreliable networks to wait longer before electing a new leader. The trade-off
|
|
when raising this value is that during network partitions or other events
|
|
(server crash) where a leader is lost, Nomad will not elect a new leader for
|
|
a longer period of time than the default. The [`nomad.nomad.leader.barrier` and
|
|
`nomad.raft.leader.lastContact` metrics](/nomad/docs/operations/metrics-reference) are a good
|
|
indicator of how often leader elections occur and Raft latency.
|
|
|
|
- `raft_snapshot_threshold` `(int: "8192")` - Specifies the minimum number of
|
|
Raft logs to be written to disk before the node is allowed to take a snapshot.
|
|
This reduces the frequency and impact of creating snapshots. During node
|
|
startup, Raft restores the latest snapshot and then applies the individual
|
|
logs to catch the node up to the last known state. This can be tuned during
|
|
operation by a hot configuration reload.
|
|
|
|
- `raft_snapshot_interval` `(string: "120s")` - Specifies the minimum time between
|
|
checks if Raft should perform a snapshot. The Raft library randomly staggers
|
|
between this value and twice this value to avoid the entire cluster performing
|
|
a snapshot at once. Nodes are eligible to snapshot once they have exceeded the
|
|
`raft_snapshot_threshold`. This value can be tuned during operation by a hot
|
|
configuration reload.
|
|
|
|
- `raft_trailing_logs` `(int: "10240")` - Specifies how many logs are retained
|
|
after a snapshot. These logs are used so that Raft can quickly replay logs on
|
|
a follower instead of being forced to send an entire snapshot. This value can
|
|
be tuned during operation by a hot configuration reload.
|
|
|
|
- `redundancy_zone` `(string: "")` - (Enterprise-only) Specifies the redundancy
|
|
zone that this server will be a part of for Autopilot management. For more
|
|
information, refer to the [Autopilot Guide](/nomad/tutorials/manage-clusters/autopilot).
|
|
|
|
- `rejoin_after_leave` `(bool: false)` - Specifies if Nomad will ignore a
|
|
previous leave and attempt to rejoin the cluster when starting. By default,
|
|
Nomad treats leave as a permanent intent and does not attempt to join the
|
|
cluster again when starting. This flag allows the previous state to be used to
|
|
rejoin the cluster.
|
|
|
|
- `root_key_gc_interval` `(string: "10m")` - Specifies the interval between
|
|
[encryption key][] metadata garbage collections.
|
|
|
|
- `root_key_gc_threshold` `(string: "1h")` - Specifies the minimum time after
|
|
the `root_key_rotation_threshold` has passed that an [encryption key][] must
|
|
exist before it can be eligible for garbage collection.
|
|
|
|
- `root_key_rotation_threshold` `(string: "720h")` - Specifies the lifetime of
|
|
an active [encryption key][] before it is automatically rotated on the next
|
|
garbage collection interval. Nomad will prepublish the replacement key at half
|
|
the `root_key_rotation_threshold` time so external consumers of Workload
|
|
Identity have time to obtain the new public key from the [JWKS URL][] before
|
|
it is used.
|
|
|
|
- `server_join` <code>([server_join][server-join]: nil)</code> - Specifies
|
|
how the Nomad server will connect to other Nomad servers. The `retry_join`
|
|
fields may directly specify the server address or use go-discover syntax for
|
|
auto-discovery. Refer to the [server_join documentation][server-join] for more detail.
|
|
|
|
- `start_timeout` `(string: "30s")` - A timeout applied to the server setup and
|
|
startup processes. These processes (keyring decryption) are expected to
|
|
complete before the server is considered healthy, and if the timeout is
|
|
reached before they are completed, the server will exit. Without this, the
|
|
server can hang indefinitely waiting for these.
|
|
|
|
- `upgrade_version` `(string: "")` - A custom version of the format X.Y.Z to use
|
|
in place of the Nomad version when custom upgrades are enabled in Autopilot.
|
|
For more information, refer to the [Autopilot Guide](/nomad/tutorials/manage-clusters/autopilot).
|
|
|
|
- `search` <code>([search][search]: nil)</code> - Specifies configuration parameters
|
|
for the Nomad search API.
|
|
|
|
- `job_max_priority` `(int: 100)` - Specifies the maximum priority that can be assigned to a job.
|
|
A valid value must be between `100` and `32766`.
|
|
|
|
- `job_default_priority` `(int: 50)` - Specifies the default priority assigned to a job.
|
|
A valid value must be between `50` and `job_max_priority`.
|
|
|
|
- `job_max_source_size` `(string: "1M")` - Specifies the size limit of the associated
|
|
job source content when registering a job. Note this is not a limit on the actual
|
|
size of a job. If the limit is exceeded, the original source is simply discarded
|
|
and no error is returned from the job API.
|
|
|
|
- `job_tracked_versions` `(int: 6)` - Specifies the number of historic job versions that
|
|
are kept.
|
|
|
|
- `oidc_issuer` `(string: "")` - Specifies the Issuer URL for [Workload
|
|
Identity][wi] JWTs. For example, `"https://nomad.example.com"`. If set the
|
|
`/.well-known/openid-configuration` HTTP endpoint is enabled for third
|
|
parties to discover Nomad's OIDC configuration. Once set `oidc_issuer`
|
|
*cannot be changed* without invalidating Workload Identities that have the
|
|
old issuer claim. For this reason it is suggested to set `oidc_issuer` to a
|
|
proxy in front of Nomad's HTTP API to ensure a stable DNS name can be used
|
|
instead of a potentially ephemeral Nomad server IP.
|
|
|
|
### Deprecated Parameters
|
|
|
|
- `retry_join` `(array<string>: [])` - Specifies a list of server addresses to
|
|
retry joining if the first attempt fails. This is similar to
|
|
[`start_join`](#start_join), but only invokes if the initial join attempt
|
|
fails. The list of addresses will be tried in the order specified, until one
|
|
succeeds. After one succeeds, no further addresses will be contacted. This is
|
|
useful for cases where we know the address will become available eventually.
|
|
Use `retry_join` with an array as a replacement for `start_join`, **do not use
|
|
both options**. Refer to the [server_join][server-join]
|
|
section for more information on the format of the string. This field is
|
|
deprecated in favor of the [server_join block][server-join].
|
|
|
|
- `retry_interval` `(string: "30s")` - Specifies the time to wait between retry
|
|
join attempts. This field is deprecated in favor of the [server_join
|
|
block][server-join].
|
|
|
|
- `retry_max` `(int: 0)` - Specifies the maximum number of join attempts to be
|
|
made before exiting with a return code of 1. By default, this is set to 0
|
|
which is interpreted as infinite retries. This field is deprecated in favor of
|
|
the [server_join block][server-join].
|
|
|
|
- `start_join` `(array<string>: [])` - Specifies a list of server addresses to
|
|
join on startup. If Nomad is unable to join with any of the specified
|
|
addresses, agent startup will fail. Refer to the [server address
|
|
format](/nomad/docs/configuration/server_join#server-address-format)
|
|
section for more information on the format of the string. This field is
|
|
deprecated in favor of the [server_join block][server-join].
|
|
|
|
### `plan_rejection_tracker` Parameters
|
|
|
|
The leader plan rejection tracker can be adjusted to prevent evaluations from
|
|
getting stuck due to always being scheduled to a client that may have an
|
|
unexpected issue. Refer to [Monitoring Nomad][monitoring_nomad_progress] for
|
|
more details.
|
|
|
|
- `enabled` `(bool: false)` - Specifies if plan rejections should be tracked.
|
|
|
|
- `node_threshold` `(int: 100)` - The number of plan rejections for a node
|
|
within the `node_window` to trigger a client to be set as ineligible.
|
|
|
|
- `node_window` `(string: "5m")` - The time window for when plan rejections for
|
|
a node should be considered.
|
|
|
|
If you observe too many false positives (clients being marked as ineligible
|
|
even if they don't present any problem) you may want to increase
|
|
`node_threshold`.
|
|
|
|
Or if you are noticing jobs not being scheduled due to plan rejections for the
|
|
same `node_id` and the client is not being set as ineligible you can try
|
|
increasing the `node_window` so more historical rejections are taken into
|
|
account.
|
|
|
|
## `server` Examples
|
|
|
|
### Common Setup
|
|
|
|
This example shows a common Nomad agent `server` configuration block. The two
|
|
IP addresses could also be DNS, and should point to the other Nomad servers in
|
|
the cluster
|
|
|
|
```hcl
|
|
server {
|
|
enabled = true
|
|
bootstrap_expect = 3
|
|
|
|
server_join {
|
|
retry_join = [ "1.1.1.1", "2.2.2.2" ]
|
|
retry_max = 3
|
|
retry_interval = "15s"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Configuring Data Directory
|
|
|
|
This example shows configuring a custom data directory for the server data.
|
|
|
|
```hcl
|
|
server {
|
|
data_dir = "/opt/nomad/server"
|
|
}
|
|
```
|
|
|
|
### Automatic Bootstrapping
|
|
|
|
The Nomad servers can automatically bootstrap if Consul is configured. For a
|
|
more detailed explanation, refer to the
|
|
[automatic Nomad bootstrapping documentation](/nomad/tutorials/manage-clusters/clustering).
|
|
|
|
### Restricting Schedulers
|
|
|
|
This example shows restricting the schedulers that are enabled as well as the
|
|
maximum number of cores to utilize when participating in scheduling decisions:
|
|
|
|
```hcl
|
|
server {
|
|
enabled = true
|
|
enabled_schedulers = ["batch", "service"]
|
|
num_schedulers = 7
|
|
}
|
|
```
|
|
|
|
### Bootstrapping with a Custom Scheduler Config ((#configuring-scheduler-config))
|
|
|
|
While [bootstrapping a cluster], you can use the `default_scheduler_config` block
|
|
to prime the cluster with a [`SchedulerConfig`][update-scheduler-config]. The
|
|
scheduler configuration determines which scheduling algorithm is configured—
|
|
spread scheduling or binpacking—and which job types are eligible for preemption.
|
|
|
|
~> **Warning:** Once the cluster is bootstrapped, you must configure this using
|
|
the [update scheduler configuration][update-scheduler-config] API. This
|
|
option is only consulted during bootstrap.
|
|
|
|
The structure matches the [Update Scheduler Config][update-scheduler-config] API
|
|
endpoint, which you should consult for canonical documentation. However, the
|
|
attributes names must be adapted to HCL syntax by using snake case
|
|
representations rather than camel case.
|
|
|
|
This example shows configuring spread scheduling and enabling preemption for all
|
|
job-type schedulers.
|
|
|
|
```hcl
|
|
server {
|
|
default_scheduler_config {
|
|
scheduler_algorithm = "spread"
|
|
memory_oversubscription_enabled = true
|
|
reject_job_registration = false
|
|
pause_eval_broker = false
|
|
|
|
preemption_config {
|
|
batch_scheduler_enabled = true
|
|
system_scheduler_enabled = true
|
|
service_scheduler_enabled = true
|
|
sysbatch_scheduler_enabled = true
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Client Heartbeats ((#client-heartbeats))
|
|
|
|
~> This is an advanced topic. It is most beneficial to clusters over 1,000
|
|
nodes or with unreliable networks or nodes (eg some edge deployments).
|
|
|
|
Nomad Clients periodically heartbeat to Nomad Servers to confirm they are
|
|
operating as expected. Nomad Clients which do not heartbeat in the specified
|
|
amount of time are considered `down` and their allocations are marked as `lost`
|
|
or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set)
|
|
and replaced.
|
|
|
|
The various heartbeat related parameters allow you to tune the following
|
|
tradeoffs:
|
|
|
|
- The longer the heartbeat period, the longer Nomad takes to replace a `down`
|
|
Client's workload.
|
|
- The shorter the heartbeat period, the more likely transient network issues,
|
|
leader elections, and other temporary issues could cause a perfectly
|
|
functional Client and its workloads to be marked as `down` and the work
|
|
replaced.
|
|
|
|
While Nomad Clients can connect to any Server, all heartbeats are forwarded to
|
|
the leader for processing. Since this heartbeat processing consumes resources,
|
|
Nomad adjusts the rate at which Clients heartbeat based on cluster size. The
|
|
goal is to try to keep the resource cost of processing heartbeats constant
|
|
regardless of cluster size.
|
|
|
|
The base formula for determining how often a Client must heartbeat is:
|
|
|
|
```
|
|
<number of Clients> / <max_heartbeats_per_second>
|
|
```
|
|
|
|
Other factors modify this base TTL:
|
|
|
|
- A random factor up to `2x` is added to the base TTL to prevent the
|
|
[thundering herd][herd] problem where a large number of clients attempt to
|
|
heartbeat at exactly the same time.
|
|
- [`min_heartbeat_ttl`](#min_heartbeat_ttl) is used as the lower bound to
|
|
prevent small clusters from sending excessive heartbeats.
|
|
- [`heartbeat_grace`](#heartbeat_grace) is the amount of _extra_ time the
|
|
leader will wait for a heartbeat beyond the base heartbeat.
|
|
- After a leader election all Clients are given up to `failover_heartbeat_ttl`
|
|
to successfully heartbeat. This gives Clients time to discover a functioning
|
|
Server in case they were directly connected to a leader that crashed.
|
|
|
|
For example, given the default values for heartbeat parameters, different sized
|
|
clusters will use the following TTLs for the heartbeats. Note that the `Server TTL`
|
|
simply adds the `heartbeat_grace` parameter to the TTL Clients are given.
|
|
|
|
| Clients | Client TTL | Server TTL | Safe after elections |
|
|
| ------- | ----------- | ----------- | -------------------- |
|
|
| 10 | 10s - 20s | 20s - 30s | yes |
|
|
| 100 | 10s - 20s | 20s - 30s | yes |
|
|
| 1000 | 20s - 40s | 30s - 50s | yes |
|
|
| 5000 | 100s - 200s | 110s - 210s | yes |
|
|
| 10000 | 200s - 400s | 210s - 410s | NO |
|
|
|
|
Regardless of size, all clients will have a Server TTL of
|
|
`failover_heartbeat_ttl` after a leader election. It should always be larger
|
|
than the maximum Client TTL for your cluster size in order to prevent marking
|
|
live Clients as `down`.
|
|
|
|
For clusters over 5000 Clients you should increase `failover_heartbeat_ttl`
|
|
using the following formula:
|
|
|
|
```
|
|
(2 * (<number of Clients> / <max_heartbeats_per_second>)) + (10 * <min_heartbeat_ttl>)
|
|
|
|
# For example with 6000 Clients:
|
|
(2 * (6000 / 50)) + (10 * 10) = 340s (5m40s)
|
|
```
|
|
|
|
This ensures Clients have some additional time to failover even if they were
|
|
told to heartbeat after the maximum interval.
|
|
|
|
The actual value used should take into consideration how much tolerance your
|
|
system has for a delay in noticing crashed Clients. For example a
|
|
`failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the
|
|
largest clusters ample time to heartbeat after an election. However if the
|
|
election was due to a datacenter-wide failure affecting Clients, it will be 30
|
|
minutes before Nomad recognizes that they are `down` and replaces their
|
|
work.
|
|
|
|
[encryption]: /nomad/tutorials/transport-security/security-gossip-encryption 'Nomad Encryption Overview'
|
|
[server-join]: /nomad/docs/configuration/server_join 'Server Join'
|
|
[update-scheduler-config]: /nomad/api-docs/operator/scheduler#update-scheduler-configuration 'Scheduler Config'
|
|
[bootstrapping a cluster]: /nomad/docs/faq#bootstrapping
|
|
[rfc4648]: https://tools.ietf.org/html/rfc4648#section-5
|
|
[monitoring_nomad_progress]: /nomad/docs/operations/monitoring-nomad#progress
|
|
[`nomad operator gossip keyring generate`]: /nomad/docs/commands/operator/gossip/keyring-generate
|
|
[search]: /nomad/docs/configuration/search
|
|
[encryption key]: /nomad/docs/operations/key-management
|
|
[disconnect.lost_after]: /nomad/docs/job-specification/disconnect#lost_after
|
|
[herd]: https://en.wikipedia.org/wiki/Thundering_herd_problem
|
|
[wi]: /nomad/docs/concepts/workload-identity
|
|
[Configure for multiple regions]: /nomad/tutorials/access-control/access-control-bootstrap#configure-for-multiple-regions
|
|
[top_level_data_dir]: /nomad/docs/configuration#data_dir
|
|
[JWKS URL]: /nomad/api-docs/operator/keyring#list-active-public-keys
|