nomad/website/content/docs/manage/autopilot.mdx

---
layout: docs
page_title: Autopilot
description: |-
  Configure and use Autopilot features to help maintain your cluster. Monitor
  server health, upgrade and clean up servers, and use redundancy zones.
---

# Autopilot

Autopilot is a set of new features added in Nomad 0.8 to allow for automatic
operator-friendly management of Nomad servers. It includes cleanup of dead
servers, monitoring the state of the Raft cluster, and stable server
introduction.

To enable Autopilot features (with the exception of dead server cleanup), the
`raft_protocol` setting in the [server stanza] must be set to 3 on all servers.

## Configuration

The configuration of Autopilot is loaded by the leader from the agent's
[Autopilot settings] when initially bootstrapping the cluster:

```hcl
autopilot {
  cleanup_dead_servers      = true
  last_contact_threshold    = "200ms"
  max_trailing_logs         = 250
  server_stabilization_time = "10s"
  enable_redundancy_zones   = false
  disable_upgrade_migration = false
  enable_custom_upgrades    = false
}

```

After bootstrapping, the configuration can be viewed or modified either via the
[`operator autopilot`] subcommand or the
[`/v1/operator/autopilot/configuration`] HTTP endpoint:

View the configuration.

```shell-session
$ nomad operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
EnableRedundancyZones = false
DisableUpgradeMigration = false
EnableCustomUpgrades = false
```

Update the configuration.

```shell-session
$ nomad operator autopilot set-config -cleanup-dead-servers=false
Configuration updated!
```

View the configuration to confirm your changes.

```shell-session
$ nomad operator autopilot get-config
CleanupDeadServers = false
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
EnableRedundancyZones = false
DisableUpgradeMigration = false
EnableCustomUpgrades = false
```

## Dead server cleanup

Dead servers will periodically be cleaned up and removed from the Raft peer set,
to prevent them from interfering with the quorum size and leader elections. This
cleanup will also happen whenever a new server is successfully added to the
cluster.

Prior to Autopilot, it would take 72 hours for dead servers to be automatically
reaped, or operators had to script a `nomad force-leave`. If another server
failure occurred, it could jeopardize the quorum, even if the failed Nomad
server had been automatically replaced. Autopilot helps prevent these kinds of
outages by quickly removing failed servers as soon as a replacement Nomad server
comes online. When servers are removed by the cleanup process they will enter
the "left" state.

This option can be disabled by running `nomad operator autopilot set-config`
with the `-cleanup-dead-servers=false` option.

## Server health checking

An internal health check runs on the leader to monitor the stability of servers.
A server is considered healthy if all of the following conditions are true:

- Its status according to Serf is 'Alive'

- The time since its last contact with the current leader is below
  `LastContactThreshold`

- Its latest Raft term matches the leader's term

- The number of Raft log entries it trails the leader by does not exceed
  `MaxTrailingLogs`

The status of these health checks can be viewed through the
[`/v1/operator/autopilot/health`] HTTP endpoint, with a top level `Healthy`
field indicating the overall status of the cluster:

```shell-session
$ curl localhost:4646/v1/operator/autopilot/health
{
  "Healthy": true,
  "FailureTolerance": 0,
  "Servers": [
    {
      "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e",
      "Name": "node1",
      "Address": "127.0.0.1:4647",
      "SerfStatus": "alive",
      "Version": "0.8.0",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 2,
      "LastIndex": 10,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2017-03-28T18:28:52Z"
    },
    {
      "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea",
      "Name": "node2",
      "Address": "127.0.0.1:4747",
      "SerfStatus": "alive",
      "Version": "0.8.0",
      "Leader": false,
      "LastContact": "35.371007ms",
      "LastTerm": 2,
      "LastIndex": 10,
      "Healthy": true,
      "Voter": false,
      "StableSince": "2017-03-28T18:29:10Z"
    }
  ]
}

```

## Stable server introduction

When a new server is added to the cluster, there is a waiting period where it
must be healthy and stable for a certain amount of time before being promoted to
a full, voting member. This can be configured via the `ServerStabilizationTime`
setting.

## Server read and scheduling scaling <EnterpriseAlert inline />

With the [`non_voting_server`] option, a server can be explicitly marked as a
non-voter and will never be promoted to a voting member. This can be useful when
more read scaling is needed; being a non-voter means that the server will still
have data replicated to it, but it will not be part of the quorum that the
leader must wait for before committing log entries. Non voting servers can also
act as scheduling workers to increase scheduling throughput in large clusters.

## Redundancy zones <EnterpriseAlert inline />

Prior to Autopilot, it was difficult to deploy servers in a way that took
advantage of isolated failure domains such as AWS Availability Zones; users
would be forced to either have an overly-large quorum (2-3 nodes per AZ) or give
up redundancy within an AZ by deploying only one server in each.

If the `EnableRedundancyZones` setting is set, Nomad will use its value to look
for a zone in each server's specified [`redundancy_zone`] field.

Here's an example showing how to configure this:

```hcl
/* config.hcl */
server {
  redundancy_zone = "west-1"
}
```

```shell-session
$ nomad operator autopilot set-config -enable-redundancy-zones=true
Configuration updated!
```

Nomad will then use these values to partition the servers by redundancy zone,
and will aim to keep one voting server per zone. Extra servers in each zone will
stay as non-voters on standby to be promoted if the active voter leaves or dies.

## Upgrade migrations <EnterpriseAlert inline />

Autopilot in Nomad Enterprise supports upgrade migrations by default. To disable
this functionality, set `DisableUpgradeMigration` to true.

When a new server is added and Autopilot detects that its Nomad version is newer
than that of the existing servers, Autopilot will avoid promoting the new server
until enough newer-versioned servers have been added to the cluster. When the
count of new servers equals or exceeds that of the old servers, Autopilot will
begin promoting the new servers to voters and demoting the old servers. After
this is finished, the old servers can be safely removed from the cluster.

To check the Nomad version of the servers, either the [autopilot health]
endpoint or the `nomad members`command can be used:

```shell-session
$ nomad server members
Name   Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
node1  127.0.0.1  4648  alive   true    3         0.7.1  dc1         global
node2  127.0.0.1  4748  alive   false   3         0.7.1  dc1         global
node3  127.0.0.1  4848  alive   false   3         0.7.1  dc1         global
node4  127.0.0.1  4948  alive   false   3         0.8.0  dc1         global
```

### Migrations without a Nomad version change

The `EnableCustomUpgrades` field can be used to override the version information
used during a migration, so that the migration logic can be used for updating
the cluster when changing configuration.

If the `EnableCustomUpgrades` setting is set to `true`, Nomad will use its value
to look for a version in each server's specified [`upgrade_version`] tag. The
upgrade logic will follow semantic versioning and the `upgrade_version` must be
in the form of either `X`, `X.Y`, or `X.Y.Z`.

[`/v1/operator/autopilot/configuration`]: /nomad/api-docs/operator/autopilot
[`/v1/operator/autopilot/health`]: /nomad/api-docs/operator/autopilot#read-health
[`non_voting_server`]: /nomad/docs/configuration/server#non_voting_server
[`operator autopilot`]: /nomad/commands/operator
[`redundancy_zone`]: /nomad/docs/configuration/server#redundancy_zone
[`upgrade_version`]: /nomad/docs/configuration/server#upgrade_version
[autopilot health]: /nomad/api-docs/operator/autopilot#read-health
[autopilot settings]: /nomad/docs/configuration/autopilot
[nomad enterprise]: https://www.hashicorp.com/products/nomad
[server stanza]: /nomad/docs/configuration/server
[version upgrade section]: /nomad/docs/upgrade/upgrade-specific#raft-protocol-version-compatibility