mirror of
https://github.com/kemko/nomad.git
synced 2026-01-08 03:15:42 +03:00
* seo operation section * other specifications section * Update website/content/docs/other-specifications/variables.mdx Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --------- Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
141 lines
6.9 KiB
Plaintext
141 lines
6.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Federated cluster failure scenarios
|
|
description: |-
|
|
Review failure scenarios in multi-region federated cluster deployments. Learn which Nomad features continue to work under federated and authoritative region failures.
|
|
---
|
|
|
|
# Federated cluster failure scenarios
|
|
|
|
When running Nomad in federated mode, failure situations and impacts are different depending on
|
|
whether the authoritative region is the impacted region or not, and what the failure mode is. In
|
|
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running,
|
|
and reachable. In hard failures, the regional servers are completely unreachable and are akin to
|
|
the underlying hardware having been terminated (cloud) or powered-off (on-prem).
|
|
|
|
The scenarios are based on a Nomad deployment running three federated regions:
|
|
* `asia-south-1`
|
|
* `europe-west-1` - authoritative region
|
|
* `us-east-1`
|
|
|
|
## Federated region failure: soft
|
|
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and
|
|
up.
|
|
|
|
All server logs in the impacted region have entries such as this example.
|
|
```console
|
|
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail unless using the `stale=true` flag.
|
|
|
|
✅ Creation and deletion of replicated objects, such as namespaces, is written to the
|
|
authoritative region.
|
|
|
|
✅ Any federated regions with leadership is able to continue to replicate all objects detailed
|
|
previously.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
## Federated region failure: hard
|
|
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs
|
|
for the other regions have log entries similar to this example.
|
|
```console
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail.
|
|
|
|
✅ Creation and deletion of replicated objects, such as namespaces, are written to the
|
|
authoritative region.
|
|
|
|
✅ Any federated regions with leadership continue to replicate all objects detailed
|
|
above.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions which are running with
|
|
leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
## Authoritative region failure: soft
|
|
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and
|
|
up.
|
|
|
|
The server logs in the authoritative region have entries such as this example.
|
|
```console
|
|
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail unless using the `stale=true` flag.
|
|
|
|
❌ Creation and deletion of replicated objects, such as namespaces, fails.
|
|
|
|
❌ Any federated regions are able to read data to replicate as they use the stale flag, but no
|
|
writes can occur to the authoritative region as described previously.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all federated regions which are running
|
|
with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all federated regions which
|
|
are running with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fails to deploy.
|
|
|
|
## Authoritative region failure: hard
|
|
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server
|
|
leader logs for the other regions have log entries similar to this example.
|
|
```console
|
|
[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
|
|
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail.
|
|
|
|
❌ Creation and deletion of replicated objects, such as namespaces, fails.
|
|
|
|
❌ Any federated regions with leadership is not able to replicate objects detailed in the logs.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
[`multiregion`]: /nomad/docs/job-specification/multiregion
|