Files
nomad/website/content/docs/deploy/clusters/federation-failure-scenarios.mdx
Aimee Ukasick 53b083b8c5 Docs: Nomad IA (#26063)
* Move commands from docs to its own root-level directory

* temporarily use modified dev-portal branch with nomad ia changes

* explicitly clone nomad ia exp branch

* retrigger build, fixed dev-portal broken build

* architecture, concepts and get started individual pages

* fix get started section destinations

* reference section

* update repo comment in website-build.sh to show branch

* docs nav file update capitalization

* update capitalization to force deploy

* remove nomad-vs-kubernetes dir; move content to what is nomad pg

* job section

* Nomad operations category, deploy section

* operations category, govern section

* operations - manage

* operations/scale; concepts scheduling fix

* networking

* monitor

* secure section

* remote auth-methods folder and move up pages to sso; linkcheck

* Fix install2deploy redirects

* fix architecture redirects

* Job section: Add missing section index pages

* Add section index pages so breadcrumbs build correctly

* concepts/index fix front matter indentation

* move task driver plugin config to new deploy section

* Finish adding full URL to tutorials links in nav

* change SSO to Authentication in nav and file system

* Docs NomadIA: Move tutorials into NomadIA branch (#26132)

* Move governance and policy from tutorials to docs

* Move tutorials content to job-declare section

* run jobs section

* stateful workloads

* advanced job scheduling

* deploy section

* manage section

* monitor section

* secure/acl and secure/authorization

* fix example that contains an unseal key in real format

* remove images from sso-vault

* secure/traffic

* secure/workload-identities

* vault-acl change unseal key and root token in command output sample

* remove lines from sample output

* fix front matter

* move nomad pack tutorials to tools

* search/replace /nomad/tutorials links

* update acl overview with content from deleted architecture/acl

* fix spelling mistake

* linkcheck - fix broken links

* fix link to Nomad variables tutorial

* fix link to Prometheus tutorial

* move who uses Nomad to use cases page; move spec/config shortcuts

add dividers

* Move Consul out of Integrations; move namespaces to govern

* move integrations/vault to secure/vault; delete integrations

* move ref arch to docs; rename Deploy Nomad back to Install Nomad

* address feedback

* linkcheck fixes

* Fixed raw_exec redirect

* add info from /nomad/tutorials/manage-jobs/jobs

* update page content with newer tutorial

* link updates for architecture sub-folders

* Add redirects for removed section index pages. Fix links.

* fix broken links from linkcheck

* Revert to use dev-portal main branch instead of nomadIA branch

* build workaround: add intro-nav-data.json with single entry

* fix content-check error

* add intro directory to get around Vercel build error

* workound for emtpry directory

* remove mdx from /intro/ to fix content-check and git snafu

* Add intro index.mdx so Vercel build should work

---------

Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
2025-07-08 19:24:52 -05:00

141 lines
6.9 KiB
Plaintext

---
layout: docs
page_title: Federated cluster failure scenarios
description: |-
Review failure scenarios in multi-region federated cluster deployments. Learn which Nomad features continue to work under federated and authoritative region failures.
---
# Federated cluster failure scenarios
When running Nomad in federated mode, failure situations and impacts are different depending on
whether the authoritative region is the impacted region or not, and what the failure mode is. In
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running,
and reachable. In hard failures, the regional servers are completely unreachable and are akin to
the underlying hardware having been terminated (cloud) or powered-off (on-prem).
The scenarios are based on a Nomad deployment running three federated regions:
* `asia-south-1`
* `europe-west-1` - authoritative region
* `us-east-1`
## Federated region failure: soft
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and
up.
All server logs in the impacted region have entries such as this example.
```console
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"
```
✅ Request forwarding continues to work between all federated regions that are running with
leadership.
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
fail unless using the `stale=true` flag.
✅ Creation and deletion of replicated objects, such as namespaces, is written to the
authoritative region.
✅ Any federated regions with leadership is able to continue to replicate all objects detailed
previously.
✅ Creation of local ACL tokens continues to work for all regions with leadership.
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
## Federated region failure: hard
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs
for the other regions have log entries similar to this example.
```console
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
```
✅ Request forwarding continues to work between all federated regions that are running with
leadership.
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
fail.
✅ Creation and deletion of replicated objects, such as namespaces, are written to the
authoritative region.
✅ Any federated regions with leadership continue to replicate all objects detailed
above.
✅ Creation of local ACL tokens continues to work for all regions which are running with
leadership.
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
## Authoritative region failure: soft
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and
up.
The server logs in the authoritative region have entries such as this example.
```console
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"
```
✅ Request forwarding continues to work between all federated regions that are running with
leadership.
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
fail unless using the `stale=true` flag.
❌ Creation and deletion of replicated objects, such as namespaces, fails.
❌ Any federated regions are able to read data to replicate as they use the stale flag, but no
writes can occur to the authoritative region as described previously.
✅ Creation of local ACL tokens continues to work for all federated regions which are running
with leadership.
✅ Jobs **without** the [`multiregion`][] block deploy to all federated regions which
are running with leadership.
❌ Jobs **with** the [`multiregion`][] block defined fails to deploy.
## Authoritative region failure: hard
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server
leader logs for the other regions have log entries similar to this example.
```console
[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
```
✅ Request forwarding continues to work between all federated regions that are running with
leadership.
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
fail.
❌ Creation and deletion of replicated objects, such as namespaces, fails.
❌ Any federated regions with leadership is not able to replicate objects detailed in the logs.
✅ Creation of local ACL tokens continues to work for all regions with leadership.
✅ Jobs **without** the [`multiregion`][] block deploy to regions with leadership.
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
[`multiregion`]: /nomad/docs/job-specification/multiregion