mirror of
https://github.com/kemko/nomad.git
synced 2026-01-06 10:25:42 +03:00
* Move commands from docs to its own root-level directory * temporarily use modified dev-portal branch with nomad ia changes * explicitly clone nomad ia exp branch * retrigger build, fixed dev-portal broken build * architecture, concepts and get started individual pages * fix get started section destinations * reference section * update repo comment in website-build.sh to show branch * docs nav file update capitalization * update capitalization to force deploy * remove nomad-vs-kubernetes dir; move content to what is nomad pg * job section * Nomad operations category, deploy section * operations category, govern section * operations - manage * operations/scale; concepts scheduling fix * networking * monitor * secure section * remote auth-methods folder and move up pages to sso; linkcheck * Fix install2deploy redirects * fix architecture redirects * Job section: Add missing section index pages * Add section index pages so breadcrumbs build correctly * concepts/index fix front matter indentation * move task driver plugin config to new deploy section * Finish adding full URL to tutorials links in nav * change SSO to Authentication in nav and file system * Docs NomadIA: Move tutorials into NomadIA branch (#26132) * Move governance and policy from tutorials to docs * Move tutorials content to job-declare section * run jobs section * stateful workloads * advanced job scheduling * deploy section * manage section * monitor section * secure/acl and secure/authorization * fix example that contains an unseal key in real format * remove images from sso-vault * secure/traffic * secure/workload-identities * vault-acl change unseal key and root token in command output sample * remove lines from sample output * fix front matter * move nomad pack tutorials to tools * search/replace /nomad/tutorials links * update acl overview with content from deleted architecture/acl * fix spelling mistake * linkcheck - fix broken links * fix link to Nomad variables tutorial * fix link to Prometheus tutorial * move who uses Nomad to use cases page; move spec/config shortcuts add dividers * Move Consul out of Integrations; move namespaces to govern * move integrations/vault to secure/vault; delete integrations * move ref arch to docs; rename Deploy Nomad back to Install Nomad * address feedback * linkcheck fixes * Fixed raw_exec redirect * add info from /nomad/tutorials/manage-jobs/jobs * update page content with newer tutorial * link updates for architecture sub-folders * Add redirects for removed section index pages. Fix links. * fix broken links from linkcheck * Revert to use dev-portal main branch instead of nomadIA branch * build workaround: add intro-nav-data.json with single entry * fix content-check error * add intro directory to get around Vercel build error * workound for emtpry directory * remove mdx from /intro/ to fix content-check and git snafu * Add intro index.mdx so Vercel build should work --------- Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
141 lines
6.9 KiB
Plaintext
141 lines
6.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Federated cluster failure scenarios
|
|
description: |-
|
|
Review failure scenarios in multi-region federated cluster deployments. Learn which Nomad features continue to work under federated and authoritative region failures.
|
|
---
|
|
|
|
# Federated cluster failure scenarios
|
|
|
|
When running Nomad in federated mode, failure situations and impacts are different depending on
|
|
whether the authoritative region is the impacted region or not, and what the failure mode is. In
|
|
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running,
|
|
and reachable. In hard failures, the regional servers are completely unreachable and are akin to
|
|
the underlying hardware having been terminated (cloud) or powered-off (on-prem).
|
|
|
|
The scenarios are based on a Nomad deployment running three federated regions:
|
|
* `asia-south-1`
|
|
* `europe-west-1` - authoritative region
|
|
* `us-east-1`
|
|
|
|
## Federated region failure: soft
|
|
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and
|
|
up.
|
|
|
|
All server logs in the impacted region have entries such as this example.
|
|
```console
|
|
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail unless using the `stale=true` flag.
|
|
|
|
✅ Creation and deletion of replicated objects, such as namespaces, is written to the
|
|
authoritative region.
|
|
|
|
✅ Any federated regions with leadership is able to continue to replicate all objects detailed
|
|
previously.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
## Federated region failure: hard
|
|
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs
|
|
for the other regions have log entries similar to this example.
|
|
```console
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail.
|
|
|
|
✅ Creation and deletion of replicated objects, such as namespaces, are written to the
|
|
authoritative region.
|
|
|
|
✅ Any federated regions with leadership continue to replicate all objects detailed
|
|
above.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions which are running with
|
|
leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
## Authoritative region failure: soft
|
|
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and
|
|
up.
|
|
|
|
The server logs in the authoritative region have entries such as this example.
|
|
```console
|
|
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail unless using the `stale=true` flag.
|
|
|
|
❌ Creation and deletion of replicated objects, such as namespaces, fails.
|
|
|
|
❌ Any federated regions are able to read data to replicate as they use the stale flag, but no
|
|
writes can occur to the authoritative region as described previously.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all federated regions which are running
|
|
with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to all federated regions which
|
|
are running with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fails to deploy.
|
|
|
|
## Authoritative region failure: hard
|
|
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server
|
|
leader logs for the other regions have log entries similar to this example.
|
|
```console
|
|
[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
|
|
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
|
|
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
|
|
[INFO] go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
|
|
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
|
|
```
|
|
|
|
✅ Request forwarding continues to work between all federated regions that are running with
|
|
leadership.
|
|
|
|
❌ API requests, either directly or attempting to use request forwarding to the impacted region,
|
|
fail.
|
|
|
|
❌ Creation and deletion of replicated objects, such as namespaces, fails.
|
|
|
|
❌ Any federated regions with leadership is not able to replicate objects detailed in the logs.
|
|
|
|
✅ Creation of local ACL tokens continues to work for all regions with leadership.
|
|
|
|
✅ Jobs **without** the [`multiregion`][] block deploy to regions with leadership.
|
|
|
|
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.
|
|
|
|
[`multiregion`]: /nomad/docs/job-specification/multiregion
|