Files
nomad/website/content/docs/operations/federation/index.mdx
Aimee Ukasick 4dfedf1aef add top-level heading so the page renders correctly (#24491)
Add opening paragraph; update description
2024-11-19 11:10:10 -06:00

65 lines
3.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
layout: docs
page_title: Federated cluster operations
description: |-
Operational considerations for running Nomad multi-region federated clusters as well as instructions for migrating the authoritative region to a federated region.
---
# Federated cluster operations
This page lists operational considerations for running multi-region federated
clusters as well as instructions for migrating the authoritative region to a
federated region.
## Operational considerations
When operating multi-region federated Nomad clusters, consider the following:
* **Regular snapshots**: You can back up Nomad server state using the
[`nomad operator snapshot save`][] and [`nomad operator snapshot agent`][] commands. Performing
regular backups expedites disaster recovery. The cadence depends on cluster rates of change
and your internal SLAs. You should regularly test snapshots using the
[`nomad operator snapshot restore`][] command to ensure they work.
* **Local ACL management tokens**: You need local management tokens to perform federated cluster
administration when the authoritative region is down. Make sure you have existing break-glass
tokens available for each region.
* **Known paths to creating local ACL tokens**: If the authoritative region fails, creation of
global ACL tokens fails. If this happens, having the ability to create local ACL tokens allows
you to continue to interact with each available federated region.
## Authoritative and federated regions
* **Can non-authoritative regions continue to operate if the authoritative region is unreachable?**:
Yes, running workloads are never interrupted due to federation failures. Scheduling of new
workloads and rescheduling of failed workloads is never interrupted due to federation failures.
See [Failure Scenarios][failure_scenarios] for details.
* **Can the authoritative region be deployed with servers only?** Yes, deploying the Nomad
authoritative region with servers only, without clients, works as expected. This servers-only
approach can expedite disaster recovery of the region. Restoration does not include objects such
as nodes, jobs, or allocations, which are large and require compute intensive reconciliation
after restoration.
* **Can I migrate the authoritative region to a currently federated region?** It is possible by
following these steps:
1. Update the [`authoritative_region`][] configuration parameter on the desired authoritative
region servers.
1. Restart the server processes in the new authoritative region and ensure all data is present in
state as expected. If the network was partitioned as part of the failure of the original
authoritative region, writes of replicated objects may not have been successfully replicated to
federated regions.
1. Update the [`authoritative_region`][] configuration parameter on the federated region servers
and restart their processes.
* **Can federated regions be bootstrapped while the authoritative region is down?** No they
cannot.
[`nomad operator snapshot save`]: /nomad/docs/commands/operator/snapshot/save
[`nomad operator snapshot agent`]: /nomad/docs/commands/operator/snapshot/agent
[`nomad operator snapshot restore`]: /nomad/docs/commands/operator/snapshot/restore
[failure_scenarios]: /nomad/docs/operations/federation/failure
[`authoritative_region`]: /nomad/docs/configuration/server#authoritative_region