mirror of
https://github.com/kemko/nomad.git
synced 2026-01-04 09:25:46 +03:00
253 lines
11 KiB
Plaintext
253 lines
11 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Production reference architecture
|
|
description: |-
|
|
Review the recommended compute and networking resources for provisioning a Nomad Enterprise cluster in a production environment.
|
|
---
|
|
|
|
# Production reference architecture
|
|
|
|
This document provides recommended practices and a reference architecture for
|
|
Nomad production deployments. This reference architecture conveys a general
|
|
architecture that should be adapted to accommodate the specific needs of each
|
|
implementation.
|
|
|
|
The following topics are addressed:
|
|
|
|
- [Reference Architecture](#ra)
|
|
- [Deployment Topology within a Single Region](#one-region)
|
|
- [Deployment Topology across Multiple Regions](#multi-region)
|
|
- [Network Connectivity Details](#net)
|
|
- [Deployment System Requirements](#system-reqs)
|
|
- [High Availability](#high-availability)
|
|
- [Failure Scenarios](#failure-scenarios)
|
|
|
|
<!-- vale Google.We = NO -->
|
|
|
|
This document describes deploying a Nomad cluster in combination with, or with
|
|
access to, a [Consul cluster][]. We recommend the use of Consul with Nomad to
|
|
provide automatic clustering, service discovery, health checking and dynamic
|
|
configuration.
|
|
|
|
<a name="ra"></a>
|
|
|
|
## Reference architecture
|
|
|
|
A Nomad cluster typically comprises three or five servers (but no more than
|
|
seven) and a number of client agents. Nomad differs slightly from Consul in
|
|
that it divides infrastructure into [regions][glossary-regions] which are
|
|
served by one Nomad server cluster, but can manage multiple
|
|
[datacenters][glossary-dc] or availability zones. For example, a _US Region_
|
|
can include datacenters _us-east-1_ and _us-west-2_.
|
|
|
|
In a Nomad multi-region architecture, communication happens via [WAN gossip][].
|
|
Additionally, Nomad can integrate easily with Consul to provide features such as
|
|
automatic clustering, service discovery, and dynamic configurations. Thus we
|
|
recommend you use Consul in your Nomad deployment to simplify the deployment.
|
|
|
|
<!-- vale Google.We = YES -->
|
|
|
|
In cloud environments, a single cluster may be deployed across multiple
|
|
availability zones. For example, in AWS each Nomad server can be deployed to an
|
|
associated EC2 instance, and those EC2 instances distributed across multiple
|
|
AZs. Similarly, Nomad server clusters can be deployed to multiple cloud regions
|
|
to allow for region level HA scenarios.
|
|
|
|
For more information on Nomad server cluster design, see the [cluster
|
|
requirements documentation][requirements].
|
|
|
|
The design shared in this document is the recommended architecture for
|
|
production environments, as it provides flexibility and resilience. Nomad
|
|
utilizes an existing Consul server cluster; however, the deployment design of
|
|
the Consul server cluster is outside the scope of this document.
|
|
|
|
Nomad to Consul connectivity is over HTTP and should be secured with TLS as well as a Consul
|
|
token to provide encryption of all traffic. This is done using Nomad's
|
|
[Automatic Clustering with Consul][consul-clustering].
|
|
|
|
<a name="one-region"></a>
|
|
|
|
### Deployment topology within a single region
|
|
|
|
A single Nomad cluster is recommended for applications deployed in the same region.
|
|
|
|
Each cluster is expected to have either three or five servers.
|
|
This strikes a balance between availability in the case of failure and
|
|
performance, as [Raft][] consensus gets progressively
|
|
slower as more servers are added.
|
|
|
|
The time taken by a new server to join an existing large cluster may increase as
|
|
the size of the cluster increases.
|
|
|
|
#### Reference diagram
|
|
|
|
[![Reference diagram][img-reference-diagram]][img-reference-diagram]
|
|
|
|
<a name="multi-region"></a>
|
|
|
|
### Deployment topology across multiple regions
|
|
|
|
By deploying Nomad server clusters in multiple regions, the user is able to
|
|
interact with the Nomad servers by targeting any region from any Nomad server
|
|
even if that server resides in a separate region. However, most data is not
|
|
replicated between regions as they are fully independent clusters. The
|
|
exceptions which _are_ replicated between regions are:
|
|
|
|
- [ACL policies and global tokens][acl]
|
|
- [Sentinel policies in Nomad Enterprise][sentinel]
|
|
|
|
Nomad server clusters in different datacenters can be federated using WAN links.
|
|
The server clusters can be joined to communicate over the WAN on port `4648`.
|
|
This same port is used for single datacenter deployments over LAN as well.
|
|
|
|
Additional documentation is available to learn more about Nomad cluster
|
|
[federation][].
|
|
|
|
<a name="net"></a>
|
|
|
|
## Network connectivity details
|
|
|
|
[![Nomad network diagram][img-nomad-net]][img-nomad-net]
|
|
|
|
Nomad servers are expected to be able to communicate in high bandwidth, low
|
|
latency network environments and have below 10 millisecond latencies between
|
|
cluster members. Nomad servers can be spread across cloud regions or datacenters
|
|
if they satisfy these latency requirements.
|
|
|
|
Nomad client clusters require the ability to receive traffic as noted in
|
|
the Network Connectivity Details; however, clients can be separated into any
|
|
type of infrastructure (multi-cloud, on-prem, virtual, bare metal, etc.) as long
|
|
as they are reachable and can receive job requests from the Nomad servers.
|
|
|
|
Additional documentation is available to learn more about [Nomad networking][].
|
|
|
|
<a name="system-reqs"></a>
|
|
|
|
## Deployment system requirements
|
|
|
|
Nomad server agents are responsible for maintaining the cluster state,
|
|
responding to RPC queries (read operations), and for processing all write
|
|
operations. Given that Nomad server agents do most of the heavy lifting, server
|
|
sizing is critical for the overall performance efficiency and health of the
|
|
Nomad cluster.
|
|
|
|
### Nomad servers
|
|
|
|
<!-- vale Vale.Spelling = NO --> <!-- vale Google.Colons = NO --> <!-- vale Google.WordList = NO -->
|
|
|
|
| Type | CPU | Memory | Disk | Typical Cloud Instance Types |
|
|
| ----- | --------- | ----------------- | ----------- | ------------------------------------------ |
|
|
| Small | 2-4 core | 8-16 GB RAM | 50 GB | **AWS**: m5.large, m5.xlarge |
|
|
| | | | | **Azure**: Standard_D2_v3, Standard_D4_v3 |
|
|
| | | | | **GCP**: n2-standard-2, n2-standard-4 |
|
|
| Large | 8-16 core | 32-64 GB RAM | 100 GB | **AWS**: m5.2xlarge, m5.4xlarge |
|
|
| | | | | **Azure**: Standard_D8_v3, Standard_D16_v3 |
|
|
| | | | | **GCP**: n2-standard-8, n2-standard-16 |
|
|
|
|
<!-- vale Vale.Spelling = YES --> <!-- Google.Colons = YES --> <!-- vale Google.WordList = YES -->
|
|
|
|
#### Hardware sizing considerations
|
|
|
|
- The small size would be appropriate for most initial production
|
|
deployments, or for development/testing environments.
|
|
|
|
- The large size is for production environments where there is a
|
|
consistently high workload.
|
|
|
|
<Note>
|
|
|
|
For large workloads, ensure that the disks support a high number of
|
|
IOPS to keep up with the rapid Raft log update rate.
|
|
|
|
</Note>
|
|
|
|
Nomad clients can be setup with specialized workloads as well. For example, if
|
|
workloads require GPU processing, a Nomad datacenter can be created to serve
|
|
those GPU specific jobs and joined to a Nomad server cluster. For more
|
|
information on specialized workloads, see the documentation on [job
|
|
constraints][] to target specific client nodes.
|
|
|
|
## High availability
|
|
|
|
A Nomad server cluster is the highly available unit of deployment within a
|
|
single datacenter. A recommended approach is to deploy a three or five node
|
|
Nomad server cluster. With this configuration, during a Nomad server outage,
|
|
failover is handled immediately without human intervention.
|
|
|
|
When setting up high availability across regions, multiple Nomad server clusters
|
|
are deployed and connected via WAN gossip. Nomad clusters in regions are fully
|
|
independent from each other and do not share jobs, clients, or state. Data
|
|
residing in a single region-specific cluster is not replicated to other clusters
|
|
in other regions.
|
|
|
|
## Failure scenarios
|
|
|
|
Typical distribution in a cloud environment is to spread Nomad server nodes into
|
|
separate Availability Zones (AZs) within a high bandwidth, low latency network,
|
|
such as an AWS Region. The diagram below shows Nomad servers deployed in
|
|
multiple AZs promoting a single voting member per AZ and providing both AZ-level
|
|
and node-level failure protection.
|
|
|
|
[![Nomad fault tolerance][img-fault-tolerance]][img-fault-tolerance]
|
|
|
|
Additional documentation is available to learn more about [cluster sizing and
|
|
failure tolerances][sizing] as well as [outage recovery][].
|
|
|
|
### Availability zone failure
|
|
|
|
In the event of a single AZ failure, only a single Nomad server is affected
|
|
which would not impact job scheduling as long as there is still a Raft quorum
|
|
(that is, 2 available servers in a 3 server cluster, 3 available servers in a 5
|
|
server cluster, more generally:
|
|
|
|
<div align="center">quorum = floor( count(members) / 2) + 1</div>
|
|
|
|
There are two scenarios that could occur should an AZ fail in a multiple AZ
|
|
setup: leader loss or follower loss.
|
|
|
|
#### Leader server loss
|
|
|
|
If the AZ containing the Nomad leader server fails, the remaining quorum members
|
|
would elect a new leader. The new leader then begins to accept new log entries and
|
|
replicates these entries to the remaining followers.
|
|
|
|
#### Follower server loss
|
|
|
|
If the AZ containing a Nomad follower server fails, there is no immediate impact
|
|
to the Nomad leader server or cluster operations. However, there still must be a
|
|
Raft quorum in order to properly manage a future failure of the Nomad leader
|
|
server.
|
|
|
|
### Region failure
|
|
|
|
In the event of a region-level failure (which would contain an entire Nomad
|
|
server cluster), clients are still able to submit jobs to another region
|
|
that is properly federated. However, data loss is likely as Nomad
|
|
server clusters do not replicate their data to other region clusters. See
|
|
[Multi-region Federation][federation] for more setup information.
|
|
|
|
## Next steps
|
|
|
|
Read [Deployment Guide][deployment-guide] to learn the steps required to
|
|
install and configure a single HashiCorp Nomad cluster that uses Consul.
|
|
|
|
[acl]: /nomad/secure/acl/bootstrap
|
|
[consul cluster]: /nomad/docs/networking/consul
|
|
[deployment-guide]: /nomad/tutorials/enterprise/production-deployment-guide-vm-with-consul
|
|
[img-fault-tolerance]: /img/deploy/nomad_fault_tolerance.png
|
|
[img-nomad-net]: /img/deploy/nomad_network_arch_0-1x.png
|
|
[img-reference-diagram]: /img/deploy/nomad_reference_diagram.png
|
|
[job constraints]: /nomad/docs/job-specification/constraint
|
|
[federation]: /nomad/docs/deploy/clusters/federate-regions
|
|
[nomad networking]: /nomad/docs/deploy/production/requirements#network-topology
|
|
[nomad server federation]: /nomad/docs/deploy/clusters/federate-regions
|
|
[outage recovery]: /nomad/docs/manage/outage-recovery
|
|
[raft]: https://raft.github.io/
|
|
[requirements]: /nomad/docs/deploy/production/requirements
|
|
[sentinel]: /nomad/docs/govern/sentinel
|
|
[sizing]: /nomad/docs/architecture/cluster/consensus#deployment_table
|
|
[wan gossip]: /nomad/docs/architecture/security/gossip
|
|
[consul-clustering]: /nomad/docs/deploy/clusters/connect-nodes
|
|
[glossary-regions]: /nomad/docs/glossary#regions
|
|
[glossary-dc]: /nomad/docs/glossary#datacenters
|