mirror of
https://github.com/kemko/nomad.git
synced 2026-01-03 17:05:43 +03:00
* Move commands from docs to its own root-level directory * temporarily use modified dev-portal branch with nomad ia changes * explicitly clone nomad ia exp branch * retrigger build, fixed dev-portal broken build * architecture, concepts and get started individual pages * fix get started section destinations * reference section * update repo comment in website-build.sh to show branch * docs nav file update capitalization * update capitalization to force deploy * remove nomad-vs-kubernetes dir; move content to what is nomad pg * job section * Nomad operations category, deploy section * operations category, govern section * operations - manage * operations/scale; concepts scheduling fix * networking * monitor * secure section * remote auth-methods folder and move up pages to sso; linkcheck * Fix install2deploy redirects * fix architecture redirects * Job section: Add missing section index pages * Add section index pages so breadcrumbs build correctly * concepts/index fix front matter indentation * move task driver plugin config to new deploy section * Finish adding full URL to tutorials links in nav * change SSO to Authentication in nav and file system * Docs NomadIA: Move tutorials into NomadIA branch (#26132) * Move governance and policy from tutorials to docs * Move tutorials content to job-declare section * run jobs section * stateful workloads * advanced job scheduling * deploy section * manage section * monitor section * secure/acl and secure/authorization * fix example that contains an unseal key in real format * remove images from sso-vault * secure/traffic * secure/workload-identities * vault-acl change unseal key and root token in command output sample * remove lines from sample output * fix front matter * move nomad pack tutorials to tools * search/replace /nomad/tutorials links * update acl overview with content from deleted architecture/acl * fix spelling mistake * linkcheck - fix broken links * fix link to Nomad variables tutorial * fix link to Prometheus tutorial * move who uses Nomad to use cases page; move spec/config shortcuts add dividers * Move Consul out of Integrations; move namespaces to govern * move integrations/vault to secure/vault; delete integrations * move ref arch to docs; rename Deploy Nomad back to Install Nomad * address feedback * linkcheck fixes * Fixed raw_exec redirect * add info from /nomad/tutorials/manage-jobs/jobs * update page content with newer tutorial * link updates for architecture sub-folders * Add redirects for removed section index pages. Fix links. * fix broken links from linkcheck * Revert to use dev-portal main branch instead of nomadIA branch * build workaround: add intro-nav-data.json with single entry * fix content-check error * add intro directory to get around Vercel build error * workound for emtpry directory * remove mdx from /intro/ to fix content-check and git snafu * Add intro index.mdx so Vercel build should work --------- Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
429 lines
14 KiB
Plaintext
429 lines
14 KiB
Plaintext
---
|
||
layout: docs
|
||
page_title: How Nomad Uses CPU
|
||
description: Learn how Nomad discovers and uses CPU resources on nodes in order to place and run workloads. Nomad Enterprise can use non-uniform memory access (NUMA) scheduling, which is optimized for the NUMA topology of a client node.
|
||
---
|
||
|
||
# How Nomad Uses CPUs
|
||
|
||
This page provides conceptual information on how Nomad discovers and uses CPU resources on nodes in order to place and run workloads.
|
||
|
||
Every Nomad node has a Central Processing Unit (CPU) providing the computational
|
||
power needed for running operating system processes. Nomad uses the CPU to
|
||
run tasks defined by the Nomad job submitter. For Nomad to know which nodes
|
||
have sufficient capacity for running a given task, each node in the cluster
|
||
is fingerprinted to gather information about the performance characteristics
|
||
of its CPU. The two metrics associated with each Nomad node with regard to
|
||
CPU performance are its bandwidth (how much it can _compute_) and the number
|
||
of cores.
|
||
|
||
Modern CPUs may contain heterogeneous core types. Apple introduced the M1 CPU
|
||
in 2020 which contains both _performance_ (P-Core) and _efficiency_ (E-Core)
|
||
types. Each core type operates at a different base frequency. Intel introduced
|
||
a similar topology in its Raptor Lake chips in 2022. When fingerprinting
|
||
the characteristics of a CPU Nomad is capable of taking these advanced CPU
|
||
topologies into account.
|
||
|
||
[](/img/nomad-pe-cores.png)
|
||
|
||
## Calculating CPU resources
|
||
|
||
The total CPU bandwidth of a Nomad node is the sum of the product between the
|
||
frequency of each core type and the total number of cores of that type in the
|
||
CPU.
|
||
|
||
```
|
||
bandwidth = (p_cores * p_frequency) + (e_cores * e_frequency)
|
||
```
|
||
|
||
The total number of cores is computed by summing the number of P-Cores and the
|
||
number of E-Cores.
|
||
|
||
```
|
||
cores = p_cores + e_cores
|
||
```
|
||
|
||
Nomad does not distinguish between logical and physical CPU cores. One of the
|
||
defining differences between the P-Core and E-Core types is that the E-Cores
|
||
do not support hyperthreading, whereas P-Cores do. As such a single physical
|
||
P-Core is presented as 2 logical cores, and a single E-Core is presented as 1
|
||
logical core.
|
||
|
||
The example below is from a Nomad node with an Intel i9-13900 CPU. It is made
|
||
up of mixed core types, with a P-Core base frequency of 2 GHz and an E-Core
|
||
base frequency of 1.5 GHz.
|
||
|
||
These characteristics are reflected in the `cpu.frequency.performance` and
|
||
`cpu.frequency.efficiency` node attributes respectively.
|
||
|
||
```text
|
||
cpu.arch = amd64
|
||
cpu.frequency.efficiency = 1500
|
||
cpu.frequency.performance = 2000
|
||
cpu.modelname = 13th Gen Intel(R) Core(TM) i9-13900
|
||
cpu.numcores = 32
|
||
cpu.numcores.efficiency = 16
|
||
cpu.numcores.performance = 16
|
||
cpu.reservablecores = 32
|
||
cpu.totalcompute = 56000
|
||
cpu.usablecompute = 56000
|
||
```
|
||
|
||
## Reserving CPU resources
|
||
|
||
In the fingerprinted node attributes, `cpu.totalcompute` indicates the total
|
||
amount of CPU bandwidth the processor is capable of delivering. In some cases it
|
||
may be beneficial to reserve some amount of a node's CPU resources for use by
|
||
the operating system and other non-Nomad processes. This can be done in client
|
||
configuration.
|
||
|
||
The amount of reserved CPU can be specified in bandwidth via `cpu`.
|
||
|
||
```hcl
|
||
client {
|
||
reserved {
|
||
cpu = 3000 # mhz
|
||
}
|
||
}
|
||
```
|
||
|
||
Or as a specific set of `cores` on which to disallow the scheduling of Nomad
|
||
tasks. This capability is available on Linux systems only.
|
||
|
||
```hcl
|
||
client {
|
||
reserved {
|
||
cores = "0-3"
|
||
}
|
||
}
|
||
```
|
||
|
||
When the CPU is constrained by one of the above configurations, the node
|
||
attribute `cpu.usablecompute` indicates the total amount of CPU bandwidth
|
||
available for scheduling of Nomad tasks.
|
||
|
||
## Allocating CPU Resources
|
||
|
||
When scheduling jobs, a Task must specify how much CPU resource should be
|
||
allocated on its behalf. This can be done in terms of bandwidth in MHz with the
|
||
`cpu` attribute. On Linux under cgroups v1, Nomad maps this MHz value directly
|
||
into a [cpu.share][]. On Linux under cgroups v2, Nomad converts the MHz value to
|
||
a [cpu.weight][] proportionally the same as the cgroups v1 `cpu.share`.
|
||
|
||
```hcl
|
||
task {
|
||
resources {
|
||
cpu = 2000 # mhz
|
||
}
|
||
}
|
||
```
|
||
|
||
Note that the isolation mechanism around CPU resources is dependent on each task
|
||
driver and its configuration. The standard behavior is that Nomad ensures a task
|
||
has access to _at least_ as much of its allocated CPU bandwidth. In which case
|
||
if a node has idle CPU capacity, a task may use additional CPU resources. Some
|
||
task drivers enable limiting a task to use only the amount of bandwidth
|
||
allocated to the task, described in the [CPU Hard Limits](#cpu-hard-limits)
|
||
section below.
|
||
|
||
### Relative CPU shares/weights on Linux
|
||
|
||
Linux cgroups are hierarchical, and the `cpu.share`/`cpu.weight` values reflect
|
||
relative weights within a given subtree. Nomad creates its own cgroup subtree
|
||
(`nomad.slice`) on startup, and all `cpu.share`/`cpu.weight` values that Nomad
|
||
writes are relative between processes within that slice. The `nomad.slice`
|
||
subtree is itself relative to another subtree on the host. For example, a host
|
||
running systemd might have the following slices:
|
||
|
||
```
|
||
/sys/fs/cgroup
|
||
├── nomad.slice
|
||
│ ├── reserve.slice
|
||
│ └── share.slice
|
||
│ ├── 912dcc05-61e1-53cb-5489-a976a1231960.task.scope
|
||
│ ├── 247e706a-6df8-4123-89b3-1bcf2846b503.task.scope
|
||
│ └── 586c0c58-3d50-4730-b4ad-022076d3c6a4.task.scope
|
||
├── system.slice
|
||
│ ├── journald.service
|
||
│ └── (various system services, etc.)
|
||
└── user-1000.slice
|
||
├── session-1.scope
|
||
└── (various user services, etc.)
|
||
```
|
||
|
||
If the task `912dcc05` has `resources.cpu = 1024` and tasks `247e706a` and
|
||
`586c0c58` have `resources.cpu = 512`, then `912dcc05` will get 50% of the CPU
|
||
resources available to the `nomad.slice` and tasks `247e706a` and `586c0c58`
|
||
will get 25% each. (The `reserve.slice` and `share.slice` are passthrough for
|
||
cpu shares here.) But together they'll get 33% of the total host's CPU resources
|
||
unless the `nomad.slice`, `system.slice`, or `user-1000.slice` have something
|
||
other than the default 1024 shares. The 1024 value is only meaningful within the
|
||
context of the Nomad slice.
|
||
|
||
### Allocating cores
|
||
|
||
On Linux systems, Nomad supports reserving whole CPU cores specifically for a
|
||
task. No task will be allowed to run on a CPU core reserved for another task.
|
||
|
||
```hcl
|
||
task {
|
||
resources {
|
||
cores = 4
|
||
}
|
||
}
|
||
```
|
||
|
||
We recommend using `resources.cores` for tasks that require high CPU performance
|
||
to give those tasks exclusive access to CPU bandwidth. Sidecar tasks in the same
|
||
allocations can use `resources.cpu` to get a proportional share of the remaining
|
||
CPU on the node.
|
||
|
||
Nomad Enterprise supports NUMA aware scheduling, which enables operators to
|
||
more finely control which CPU cores may be reserved for tasks.
|
||
|
||
### CPU hard limits
|
||
|
||
Some task drivers support the configuration option `cpu_hard_limit`. If enabled
|
||
this option restricts tasks from bursting above their CPU limit even when there
|
||
is idle capacity on the node. The tradeoff is consistency versus utilization.
|
||
A task with too few CPU resources may operate fine until another task is placed
|
||
on the node causing a reduction in available CPU bandwidth, which could cause
|
||
disruption for the underprovisioned task.
|
||
|
||
### CPU environment variables
|
||
|
||
To help tasks understand the resources available to them, Nomad sets the
|
||
following environment variables in their runtime environment.
|
||
|
||
|
||
- `NOMAD_CPU_LIMIT` - The amount of CPU bandwidth allocated on behalf of the
|
||
task.
|
||
- `NOMAD_CPU_CORES` - The set of cores in [cpuset][] notation reserved for the
|
||
task. This value is only set if `resources.cores` is configured.
|
||
|
||
```sh
|
||
NOMAD_CPU_CORES=3-5
|
||
NOMAD_CPU_LIMIT=9000
|
||
```
|
||
|
||
## NUMA
|
||
|
||
Nomad clients are commonly provisioned on real hardware in an on-premise
|
||
environment or in the cloud on large `.metal` instance types. In either case, it
|
||
is likely the underlying server is designed around a [non-uniform memory access (NUMA) topology][numa_wiki].
|
||
Servers that contain multiple CPU sockets or multiple RAM banks per CPU socket
|
||
are characterized by the non-uniform access times involved in accessing system
|
||
memory.
|
||
|
||
[](/img/nomad-numa.png)
|
||
|
||
The simplified example machine above has the following topology:
|
||
- 2 physical CPU sockets
|
||
- 4 system memory banks, 2 per socket
|
||
- 8 physical cpu cores (4 per socket)
|
||
- 2 logical cpu cores per physical core
|
||
- 4 PCI devices, 1 per memory bank
|
||
|
||
### Optimizing performance
|
||
|
||
Operating system processes take longer to access memory across a NUMA boundary.
|
||
|
||
Using the example above if a task is scheduled on Core 0, accessing memory in
|
||
Mem 1 might take 20% longer than accessing memory in Mem 0, and accessing memory
|
||
in Mem 2 might take 300% longer.
|
||
|
||
The extreme differences are due to various physical hardware limitations. A core
|
||
accessing memory in its own NUMA node is optimal. Programs which perform a high
|
||
throughput of reads or writes to/from system memory will have their performance
|
||
substantially hindered by not optimizing their spatial locality with regard to
|
||
the systems NUMA topology.
|
||
|
||
### SLIT tables
|
||
|
||
Modern machines will define System Locality Distance Information (SLIT) tables
|
||
in their firmware. These tables are understood and made referenceable by the
|
||
Linux kernel. There are two key pieces of information provided by SLIT tables:
|
||
- Which CPU cores belong to which NUMA nodes
|
||
- The penalty incurred for accessing each NUMA node from a core in every other
|
||
NUMA node
|
||
|
||
The `lscpu` command can be used to describe the Core associativity on a machine.
|
||
For example on an `r6a.metal` EC2 instance:
|
||
|
||
```shell-session
|
||
$ lscpu | grep NUMA
|
||
NUMA node(s): 4
|
||
NUMA node0 CPU(s): 0-23,96-119
|
||
NUMA node1 CPU(s): 24-47,120-143
|
||
NUMA node2 CPU(s): 48-71,144-167
|
||
NUMA node3 CPU(s): 72-95,168-191
|
||
```
|
||
|
||
And the associated performance degradations are available via `numactl`:
|
||
|
||
```shell-session
|
||
$ numactl -H
|
||
available: 4 nodes (0-3)
|
||
...
|
||
node distances:
|
||
node 0 1 2 3
|
||
0: 10 12 32 32
|
||
1: 12 10 32 32
|
||
2: 32 32 10 12
|
||
3: 32 32 12 10
|
||
```
|
||
|
||
These SLIT table node distance values are presented as approximate relative
|
||
ratios. The value of 10 represents an optimal situation where a memory access
|
||
is occurring from a CPU that is part of the same NUMA node. A value of 20 would
|
||
indicate a 200% performance degradation, 30 for 300%, etc.
|
||
|
||
### Node Attributes
|
||
|
||
Nomad clients will fingerprint the machine's NUMA topology and export the
|
||
core associativity as node attributes. This data can provide a Nomad operator
|
||
a better understanding of when it might be useful to make use of NUMA aware
|
||
scheduling for certain workloads.
|
||
|
||
```
|
||
numa.node.count = 4
|
||
numa.node0.cores = 0-23,96-119
|
||
numa.node1.cores = 24-47,120-143
|
||
numa.node2.cores = 48-71,144-167
|
||
numa.node3.cores = 72-95,168-191
|
||
```
|
||
|
||
## NUMA aware scheduling <EnterpriseAlert inline />
|
||
|
||
Nomad Enterprise is capable of scheduling tasks in a way that is optimized for
|
||
the NUMA topology of a client node. Nomad is able to correlate CPU cores with
|
||
memory nodes and assign tasks to run on specific CPU cores so as to minimize any
|
||
cross-memory node access patterns. Additionally, Nomad is able to correlate
|
||
devices to memory nodes and enable NUMA-aware scheduling to take device
|
||
associativity into account when making scheduling decisions.
|
||
|
||
A task may specify a `numa` block indicating its NUMA optimization preference. This example allocates a `1080ti` GPU and ensures it is on the same NUMA node as the 4 CPU cores reserved for the task.
|
||
|
||
```hcl
|
||
task {
|
||
resources {
|
||
cores = 4
|
||
memory = 2048
|
||
|
||
device "nvidia/gpu/1080ti" {
|
||
count = 1
|
||
}
|
||
|
||
numa {
|
||
affinity = "require"
|
||
devices = [
|
||
"nvidia/gpu/1080ti"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### `affinity` options
|
||
|
||
This is a required field. There are three supported affinity options:
|
||
`none`, `prefer`, and `require`, each with their own advantages and tradeoffs.
|
||
|
||
#### option `none`
|
||
|
||
In the `none` mode, the Nomad scheduler leverages the apathy of jobs without
|
||
preference of NUMA affinity to help reduce core fragmentation within NUMA nodes.
|
||
It does so by bin-packing the core request of these jobs onto the NUMA nodes
|
||
with the fewest unused cores available.
|
||
|
||
The `none` mode is the default mode if the `numa` block is not specified.
|
||
|
||
```hcl
|
||
resources {
|
||
cores = 4
|
||
numa {
|
||
affinity = "none"
|
||
}
|
||
}
|
||
```
|
||
|
||
#### option `prefer`
|
||
|
||
In the `prefer` mode, the Nomad scheduler uses the hardware topology of a node
|
||
to calculate an optimized selection of available cores, but does not limit
|
||
those cores to come from a single NUMA node.
|
||
|
||
```hcl
|
||
resources {
|
||
cores = 4
|
||
numa {
|
||
affinity = "prefer"
|
||
}
|
||
}
|
||
```
|
||
|
||
#### option `require`
|
||
|
||
In the `require` mode, the Nomad scheduler uses the topology of each potential
|
||
client to find a set of available CPU cores that belong to the same NUMA node.
|
||
If no such set of cores can be found, that node is marked exhausted for the
|
||
resource of `numa-cores`.
|
||
|
||
```hcl
|
||
resources {
|
||
cores = 4
|
||
numa {
|
||
affinity = "require"
|
||
}
|
||
}
|
||
```
|
||
|
||
### `devices` options
|
||
|
||
`devices` is an optional list of devices that must be colocated on the NUMA node
|
||
along with allocated CPU cores.
|
||
|
||
The following diagram shows how a set of devices can be correlated to CPU and memory.
|
||
|
||
[](/img/nomad-devices-correlate-cpu-memory.png)
|
||
|
||
This example declares three devices and configures two in the `numa` block.
|
||
|
||
```hcl
|
||
task {
|
||
resources {
|
||
cores = 8
|
||
memory = 16384
|
||
|
||
device "nvidia/gpu/H100" {
|
||
count = 2
|
||
}
|
||
device "intel/net/XXVDA2" {
|
||
count = 1
|
||
}
|
||
device "xilinx/fpga/X7" {
|
||
count = 1
|
||
}
|
||
|
||
numa {
|
||
affinity = "require"
|
||
devices = [
|
||
"nvidia/gpu/H100",
|
||
"intel/net/XXVDA2"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
## Virtual CPU fingerprinting
|
||
|
||
When running on a virtualized host such as Amazon EC2 Nomad makes use of the
|
||
`dmidecode` tool to detect CPU performance data. Some Linux distributions will
|
||
require installing the `dmidecode` package manually.
|
||
|
||
[cpuset]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html
|
||
[cpu.share]: https://www.redhat.com/sysadmin/cgroups-part-two
|
||
[cpu.weight]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#weights
|
||
[numa_wiki]: https://en.wikipedia.org/wiki/Non-uniform_memory_access
|