nomad/website/content/docs/architecture/cpu.mdx

---
layout: docs
page_title: How Nomad Uses CPU
description: Learn how Nomad discovers and uses CPU resources on nodes in order to place and run workloads. Nomad Enterprise can use non-uniform memory access (NUMA) scheduling, which is optimized for the NUMA topology of a client node.
---

# How Nomad Uses CPUs

This page provides conceptual information on how Nomad discovers and uses CPU resources on nodes in order to place and run workloads.

Every Nomad node has a Central Processing Unit (CPU) providing the computational
power needed for running operating system processes. Nomad uses the CPU to
run tasks defined by the Nomad job submitter. For Nomad to know which nodes
have sufficient capacity for running a given task, each node in the cluster
is fingerprinted to gather information about the performance characteristics
of its CPU. The two metrics associated with each Nomad node with regard to
CPU performance are its bandwidth (how much it can _compute_) and the number
of cores.

Modern CPUs may contain heterogeneous core types. Apple introduced the M1 CPU
in 2020 which contains both _performance_ (P-Core) and _efficiency_ (E-Core)
types. Each core type operates at a different base frequency. Intel introduced
a similar topology in its Raptor Lake chips in 2022. When fingerprinting
the characteristics of a CPU Nomad is capable of taking these advanced CPU
topologies into account.

[![PE Cores](/img/nomad-pe-cores.png)](/img/nomad-pe-cores.png)

## Calculating CPU resources

The total CPU bandwidth of a Nomad node is the sum of the product between the
frequency of each core type and the total number of cores of that type in the
CPU.

```
bandwidth = (p_cores * p_frequency) + (e_cores * e_frequency)
```

The total number of cores is computed by summing the number of P-Cores and the
number of E-Cores.

```
cores = p_cores + e_cores
```

Nomad does not distinguish between logical and physical CPU cores. One of the
defining differences between the P-Core and E-Core types is that the E-Cores
do not support hyperthreading, whereas P-Cores do. As such a single physical
P-Core is presented as 2 logical cores, and a single E-Core is presented as 1
logical core.

The example below is from a Nomad node with an Intel i9-13900 CPU. It is made
up of mixed core types, with a P-Core base frequency of 2 GHz and an E-Core
base frequency of 1.5 GHz.

These characteristics are reflected in the `cpu.frequency.performance` and
`cpu.frequency.efficiency` node attributes respectively.

```text
cpu.arch                        = amd64
cpu.frequency.efficiency        = 1500
cpu.frequency.performance       = 2000
cpu.modelname                   = 13th Gen Intel(R) Core(TM) i9-13900
cpu.numcores                    = 32
cpu.numcores.efficiency         = 16
cpu.numcores.performance        = 16
cpu.reservablecores             = 32
cpu.totalcompute                = 56000
cpu.usablecompute               = 56000
```

## Reserving CPU resources

In the fingerprinted node attributes, `cpu.totalcompute` indicates the total
amount of CPU bandwidth the processor is capable of delivering. In some cases it
may be beneficial to reserve some amount of a node's CPU resources for use by
the operating system and other non-Nomad processes. This can be done in client
configuration.

The amount of reserved CPU can be specified in bandwidth via `cpu`.

```hcl
client {
  reserved {
    cpu = 3000 # mhz
  }
}
```

Or as a specific set of `cores` on which to disallow the scheduling of Nomad
tasks. This capability is available on Linux systems only.

```hcl
client {
  reserved {
    cores = "0-3"
  }
}
```

When the CPU is constrained by one of the above configurations, the node
attribute `cpu.usablecompute` indicates the total amount of CPU bandwidth
available for scheduling of Nomad tasks.

## Allocating CPU Resources

When scheduling jobs, a Task must specify how much CPU resource should be
allocated on its behalf. This can be done in terms of bandwidth in MHz with the
`cpu` attribute. On Linux under cgroups v1, Nomad maps this MHz value directly
into a [cpu.share][]. On Linux under cgroups v2, Nomad converts the MHz value to
a [cpu.weight][] proportionally the same as the cgroups v1 `cpu.share`.

```hcl
task {
  resources {
    cpu = 2000 # mhz
  }
}
```

Note that the isolation mechanism around CPU resources is dependent on each task
driver and its configuration. The standard behavior is that Nomad ensures a task
has access to _at least_ as much of its allocated CPU bandwidth. In which case
if a node has idle CPU capacity, a task may use additional CPU resources.  Some
task drivers enable limiting a task to use only the amount of bandwidth
allocated to the task, described in the [CPU Hard Limits](#cpu-hard-limits)
section below.

### Relative CPU shares/weights on Linux

Linux cgroups are hierarchical, and the `cpu.share`/`cpu.weight` values reflect
relative weights within a given subtree. Nomad creates its own cgroup subtree
(`nomad.slice`) on startup, and all `cpu.share`/`cpu.weight` values that Nomad
writes are relative between processes within that slice. The `nomad.slice`
subtree is itself relative to another subtree on the host. For example, a host
running systemd might have the following slices:

```
/sys/fs/cgroup
├── nomad.slice
│   ├── reserve.slice
│   └── share.slice
│       ├── 912dcc05-61e1-53cb-5489-a976a1231960.task.scope
│       ├── 247e706a-6df8-4123-89b3-1bcf2846b503.task.scope
│       └── 586c0c58-3d50-4730-b4ad-022076d3c6a4.task.scope
├── system.slice
│   ├── journald.service
│   └── (various system services, etc.)
└── user-1000.slice
    ├── session-1.scope
    └── (various user services, etc.)
```

If the task `912dcc05` has `resources.cpu = 1024` and tasks `247e706a` and
`586c0c58` have `resources.cpu = 512`, then `912dcc05` will get 50% of the CPU
resources available to the `nomad.slice` and tasks `247e706a` and `586c0c58`
will get 25% each. (The `reserve.slice` and `share.slice` are passthrough for
cpu shares here.) But together they'll get 33% of the total host's CPU resources
unless the `nomad.slice`, `system.slice`, or `user-1000.slice` have something
other than the default 1024 shares. The 1024 value is only meaningful within the
context of the Nomad slice.

### Allocating cores

On Linux systems, Nomad supports reserving whole CPU cores specifically for a
task. No task will be allowed to run on a CPU core reserved for another task.

```hcl
task {
  resources {
    cores = 4
  }
}
```

We recommend using `resources.cores` for tasks that require high CPU performance
to give those tasks exclusive access to CPU bandwidth. Sidecar tasks in the same
allocations can use `resources.cpu` to get a proportional share of the remaining
CPU on the node.

Nomad Enterprise supports NUMA aware scheduling, which enables operators to
more finely control which CPU cores may be reserved for tasks.

### CPU hard limits

Some task drivers support the configuration option `cpu_hard_limit`. If enabled
this option restricts tasks from bursting above their CPU limit even when there
is idle capacity on the node. The tradeoff is consistency versus utilization.
A task with too few CPU resources may operate fine until another task is placed
on the node causing a reduction in available CPU bandwidth, which could cause
disruption for the underprovisioned task.

### CPU environment variables

To help tasks understand the resources available to them, Nomad sets the
following environment variables in their runtime environment.


- `NOMAD_CPU_LIMIT` - The amount of CPU bandwidth allocated on behalf of the
  task.
- `NOMAD_CPU_CORES` - The set of cores in [cpuset][] notation reserved for the
  task. This value is only set if `resources.cores` is configured.

```sh
NOMAD_CPU_CORES=3-5
NOMAD_CPU_LIMIT=9000
```

## NUMA

Nomad clients are commonly provisioned on real hardware in an on-premise
environment or in the cloud on large `.metal` instance types. In either case, it
is likely the underlying server is designed around a [non-uniform memory access (NUMA) topology][numa_wiki].
Servers that contain multiple CPU sockets or multiple RAM banks per CPU socket
are characterized by the non-uniform access times involved in accessing system
memory.

[![NUMA](/img/nomad-numa.png)](/img/nomad-numa.png)

The simplified example machine above has the following topology:
- 2 physical CPU sockets
- 4 system memory banks, 2 per socket
- 8 physical cpu cores (4 per socket)
- 2 logical cpu cores per physical core
- 4 PCI devices, 1 per memory bank

### Optimizing performance

Operating system processes take longer to access memory across a NUMA boundary.

Using the example above if a task is scheduled on Core 0, accessing memory in
Mem 1 might take 20% longer than accessing memory in Mem 0, and accessing memory
in Mem 2 might take 300% longer.

The extreme differences are due to various physical hardware limitations. A core
accessing memory in its own NUMA node is optimal. Programs which perform a high
throughput of reads or writes to/from system memory will have their performance
substantially hindered by not optimizing their spatial locality with regard to
the systems NUMA topology.

### SLIT tables

Modern machines will define System Locality Distance Information (SLIT) tables
in their firmware. These tables are understood and made referenceable by the
Linux kernel. There are two key pieces of information provided by SLIT tables:
- Which CPU cores belong to which NUMA nodes
- The penalty incurred for accessing each NUMA node from a core in every other
NUMA node

The `lscpu` command can be used to describe the Core associativity on a machine.
For example on an `r6a.metal` EC2 instance:

```shell-session
$ lscpu | grep NUMA
NUMA node(s):           4
NUMA node0 CPU(s):      0-23,96-119
NUMA node1 CPU(s):      24-47,120-143
NUMA node2 CPU(s):      48-71,144-167
NUMA node3 CPU(s):      72-95,168-191
```

And the associated performance degradations are available via `numactl`:

```shell-session
$ numactl -H
available: 4 nodes (0-3)
...
node distances:
node   0   1   2   3
  0:  10  12  32  32
  1:  12  10  32  32
  2:  32  32  10  12
  3:  32  32  12  10
```

These SLIT table node distance values are presented as approximate relative
ratios. The value of 10 represents an optimal situation where a memory access
is occurring from a CPU that is part of the same NUMA node. A value of 20 would
indicate a 200% performance degradation, 30 for 300%, etc.

### Node Attributes

Nomad clients will fingerprint the machine's NUMA topology and export the
core associativity as node attributes. This data can provide a Nomad operator
a better understanding of when it might be useful to make use of NUMA aware
scheduling for certain workloads.

```
numa.node.count       = 4
numa.node0.cores      = 0-23,96-119
numa.node1.cores      = 24-47,120-143
numa.node2.cores      = 48-71,144-167
numa.node3.cores      = 72-95,168-191
```

## NUMA aware scheduling <EnterpriseAlert inline />

Nomad Enterprise is capable of scheduling tasks in a way that is optimized for
the NUMA topology of a client node. Nomad is able to correlate CPU cores with
memory nodes and assign tasks to run on specific CPU cores so as to minimize any
cross-memory node access patterns. Additionally, Nomad is able to correlate
devices to memory nodes and enable NUMA-aware scheduling to take device
associativity into account when making scheduling decisions.

A task may specify a `numa` block indicating its NUMA optimization preference. This example allocates a `1080ti` GPU and ensures it is on the same NUMA node as the 4 CPU cores reserved for the task.

```hcl
task {
  resources {
    cores = 4
    memory = 2048

    device "nvidia/gpu/1080ti" {
      count = 1
    }

    numa {
      affinity = "require"
      devices = [
        "nvidia/gpu/1080ti"
      ]
	  }
  }
}
```

### `affinity` options

This is a required field. There are three supported affinity options:
`none`, `prefer`, and `require`, each with their own advantages and tradeoffs.

#### option `none`

In the `none` mode, the Nomad scheduler leverages the apathy of jobs without
preference of NUMA affinity to help reduce core fragmentation within NUMA nodes.
It does so by bin-packing the core request of these jobs onto the NUMA nodes
with the fewest unused cores available.

The `none` mode is the default mode if the `numa` block is not specified.

```hcl
resources {
  cores = 4
  numa {
    affinity = "none"
  }
}
```

#### option `prefer`

In the `prefer` mode, the Nomad scheduler uses the hardware topology of a node
to calculate an optimized selection of available cores, but does not limit
those cores to come from a single NUMA node.

```hcl
resources {
  cores = 4
  numa {
    affinity = "prefer"
  }
}
```

#### option `require`

In the `require` mode, the Nomad scheduler uses the topology of each potential
client to find a set of available CPU cores that belong to the same NUMA node.
If no such set of cores can be found, that node is marked exhausted for the
resource of `numa-cores`.

```hcl
resources {
  cores = 4
  numa {
    affinity = "require"
  }
}
```

### `devices` options

`devices` is an optional list of devices that must be colocated on the NUMA node
along with allocated CPU cores.

The following diagram shows how a set of devices can be correlated to CPU and memory.

[![How a set of devices can be correlated to CPU and memory](/img/nomad-devices-correlate-cpu-memory.png)](/img/nomad-devices-correlate-cpu-memory.png)

This example declares three devices and configures two in the `numa` block.

```hcl
task {
  resources {
    cores = 8
    memory = 16384

    device "nvidia/gpu/H100" {
      count = 2
    }
    device "intel/net/XXVDA2" {
      count = 1
    }
    device "xilinx/fpga/X7" {
      count = 1
    }

    numa {
	    affinity = "require"
	    devices = [
  	    "nvidia/gpu/H100",
  	    "intel/net/XXVDA2"
	    ]
    }
  }
}
```

## Virtual CPU fingerprinting

When running on a virtualized host such as Amazon EC2 Nomad makes use of the
`dmidecode` tool to detect CPU performance data. Some Linux distributions will
require installing the `dmidecode` package manually.

[cpuset]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html
[cpu.share]: https://www.redhat.com/sysadmin/cgroups-part-two
[cpu.weight]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#weights
[numa_wiki]: https://en.wikipedia.org/wiki/Non-uniform_memory_access