--- layout: docs page_title: CPU description: Learn about how Nomad manages CPU resources. --- # Modern Processors Every Nomad node has a Central Processing Unit (CPU) providing the computational power needed for running operating system processes. Nomad uses the CPU to run tasks defined by the Nomad job submitter. For Nomad to know which nodes have sufficient capacity for running a given task, each node in the cluster is fingerprinted to gather information about the performance characteristics of its CPU. The two metrics associated with each Nomad node with regard to CPU performance are its bandwidth (how much it can _compute_) and the number of cores. Modern CPUs may contain heterogeneous core types. Apple introduced the M1 CPU in 2020 which contains both _performance_ (P-Core) and _efficiency_ (E-Core) types. Each core type operates at a different base frequency. Intel introduced a similar topology in its Raptor Lake chips in 2022. When fingerprinting the characteristics of a CPU Nomad is capable of taking these advanced CPU topologies into account. [![PE Cores](/img/nomad-pe-cores.png)](/img/nomad-pe-cores.png) ## Calculating CPU Resources The total CPU bandwidth of a Nomad node is the sum of the product between the frequency of each core type and the total number of cores of that type in the CPU. ``` bandwidth = (p_cores * p_frequency) + (e_cores * e_frequency) ``` The total number of cores is computed by summing the number of P-Cores and the number of E-Cores. ``` cores = p_cores + e_cores ``` Nomad does not distinguish between logical and physical CPU cores. One of the defining differences between the P-Core and E-Core types is that the E-Cores do not support hyperthreading, whereas P-Cores do. As such a single physical P-Core is presented as 2 logical cores, and a single E-Core is presented as 1 logical core. The example below is from a Nomad node with an Intel i9-13900 CPU. It is made up of mixed core types, with a P-Core base frequency of 2 GHz and an E-Core base frequency of 1.5 GHz. These characteristics are reflected in the `cpu.frequency.performance` and `cpu.frequency.efficiency` node attributes respectively. ```text cpu.arch = amd64 cpu.frequency.efficiency = 1500 cpu.frequency.performance = 2000 cpu.modelname = 13th Gen Intel(R) Core(TM) i9-13900 cpu.numcores = 32 cpu.numcores.efficiency = 16 cpu.numcores.performance = 16 cpu.reservablecores = 32 cpu.totalcompute = 56000 cpu.usablecompute = 56000 ``` ## Reserving CPU Resources In the fingerprinted node attributes, `cpu.totalcompute` indicates the total amount of CPU bandwidth the processor is capable of delivering. In some cases it may be beneficial to reserve some amount of a node's CPU resources for use by the operating system and other non-Nomad processes. This can be done in client configuration. The amount of reserved CPU can be specified in bandwidth via `cpu`. ```hcl client { reserved { cpu = 3000 # mhz } } ``` Or as a specific set of `cores` on which to disallow the scheduling of Nomad tasks. This capability is available on Linux systems only. ```hcl client { reserved { cores = "0-3" } } ``` When the CPU is constrained by one of the above configurations, the node attribute `cpu.usablecompute` indicates the total amount of CPU bandwidth available for scheduling of Nomad tasks. ## Allocating CPU Resources When scheduling jobs, a Task must specify how much CPU resource should be allocated on its behalf. This can be done in terms of bandwidth in MHz with the `cpu` attribute. This MHz value is translated directly into [cpushares][] on Linux systems. ```hcl task { resources { cpu = 2000 # mhz } } ``` Note that the isolation mechanism around CPU resources is dependent on each task driver and its configuration. The standard behavior is that Nomad ensures a task has access to _at least_ as much of its allocated CPU bandwidth. In which case if a node has idle CPU capacity, a task may use additional CPU resources. Some task drivers enable limiting a task to use only the amount of bandwidth allocated to the task, described in the [CPU Hard Limits](#cpu-hard-limits) section below. On Linux systems, Nomad supports reserving whole CPU cores specifically for a task. No task will be allowed to run on a CPU core reserved for another task. ```hcl task { resources { cores = 4 } } ``` Nomad Enterprise supports NUMA aware scheduling, which enables operators to more finely control which CPU cores may be reserved for tasks. ### CPU Hard Limits Some task drivers support the configuration option `cpu_hard_limit`. If enabled this option restricts tasks from bursting above their CPU limit even when there is idle capacity on the node. The tradeoff is consistency versus utilization. A task with too few CPU resources may operate fine until another task is placed on the node causing a reduction in available CPU bandwidth, which could cause disruption for the underprovisioned task. ### CPU Environment Variables To help tasks understand the resources available to them, Nomad sets the following environment variables in their runtime environment. - `NOMAD_CPU_LIMIT` - The amount of CPU bandwidth allocated on behalf of the task. - `NOMAD_CPU_CORES` - The set of cores in [cpuset][] notation reserved for the task. This value is only set if `resources.cores` is configured. ```sh NOMAD_CPU_CORES=3-5 NOMAD_CPU_LIMIT=9000 ``` # NUMA Nomad clients are commonly provisioned on real hardware in an on-premise environment or in the cloud on large `.metal` instance types. In either case it is likely the underlying server is designed around a [NUMA topology][numa_wiki]. Servers that contain multiple CPU sockets or multiple RAM banks per CPU socket are characterized by the non-uniform access times involved in accessing system memory. [![NUMA](/img/nomad-numa.png)](/img/nomad-numa.png) The simplified example machine above has the following topology - 2 physical CPU sockets - 4 system memory banks, 2 per socket - 8 physical cpu cores (4 per socket) - 2 logical cpu cores per physical core - 4 PCI devices, 1 per memory bank ### Optimizing performance Operating system processes take longer to access memory across a NUMA boundary. Using the example above if a task is scheduled on Core 0, accessing memory in Mem 1 might take 20% longer than accessing memory in Mem 0, and accessing memory in Mem 2 might take 300% longer. The extreme differences are due to various physical hardware limitations. A core accessing memory in its own NUMA node is optimal. Programs which perform a high throughput of reads or writes to/from system memory will have their performance substantially hindered by not optimizing their spatial locality with regard to the systems NUMA topology. ### SLIT tables Modern machines will define System Locality Distance Information (SLIT) tables in their firmware. These tables are understood and made referenceable by the Linux kernel. There are two key pieces of information provided by SLIT tables: - Which CPU cores belong to which NUMA nodes - The penalty incurred for accessing each NUMA node from a core in every other NUMA node The `lscpu` command can be used to describe the Core associativity on a machine. For example on an `r6a.metal` EC2 instance: ```shell-session $ lscpu | grep NUMA NUMA node(s):          4 NUMA node0 CPU(s):     0-23,96-119 NUMA node1 CPU(s):     24-47,120-143 NUMA node2 CPU(s):     48-71,144-167 NUMA node3 CPU(s):     72-95,168-191 ``` And the associated performance degradations are available via `numactl`: ```shell-session $ numactl -H available: 4 nodes (0-3) ... node distances: node   0   1   2   3   0:  10  12  32  32   1:  12  10  32  32   2:  32  32  10  12   3:  32  32  12  10 ``` These SLIT table "node distance" values are presented as approximate relative ratios. The value of 10 represents an optimal situation where a memory access is occurring from a CPU that is part of the same NUMA node. A value of 20 would indicate a 200% performance degradation, 30 for 300%, etc. ### Node Attributes Nomad clients will fingerprint the machine's NUMA topology and export the core associativity as node attributes. This data can provide a Nomad operator a better understanding of when it might be useful to make use of NUMA aware scheduling for certain workloads. ``` numa.node.count = 4 numa.node0.cores = 0-23,96-119 numa.node1.cores = 24-47,120-143 numa.node2.cores = 48-71,144-167 numa.node3.cores = 72-95,168-191 ``` ## NUMA aware scheduling Nomad Enterprise is capable of scheduling tasks in a way that is optimized for the NUMA topology of a client node. A task may specify a `numa` block indicating its NUMA optimization preference. ```hcl task { resources { cores = 6 numa { affinity = "require" } } } ``` ### `affinity` Options There are three supported affinity options: `none`, `prefer`, and `require`, each with their own advantages and tradeoffs. #### option `none` In the `none` mode the Nomad scheduler leverages the apathy of jobs without preference of NUMA affinity to help reduce core fragmentation within NUMA nodes. It does so by bin-packing the core request of these jobs onto the NUMA nodes with the fewest unused cores available. The `none` mode is the default mode if the `numa` block is not specified. ```hcl resources { cores = 4 numa { affinity = "none" } } ``` #### option `prefer` In the `prefer` mode the Nomad scheduler uses the hardware topology of a node to calculate an optimized selection of available cores, but does not limit those cores to come from a single NUMA node. ```hcl resources { cores = 4 numa { affinity = "prefer" } } ``` #### option `require` In the `require` mode the Nomad scheduler uses the topology of each potential client to find a set of available CPU cores that belong to the same NUMA node. If no such set of cores can be found, that node is marked exhausted for the resource of `numa-cores`. ```hcl resources { cores = 4 numa { affinity = "require" } } ``` ## Virtual CPU Fingerprinting When running on a virtualized host such as Amazon EC2 Nomad makes use of the `dmidecode` tool to detect CPU performance data. Some Linux distributions will require installing the `dmidecode` package manually. [cpuset]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html [cpushares]: https://www.redhat.com/sysadmin/cgroups-part-two [numa_wiki]: https://en.wikipedia.org/wiki/Non-uniform_memory_access