nomad/website/content/plugins/devices/nvidia.mdx

---
layout: docs
page_title: NVIDIA GPU device plugin
description: The NVIDIA GPU device plugin exposes NVIDIA GPUs to Nomad so you can run workloads on GPU hardware. Install and configure the plugin on your Nomad clients. Configure job tasks to use GPUs
---

# NVIDIA GPU device plugin

Name: `nomad-device-nvidia`

Use the NVIDIA device plugin to expose NVIDIA graphical processing units (GPUs)
to Nomad. The driver automatically supports [Multi-Instance GPU (MIG)][mig].

The NVIDIA device plugin uses [NVML] bindings to get data regarding available
NVIDIA devices and then exposes them via [Fingerprint RPC]. The plugin detects
whether the GPU has Multi-Instance GPU enabled, and when enabled, the plugin
fingerprints all instances as individual GPUs. You may exclude GPUs from
fingerprinting by setting the [`ignored_gpu_ids` field](#plugin-configuration).

## Fingerprinted Attributes

| Attribute        | Unit     |
|------------------|----------|
| memory           | MiB      |
| power            | W (Watt) |
| bar1             | MiB      |
| driver_version   | string   |
| cores_clock      | MHz      |
| memory_clock     | MHz      |
| pci_bandwidth    | MB/s     |
| display_state    | string   |
| persistence_mode | string   |

## Runtime Environment

The `nvidia-gpu` device plugin exposes the following environment variables:

- `NVIDIA_VISIBLE_DEVICES` - List of NVIDIA GPU IDs available to the task.

### Additional Task Configurations

Additional environment variables can be set by the task to influence the runtime
environment. See [Nvidia's
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).

## Installation Requirements

In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:

1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
4. Docker v19.03+

### Container Toolkit Installation

Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
from NVIDIA to prepare a machine to use docker containers with NVIDIA GPUs. You should
be able to run this simple command to test your environment and produce meaningful
output.

```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```


## Plugin Configuration

```hcl
plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]
    fingerprint_period = "1m"
  }
}
```

The `nomad-device-nvidia` device plugin supports the following configuration in the agent
config:

- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.

- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
  should be ignored when fingerprinting.

- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
  device changes.

## Limitations

The NVIDIA integration only works with drivers who natively integrate with
NVIDIA's [container runtime
library](https://github.com/NVIDIA/libnvidia-container).

Nomad has tested support with the [`docker` driver][docker-driver].

## Source Code & Compiled Binaries

The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
can also find pre-built binaries on the [releases page][nvidia_plugin_download].

## Examples

Inspect a node with a GPU:

```shell-session
$ nomad node status 4d46e59f

// ...TRUNCATED...

Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
```

Display detailed statistics on a node with a GPU:

```shell-session
$ nomad node status -stats 4d46e59f

// ...TRUNCATED...

Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB

// ...TRUNCATED...

Device Stats
Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
BAR1 buffer state   = 2 / 16384 MiB
Decoder utilization = 0 %
ECC L1 errors       = 0
ECC L2 errors       = 0
ECC memory errors   = 0
Encoder utilization = 0 %
GPU utilization     = 0 %
Memory state        = 0 / 11441 MiB
Memory utilization  = 0 %
Power usage         = 37 / 149 W
Temperature         = 34 C
```

Run the following example job to see that the GPU was mounted in the
container:

```hcl
job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:11.0-base"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1

          # Add an affinity for a particular model
          affinity {
            attribute = "${device.model}"
            value     = "Tesla K80"
            weight    = 50
          }
        }
      }
    }
  }
}
```

```shell-session
$ nomad run example.nomad.hcl
==> Monitoring evaluation "21bd7584"
    Evaluation triggered by job "gpu-test"
    Allocation "d250baed" created: node "4d46e59f", group "smi"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "21bd7584" finished with status "complete"

$ nomad alloc status d250baed

// ...TRUNCATED...

Task "smi" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB

Device Stats
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB

Task Events:
Started At     = 2019-01-23T18:25:32Z
Finished At    = 2019-01-23T18:25:34Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2019-01-23T18:25:34Z  Terminated  Exit Code: 0
2019-01-23T18:25:32Z  Started     Task started by client
2019-01-23T18:25:29Z  Task Setup  Building Task Directory
2019-01-23T18:25:29Z  Received    Task received by client

$ nomad alloc logs d250baed
Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
| N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

[NVML]: https://github.com/NVIDIA/go-nvml
[Fingerprint RPC]:
    /nomad/docs/concepts/plugins/devices#fingerprint-context-context-chan-fingerprintresponse-error
[mig]: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
[docker-driver]: /nomad/docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /nomad/docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /nomad/docs/drivers/java 'Nomad java Driver'
[`plugin`]: /nomad/docs/configuration/plugin
[`plugin_dir`]: /nomad/docs/configuration#plugin_dir
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
[source]: https://github.com/hashicorp/nomad-device-nvidia