---
layout: docs
page_title: 'Device Plugins: Nvidia'
description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
---

# Nvidia GPU Device Plugin

Name: `nomad-device-nvidia`

Use the NVIDIA device plugin to expose NVIDIA GPUs to Nomad. The driver
automatically supports [Multi-Instance GPU (MIG)][mig].

The NVIDIA device plugin uses [NVML] bindings to get data regarding available
NVIDIA devices and then exposes them via [Fingerprint RPC]. The plugin detects
whether the GPU has Multi-Instance GPU enabled, and when enabled, the plugin
fingerprints all instances as individual GPUs. You may exclude GPUs from
fingerprinting by setting the [`ignored_gpu_ids` field](#plugin-configuration).

## Fingerprinted Attributes

| Attribute        | Unit     |
|------------------|----------|
| memory           | MiB      |
| power            | W (Watt) |
| bar1             | MiB      |
| driver_version   | string   |
| cores_clock      | MHz      |
| memory_clock     | MHz      |
| pci_bandwidth    | MB/s     |
| display_state    | string   |
| persistence_mode | string   |

## Runtime Environment

The `nvidia-gpu` device plugin exposes the following environment variables:

- `NVIDIA_VISIBLE_DEVICES` - List of NVIDIA GPU IDs available to the task.

### Additional Task Configurations

Additional environment variables can be set by the task to influence the runtime
environment. See [Nvidia's
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).

## Installation Requirements

In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:

1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
4. Docker v19.03+

### Container Toolkit Installation

Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
from NVIDIA to prepare a machine to use docker containers with NVIDIA GPUs. You should
be able to run this simple command to test your environment and produce meaningful
output.

```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```


## Plugin Configuration

```hcl
plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]
    fingerprint_period = "1m"
  }
}
```

The `nomad-device-nvidia` device plugin supports the following configuration in the agent
config:

- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.

- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
  should be ignored when fingerprinting.

- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
  device changes.

## Limitations

The NVIDIA integration only works with drivers who natively integrate with
NVIDIA's [container runtime
library](https://github.com/NVIDIA/libnvidia-container).

Nomad has tested support with the [`docker` driver][docker-driver].

## Source Code & Compiled Binaries

The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
can also find pre-built binaries on the [releases page][nvidia_plugin_download].

## Examples

Inspect a node with a GPU:

```shell-session
$ nomad node status 4d46e59f

// ...TRUNCATED...

Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
```

Display detailed statistics on a node with a GPU:

```shell-session
$ nomad node status -stats 4d46e59f

// ...TRUNCATED...

Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB

// ...TRUNCATED...

Device Stats
Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
BAR1 buffer state   = 2 / 16384 MiB
Decoder utilization = 0 %
ECC L1 errors       = 0
ECC L2 errors       = 0
ECC memory errors   = 0
Encoder utilization = 0 %
GPU utilization     = 0 %
Memory state        = 0 / 11441 MiB
Memory utilization  = 0 %
Power usage         = 37 / 149 W
Temperature         = 34 C
```

Run the following example job to see that the GPU was mounted in the
container:

```hcl
job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:11.0-base"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1

          # Add an affinity for a particular model
          affinity {
            attribute = "${device.model}"
            value     = "Tesla K80"
            weight    = 50
          }
        }
      }
    }
  }
}
```

```shell-session
$ nomad run example.nomad.hcl
==> Monitoring evaluation "21bd7584"
    Evaluation triggered by job "gpu-test"
    Allocation "d250baed" created: node "4d46e59f", group "smi"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "21bd7584" finished with status "complete"

$ nomad alloc status d250baed

// ...TRUNCATED...

Task "smi" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB

Device Stats
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB

Task Events:
Started At     = 2019-01-23T18:25:32Z
Finished At    = 2019-01-23T18:25:34Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2019-01-23T18:25:34Z  Terminated  Exit Code: 0
2019-01-23T18:25:32Z  Started     Task started by client
2019-01-23T18:25:29Z  Task Setup  Building Task Directory
2019-01-23T18:25:29Z  Received    Task received by client

$ nomad alloc logs d250baed
Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
| N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

[NVML]: https://github.com/NVIDIA/go-nvml
[Fingerprint RPC]:
    /nomad/docs/concepts/plugins/devices#fingerprint-context-context-chan-fingerprintresponse-error
[mig]: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
[docker-driver]: /nomad/docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /nomad/docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /nomad/docs/drivers/java 'Nomad java Driver'
[`plugin`]: /nomad/docs/configuration/plugin
[`plugin_dir`]: /nomad/docs/configuration#plugin_dir
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
[source]: https://github.com/hashicorp/nomad-device-nvidia