--- layout: docs page_title: 'Device Plugins: Nvidia' description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. --- # Nvidia GPU Device Plugin Name: `nomad-device-nvidia` The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. ~> **Note**: The Nvidia device plugin setup has changed in Nomad 1.2. You must add a [`plugin`] block to your clients configuration and install the [external Nvidia device plugin][nvidia_plugin_download] into their [`plugin_dir`] prior to upgrading. See plugin options below for an example. Note the job specification remains the same. ## Fingerprinted Attributes
Attribute Unit
memory MiB
power W (Watt)
bar1 MiB
driver_version string
cores_clock MHz
memory_clock MHz
pci_bandwidth MB/s
display_state string
persistence_mode string
## Runtime Environment The `nvidia-gpu` device plugin exposes the following environment variables: - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. ### Additional Task Configurations Additional environment variables can be set by the task to influence the runtime environment. See [Nvidia's documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). ## Installation Requirements In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met: 1. GNU/Linux x86_64 with kernel version > 3.10 2. NVIDIA GPU with Architecture > Fermi (2.1) 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 4. Docker v19.03+ ### Container Toolkit Installation Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit] from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should be able to run this simple command to test your environment and produce meaningful output. ```shell docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi ``` ## Plugin Configuration ```hcl plugin "nomad-device-nvidia" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] fingerprint_period = "1m" } } ``` The `nomad-device-nvidia` device plugin supports the following configuration in the agent config: - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. - `ignored_gpu_ids` `(array: [])` - Specifies the set of GPU UUIDs that should be ignored when fingerprinting. - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for device changes. ## Limitations The Nvidia integration only works with drivers who natively integrate with Nvidia's [container runtime library](https://github.com/NVIDIA/libnvidia-container). Nomad has tested support with the [`docker` driver][docker-driver]. Support for [`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook] but is not tested or documented by Nomad. ## Source Code & Compiled Binaries The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You can also find pre-built binaries on the [releases page][nvidia_plugin_download]. ## Examples Inspect a node with a GPU: ```shell-session $ nomad node status 4d46e59f // ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB ``` Display detailed statistics on a node with a GPU: ```shell-session $ nomad node status -stats 4d46e59f // ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB // ...TRUNCATED... Device Stats Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] BAR1 buffer state = 2 / 16384 MiB Decoder utilization = 0 % ECC L1 errors = 0 ECC L2 errors = 0 ECC memory errors = 0 Encoder utilization = 0 % GPU utilization = 0 % Memory state = 0 / 11441 MiB Memory utilization = 0 % Power usage = 37 / 149 W Temperature = 34 C ``` Run the following example job to see that the GPU was mounted in the container: ```hcl job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nvidia/cuda:11.0-base" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 # Add an affinity for a particular model affinity { attribute = "${device.model}" value = "Tesla K80" weight = 50 } } } } } } ``` ```shell-session $ nomad run example.nomad.hcl ==> Monitoring evaluation "21bd7584" Evaluation triggered by job "gpu-test" Allocation "d250baed" created: node "4d46e59f", group "smi" Evaluation status changed: "pending" -> "complete" ==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed // ...TRUNCATED... Task "smi" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 0 B/300 MiB 300 MiB Device Stats nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Task Events: Started At = 2019-01-23T18:25:32Z Finished At = 2019-01-23T18:25:34Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-01-23T18:25:34Z Terminated Exit Code: 0 2019-01-23T18:25:32Z Started Task started by client 2019-01-23T18:25:29Z Task Setup Building Task Directory 2019-01-23T18:25:29Z Received Task received by client $ nomad alloc logs d250baed Wed Jan 23 18:25:32 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` [docker-driver]: /nomad/docs/drivers/docker 'Nomad docker Driver' [exec-driver]: /nomad/docs/drivers/exec 'Nomad exec Driver' [java-driver]: /nomad/docs/drivers/java 'Nomad java Driver' [lxc-driver]: /nomad/plugins/drivers/community/lxc 'Nomad lxc Driver' [`plugin`]: /nomad/docs/configuration/plugin [`plugin_dir`]: /nomad/docs/configuration#plugin_dir [nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/ [nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html [source]: https://github.com/hashicorp/nomad-device-nvidia