--- layout: docs page_title: 'Device Plugins: Nvidia' description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. --- # Nvidia GPU Device Plugin Name: `nomad-device-nvidia` Use the NVIDIA device plugin to expose NVIDIA GPUs to Nomad. The driver automatically supports [Multi-Instance GPU (MIG)][mig]. The NVIDIA device plugin uses [NVML] bindings to get data regarding available NVIDIA devices and then exposes them via [Fingerprint RPC]. The plugin detects whether the GPU has Multi-Instance GPU enabled, and when enabled, the plugin fingerprints all instances as individual GPUs. You may exclude GPUs from fingerprinting by setting the [`ignored_gpu_ids` field](#plugin-configuration). ## Fingerprinted Attributes | Attribute | Unit | |------------------|----------| | memory | MiB | | power | W (Watt) | | bar1 | MiB | | driver_version | string | | cores_clock | MHz | | memory_clock | MHz | | pci_bandwidth | MB/s | | display_state | string | | persistence_mode | string | ## Runtime Environment The `nvidia-gpu` device plugin exposes the following environment variables: - `NVIDIA_VISIBLE_DEVICES` - List of NVIDIA GPU IDs available to the task. ### Additional Task Configurations Additional environment variables can be set by the task to influence the runtime environment. See [Nvidia's documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). ## Installation Requirements In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met: 1. GNU/Linux x86_64 with kernel version > 3.10 2. NVIDIA GPU with Architecture > Fermi (2.1) 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 4. Docker v19.03+ ### Container Toolkit Installation Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit] from NVIDIA to prepare a machine to use docker containers with NVIDIA GPUs. You should be able to run this simple command to test your environment and produce meaningful output. ```shell docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi ``` ## Plugin Configuration ```hcl plugin "nomad-device-nvidia" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] fingerprint_period = "1m" } } ``` The `nomad-device-nvidia` device plugin supports the following configuration in the agent config: - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. - `ignored_gpu_ids` `(array: [])` - Specifies the set of GPU UUIDs that should be ignored when fingerprinting. - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for device changes. ## Limitations The NVIDIA integration only works with drivers who natively integrate with NVIDIA's [container runtime library](https://github.com/NVIDIA/libnvidia-container). Nomad has tested support with the [`docker` driver][docker-driver]. ## Source Code & Compiled Binaries The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You can also find pre-built binaries on the [releases page][nvidia_plugin_download]. ## Examples Inspect a node with a GPU: ```shell-session $ nomad node status 4d46e59f // ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB ``` Display detailed statistics on a node with a GPU: ```shell-session $ nomad node status -stats 4d46e59f // ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB // ...TRUNCATED... Device Stats Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] BAR1 buffer state = 2 / 16384 MiB Decoder utilization = 0 % ECC L1 errors = 0 ECC L2 errors = 0 ECC memory errors = 0 Encoder utilization = 0 % GPU utilization = 0 % Memory state = 0 / 11441 MiB Memory utilization = 0 % Power usage = 37 / 149 W Temperature = 34 C ``` Run the following example job to see that the GPU was mounted in the container: ```hcl job "gpu-test" { datacenters = ["dc1"] type = "batch" group "smi" { task "smi" { driver = "docker" config { image = "nvidia/cuda:11.0-base" command = "nvidia-smi" } resources { device "nvidia/gpu" { count = 1 # Add an affinity for a particular model affinity { attribute = "${device.model}" value = "Tesla K80" weight = 50 } } } } } } ``` ```shell-session $ nomad run example.nomad.hcl ==> Monitoring evaluation "21bd7584" Evaluation triggered by job "gpu-test" Allocation "d250baed" created: node "4d46e59f", group "smi" Evaluation status changed: "pending" -> "complete" ==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed // ...TRUNCATED... Task "smi" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 0 B/300 MiB 300 MiB Device Stats nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB Task Events: Started At = 2019-01-23T18:25:32Z Finished At = 2019-01-23T18:25:34Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-01-23T18:25:34Z Terminated Exit Code: 0 2019-01-23T18:25:32Z Started Task started by client 2019-01-23T18:25:29Z Task Setup Building Task Directory 2019-01-23T18:25:29Z Received Task received by client $ nomad alloc logs d250baed Wed Jan 23 18:25:32 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` [NVML]: https://github.com/NVIDIA/go-nvml [Fingerprint RPC]: /nomad/docs/concepts/plugins/devices#fingerprint-context-context-chan-fingerprintresponse-error [mig]: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ [docker-driver]: /nomad/docs/drivers/docker 'Nomad docker Driver' [exec-driver]: /nomad/docs/drivers/exec 'Nomad exec Driver' [java-driver]: /nomad/docs/drivers/java 'Nomad java Driver' [`plugin`]: /nomad/docs/configuration/plugin [`plugin_dir`]: /nomad/docs/configuration#plugin_dir [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/ [nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html [source]: https://github.com/hashicorp/nomad-device-nvidia