mirror of
https://github.com/kemko/nomad.git
synced 2026-01-04 09:25:46 +03:00
* Docs SEO: task drivers and plugins; refactor virt section * add redirects for virt driver files * Some updates. committing rather than stashing * fix content-check errors * Remove docs/devices/ and redirect to plugins/devices * Update docs/drivers descriptions * Move USB device plugin up a level. Finish descriptions. * Apply suggestions from Jeff's code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> * Apply title case suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> * apply title case suggestions; fix indentation --------- Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
242 lines
7.7 KiB
Plaintext
242 lines
7.7 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: NVIDIA GPU device plugin
|
|
description: The NVIDIA GPU device plugin exposes NVIDIA GPUs to Nomad so you can run workloads on GPU hardware. Install and configure the plugin on your Nomad clients. Configure job tasks to use GPUs
|
|
---
|
|
|
|
# NVIDIA GPU device plugin
|
|
|
|
Name: `nomad-device-nvidia`
|
|
|
|
Use the NVIDIA device plugin to expose NVIDIA graphical processing units (GPUs)
|
|
to Nomad. The driver automatically supports [Multi-Instance GPU (MIG)][mig].
|
|
|
|
The NVIDIA device plugin uses [NVML] bindings to get data regarding available
|
|
NVIDIA devices and then exposes them via [Fingerprint RPC]. The plugin detects
|
|
whether the GPU has Multi-Instance GPU enabled, and when enabled, the plugin
|
|
fingerprints all instances as individual GPUs. You may exclude GPUs from
|
|
fingerprinting by setting the [`ignored_gpu_ids` field](#plugin-configuration).
|
|
|
|
## Fingerprinted Attributes
|
|
|
|
| Attribute | Unit |
|
|
|------------------|----------|
|
|
| memory | MiB |
|
|
| power | W (Watt) |
|
|
| bar1 | MiB |
|
|
| driver_version | string |
|
|
| cores_clock | MHz |
|
|
| memory_clock | MHz |
|
|
| pci_bandwidth | MB/s |
|
|
| display_state | string |
|
|
| persistence_mode | string |
|
|
|
|
## Runtime Environment
|
|
|
|
The `nvidia-gpu` device plugin exposes the following environment variables:
|
|
|
|
- `NVIDIA_VISIBLE_DEVICES` - List of NVIDIA GPU IDs available to the task.
|
|
|
|
### Additional Task Configurations
|
|
|
|
Additional environment variables can be set by the task to influence the runtime
|
|
environment. See [Nvidia's
|
|
documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
|
|
|
|
## Installation Requirements
|
|
|
|
In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:
|
|
|
|
1. GNU/Linux x86_64 with kernel version > 3.10
|
|
2. NVIDIA GPU with Architecture > Fermi (2.1)
|
|
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
|
|
4. Docker v19.03+
|
|
|
|
### Container Toolkit Installation
|
|
|
|
Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
|
|
from NVIDIA to prepare a machine to use docker containers with NVIDIA GPUs. You should
|
|
be able to run this simple command to test your environment and produce meaningful
|
|
output.
|
|
|
|
```shell
|
|
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
|
|
```
|
|
|
|
|
|
## Plugin Configuration
|
|
|
|
```hcl
|
|
plugin "nomad-device-nvidia" {
|
|
config {
|
|
enabled = true
|
|
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
|
|
fingerprint_period = "1m"
|
|
}
|
|
}
|
|
```
|
|
|
|
The `nomad-device-nvidia` device plugin supports the following configuration in the agent
|
|
config:
|
|
|
|
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
|
|
|
|
- `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
|
|
should be ignored when fingerprinting.
|
|
|
|
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
|
|
device changes.
|
|
|
|
## Limitations
|
|
|
|
The NVIDIA integration only works with drivers who natively integrate with
|
|
NVIDIA's [container runtime
|
|
library](https://github.com/NVIDIA/libnvidia-container).
|
|
|
|
Nomad has tested support with the [`docker` driver][docker-driver].
|
|
|
|
## Source Code & Compiled Binaries
|
|
|
|
The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
|
|
can also find pre-built binaries on the [releases page][nvidia_plugin_download].
|
|
|
|
## Examples
|
|
|
|
Inspect a node with a GPU:
|
|
|
|
```shell-session
|
|
$ nomad node status 4d46e59f
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Resource Utilization
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
```
|
|
|
|
Display detailed statistics on a node with a GPU:
|
|
|
|
```shell-session
|
|
$ nomad node status -stats 4d46e59f
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Resource Utilization
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Device Stats
|
|
Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
|
|
BAR1 buffer state = 2 / 16384 MiB
|
|
Decoder utilization = 0 %
|
|
ECC L1 errors = 0
|
|
ECC L2 errors = 0
|
|
ECC memory errors = 0
|
|
Encoder utilization = 0 %
|
|
GPU utilization = 0 %
|
|
Memory state = 0 / 11441 MiB
|
|
Memory utilization = 0 %
|
|
Power usage = 37 / 149 W
|
|
Temperature = 34 C
|
|
```
|
|
|
|
Run the following example job to see that the GPU was mounted in the
|
|
container:
|
|
|
|
```hcl
|
|
job "gpu-test" {
|
|
datacenters = ["dc1"]
|
|
type = "batch"
|
|
|
|
group "smi" {
|
|
task "smi" {
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "nvidia/cuda:11.0-base"
|
|
command = "nvidia-smi"
|
|
}
|
|
|
|
resources {
|
|
device "nvidia/gpu" {
|
|
count = 1
|
|
|
|
# Add an affinity for a particular model
|
|
affinity {
|
|
attribute = "${device.model}"
|
|
value = "Tesla K80"
|
|
weight = 50
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
```shell-session
|
|
$ nomad run example.nomad.hcl
|
|
==> Monitoring evaluation "21bd7584"
|
|
Evaluation triggered by job "gpu-test"
|
|
Allocation "d250baed" created: node "4d46e59f", group "smi"
|
|
Evaluation status changed: "pending" -> "complete"
|
|
==> Evaluation "21bd7584" finished with status "complete"
|
|
|
|
$ nomad alloc status d250baed
|
|
|
|
// ...TRUNCATED...
|
|
|
|
Task "smi" is "dead"
|
|
Task Resources
|
|
CPU Memory Disk Addresses
|
|
0/100 MHz 0 B/300 MiB 300 MiB
|
|
|
|
Device Stats
|
|
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
|
|
|
Task Events:
|
|
Started At = 2019-01-23T18:25:32Z
|
|
Finished At = 2019-01-23T18:25:34Z
|
|
Total Restarts = 0
|
|
Last Restart = N/A
|
|
|
|
Recent Events:
|
|
Time Type Description
|
|
2019-01-23T18:25:34Z Terminated Exit Code: 0
|
|
2019-01-23T18:25:32Z Started Task started by client
|
|
2019-01-23T18:25:29Z Task Setup Building Task Directory
|
|
2019-01-23T18:25:29Z Received Task received by client
|
|
|
|
$ nomad alloc logs d250baed
|
|
Wed Jan 23 18:25:32 2019
|
|
+-----------------------------------------------------------------------------+
|
|
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|
|
|-------------------------------+----------------------+----------------------+
|
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
|
|===============================+======================+======================|
|
|
| 0 Tesla K80 On | 00004477:00:00.0 Off | 0 |
|
|
| N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default |
|
|
+-------------------------------+----------------------+----------------------+
|
|
|
|
+-----------------------------------------------------------------------------+
|
|
| Processes: GPU Memory |
|
|
| GPU PID Type Process name Usage |
|
|
|=============================================================================|
|
|
| No running processes found |
|
|
+-----------------------------------------------------------------------------+
|
|
```
|
|
|
|
[NVML]: https://github.com/NVIDIA/go-nvml
|
|
[Fingerprint RPC]:
|
|
/nomad/docs/concepts/plugins/devices#fingerprint-context-context-chan-fingerprintresponse-error
|
|
[mig]: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
|
|
[docker-driver]: /nomad/docs/drivers/docker 'Nomad docker Driver'
|
|
[exec-driver]: /nomad/docs/drivers/exec 'Nomad exec Driver'
|
|
[java-driver]: /nomad/docs/drivers/java 'Nomad java Driver'
|
|
[`plugin`]: /nomad/docs/configuration/plugin
|
|
[`plugin_dir`]: /nomad/docs/configuration#plugin_dir
|
|
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
|
|
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
|
|
[source]: https://github.com/hashicorp/nomad-device-nvidia
|