docs: update nvidia driver documentation

notably:
- name of the compiled binary is 'nomad-device-nvidia', not 'nvidia-gpu'
- link to Nvidia docs for installing the container runtime toolkit
- list docker v19.03 as minimum version, to track with nvidia's new container runtime toolkit
This commit is contained in:
Seth Hoenig
2022-05-02 09:11:05 -05:00
parent dfda28daab
commit d352ab25c4

View File

@@ -6,7 +6,7 @@ description: The Nvidia Device Plugin detects and makes Nvidia devices available
# Nvidia GPU Device Plugin
Name: `nvidia-gpu`
Name: `nomad-device-nvidia`
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad.
@@ -97,23 +97,29 @@ documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-va
## Installation Requirements
In order to use the `nvidia-gpu` the following prerequisites must be met:
In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:
1. GNU/Linux x86_64 with kernel version > 3.10
2. NVIDIA GPU with Architecture > Fermi (2.1)
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
4. Docker v19.03+
### Docker Driver Requirements
### Container Toolkit Installation
Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should
be able to run this simple command to test your environment and produce meaningful
output.
```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```
The Nvidia driver plugin currently only supports the older v1.0 version of the
Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with
the Docker driver, please follow the installation instructions for
[`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation).
## Plugin Configuration
```hcl
plugin "nvidia-gpu" {
plugin "nomad-device-nvidia" {
config {
enabled = true
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
@@ -122,7 +128,7 @@ plugin "nvidia-gpu" {
}
```
The `nvidia-gpu` device plugin supports the following configuration in the agent
The `nomad-device-nvidia` device plugin supports the following configuration in the agent
config:
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
@@ -133,17 +139,20 @@ config:
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
device changes.
## Restrictions
## Limitations
The Nvidia integration only works with drivers who natively integrate with
Nvidia's [container runtime
library](https://github.com/NVIDIA/libnvidia-container).
Nomad has tested support with the [`docker` driver][docker-driver] and plans to
bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
tested or documented by Nomad.
Nomad has tested support with the [`docker` driver][docker-driver]. Support for
[`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook]
but is not tested or documented by Nomad.
## Source Code & Compiled Binaries
The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
can also find pre-built binaries on the [releases page][nvidia_plugin_download].
## Examples
@@ -151,68 +160,19 @@ Inspect a node with a GPU:
```shell-session
$ nomad node status 4d46e59f
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m43s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
// ...TRUNCATED...
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
Allocations
No allocations placed
```
Display detailed statistics on a node with a GPU:
```shell-session
$ nomad node status -stats 4d46e59f
ID = 4d46e59f
Name = nomad
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 19m59s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time Subsystem Message
2019-01-23T18:25:18Z Cluster Node registered
Allocated Resources
CPU Memory Disk
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
Allocation Resource Utilization
CPU Memory
0/15576 MHz 0 B/55 GiB
Host Resource Utilization
CPU Memory Disk
2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
// ...TRUNCATED...
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
@@ -232,9 +192,6 @@ Memory state = 0 / 11441 MiB
Memory utilization = 0 %
Power usage = 37 / 149 W
Temperature = 34 C
Allocations
No allocations placed
```
Run the following example job to see that that the GPU was mounted in the
@@ -250,7 +207,7 @@ job "gpu-test" {
driver = "docker"
config {
image = "nvidia/cuda:9.0-base"
image = "nvidia/cuda:11.0-base"
command = "nvidia-smi"
}
@@ -280,18 +237,8 @@ $ nomad run example.nomad
==> Evaluation "21bd7584" finished with status "complete"
$ nomad alloc status d250baed
ID = d250baed
Eval ID = 21bd7584
Name = gpu-test.smi[0]
Node ID = 4d46e59f
Job ID = example
Job Version = 0
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 7s ago
Modified = 2s ago
// ...TRUNCATED...
Task "smi" is "dead"
Task Resources
@@ -334,10 +281,14 @@ Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
```
[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
[java-driver]: /docs/drivers/java 'Nomad java Driver'
[lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver'
[`plugin`]: /docs/configuration/plugin
[`plugin_dir`]: /docs/configuration#plugin_dir
[nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
[source]: https://github.com/hashicorp/nomad-device-nvidia