mirror of
https://github.com/kemko/nomad.git
synced 2026-01-06 18:35:44 +03:00
docs: update nvidia driver documentation
notably: - name of the compiled binary is 'nomad-device-nvidia', not 'nvidia-gpu' - link to Nvidia docs for installing the container runtime toolkit - list docker v19.03 as minimum version, to track with nvidia's new container runtime toolkit
This commit is contained in:
@@ -6,7 +6,7 @@ description: The Nvidia Device Plugin detects and makes Nvidia devices available
|
||||
|
||||
# Nvidia GPU Device Plugin
|
||||
|
||||
Name: `nvidia-gpu`
|
||||
Name: `nomad-device-nvidia`
|
||||
|
||||
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad.
|
||||
|
||||
@@ -97,23 +97,29 @@ documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-va
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
In order to use the `nvidia-gpu` the following prerequisites must be met:
|
||||
In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:
|
||||
|
||||
1. GNU/Linux x86_64 with kernel version > 3.10
|
||||
2. NVIDIA GPU with Architecture > Fermi (2.1)
|
||||
3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
|
||||
4. Docker v19.03+
|
||||
|
||||
### Docker Driver Requirements
|
||||
### Container Toolkit Installation
|
||||
|
||||
Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
|
||||
from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should
|
||||
be able to run this simple command to test your environment and produce meaningful
|
||||
output.
|
||||
|
||||
```shell
|
||||
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
|
||||
```
|
||||
|
||||
The Nvidia driver plugin currently only supports the older v1.0 version of the
|
||||
Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with
|
||||
the Docker driver, please follow the installation instructions for
|
||||
[`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation).
|
||||
|
||||
## Plugin Configuration
|
||||
|
||||
```hcl
|
||||
plugin "nvidia-gpu" {
|
||||
plugin "nomad-device-nvidia" {
|
||||
config {
|
||||
enabled = true
|
||||
ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
|
||||
@@ -122,7 +128,7 @@ plugin "nvidia-gpu" {
|
||||
}
|
||||
```
|
||||
|
||||
The `nvidia-gpu` device plugin supports the following configuration in the agent
|
||||
The `nomad-device-nvidia` device plugin supports the following configuration in the agent
|
||||
config:
|
||||
|
||||
- `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
|
||||
@@ -133,17 +139,20 @@ config:
|
||||
- `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
|
||||
device changes.
|
||||
|
||||
## Restrictions
|
||||
## Limitations
|
||||
|
||||
The Nvidia integration only works with drivers who natively integrate with
|
||||
Nvidia's [container runtime
|
||||
library](https://github.com/NVIDIA/libnvidia-container).
|
||||
|
||||
Nomad has tested support with the [`docker` driver][docker-driver] and plans to
|
||||
bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
|
||||
drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
|
||||
[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
|
||||
tested or documented by Nomad.
|
||||
Nomad has tested support with the [`docker` driver][docker-driver]. Support for
|
||||
[`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook]
|
||||
but is not tested or documented by Nomad.
|
||||
|
||||
## Source Code & Compiled Binaries
|
||||
|
||||
The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
|
||||
can also find pre-built binaries on the [releases page][nvidia_plugin_download].
|
||||
|
||||
## Examples
|
||||
|
||||
@@ -151,68 +160,19 @@ Inspect a node with a GPU:
|
||||
|
||||
```shell-session
|
||||
$ nomad node status 4d46e59f
|
||||
ID = 4d46e59f
|
||||
Name = nomad
|
||||
Class = <none>
|
||||
DC = dc1
|
||||
Drain = false
|
||||
Eligibility = eligible
|
||||
Status = ready
|
||||
Uptime = 19m43s
|
||||
Driver Status = docker,mock_driver,raw_exec
|
||||
|
||||
Node Events
|
||||
Time Subsystem Message
|
||||
2019-01-23T18:25:18Z Cluster Node registered
|
||||
|
||||
Allocated Resources
|
||||
CPU Memory Disk
|
||||
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
|
||||
|
||||
Allocation Resource Utilization
|
||||
CPU Memory
|
||||
0/15576 MHz 0 B/55 GiB
|
||||
|
||||
Host Resource Utilization
|
||||
CPU Memory Disk
|
||||
2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
|
||||
// ...TRUNCATED...
|
||||
|
||||
Device Resource Utilization
|
||||
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
||||
|
||||
Allocations
|
||||
No allocations placed
|
||||
```
|
||||
|
||||
Display detailed statistics on a node with a GPU:
|
||||
|
||||
```shell-session
|
||||
$ nomad node status -stats 4d46e59f
|
||||
ID = 4d46e59f
|
||||
Name = nomad
|
||||
Class = <none>
|
||||
DC = dc1
|
||||
Drain = false
|
||||
Eligibility = eligible
|
||||
Status = ready
|
||||
Uptime = 19m59s
|
||||
Driver Status = docker,mock_driver,raw_exec
|
||||
|
||||
Node Events
|
||||
Time Subsystem Message
|
||||
2019-01-23T18:25:18Z Cluster Node registered
|
||||
|
||||
Allocated Resources
|
||||
CPU Memory Disk
|
||||
0/15576 MHz 0 B/55 GiB 0 B/28 GiB
|
||||
|
||||
Allocation Resource Utilization
|
||||
CPU Memory
|
||||
0/15576 MHz 0 B/55 GiB
|
||||
|
||||
Host Resource Utilization
|
||||
CPU Memory Disk
|
||||
2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB
|
||||
// ...TRUNCATED...
|
||||
|
||||
Device Resource Utilization
|
||||
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB
|
||||
@@ -232,9 +192,6 @@ Memory state = 0 / 11441 MiB
|
||||
Memory utilization = 0 %
|
||||
Power usage = 37 / 149 W
|
||||
Temperature = 34 C
|
||||
|
||||
Allocations
|
||||
No allocations placed
|
||||
```
|
||||
|
||||
Run the following example job to see that that the GPU was mounted in the
|
||||
@@ -250,7 +207,7 @@ job "gpu-test" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "nvidia/cuda:9.0-base"
|
||||
image = "nvidia/cuda:11.0-base"
|
||||
command = "nvidia-smi"
|
||||
}
|
||||
|
||||
@@ -280,18 +237,8 @@ $ nomad run example.nomad
|
||||
==> Evaluation "21bd7584" finished with status "complete"
|
||||
|
||||
$ nomad alloc status d250baed
|
||||
ID = d250baed
|
||||
Eval ID = 21bd7584
|
||||
Name = gpu-test.smi[0]
|
||||
Node ID = 4d46e59f
|
||||
Job ID = example
|
||||
Job Version = 0
|
||||
Client Status = complete
|
||||
Client Description = All tasks have completed
|
||||
Desired Status = run
|
||||
Desired Description = <none>
|
||||
Created = 7s ago
|
||||
Modified = 2s ago
|
||||
|
||||
// ...TRUNCATED...
|
||||
|
||||
Task "smi" is "dead"
|
||||
Task Resources
|
||||
@@ -334,10 +281,14 @@ Wed Jan 23 18:25:32 2019
|
||||
+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
|
||||
[docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
|
||||
[exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
|
||||
[java-driver]: /docs/drivers/java 'Nomad java Driver'
|
||||
[lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver'
|
||||
[`plugin`]: /docs/configuration/plugin
|
||||
[`plugin_dir`]: /docs/configuration#plugin_dir
|
||||
[nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia
|
||||
[nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
|
||||
[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
|
||||
[source]: https://github.com/hashicorp/nomad-device-nvidia
|
||||
Reference in New Issue
Block a user