diff --git a/website/content/plugins/devices/nvidia.mdx b/website/content/plugins/devices/nvidia.mdx index 7c5887956..b1106a4ac 100644 --- a/website/content/plugins/devices/nvidia.mdx +++ b/website/content/plugins/devices/nvidia.mdx @@ -6,7 +6,7 @@ description: The Nvidia Device Plugin detects and makes Nvidia devices available # Nvidia GPU Device Plugin -Name: `nvidia-gpu` +Name: `nomad-device-nvidia` The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. @@ -97,23 +97,29 @@ documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-va ## Installation Requirements -In order to use the `nvidia-gpu` the following prerequisites must be met: +In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met: 1. GNU/Linux x86_64 with kernel version > 3.10 2. NVIDIA GPU with Architecture > Fermi (2.1) 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` +4. Docker v19.03+ -### Docker Driver Requirements +### Container Toolkit Installation + +Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit] +from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should +be able to run this simple command to test your environment and produce meaningful +output. + +```shell +docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi +``` -The Nvidia driver plugin currently only supports the older v1.0 version of the -Docker driver provided by Nvidia. In order to use the Nvidia driver plugin with -the Docker driver, please follow the installation instructions for -[`nvidia-container-runtime`](https://github.com/nvidia/nvidia-container-runtime#installation). ## Plugin Configuration ```hcl -plugin "nvidia-gpu" { +plugin "nomad-device-nvidia" { config { enabled = true ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] @@ -122,7 +128,7 @@ plugin "nvidia-gpu" { } ``` -The `nvidia-gpu` device plugin supports the following configuration in the agent +The `nomad-device-nvidia` device plugin supports the following configuration in the agent config: - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. @@ -133,17 +139,20 @@ config: - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for device changes. -## Restrictions +## Limitations The Nvidia integration only works with drivers who natively integrate with Nvidia's [container runtime library](https://github.com/NVIDIA/libnvidia-container). -Nomad has tested support with the [`docker` driver][docker-driver] and plans to -bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] -drivers. Support for [`lxc`][lxc-driver] should be possible by installing the -[Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not -tested or documented by Nomad. +Nomad has tested support with the [`docker` driver][docker-driver]. Support for +[`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook] +but is not tested or documented by Nomad. + +## Source Code & Compiled Binaries + +The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You +can also find pre-built binaries on the [releases page][nvidia_plugin_download]. ## Examples @@ -151,68 +160,19 @@ Inspect a node with a GPU: ```shell-session $ nomad node status 4d46e59f -ID = 4d46e59f -Name = nomad -Class = -DC = dc1 -Drain = false -Eligibility = eligible -Status = ready -Uptime = 19m43s -Driver Status = docker,mock_driver,raw_exec -Node Events -Time Subsystem Message -2019-01-23T18:25:18Z Cluster Node registered - -Allocated Resources -CPU Memory Disk -0/15576 MHz 0 B/55 GiB 0 B/28 GiB - -Allocation Resource Utilization -CPU Memory -0/15576 MHz 0 B/55 GiB - -Host Resource Utilization -CPU Memory Disk -2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB +// ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB - -Allocations -No allocations placed ``` Display detailed statistics on a node with a GPU: ```shell-session $ nomad node status -stats 4d46e59f -ID = 4d46e59f -Name = nomad -Class = -DC = dc1 -Drain = false -Eligibility = eligible -Status = ready -Uptime = 19m59s -Driver Status = docker,mock_driver,raw_exec -Node Events -Time Subsystem Message -2019-01-23T18:25:18Z Cluster Node registered - -Allocated Resources -CPU Memory Disk -0/15576 MHz 0 B/55 GiB 0 B/28 GiB - -Allocation Resource Utilization -CPU Memory -0/15576 MHz 0 B/55 GiB - -Host Resource Utilization -CPU Memory Disk -2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB +// ...TRUNCATED... Device Resource Utilization nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB @@ -232,9 +192,6 @@ Memory state = 0 / 11441 MiB Memory utilization = 0 % Power usage = 37 / 149 W Temperature = 34 C - -Allocations -No allocations placed ``` Run the following example job to see that that the GPU was mounted in the @@ -250,7 +207,7 @@ job "gpu-test" { driver = "docker" config { - image = "nvidia/cuda:9.0-base" + image = "nvidia/cuda:11.0-base" command = "nvidia-smi" } @@ -280,18 +237,8 @@ $ nomad run example.nomad ==> Evaluation "21bd7584" finished with status "complete" $ nomad alloc status d250baed -ID = d250baed -Eval ID = 21bd7584 -Name = gpu-test.smi[0] -Node ID = 4d46e59f -Job ID = example -Job Version = 0 -Client Status = complete -Client Description = All tasks have completed -Desired Status = run -Desired Description = -Created = 7s ago -Modified = 2s ago + +// ...TRUNCATED... Task "smi" is "dead" Task Resources @@ -334,10 +281,14 @@ Wed Jan 23 18:25:32 2019 +-----------------------------------------------------------------------------+ ``` + [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' [java-driver]: /docs/drivers/java 'Nomad java Driver' [lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver' [`plugin`]: /docs/configuration/plugin [`plugin_dir`]: /docs/configuration#plugin_dir +[nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/ +[nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html +[source]: https://github.com/hashicorp/nomad-device-nvidia \ No newline at end of file