From 4580fb8dd4fbd313119d232c01f9969dec859338 Mon Sep 17 00:00:00 2001 From: Dominic Sloan-Murphy Date: Wed, 15 Jan 2025 14:16:23 +0000 Subject: [PATCH 1/2] docs: add initial GPU explanation --- .custom_wordlist.txt | 4 ++++ explanation/gpus/driver.md | 8 ++++++++ explanation/gpus/index.md | 16 ++++++++++++++++ explanation/gpus/slurmconf.md | 31 +++++++++++++++++++++++++++++++ explanation/index.md | 6 +++--- 5 files changed, 62 insertions(+), 3 deletions(-) create mode 100644 explanation/gpus/driver.md create mode 100644 explanation/gpus/index.md create mode 100644 explanation/gpus/slurmconf.md diff --git a/.custom_wordlist.txt b/.custom_wordlist.txt index 40061e1..b26a7c0 100644 --- a/.custom_wordlist.txt +++ b/.custom_wordlist.txt @@ -14,3 +14,7 @@ PyPI GPUs integrations autonomizing +GRES +PCIe +Nvidia +RESource diff --git a/explanation/gpus/driver.md b/explanation/gpus/driver.md new file mode 100644 index 0000000..27863e9 --- /dev/null +++ b/explanation/gpus/driver.md @@ -0,0 +1,8 @@ +(driver)= +# Driver auto-install + +Charmed HPC installs GPU drivers when the `slurmd` charm is deployed on a compute node equipped with a supported Nvidia GPU. Driver detection is performed via the API to [`ubuntu-drivers-common`](https://documentation.ubuntu.com/server/how-to/graphics/install-nvidia-drivers/#the-recommended-way-ubuntu-drivers-tool), a package which examines node hardware, determines appropriate third-party drivers and recommends a set of driver packages that are installed from the Ubuntu repositories. + +## Libraries used + +- [`ubuntu-drivers-common`](https://github.com/canonical/ubuntu-drivers-common), from GitHub. diff --git a/explanation/gpus/index.md b/explanation/gpus/index.md new file mode 100644 index 0000000..8f7c318 --- /dev/null +++ b/explanation/gpus/index.md @@ -0,0 +1,16 @@ +(gpus)= +# GPUs + +A Graphics Processing Unit (GPU) is a specialized hardware resource that was originally designed to accelerate computer graphics calculations but now has expanded use in general purpose computing across a number of fields. GPU-enabled workloads are supported on a Charmed HPC cluster with the necessary driver and workload manager configuration automatically handled by the charms. + +- {ref}`driver` +- {ref}`slurmconf` + +```{toctree} +:titlesonly: +:maxdepth: 1 +:hidden: + +Driver auto-install +Slurm enlistment +``` diff --git a/explanation/gpus/slurmconf.md b/explanation/gpus/slurmconf.md new file mode 100644 index 0000000..b397768 --- /dev/null +++ b/explanation/gpus/slurmconf.md @@ -0,0 +1,31 @@ +(slurmconf)= +# Slurm enlistment + +To allow cluster users to submit jobs requesting GPUs, detected GPUs are automatically added to the [Generic RESource (GRES) Slurm configuration](https://slurm.schedmd.com/gres.html). This is a feature in Slurm which enables scheduling of arbitrary generic resources, including GPUs. + +## Device details + +GPU details are gathered by [`pynvml`](https://pypi.org/project/nvidia-ml-py/), the official Python bindings for the Nvidia management library, which enables GPU counts, associated device files and model names to be queried from the drivers. For compatibility with Slurm configuration files, retrieved model names are converted to lowercase and white space is replaced with underscores. “Tesla T4” becomes `tesla_t4`, for example. + +## Slurm configuration + +Each GPU-equipped node is added to the `gres.conf` configuration file as its own `NodeName` entry, following the format defined in the [Slurm `gres.conf` documentation](https://slurm.schedmd.com/gres.conf.html). Individual `NodeName` entries are used over an entry per GRES resource to provide greater support for heterogeneous environments, such as a cluster where the same model of GPU is not consistently the same device file across compute nodes. + +In `slurm.conf`, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type and count for each GPU on the node. + +For example, a Microsoft Azure Standard_NC24ads_A100_v4 node, equipped with a NVIDIA A100 PCIe GPU, is given a node configuration in `slurm.conf` of: + +``` +NodeName=juju-e33208-1 CPUs=24 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=221446 Gres=gpu:nvidia_a100_80gb_pcie:1 MemSpecLimit=1024 +``` + +and corresponding `gres.conf` line: + +``` +NodeName=juju-e33208-1 Name=gpu Type=nvidia_a100_80gb_pcie File=/dev/nvidia0 +``` + +## Libraries used + +- [`pynvml / nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py/), from PyPI. + diff --git a/explanation/index.md b/explanation/index.md index f8c3910..4f618c2 100644 --- a/explanation/index.md +++ b/explanation/index.md @@ -2,8 +2,7 @@ # Explanation - {ref}`cryptography` - -🚧 Under construction 🚧 +- {ref}`GPUs` ```{toctree} :titlesonly: @@ -11,4 +10,5 @@ :hidden: cryptography/index -``` \ No newline at end of file +gpus/index +``` From 9268320b05188bb554265fc345083aa2d268dc77 Mon Sep 17 00:00:00 2001 From: Dominic Sloan-Murphy Date: Thu, 16 Jan 2025 09:56:09 +0000 Subject: [PATCH 2/2] docs(gpu): reformat Azure instance name --- explanation/gpus/slurmconf.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/explanation/gpus/slurmconf.md b/explanation/gpus/slurmconf.md index b397768..16b5cc7 100644 --- a/explanation/gpus/slurmconf.md +++ b/explanation/gpus/slurmconf.md @@ -13,7 +13,7 @@ Each GPU-equipped node is added to the `gres.conf` configuration file as its own In `slurm.conf`, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type and count for each GPU on the node. -For example, a Microsoft Azure Standard_NC24ads_A100_v4 node, equipped with a NVIDIA A100 PCIe GPU, is given a node configuration in `slurm.conf` of: +For example, a Microsoft Azure `Standard_NC24ads_A100_v4` node, equipped with a NVIDIA A100 PCIe GPU, is given a node configuration in `slurm.conf` of: ``` NodeName=juju-e33208-1 CPUs=24 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=221446 Gres=gpu:nvidia_a100_80gb_pcie:1 MemSpecLimit=1024