From f7ac931c2dc9ddaa741c36819331cc0145ca424c Mon Sep 17 00:00:00 2001 From: Ahmad Nawab Date: Mon, 12 Jun 2023 08:22:12 +0000 Subject: [PATCH] Added GPU offload instructions to the README --- README.md | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index f17adbde..d7a1145e 100644 --- a/README.md +++ b/README.md @@ -47,6 +47,7 @@ Further optional dependencies: - multio (see https://github.com/ecmwf/multio) - ocean model (e.g. NEMO or FESOM) - fypp (see https://github.com/aradi/fypp) +- loki (see https://github.com/ecmwf-ifs/loki) Some driver scripts to run tests and validate results rely on availability of: - md5sum (part of GNU Coreutils; on MacOS, install with `brew install coreutils`) @@ -218,36 +219,35 @@ Note that only `ecwam-run-model` currently supports MPI. Running with source-term computation offloaded to the GPU ========================================================= +The calculation of the source-terms in ecWam, i.e. the physics, can be offloaded for GPU execution. GPU optimised code is +generated at build-time using ECMWF's source-to-source translation toolchain Loki. Currently, two Loki transformations are supported +(in ascending order of performance): +- Single-column-coalesced (scc): Fuse vector loops and promote to the outermost level to target the SIMT execution model +- scc-stack: The scc transformation with a pool allocator used to allocate temporary arrays (the default) -ecWam can be run with the source-term computation offloaded to the GPU. Please note that this is under active development -and will change frequently. - -Single-node multi-GPU runs are also supported. +Currently, only the OpenACC programming model is supported. Building -------- +The recommended option for building the GPU enabled variants is to use the provided bundle, and pass the `--with-loki --with-acc` +options. Different Loki transformations can also be chosen at build-time via the following bundle option: `--loki-mode=`. + +The ecwam-bundle also provides appropriate arch files for the nvhpc suite on the ECMWF ATOS system. -The [ecwam-bundle](https://git.ecmwf.int/users/nawd/repos/ecwam-bundle/browse?at=refs%2Fheads%2Fnaan-phys-gpu) is the recommended build option -for the ecWam GPU enabled variant. The option `--with-phys-gpu` has to be specified at the build step. Arch files are provided for the nvhpc -suite on the ECMWF ATOS system. +Running +------- +No extra run-time options are needed to run the GPU enabled ecWam. Please note that this means that if ecWam is built with +`--with-loki` and `--with-acc` bundle arguments, the source-term computation will necessarily be offloaded for GPU execution. +For multi-GPU runs, the number of GPUs maps to the number of MPI ranks. Thus multiple GPUs can be requested by launching +with multiple MPI ranks. The mapping of MPI ranks to GPUs assumes at most 4 GPUs per host node. Environment variables --------------------- -In its current guise, the CUDA runtime is used to manage temporary arrays and needs a large `NV_ACC_CUDA_HEAPSIZE`, e.g. +The loki-scc variant uses the CUDA runtime to manage temporary arrays and needs a large `NV_ACC_CUDA_HEAPSIZE`, e.g. `NV_ACC_CUDA_HEAPSIZE=8G`. -Currently, the nvhpc compiler suite cannot be used with the hpcx-openmpi suite and must instead use the version of openmpi -bundled within. It's location is specified via the `MPI_HOME` environment variable at build-time. At run-time, we must -specify the location of the `mpirun` executable manually, even if running with one process. This can be done via either of -the following two options: -``` -export LAUNCH="$MPI_HOME/bin/mpirun -np 1" -export LAUNCH="srun -n 1" -``` - -Please note that `env.sh` must be sourced to set `MPI_HOME`. For running with multiple OpenMP threads and grids finer than `O48`, -`OMP_STACKSIZE` should be set to at least `256M`. +For running with multiple OpenMP threads and grids finer than `O48`, `OMP_STACKSIZE` should be set to at least `256M`. Known issues ============