Added GPU offload instructions to the README

ecmwf-ifs · Mar 25, 2024 · f7ac931 · f7ac931
1 parent e7d18af
commit f7ac931
Showing 1 changed file with 19 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -47,6 +47,7 @@ Further optional dependencies:
 - multio (see https://github.com/ecmwf/multio)
 - ocean model (e.g. NEMO or FESOM)
 - fypp (see https://github.com/aradi/fypp)
+- loki (see https://github.com/ecmwf-ifs/loki)
 
 Some driver scripts to run tests and validate results rely on availability of:
 - md5sum (part of GNU Coreutils; on MacOS, install with `brew install coreutils`)
@@ -218,36 +219,35 @@ Note that only `ecwam-run-model` currently supports MPI.
 
 Running with source-term computation offloaded to the GPU
 =========================================================
+The calculation of the source-terms in ecWam, i.e. the physics, can be offloaded for GPU execution. GPU optimised code is 
+generated at build-time using ECMWF's source-to-source translation toolchain Loki. Currently, two Loki transformations are supported 
+(in ascending order of performance):
+- Single-column-coalesced (scc): Fuse vector loops and promote to the outermost level to target the SIMT execution model
+- scc-stack: The scc transformation with a pool allocator used to allocate temporary arrays (the default)
 
-ecWam can be run with the source-term computation offloaded to the GPU. Please note that this is under active development
-and will change frequently.
-
-Single-node multi-GPU runs are also supported.
+Currently, only the OpenACC programming model is supported.
 
 Building
 --------
+The recommended option for building the GPU enabled variants is to use the provided bundle, and pass the `--with-loki --with-acc`
+options. Different Loki transformations can also be chosen at build-time via the following bundle option: `--loki-mode=<trafo>`.
+
+The ecwam-bundle also provides appropriate arch files for the nvhpc suite on the ECMWF ATOS system.
 
-The [ecwam-bundle](https://git.ecmwf.int/users/nawd/repos/ecwam-bundle/browse?at=refs%2Fheads%2Fnaan-phys-gpu) is the recommended build option
-for the ecWam GPU enabled variant. The option `--with-phys-gpu` has to be specified at the build step. Arch files are provided for the nvhpc 
-suite on the ECMWF ATOS system.
+Running
+-------
+No extra run-time options are needed to run the GPU enabled ecWam. Please note that this means that if ecWam is built with 
+`--with-loki` and `--with-acc` bundle arguments, the source-term computation will necessarily be offloaded for GPU execution.
+For multi-GPU runs, the number of GPUs maps to the number of MPI ranks. Thus multiple GPUs can be requested by launching
+with multiple MPI ranks. The mapping of MPI ranks to GPUs assumes at most 4 GPUs per host node.
 
 Environment variables
 ---------------------
 
-In its current guise, the CUDA runtime is used to manage temporary arrays and needs a large `NV_ACC_CUDA_HEAPSIZE`, e.g.
+The loki-scc variant uses the CUDA runtime to manage temporary arrays and needs a large `NV_ACC_CUDA_HEAPSIZE`, e.g.
 `NV_ACC_CUDA_HEAPSIZE=8G`.
 
-Currently, the nvhpc compiler suite cannot be used with the hpcx-openmpi suite and must instead use the version of openmpi
-bundled within. It's location is specified via the `MPI_HOME` environment variable at build-time. At run-time, we must
-specify the location of the `mpirun` executable manually, even if running with one process. This can be done via either of
-the following two options:
-```
-export LAUNCH="$MPI_HOME/bin/mpirun -np 1"
-export LAUNCH="srun -n 1"
-```
-
-Please note that `env.sh` must be sourced to set `MPI_HOME`. For running with multiple OpenMP threads and grids finer than `O48`, 
-`OMP_STACKSIZE` should be set to at least `256M`.
+For running with multiple OpenMP threads and grids finer than `O48`, `OMP_STACKSIZE` should be set to at least `256M`.
 
 Known issues
 ============