Run the entire advection loop on device #34

awnawab · 2024-10-08T08:21:56Z

This PR hoists the data transfers out of the advection loop, making the entire timestep device resident and resulting in big reductions in the data offload cost. The attached plot shows the effects of the optimisations included herein.

The most invasive change is to OUTBLOCK. Whilst not a very computationally heavy routine, it needs to run on device in order to prevent copying all the state fields back to host. The indirection + accumulation pattern in OUTBLOCK wasn't compatible with how scalars are privatised to GPU threads. @jkousal32 your recent ifs-source PR conflicts with some of these changes, I'll resolve that conflict once I merge this back into develop.

One more optimisation PR is planned after this, which will focus on improving the GPU performance of the physics kernel.

awnawab · 2024-10-08T13:06:47Z

@jksoual32 and @jrbidlot, could you please also review this PR? Many thanks!

jrbidlot · 2024-10-08T14:03:32Z

Ahmad,
silly question. Changing outblock affects the output, many of those are pure diagnostic parameters.
Have you checked that the output file id the same in both cases?

awnawab · 2024-10-08T14:18:07Z

Ahmad, silly question. Changing outblock affects the output, many of those are pure diagnostic parameters. Have you checked that the output file id the same in both cases?

The validation passes both on CPU and GPU after these changes, so I am reasonably confident I haven't changed the results. But of course I will double check and run checksums on the output file before and after this change.

Just to be sure, the relevant file to check would be the path stored in CNORMWAMOUT_FILE right?

awnawab · 2024-10-09T10:43:41Z

Ahmad, silly question. Changing outblock affects the output, many of those are pure diagnostic parameters. Have you checked that the output file id the same in both cases?

Just confirmed (using sha256sum) that the contents of CNORMWAMOUT_FILE are identical on CPU before and after this PR. On GPU they are not, but that is expected because now that we run more of ecWAM on device, we will get small round-off differences.

jkousal32

looks okay to me!

jrbidlot

Hi Ahmad,
I think there is a bug in outblock around old line 555

The tricky bit with outblock is that it produces many purely diagnostic output and so it would not be picked-up by checking the standard output norms.

Did you test with all possible output turned on?

jrbidlot · 2024-10-10T08:58:36Z

src/ecwam/outblock.F90

-        IR=IR+1
-        IF (IPFGTBL(IR) /= 0) THEN
-          CALL SEBTMEAN (KIJS, KIJL, FL2ND, TEWH(IH-1), TEWH(IH), BOUT(KIJS,ITOBOUT(IR)))
+        IF (IPFGTBL(59) /= 0) THEN


The old code incremented IR from 59 to 59+NTEWH, but the new code only has 59 as an index and so all output will go to index 59
The afterward 60 and so one will pick up the wrong parameters

Excellent catch! There was another similar loop over NTRAIN earlier that I also missed. I've now fixed it and confirmed that the norms are identical even if all 87 output parameters are enabled. Many thanks for spotting this 🙏

jrbidlot

We got it working! Very good.

It is always tricky when dealing with the diagnostic output.
We need to remember to trigger the full output from the model in that case.

mlange05

Looks good to me.

wdeconinck · 2024-10-28T10:16:37Z

src/ecwam/CMakeLists.txt

@@ -445,7 +445,7 @@ ecbuild_add_library(
                     $<${HAVE_ACC}:OpenACC::OpenACC_Fortran>
    PUBLIC_INCLUDES  $<INSTALL_INTERFACE:include>
    PRIVATE_INCLUDES ${CMAKE_CURRENT_SOURCE_DIR}
-    PRIVATE_DEFINITIONS ${ECWAM_PRIVATE_DEFINITIONS}


The underscore in _CUDA gives the impression it is defined by the compiler. Probably it is better to use a WAM_HAVE_CUDA definition as for some other defines.

awnawab · 2024-10-30T16:25:01Z

Many thanks @wdeconinck for spotting that, it is indeed more intuitive to rename the preproc def as you suggest. As discussed offline, if you are happy with the fix, could you please merge the branch? I will address the failing macos CI tests in a follow up PR.

awnawab added 15 commits October 7, 2024 16:42

Move data transfers up to WAMODEL

2771362

Offload OUTBS computation to GPU

fcf8392

Move data transfers out of ADVECTION loop

ee19884

Use asynchronous data transfers

1823420

O320: add single precision validation hashes and lower nproma to 64

e5d589f

Move WVPRPT_LAND data movement to WAMODEL

05c5332

Time advection loop

1fd99ee

Allocate MIJ and XLLWS only once

0ebda17

Delete GPU allocations at the end of WAMODEL

11a679c

CUDA: enable pinning of fields

ebb0116

TIMINGS: add timers for MPI,I/O

15d8296

LOKI-SCC-STACK: normalise LBOUNDS in FNDPRT and inline PEAKFRI

eda0ccf

Switch to DR_HOOK timers

d4c9b34

Don't explicitly map MPI rank to GPU

8851664

ACC: use data regions rather than hard-coded parallel gang loops

8ef50b1

awnawab changed the title ~~Naan port wamodel~~ Run the entire advection loop on device Oct 8, 2024

awnawab marked this pull request as ready for review October 8, 2024 13:05

awnawab requested review from wdeconinck and mlange05 October 8, 2024 13:06

jkousal32 approved these changes Oct 9, 2024

View reviewed changes

jrbidlot suggested changes Oct 10, 2024

View reviewed changes

awnawab added 2 commits October 10, 2024 17:09

OUTBLOCK: test GPU ready version for all 87 output parameters

2fad889

Rebase cleanup

b4d3002

jrbidlot approved these changes Oct 11, 2024

View reviewed changes

mlange05 approved these changes Oct 14, 2024

View reviewed changes

wdeconinck reviewed Oct 28, 2024

View reviewed changes

Rename preproc def to guard cuda functionality

4c96ec0

wdeconinck merged commit fb68c2c into develop-1.3 Oct 31, 2024
17 of 21 checks passed

awnawab deleted the naan-port-wamodel branch November 6, 2024 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the entire advection loop on device #34

Run the entire advection loop on device #34

awnawab commented Oct 8, 2024 •

edited

Loading

awnawab commented Oct 8, 2024

jrbidlot commented Oct 8, 2024

awnawab commented Oct 8, 2024

awnawab commented Oct 9, 2024 •

edited

Loading

jkousal32 left a comment

jrbidlot left a comment

jrbidlot Oct 10, 2024

awnawab Oct 11, 2024

jrbidlot left a comment

mlange05 left a comment

wdeconinck Oct 28, 2024

awnawab commented Oct 30, 2024

Run the entire advection loop on device #34

Run the entire advection loop on device #34

Conversation

awnawab commented Oct 8, 2024 • edited Loading

awnawab commented Oct 8, 2024

jrbidlot commented Oct 8, 2024

awnawab commented Oct 8, 2024

awnawab commented Oct 9, 2024 • edited Loading

jkousal32 left a comment

Choose a reason for hiding this comment

jrbidlot left a comment

Choose a reason for hiding this comment

jrbidlot Oct 10, 2024

Choose a reason for hiding this comment

awnawab Oct 11, 2024

Choose a reason for hiding this comment

jrbidlot left a comment

Choose a reason for hiding this comment

mlange05 left a comment

Choose a reason for hiding this comment

wdeconinck Oct 28, 2024

Choose a reason for hiding this comment

awnawab commented Oct 30, 2024

awnawab commented Oct 8, 2024 •

edited

Loading

awnawab commented Oct 9, 2024 •

edited

Loading