OpenACC port of wave propagation (ACCV6) #6

fdisante · 2023-11-17T14:07:15Z

No description provided.

FussyDuck · 2023-11-17T14:07:20Z

All committers have signed the CLA.

awnawab

Many many thanks for all the great work in this PR Fabio (and all others involved of course)! I compared O320 run-times on 2 A100 GPUs to 2 AMD Rome CPUs. I measured a 4.3x speedup on the GPU, including the data offload time for some of the smaller arrays but excluding the time taken to offload FL1. If I include the offload time for FL1, the speedup drops to 1.3x. The former number is however more important, since data offload costs will be negligible once I've refactored the time-step, and >4x speedup is extremely impressive indeed!

There are nevertheless a few things which I would like you to address:

The code now suffers heavily from memory leaks on device, and I was unable to even run it on 4 A100 40GB GPUs without restoring the FIELD_API destructors. I believe these are chiefly responsible for the memory leak, but there are also several !$acc enter data statements without matching !$acc exit data delete statements. I think these should be added to the code to avoid potential memory leaks.
Validation on both the GPU variants still passes but the CPU-only ecWAM variant fails validation after the changes in this PR. I initially suspected unguarded !$acc parallel/kernels statements to be the cause, as you can tell from my comments in the code, but now I am leaning more towards some of the manually implemented loop optimizations. The NVIDIA CPU compiler is known to be fragile when trying to vectorize large loops, such as we now have in CTUW after the loop unrolling. The source of the error needs to be identified and resolved.
The performance of some of the newly ported GPU kernels is linked to the number of CPU openmp threads because of the manual openmp partitioning; the outermost bounds of some of the loop nests are determined by the number of available openmp threads. As a consequence, the GPU code runs slower if we increase OpenMP threads (I noticed a 2x speedup when running 1 OpenMP thread as compared to 64). This obviously shouldn't be the case, and I've suggested a possible easy fix in the code.
CUDA aware MPI is still missing, and the subroutine PROENVHALO has also not been ported as a consequence. Unless I have misunderstood something, CUDA aware MPI could be added by replacing calls to MPL_SEND/RECV with raw MPI calls in MPEXCHNG. @reuterbal can confirm whether I am correct on this.
Generally, the code has been left in a bit of messy state. I am referring here to many (!) commented out loop nests or openacc statements, newly added module imports that are unused, lots of whitespace changes etc. Whilst none of these individually are that serious, when added up they do hurt the readability of the code and I would be grateful if the code could be cleaned up.

Thanks once again for the amazing work. Once the points above are addressed, this will really be a huge step forward in the capabilities of the ecWAM model!

awnawab · 2023-11-28T15:09:24Z

src/ecwam/yowmap.F90

@@ -78,4 +78,5 @@ MODULE YOWMAP
 !                            (i.e. NO LAND AND DEEP WATER).

 ! ----------------------------------------------------------------------
-      END MODULE YOWMAP
+
+END MODULE YOWMAP


Please revert this whitespace change.

awnawab · 2023-11-28T15:11:21Z

src/ecwam/wamintgr_loki_gpu.F90

@@ -245,12 +259,6 @@ SUBROUTINE WAMINTGR_LOKI_GPU(CDTPRA, CDATE, CDATEWH, CDTIMP, CDTIMPNEXT,  &
      CALL SRC_CONTRIBS%ENSURE_HOST()

 !$loki update_host
-    CALL WVPRPT_FIELD%FINAL()


Why were these FIELD_API destructors removed? This was causing a memory leak on device, meaning I wasn't able to run an O320 grid even on 4 40GB A100s without restoring these destructors.

The FIELD_API destructors have been restored after the new version of field_api was installed

awnawab · 2023-11-28T15:16:03Z

src/ecwam/propag_wam.F90

                ENDDO
              ENDDO
            ENDIF

          ENDDO
+        !$acc end kernels
+
+!F        !$acc kernels loop independent private(KIJS, IJSB, KIJL, IJLB)


Similar to here, a lot of mid-development (I assume) comments have been left in the code. Whilst this isn't a big problem, could these please be removed? In my opinion they make the code less readable. Thanks!

awnawab · 2023-11-28T15:25:14Z

src/ecwam/propag_wam.F90

 #ifdef WAM_HAVE_UNWAM
      USE UNWAM    , ONLY : PROPAG_UNWAM
 #endif

      USE YOMHOOK  , ONLY : LHOOK,   DR_HOOK, JPHOOK

+      USE NVTX


Is there a performance penalty associated with NVTX markers? If so, could they please be removed? If not, could we please guard them with an appropriate ifdef (probably not _OPENACC because we would like to also run the code on LUMI at some point)?

Unless I'm misreading this, the NVTX markers have been commented anyway - so I would suggest removing them alltogether.

awnawab · 2023-11-28T15:28:49Z

src/ecwam/propag_wam.F90

+     &            WLATN    ,WLONN    ,WCORN    ,WKPMN    ,WMPMN     ,   &
+     &            LLWLATN  ,LLWLONN  ,LLWCORN  ,LLWKPMN  ,LLWMPMN   ,   &
+     &            SUMWN    ,                                            &
+     &            JXO      ,JYO      ,KCR      ,KPM      ,MPM


Unless I am mistaken, I don't think these new module imports are being used; the openacc statements using them seem to be commented out. If so, could these please be removed?

awnawab · 2023-11-28T21:20:52Z

src/ecwam/ctuwupdt.F90

@@ -175,22 +187,37 @@ SUBROUTINE CTUWUPDT (IJS, IJL, NINF, NSUP,                  &
  IF (.NOT. ALLOCATED(LLWMPMN)) ALLOCATE(LLWMPMN(NANG,NFRE_RED,-1:1))
 ENDIF

-
+!$acc enter data copyin(sumwn,LLWKPMN, WLATN,WLONN,WCORN,WKPMN)


See comment below about deallocating device memory.

awnawab · 2023-11-28T21:22:37Z

src/ecwam/ctuwupdt.F90


  IF (.NOT. ALLOCATED(KPM)) ALLOCATE(KPM(NANG,-1:1))
  IF (.NOT. ALLOCATED(JXO)) ALLOCATE(JXO(NANG,2))
  IF (.NOT. ALLOCATED(JYO)) ALLOCATE(JYO(NANG,2))
  IF (.NOT. ALLOCATED(KCR)) ALLOCATE(KCR(NANG,4))

+!$ACC ENTER DATA COPYIN(KLON, KLAT, KCOR, JXO, JYO, KCR) 


See comment below about freeing up device memory.

awnawab · 2023-11-28T21:23:02Z

src/ecwam/ctuwupdt.F90


 USE YOMHOOK  , ONLY : LHOOK,   DR_HOOK, JPHOOK
+!USE CTUWINI_MOD , ONLY : CTUWINI


Please remove commented out module imports.

awnawab · 2023-11-28T21:24:19Z

src/ecwam/ctuw.F90

@@ -603,23 +710,28 @@ SUBROUTINE CTUW (DELPRO, MSTART, MEND,                    &
 !!!!!!INCLUDE THE BLOCKING COEFFICIENTS INTO THE WEIGHTS OF THE
 !     SURROUNDING POINTS.

+!      call nvtxStartRange("ctuw: Loop 4")
+!$acc parallel loop collapse(3)


Please guard openacc parallel pragmas behind an ifdef.

My understanding is that the openacc parallel pragmas are skipped when not needed. Are there any other reason to guard them with an ifdef?

Yes you are right, these can be left in 👍

awnawab · 2023-11-28T21:31:28Z

src/ecwam/ctuwini.F90

 !*        COMPUTE COS PHI FACTOR FOR ADJOINING GRID POINT.
 !         (for all grid points)
+      !$acc parallel loop independent collapse(2) private(KY,KK,KKM)


Can we please guard openacc parallel clauses behind ifdefs?

reuterbal

Thank you very much for this contribution and the great performance numbers. Thanks also to @awnawab for the detailed review and confirming your results. I agree with all the comments and would indeed appreciate if these could be addressed as well as the testing whether GPU-aware MPI couldn't be added in MPEXCHNG.

I have just approved the CI run (apologies, overlooked that this was pending) and we should aim for this to pass all checks.

More a general note and mostly directed to us rather than you and therefore nothing that requires action here: Others have pointed out a performance penalty due to the use of acc kernels on LUMI, and recommended using acc parallel wherever possible. The suggestion was that the outlining that is implied by the kernels directive adds additional launch latency. As far as I'm aware, this is not the case on NVIDIA platforms, though. Hence, we may want to pay attention to this when we attempt to port this to LUMI.

reuterbal · 2023-11-30T08:42:44Z

src/ecwam/ctuw.F90

+!            LOOP OVER GRID POINTS
+!            ---------------------
+
+#IFNDEF _OPENACC


Most compilers don't like upper-case preprocessor statements, could you please convert them to lower-case? #ifndef here, similarly throughout the rest of the file.

thomas-destine · 2023-12-04T13:55:45Z

@fdisante do you need support with making the requested changes?

fdisante · 2023-12-07T11:09:01Z

@fdisante do you need support with making the requested changes?

Hi Thomas, I don't think I need any help, but if I do, I'll let you know. I apologize for the delay

removed comments in propags2

…paces in mpexchng and mubuf

awnawab · 2024-02-01T16:34:37Z

Hi @fdisante. Just as a heads up, don't worry about the failing macos tests. That is a known problem unrelated to your work that Willem has already fixed.

thomas-destine · 2024-02-06T09:50:49Z

from the mail discussions i understood that this last build fail is harmless, are we ready to close this pull request?

reuterbal · 2024-02-06T10:09:18Z

@fdisante As discussed offline, could you give us notice when you think the code is ready for another review?

fdisante · 2024-02-06T14:57:26Z

Hello everyone, I was working on the porting of proenvhalo routine as said during the last meeting. Unfortunately I'm still getting failed tests but, since it is a routine that just do assignment, in my opinion can be let running on CPU. This should not affect the performance of the code, the exchange of data between host and device is very limited and the total wall time of this subroutine is below 0.00%... If you are in agreement with me we can close for revision the pull request. Otherwise, if you think that this subroutine is wort to have ported to GPU, I need an iteration with @awnawab to fix the problem that I'm having with the failed final checks (after Wednesday since the system is down from maintenance today and tomorrow). Thanks a lot for all your feedbacks

awnawab · 2024-02-06T15:19:20Z

Thanks a lot @fdisante for continuing to grind through the PR 🙏 could you please share your attempt at porting PROENVHALO in a separate dev branch on your github fork? I'll have a look at it, and call you either on Thursday or Friday to discuss how to proceed. In the meanwhile, could you please ask Piero to also sign the CLA?

fdisante · 2024-02-06T15:31:33Z

Hi @awnawab, unfortunately I cannot access the system since it is in maintenance, and my porting efforts are also unavailable. As soon as the system is up, I can share the code with you.
Sure, I will ask Piero to sign the CLA.

fdisante · 2024-02-08T15:09:58Z

Hi everyone, thanks to @awnawab 's help, I've finally completed the requested changes for the pull request. Please feel free to start reviewing it at your convenience. I appreciate your time!

awnawab · 2024-02-08T16:37:00Z

Thanks a lot @fdisante 🙏 As agreed offline, I'll remove Piero's commit so he doesn't have to sign the CLA. While I review the PR again, could you please confirm we haven't deteriorated the performance of the CPU only code. The O320 grid should be sufficient. And when you gather these timings, could you please compile both the baseline and your contribution with the -march=core-avx2 (or the equivalent flag for NVHPC, that might be an Intel only flag) flag to ensure AVX2 vectorisation is used (you will get validation failures when you run with AVX2 vectorisation, so please don't be alarmed).

fdisante · 2024-02-12T16:21:45Z

Thanks @awnawab, the CPU code performance hasn't deteriorated. However, I haven't found an equivalent for -march=core-avx2 in NVHPC. I tried -Mvect=simd:256 but didn't see any significant changes in the simulation timing. Attached you can find an image of the CPU code performance for both versions:

… O1280 case

…esfully

awnawab

Many thanks @fdisante for addressing all the requested changes 🙏 I am happy to report that after adding CUDA-aware MPI and porting PROENVHALO the speedup (excluding data offload) for the wave propagation kernel has increased from 4x to an impressive 7x! The overall application wall-time is now also 15% less than the CPU only variant, which bodes very well for the gains we can achieve without all the redundant data transfers (the timings above refer to 2 AMD Rome CPUs vs 2 A100 GPUs).

awnawab · 2024-02-14T17:09:47Z

The nvhpc CI builds are known to be flaky and are in fact disabled on the main branch (they fail in building eccodes which is nothing to do with your work). We don't have to worry about nvhpc or macos tests failing.

reuterbal

I can only echo what @awnawab said: Many thanks for this great work and the impressive performance results. GTG from me!

reuterbal · 2024-02-20T20:35:58Z

tests/etopo1_oper_an_fc_O1280.yml

@@ -4,9 +4,9 @@ frequencies: 29
 bathymetry:  ETOPO1

 advection:
-    timestep: 450
+    timestep: 225


Just checking what the reason is behind this time step reduction. @awnawab is that intentional or should this be reverted to the original 450?

Well spotted, but this is intentional 😄

ACCV6

ae322e0

awnawab self-requested a review November 17, 2023 14:18

awnawab requested changes Nov 28, 2023

View reviewed changes

awnawab requested review from wdeconinck and reuterbal November 28, 2023 22:46

reuterbal requested changes Nov 30, 2023

View reviewed changes

Update intel-oneapi install script

bc824f1

fdisante and others added 18 commits December 7, 2023 14:53

Some comments and nvtx removed, field_api destructor restored

756a754

removed comments in propags2

d1bd955

Merge pull request #1 from fdisante1/naan-phys-gpu

6c46fa8

removed comments in propags2

removed comments and guarded use openacc propag_wam - removed/added s…

b555fec

…paces in mpexchng and mubuf

removed unused acc data from wamintgr_loki

4f6d4f1

ctuwdrv: restored comment

de24c4c

ctuwdrv: restored whitespaces

a6090b1

ctuwdrv: restored whitespaces

3558443

ctuwdrv: restored whitespaces

31a263a

ctuwini: removed commented module, restored dr_hook

a0d568a

ctuwupdt: removed commented openacc pragmas

c655ed2

ctuwupdt: removed commented openacc pragmas and !F

dd87579

wamintgr_loki_gpu: FIELD_API destructors restored

d642d66

wamintgr_loki_gpu: whitespace removed

ed6a9d1

ctuwini: uncommented ZHOOK_HANDLE declaration

79988d1

ctuwupdt: removed unused module, propag_wam: fixed openacc data pragma

142659b

ctuw: reverted spaces

6b8301a

ctuwini: uncommented DR_HOOK reverted spaces

a8f5c9b

fdisante and others added 11 commits February 14, 2024 16:40

cuda aware MPI

a0c9607

CPU-only ecWAM variant fails validation solved

f78680f

Using MTHREADS = OMP_GET_MAX_THREADS when openacc activated

7084540

Removed unclosed acc enter data and converted in acc declare

b7dd9c9

Cleaning code (removing spaces and comments) updated the timestep for…

d74d1f6

… O1280 case

More cleaning and minor fixing. OMP_GET_MAX_THREADS

4262014

removed trailing spaces in the ifdef endif to make gnu compiling succ…

e422898

…esfully

removed use openacc (not needed) to make intel compiling succesfully

fb87ad5

proenvhalo ported to GPU

318594c

use MTHREADS=1 when OPENACC is activated

d026989

CTUW: make preprocessor flags lower case

54e56ef

awnawab force-pushed the naan-phys-gpu branch from c122a58 to 54e56ef Compare February 14, 2024 16:42

awnawab approved these changes Feb 14, 2024

View reviewed changes

reuterbal approved these changes Feb 20, 2024

View reviewed changes

reuterbal changed the title ~~ACCV6~~ OpenACC port of wave propagation (ACCV6) Feb 20, 2024

awnawab merged commit e2e191b into ecmwf-ifs:naan-phys-gpu Feb 22, 2024
4 of 6 checks passed


		USE YOMHOOK , ONLY : LHOOK, DR_HOOK, JPHOOK
		!USE CTUWINI_MOD , ONLY : CTUWINI

OpenACC port of wave propagation (ACCV6) #6

OpenACC port of wave propagation (ACCV6) #6

Conversation

fdisante commented Nov 17, 2023

FussyDuck commented Nov 17, 2023 • edited Loading

awnawab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuterbal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomas-destine commented Dec 4, 2023

fdisante commented Dec 7, 2023

awnawab commented Feb 1, 2024

thomas-destine commented Feb 6, 2024

reuterbal commented Feb 6, 2024

fdisante commented Feb 6, 2024

awnawab commented Feb 6, 2024 • edited Loading

fdisante commented Feb 6, 2024

fdisante commented Feb 8, 2024

awnawab commented Feb 8, 2024 • edited Loading

fdisante commented Feb 12, 2024

awnawab left a comment

Choose a reason for hiding this comment

awnawab commented Feb 14, 2024

reuterbal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FussyDuck commented Nov 17, 2023 •

edited

Loading

awnawab commented Feb 6, 2024 •

edited

Loading

awnawab commented Feb 8, 2024 •

edited

Loading