-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenACC port of wave propagation (ACCV6) #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many many thanks for all the great work in this PR Fabio (and all others involved of course)! I compared O320 run-times on 2 A100 GPUs to 2 AMD Rome CPUs. I measured a 4.3x speedup on the GPU, including the data offload time for some of the smaller arrays but excluding the time taken to offload FL1
. If I include the offload time for FL1
, the speedup drops to 1.3x. The former number is however more important, since data offload costs will be negligible once I've refactored the time-step, and >4x speedup is extremely impressive indeed!
There are nevertheless a few things which I would like you to address:
- The code now suffers heavily from memory leaks on device, and I was unable to even run it on 4 A100 40GB GPUs without restoring the FIELD_API destructors. I believe these are chiefly responsible for the memory leak, but there are also several
!$acc enter data
statements without matching!$acc exit data delete
statements. I think these should be added to the code to avoid potential memory leaks. - Validation on both the GPU variants still passes but the CPU-only ecWAM variant fails validation after the changes in this PR. I initially suspected unguarded
!$acc parallel/kernels
statements to be the cause, as you can tell from my comments in the code, but now I am leaning more towards some of the manually implemented loop optimizations. The NVIDIA CPU compiler is known to be fragile when trying to vectorize large loops, such as we now have inCTUW
after the loop unrolling. The source of the error needs to be identified and resolved. - The performance of some of the newly ported GPU kernels is linked to the number of CPU openmp threads because of the manual openmp partitioning; the outermost bounds of some of the loop nests are determined by the number of available openmp threads. As a consequence, the GPU code runs slower if we increase OpenMP threads (I noticed a 2x speedup when running 1 OpenMP thread as compared to 64). This obviously shouldn't be the case, and I've suggested a possible easy fix in the code.
- CUDA aware MPI is still missing, and the subroutine
PROENVHALO
has also not been ported as a consequence. Unless I have misunderstood something, CUDA aware MPI could be added by replacing calls toMPL_SEND/RECV
with raw MPI calls inMPEXCHNG
. @reuterbal can confirm whether I am correct on this. - Generally, the code has been left in a bit of messy state. I am referring here to many (!) commented out loop nests or openacc statements, newly added module imports that are unused, lots of whitespace changes etc. Whilst none of these individually are that serious, when added up they do hurt the readability of the code and I would be grateful if the code could be cleaned up.
Thanks once again for the amazing work. Once the points above are addressed, this will really be a huge step forward in the capabilities of the ecWAM model!
@@ -78,4 +78,5 @@ MODULE YOWMAP | |||
! (i.e. NO LAND AND DEEP WATER). | |||
|
|||
! ---------------------------------------------------------------------- | |||
END MODULE YOWMAP | |||
|
|||
END MODULE YOWMAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this whitespace change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ecwam/wamintgr_loki_gpu.F90
Outdated
@@ -245,12 +259,6 @@ SUBROUTINE WAMINTGR_LOKI_GPU(CDTPRA, CDATE, CDATEWH, CDTIMP, CDTIMPNEXT, & | |||
CALL SRC_CONTRIBS%ENSURE_HOST() | |||
|
|||
!$loki update_host | |||
CALL WVPRPT_FIELD%FINAL() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these FIELD_API destructors removed? This was causing a memory leak on device, meaning I wasn't able to run an O320 grid even on 4 40GB A100s without restoring these destructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIELD_API destructors have been restored after the new version of field_api was installed
src/ecwam/propag_wam.F90
Outdated
ENDDO | ||
ENDDO | ||
ENDIF | ||
|
||
ENDDO | ||
!$acc end kernels | ||
|
||
!F !$acc kernels loop independent private(KIJS, IJSB, KIJL, IJLB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to here, a lot of mid-development (I assume) comments have been left in the code. Whilst this isn't a big problem, could these please be removed? In my opinion they make the code less readable. Thanks!
src/ecwam/propag_wam.F90
Outdated
#ifdef WAM_HAVE_UNWAM | ||
USE UNWAM , ONLY : PROPAG_UNWAM | ||
#endif | ||
|
||
USE YOMHOOK , ONLY : LHOOK, DR_HOOK, JPHOOK | ||
|
||
USE NVTX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a performance penalty associated with NVTX markers? If so, could they please be removed? If not, could we please guard them with an appropriate ifdef (probably not _OPENACC
because we would like to also run the code on LUMI at some point)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm misreading this, the NVTX markers have been commented anyway - so I would suggest removing them alltogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ecwam/propag_wam.F90
Outdated
& WLATN ,WLONN ,WCORN ,WKPMN ,WMPMN , & | ||
& LLWLATN ,LLWLONN ,LLWCORN ,LLWKPMN ,LLWMPMN , & | ||
& SUMWN , & | ||
& JXO ,JYO ,KCR ,KPM ,MPM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am mistaken, I don't think these new module imports are being used; the openacc statements using them seem to be commented out. If so, could these please be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ecwam/ctuwupdt.F90
Outdated
@@ -175,22 +187,37 @@ SUBROUTINE CTUWUPDT (IJS, IJL, NINF, NSUP, & | |||
IF (.NOT. ALLOCATED(LLWMPMN)) ALLOCATE(LLWMPMN(NANG,NFRE_RED,-1:1)) | |||
ENDIF | |||
|
|||
|
|||
!$acc enter data copyin(sumwn,LLWKPMN, WLATN,WLONN,WCORN,WKPMN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below about deallocating device memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ecwam/ctuwupdt.F90
Outdated
|
||
IF (.NOT. ALLOCATED(KPM)) ALLOCATE(KPM(NANG,-1:1)) | ||
IF (.NOT. ALLOCATED(JXO)) ALLOCATE(JXO(NANG,2)) | ||
IF (.NOT. ALLOCATED(JYO)) ALLOCATE(JYO(NANG,2)) | ||
IF (.NOT. ALLOCATED(KCR)) ALLOCATE(KCR(NANG,4)) | ||
|
||
!$ACC ENTER DATA COPYIN(KLON, KLAT, KCOR, JXO, JYO, KCR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below about freeing up device memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ecwam/ctuwupdt.F90
Outdated
|
||
USE YOMHOOK , ONLY : LHOOK, DR_HOOK, JPHOOK | ||
!USE CTUWINI_MOD , ONLY : CTUWINI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove commented out module imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -603,23 +710,28 @@ SUBROUTINE CTUW (DELPRO, MSTART, MEND, & | |||
!!!!!!INCLUDE THE BLOCKING COEFFICIENTS INTO THE WEIGHTS OF THE | |||
! SURROUNDING POINTS. | |||
|
|||
! call nvtxStartRange("ctuw: Loop 4") | |||
!$acc parallel loop collapse(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please guard openacc parallel pragmas behind an ifdef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the openacc parallel pragmas are skipped when not needed. Are there any other reason to guard them with an ifdef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, these can be left in 👍
!* COMPUTE COS PHI FACTOR FOR ADJOINING GRID POINT. | ||
! (for all grid points) | ||
!$acc parallel loop independent collapse(2) private(KY,KK,KKM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please guard openacc parallel clauses behind ifdefs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution and the great performance numbers. Thanks also to @awnawab for the detailed review and confirming your results. I agree with all the comments and would indeed appreciate if these could be addressed as well as the testing whether GPU-aware MPI couldn't be added in MPEXCHNG
.
I have just approved the CI run (apologies, overlooked that this was pending) and we should aim for this to pass all checks.
More a general note and mostly directed to us rather than you and therefore nothing that requires action here: Others have pointed out a performance penalty due to the use of acc kernels
on LUMI, and recommended using acc parallel
wherever possible. The suggestion was that the outlining that is implied by the kernels directive adds additional launch latency. As far as I'm aware, this is not the case on NVIDIA platforms, though. Hence, we may want to pay attention to this when we attempt to port this to LUMI.
src/ecwam/ctuw.F90
Outdated
! LOOP OVER GRID POINTS | ||
! --------------------- | ||
|
||
#IFNDEF _OPENACC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most compilers don't like upper-case preprocessor statements, could you please convert them to lower-case? #ifndef
here, similarly throughout the rest of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@fdisante do you need support with making the requested changes? |
Hi Thomas, I don't think I need any help, but if I do, I'll let you know. I apologize for the delay |
removed comments in propags2
…paces in mpexchng and mubuf
Hi @fdisante. Just as a heads up, don't worry about the failing macos tests. That is a known problem unrelated to your work that Willem has already fixed. |
from the mail discussions i understood that this last build fail is harmless, are we ready to close this pull request? |
@fdisante As discussed offline, could you give us notice when you think the code is ready for another review? |
Hello everyone, I was working on the porting of proenvhalo routine as said during the last meeting. Unfortunately I'm still getting failed tests but, since it is a routine that just do assignment, in my opinion can be let running on CPU. This should not affect the performance of the code, the exchange of data between host and device is very limited and the total wall time of this subroutine is below 0.00%... If you are in agreement with me we can close for revision the pull request. Otherwise, if you think that this subroutine is wort to have ported to GPU, I need an iteration with @awnawab to fix the problem that I'm having with the failed final checks (after Wednesday since the system is down from maintenance today and tomorrow). Thanks a lot for all your feedbacks |
Thanks a lot @fdisante for continuing to grind through the PR 🙏 could you please share your attempt at porting |
Hi @awnawab, unfortunately I cannot access the system since it is in maintenance, and my porting efforts are also unavailable. As soon as the system is up, I can share the code with you. |
Hi everyone, thanks to @awnawab 's help, I've finally completed the requested changes for the pull request. Please feel free to start reviewing it at your convenience. I appreciate your time! |
Thanks a lot @fdisante 🙏 As agreed offline, I'll remove Piero's commit so he doesn't have to sign the CLA. While I review the PR again, could you please confirm we haven't deteriorated the performance of the CPU only code. The O320 grid should be sufficient. And when you gather these timings, could you please compile both the baseline and your contribution with the |
Thanks @awnawab, the CPU code performance hasn't deteriorated. However, I haven't found an equivalent for -march=core-avx2 in NVHPC. I tried -Mvect=simd:256 but didn't see any significant changes in the simulation timing. Attached you can find an image of the CPU code performance for both versions: |
c122a58
to
54e56ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @fdisante for addressing all the requested changes 🙏 I am happy to report that after adding CUDA-aware MPI and porting PROENVHALO
the speedup (excluding data offload) for the wave propagation kernel has increased from 4x to an impressive 7x! The overall application wall-time is now also 15% less than the CPU only variant, which bodes very well for the gains we can achieve without all the redundant data transfers (the timings above refer to 2 AMD Rome CPUs vs 2 A100 GPUs).
The nvhpc CI builds are known to be flaky and are in fact disabled on the main branch (they fail in building eccodes which is nothing to do with your work). We don't have to worry about nvhpc or macos tests failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only echo what @awnawab said: Many thanks for this great work and the impressive performance results. GTG from me!
@@ -4,9 +4,9 @@ frequencies: 29 | |||
bathymetry: ETOPO1 | |||
|
|||
advection: | |||
timestep: 450 | |||
timestep: 225 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking what the reason is behind this time step reduction. @awnawab is that intentional or should this be reverted to the original 450?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted, but this is intentional 😄
No description provided.