-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the entire advection loop on device #34
Conversation
@jksoual32 and @jrbidlot, could you please also review this PR? Many thanks! |
Ahmad, |
The validation passes both on CPU and GPU after these changes, so I am reasonably confident I haven't changed the results. But of course I will double check and run checksums on the output file before and after this change. Just to be sure, the relevant file to check would be the path stored in |
Just confirmed (using sha256sum) that the contents of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks okay to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ahmad,
I think there is a bug in outblock around old line 555
The tricky bit with outblock is that it produces many purely diagnostic output and so it would not be picked-up by checking the standard output norms.
Did you test with all possible output turned on?
src/ecwam/outblock.F90
Outdated
IR=IR+1 | ||
IF (IPFGTBL(IR) /= 0) THEN | ||
CALL SEBTMEAN (KIJS, KIJL, FL2ND, TEWH(IH-1), TEWH(IH), BOUT(KIJS,ITOBOUT(IR))) | ||
IF (IPFGTBL(59) /= 0) THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code incremented IR from 59 to 59+NTEWH, but the new code only has 59 as an index and so all output will go to index 59
The afterward 60 and so one will pick up the wrong parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent catch! There was another similar loop over NTRAIN
earlier that I also missed. I've now fixed it and confirmed that the norms are identical even if all 87 output parameters are enabled. Many thanks for spotting this 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got it working! Very good.
It is always tricky when dealing with the diagnostic output.
We need to remember to trigger the full output from the model in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@@ -445,7 +445,7 @@ ecbuild_add_library( | |||
$<${HAVE_ACC}:OpenACC::OpenACC_Fortran> | |||
PUBLIC_INCLUDES $<INSTALL_INTERFACE:include> | |||
PRIVATE_INCLUDES ${CMAKE_CURRENT_SOURCE_DIR} | |||
PRIVATE_DEFINITIONS ${ECWAM_PRIVATE_DEFINITIONS} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underscore in _CUDA gives the impression it is defined by the compiler. Probably it is better to use a WAM_HAVE_CUDA definition as for some other defines.
Many thanks @wdeconinck for spotting that, it is indeed more intuitive to rename the preproc def as you suggest. As discussed offline, if you are happy with the fix, could you please merge the branch? I will address the failing macos CI tests in a follow up PR. |
This PR hoists the data transfers out of the advection loop, making the entire timestep device resident and resulting in big reductions in the data offload cost. The attached plot shows the effects of the optimisations included herein.
The most invasive change is to
OUTBLOCK
. Whilst not a very computationally heavy routine, it needs to run on device in order to prevent copying all the state fields back to host. The indirection + accumulation pattern inOUTBLOCK
wasn't compatible with how scalars are privatised to GPU threads. @jkousal32 your recent ifs-source PR conflicts with some of these changes, I'll resolve that conflict once I merge this back into develop.One more optimisation PR is planned after this, which will focus on improving the GPU performance of the physics kernel.