-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporary array reductions #13
Conversation
@jrbidlot and @jkousal32 could you also please review this PR? Thanks 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just some questions really; see below.
Some changes seem like going further than just reduce arrays.
@@ -382,7 +370,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1, & | |||
& EMEAN, FMEAN, F1MEAN, AKMEAN, XKMEAN) | |||
|
|||
! MEAN FREQUENCY CHARACTERISTIC FOR WIND SEA | |||
CALL FEMEANWS(KIJS, KIJL, FL1, XLLWS, EMEANWS, FMEANWS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this swap of arguments intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, it's to make one of the arguments optional and make that the last argument. This allowed a temporary array to be deleted from implsch if I remember correctly.
src/ecwam/implsch.F90
Outdated
@@ -347,7 +335,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1, & | |||
GTEMP1 = MAX((1.0_JWRB-DELT5*FLD(IJ,K,M)),1.0_JWRB) | |||
GTEMP2 = DELT*SL(IJ,K,M)/GTEMP1 | |||
FLHAB = ABS(GTEMP2) | |||
FLHAB = MIN(FLHAB,TEMP(IJ,M)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this inlining of TEMP supposed to improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has no effect on CPU, but temporary array allocation does have a big penalty on GPU. The Loki pool allocator, which was implemented after I made this change, mitigates this to a large extent. Nevertheless, better to avoid a temporary array if it just saves two multiplications.
@@ -259,7 +245,7 @@ SUBROUTINE SDISSIP_ARD (KIJS, KIJL, FL1, FLD, SL, & | |||
DO M2=1,M-NDIKCUMUL | |||
DO KK=0,NANGD | |||
DO IJ=KIJS,KIJL | |||
WCUMUL(IJ,KK,M2)=SQRT(ABS(C_C(IJ,M)+C_C(IJ,M2)-2.0_JWRB*C_(IJ,M)*C_(IJ,M2)*COSDTH(KK)))*TRPZ_DSIP(IJ,M2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is COS(KK*DELTH) faster than using the precomputed COSDTH ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporaries whose size is not a multiple of NPROMA can suffer from misaligned addresses on device in the Loki pool allocator. Of course, we can guard against this by padding each allocation to ensure it is a multiple of 8, but for single precision runs I would like to avoid this padding. These changes are all performance neutral on the CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ahmad,
I assume this large set of changes is necessary.
The slicing of the arrays (KIJS:KIJL) comes from the time when the code did have the nproma blocks and vectors were split into KIJS:KIJL slices,
where KIJS could be different than 1.
Now removing the slicing, essentially means that KIJS is always 1, and KIJL is the dimension
Does it make sense to keep KIJS, rather than changing it 1?
These are quite some changes. How did you test them? |
Otherwise, apologies, I have made use of a lot of temporary arrays. It was something we used to do a lot with the old code. |
Hi Jean, Thanks for the feedback. I'll try and answer point-by-point:
|
Good to know the NPROMA requirements. I will try to adhere to it. |
10ac544
to
9085bd3
Compare
After restoring some of the changes, this branch is now bit-identical to main in coupled IFS+wam+nemo and IFS+wam runs. I have also added a unit-test for Once the 3 GPU offload PRs are merged I will also file a PR to ifs so that they run through the entire ecflow CI suite and we can be sure results are unaffected for all the possible configurations we might be interested in. |
For the testing, actually CY49R1 will activate |
Ah, good to know! The coupled runs I did were compared to a cy49r1 baseline, so the changes are safe for that configuration too. Nevertheless, I will update the unit-tests so we have one that matches the above configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can I see. It seems to me like sensible changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me also, very nice to see how things work for optimization and what the best practices are in order to enable efficient running on GPU
This is the first of 3 PRs aimed at merging the initial GPU port of ecWAM to the main branch. The focus of the current PR is to make small bit-identical changes to the source-term computations to reduce the amount of temporary arrays. Due to the design of one of Loki's GPU memory management recipes, it is especially important to avoid temporary arrays without compile-time parameter bounds that don't have
NPROMA
as the leading dimension.