Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary array reductions #13

Merged
merged 19 commits into from
Mar 27, 2024
Merged

Temporary array reductions #13

merged 19 commits into from
Mar 27, 2024

Conversation

awnawab
Copy link
Contributor

@awnawab awnawab commented Mar 25, 2024

This is the first of 3 PRs aimed at merging the initial GPU port of ecWAM to the main branch. The focus of the current PR is to make small bit-identical changes to the source-term computations to reduce the amount of temporary arrays. Due to the design of one of Loki's GPU memory management recipes, it is especially important to avoid temporary arrays without compile-time parameter bounds that don't have NPROMA as the leading dimension.

@awnawab awnawab requested a review from wdeconinck March 25, 2024 10:30
@awnawab
Copy link
Contributor Author

awnawab commented Mar 25, 2024

@jrbidlot and @jkousal32 could you also please review this PR? Thanks 😄

Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some questions really; see below.
Some changes seem like going further than just reduce arrays.

@@ -382,7 +370,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1, &
& EMEAN, FMEAN, F1MEAN, AKMEAN, XKMEAN)

! MEAN FREQUENCY CHARACTERISTIC FOR WIND SEA
CALL FEMEANWS(KIJS, KIJL, FL1, XLLWS, EMEANWS, FMEANWS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this swap of arguments intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, it's to make one of the arguments optional and make that the last argument. This allowed a temporary array to be deleted from implsch if I remember correctly.

src/ecwam/sinput_jan.F90 Show resolved Hide resolved
@@ -347,7 +335,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1, &
GTEMP1 = MAX((1.0_JWRB-DELT5*FLD(IJ,K,M)),1.0_JWRB)
GTEMP2 = DELT*SL(IJ,K,M)/GTEMP1
FLHAB = ABS(GTEMP2)
FLHAB = MIN(FLHAB,TEMP(IJ,M))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this inlining of TEMP supposed to improve performance?

Copy link
Contributor Author

@awnawab awnawab Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has no effect on CPU, but temporary array allocation does have a big penalty on GPU. The Loki pool allocator, which was implemented after I made this change, mitigates this to a large extent. Nevertheless, better to avoid a temporary array if it just saves two multiplications.

@@ -259,7 +245,7 @@ SUBROUTINE SDISSIP_ARD (KIJS, KIJL, FL1, FLD, SL, &
DO M2=1,M-NDIKCUMUL
DO KK=0,NANGD
DO IJ=KIJS,KIJL
WCUMUL(IJ,KK,M2)=SQRT(ABS(C_C(IJ,M)+C_C(IJ,M2)-2.0_JWRB*C_(IJ,M)*C_(IJ,M2)*COSDTH(KK)))*TRPZ_DSIP(IJ,M2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is COS(KK*DELTH) faster than using the precomputed COSDTH ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporaries whose size is not a multiple of NPROMA can suffer from misaligned addresses on device in the Loki pool allocator. Of course, we can guard against this by padding each allocation to ensure it is a multiple of 8, but for single precision runs I would like to avoid this padding. These changes are all performance neutral on the CPU.

Copy link
Contributor

@jrbidlot jrbidlot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ahmad,
I assume this large set of changes is necessary.
The slicing of the arrays (KIJS:KIJL) comes from the time when the code did have the nproma blocks and vectors were split into KIJS:KIJL slices,
where KIJS could be different than 1.
Now removing the slicing, essentially means that KIJS is always 1, and KIJL is the dimension
Does it make sense to keep KIJS, rather than changing it 1?

@jrbidlot
Copy link
Contributor

These are quite some changes. How did you test them?
Some modified routines are only called for a specific choice of the wave physics package
Did you try different options for IPHYS (0 and 1)?
Some other changes only affect the post processing routines. I don't think the default tests check for that.

@jrbidlot
Copy link
Contributor

Otherwise, apologies, I have made use of a lot of temporary arrays. It was something we used to do a lot with the old code.
If the refactored code is still working fine on CPU and it is necessary for GPU, we will need to avoid using temporary arrays in future changes.

@awnawab
Copy link
Contributor Author

awnawab commented Mar 26, 2024

Hi Jean,

Thanks for the feedback. I'll try and answer point-by-point:

  1. The change only affects the allocation of the arrays, the loop bounds are still set via KIJS and KIJL. I think it is better to fix the allocation size to NPROMA even if we loop over smaller parts of it. Setting the leading dimension to NPROMA for "single-column" routines (doesn't apply here physically but does computationally) is also part of the new ifs-arpege coding standard (https://github.com/ecmwf-ifs/ifs-arpege-coding-standards/blob/main/fortran/rules/L12.rst)
  2. That is a good point, the testing coverage needs to be wider for such an invasive change. I will add a test for alternate IPHYS to the CI as well in fact. I will also test this in a Tco399 forecast experiment coupled with the IFS and nemo.
  3. Temporary arrays in general are not too bad, but I would be grateful if we could avoid temporary arrays that don't have NPROMA as the leading dimension.

@jrbidlot
Copy link
Contributor

Good to know the NPROMA requirements. I will try to adhere to it.
Yes, the extra testing will be good.

@awnawab awnawab force-pushed the naan-temp-array-reductions branch from 10ac544 to 9085bd3 Compare March 27, 2024 12:28
@awnawab
Copy link
Contributor Author

awnawab commented Mar 27, 2024

After restoring some of the changes, this branch is now bit-identical to main in coupled IFS+wam+nemo and IFS+wam runs. I have also added a unit-test for IPHYS=0 and LLGCBZ0=T. I saw that the ifs-source CI forecast experiment uses the latter option, and this results in a different code path being followed in the physics.

Once the 3 GPU offload PRs are merged I will also file a PR to ifs so that they run through the entire ecflow CI suite and we can be sure results are unaffected for all the possible configurations we might be interested in.

@awnawab awnawab requested a review from jrbidlot March 27, 2024 13:30
@awnawab awnawab requested a review from wdeconinck March 27, 2024 13:30
@jrbidlot
Copy link
Contributor

For the testing, actually CY49R1 will activate
LLGCBZ0=T
and
LLNORMAGAM=T
with IPHYS=1

@awnawab
Copy link
Contributor Author

awnawab commented Mar 27, 2024

Ah, good to know! The coupled runs I did were compared to a cy49r1 baseline, so the changes are safe for that configuration too. Nevertheless, I will update the unit-tests so we have one that matches the above configuration.

Copy link
Contributor

@jrbidlot jrbidlot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can I see. It seems to me like sensible changes.

Copy link

@jkousal32 jkousal32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me also, very nice to see how things work for optimization and what the best practices are in order to enable efficient running on GPU

@wdeconinck wdeconinck merged commit dd11e32 into main Mar 27, 2024
9 checks passed
@awnawab awnawab deleted the naan-temp-array-reductions branch March 28, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants