Temporary array reductions #13

awnawab · 2024-03-25T10:30:02Z

This is the first of 3 PRs aimed at merging the initial GPU port of ecWAM to the main branch. The focus of the current PR is to make small bit-identical changes to the source-term computations to reduce the amount of temporary arrays. Due to the design of one of Loki's GPU memory management recipes, it is especially important to avoid temporary arrays without compile-time parameter bounds that don't have NPROMA as the leading dimension.

awnawab · 2024-03-25T11:04:52Z

@jrbidlot and @jkousal32 could you also please review this PR? Thanks 😄

wdeconinck

Looks good to me, just some questions really; see below.
Some changes seem like going further than just reduce arrays.

wdeconinck · 2024-03-25T17:15:04Z

src/ecwam/implsch.F90

@@ -382,7 +370,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1,                         &
     &            EMEAN, FMEAN, F1MEAN, AKMEAN, XKMEAN)

 !     MEAN FREQUENCY CHARACTERISTIC FOR WIND SEA
-      CALL FEMEANWS(KIJS, KIJL, FL1, XLLWS, EMEANWS, FMEANWS)


Is this swap of arguments intentional?

Yes it is, it's to make one of the arguments optional and make that the last argument. This allowed a temporary array to be deleted from implsch if I remember correctly.

src/ecwam/sinput_jan.F90

wdeconinck · 2024-03-25T17:22:02Z

src/ecwam/implsch.F90

@@ -347,7 +335,7 @@ SUBROUTINE IMPLSCH (KIJS, KIJL, FL1,                         &
              GTEMP1 = MAX((1.0_JWRB-DELT5*FLD(IJ,K,M)),1.0_JWRB)
              GTEMP2 = DELT*SL(IJ,K,M)/GTEMP1
              FLHAB = ABS(GTEMP2)
-              FLHAB = MIN(FLHAB,TEMP(IJ,M))


Is this inlining of TEMP supposed to improve performance?

It has no effect on CPU, but temporary array allocation does have a big penalty on GPU. The Loki pool allocator, which was implemented after I made this change, mitigates this to a large extent. Nevertheless, better to avoid a temporary array if it just saves two multiplications.

wdeconinck · 2024-03-25T17:22:49Z

src/ecwam/sdissip_ard.F90

@@ -259,7 +245,7 @@ SUBROUTINE SDISSIP_ARD (KIJS, KIJL, FL1, FLD, SL,          &
          DO M2=1,M-NDIKCUMUL
            DO KK=0,NANGD
              DO IJ=KIJS,KIJL
-                WCUMUL(IJ,KK,M2)=SQRT(ABS(C_C(IJ,M)+C_C(IJ,M2)-2.0_JWRB*C_(IJ,M)*C_(IJ,M2)*COSDTH(KK)))*TRPZ_DSIP(IJ,M2)


Is COS(KK*DELTH) faster than using the precomputed COSDTH ?

Temporaries whose size is not a multiple of NPROMA can suffer from misaligned addresses on device in the Loki pool allocator. Of course, we can guard against this by padding each allocation to ensure it is a multiple of 8, but for single precision runs I would like to avoid this padding. These changes are all performance neutral on the CPU.

jrbidlot

Hi Ahmad,
I assume this large set of changes is necessary.
The slicing of the arrays (KIJS:KIJL) comes from the time when the code did have the nproma blocks and vectors were split into KIJS:KIJL slices,
where KIJS could be different than 1.
Now removing the slicing, essentially means that KIJS is always 1, and KIJL is the dimension
Does it make sense to keep KIJS, rather than changing it 1?

jrbidlot · 2024-03-26T16:13:01Z

These are quite some changes. How did you test them?
Some modified routines are only called for a specific choice of the wave physics package
Did you try different options for IPHYS (0 and 1)?
Some other changes only affect the post processing routines. I don't think the default tests check for that.

jrbidlot · 2024-03-26T16:18:28Z

Otherwise, apologies, I have made use of a lot of temporary arrays. It was something we used to do a lot with the old code.
If the refactored code is still working fine on CPU and it is necessary for GPU, we will need to avoid using temporary arrays in future changes.

awnawab · 2024-03-26T16:27:39Z

Hi Jean,

Thanks for the feedback. I'll try and answer point-by-point:

The change only affects the allocation of the arrays, the loop bounds are still set via KIJS and KIJL. I think it is better to fix the allocation size to NPROMA even if we loop over smaller parts of it. Setting the leading dimension to NPROMA for "single-column" routines (doesn't apply here physically but does computationally) is also part of the new ifs-arpege coding standard (https://github.com/ecmwf-ifs/ifs-arpege-coding-standards/blob/main/fortran/rules/L12.rst)
That is a good point, the testing coverage needs to be wider for such an invasive change. I will add a test for alternate IPHYS to the CI as well in fact. I will also test this in a Tco399 forecast experiment coupled with the IFS and nemo.
Temporary arrays in general are not too bad, but I would be grateful if we could avoid temporary arrays that don't have NPROMA as the leading dimension.

jrbidlot · 2024-03-26T16:48:48Z

Good to know the NPROMA requirements. I will try to adhere to it.
Yes, the extra testing will be good.

awnawab · 2024-03-27T13:28:29Z

After restoring some of the changes, this branch is now bit-identical to main in coupled IFS+wam+nemo and IFS+wam runs. I have also added a unit-test for IPHYS=0 and LLGCBZ0=T. I saw that the ifs-source CI forecast experiment uses the latter option, and this results in a different code path being followed in the physics.

Once the 3 GPU offload PRs are merged I will also file a PR to ifs so that they run through the entire ecflow CI suite and we can be sure results are unaffected for all the possible configurations we might be interested in.

jrbidlot · 2024-03-27T14:44:19Z

For the testing, actually CY49R1 will activate
LLGCBZ0=T
and
LLNORMAGAM=T
with IPHYS=1

awnawab · 2024-03-27T14:49:27Z

Ah, good to know! The coupled runs I did were compared to a cy49r1 baseline, so the changes are safe for that configuration too. Nevertheless, I will update the unit-tests so we have one that matches the above configuration.

jrbidlot

From what I can I see. It seems to me like sensible changes.

jkousal32

All looks good to me also, very nice to see how things work for optimization and what the best practices are in order to enable efficient running on GPU

awnawab added 6 commits March 19, 2024 15:43

Optimized sdissip_ard

b9690c7

Optimized peak_ang

18c8a36

Made last dummy arg to femeanws optional

322ef35

Optmized sbottom

7bcfc06

Optimized imphftail

08f4bb3

Optimized setice

c616a94

awnawab requested a review from wdeconinck March 25, 2024 10:30

wdeconinck approved these changes Mar 25, 2024

View reviewed changes

jrbidlot reviewed Mar 26, 2024

View reviewed changes

awnawab added 13 commits March 27, 2024 12:16

Removed slicing from array declarations

449ef4d

Removed slicing instance from sinflx

6f9400f

Fixed yowice indentation error

90248b9

Changed position of omegagc call

e563f70

Various demotions and fixes that were missed

0928fcc

sinput_ard array demotions

eb0a9bf

sinput_jan array demotions

203c345

Removed unnecessary reduction from stresso, wnfluxes

0d60939

BUNDLE: explicitly disable SINGLE_PRECISION by default

24e9999

Remove duplicate zeroing in sinput_jan

3c16301

SCRIPTS: make iphys and llgcbz0 configurable

43dcf3d

TESTS: add etopo1_open_an_fc config for iphys=0 and llgcbz0=T

eb4ae08

Restore peak_ang and stress_gc

9085bd3

awnawab force-pushed the naan-temp-array-reductions branch from 10ac544 to 9085bd3 Compare March 27, 2024 12:28

awnawab requested a review from jrbidlot March 27, 2024 13:30

awnawab requested a review from wdeconinck March 27, 2024 13:30

jrbidlot approved these changes Mar 27, 2024

View reviewed changes

jkousal32 approved these changes Mar 27, 2024

View reviewed changes

wdeconinck approved these changes Mar 27, 2024

View reviewed changes

wdeconinck merged commit dd11e32 into main Mar 27, 2024
9 checks passed

awnawab deleted the naan-temp-array-reductions branch March 28, 2024 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary array reductions #13

Temporary array reductions #13

awnawab commented Mar 25, 2024

awnawab commented Mar 25, 2024

wdeconinck left a comment

wdeconinck Mar 25, 2024

awnawab Mar 26, 2024

wdeconinck Mar 25, 2024

awnawab Mar 26, 2024 •

edited

Loading

wdeconinck Mar 25, 2024

awnawab Mar 26, 2024

jrbidlot left a comment

jrbidlot commented Mar 26, 2024

jrbidlot commented Mar 26, 2024

awnawab commented Mar 26, 2024 •

edited

Loading

jrbidlot commented Mar 26, 2024

awnawab commented Mar 27, 2024 •

edited

Loading

jrbidlot commented Mar 27, 2024

awnawab commented Mar 27, 2024

jrbidlot left a comment

jkousal32 left a comment

Temporary array reductions #13

Temporary array reductions #13

Conversation

awnawab commented Mar 25, 2024

awnawab commented Mar 25, 2024

wdeconinck left a comment

Choose a reason for hiding this comment

wdeconinck Mar 25, 2024

Choose a reason for hiding this comment

awnawab Mar 26, 2024

Choose a reason for hiding this comment

wdeconinck Mar 25, 2024

Choose a reason for hiding this comment

awnawab Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

wdeconinck Mar 25, 2024

Choose a reason for hiding this comment

awnawab Mar 26, 2024

Choose a reason for hiding this comment

jrbidlot left a comment

Choose a reason for hiding this comment

jrbidlot commented Mar 26, 2024

jrbidlot commented Mar 26, 2024

awnawab commented Mar 26, 2024 • edited Loading

jrbidlot commented Mar 26, 2024

awnawab commented Mar 27, 2024 • edited Loading

jrbidlot commented Mar 27, 2024

awnawab commented Mar 27, 2024

jrbidlot left a comment

Choose a reason for hiding this comment

jkousal32 left a comment

Choose a reason for hiding this comment

awnawab Mar 26, 2024 •

edited

Loading

awnawab commented Mar 26, 2024 •

edited

Loading

awnawab commented Mar 27, 2024 •

edited

Loading