Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA variant optimised beyond k-caching #104

Merged
merged 4 commits into from
Dec 20, 2024

Conversation

MichaelSt98
Copy link
Contributor

Optimisation beyond k-caching:

likely still potential for further optimisation ...

Implemented with input from @lukasm91


k-caching vs "opt"-variant performance (compiled with nvhpc 22.11 and executed with 1 262144 128):

  • double-precision
    • k-caching: 600 - 700 GF/s
    • "opt": 700 - 850 GF/s
  • single-precision
    • k-caching: 2300 - 3000 GF/s
    • "opt": 2700 - 5000 GF/s

Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really impressive!

  • I left some comments on the CMake side, which would be good to take care of
  • Tests are currently failing, not sure why
  • Please describe the new variant in the README, also adding some details on what constitutes the optimisation

Comment on lines 230 to 232
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
# -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch according to build-type?

Suggested change
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
# -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-O0 -g -G>)
else()
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE
$<$<COMPILE_LANGUAGE:CUDA>:-O3 -use_fast_math -lineinfo>)
endif()
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)

${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
)
if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little weird. Could this be the reason for the failing tests?

target_include_directories(
dwarf-cloudsc-c-cuda-opt-lib
PUBLIC
${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What precisely do we need from that?
It would be better to link against CUDA::cudart (or similar, see https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cuda-toolkit-rt-lib) as a PUBLIC_LIBS target instead

@@ -54,7 +54,7 @@ if( HAVE_CLOUDSC_C_CUDA )
target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>)
else()
target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
--ptxas-options=-O3 -use_fast_math -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
--ptxas-options=-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest collecting these flags in variables, separating between optimisation and base flags. Something like

set( CLOUDSC_CUDA_OPT_FLAGS "..." )
set( CLOUDSC_CUDA_FLAGS "..." )

The first one can then be build_type dependent, e.g., use the -O0 flags for debug as suggested in the other comment, and then apply them to each target as

target_compile_options< dwarf-... PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--ptxas-options=${CLOUDSC_CUDA_OPT_FLAGS} ${CLOUDSC_CUDA_FLAGS}>)

That way you have to change them only in one place.

@MichaelSt98
Copy link
Contributor Author

This is really impressive!

  • I left some comments on the CMake side, which would be good to take care of
  • Tests are currently failing, not sure why
  • Please describe the new variant in the README, also adding some details on what constitutes the optimisation

Tests are failing due to the compute capability?!

/home/runner/work/dwarf-p-cloudsc/dwarf-p-cloudsc/nvhpc-install/Linux_x86_64/21.9/cuda/11.4/include/cuda/std/detail/__atomic:11:4: error: #error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
[2891](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2892)
   11 | #  error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
[2892](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2893)
      |    ^~~~~
[2893](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2894)
make[2]: *** [cloudsc-dwarf/src/cloudsc_cuda/CMakeFiles/dwarf-cloudsc-c-cuda-opt-lib.dir/build.make:95: cloudsc-dwarf/src/cloudsc_cuda/CMakeFiles/dwarf-cloudsc-c-cuda-opt-lib.dir/cloudsc/cloudsc_c_opt.cu.o] Error 1

Therefore, I guess we have to disable the new CUDA optimised variant for testing?!

@reuterbal
Copy link
Collaborator

It shouldn't be run but it should be built.
Try setting the CUDA architecture to 80 in the Github toolchain file: https://github.com/ecmwf-ifs/dwarf-p-cloudsc/blob/main/arch/toolchains/github-ubuntu-nvhpc.cmake

@MichaelSt98 MichaelSt98 force-pushed the nams-cuda-beyond-k-caching branch from 99afc47 to 7d92eef Compare December 19, 2024 14:13
@MichaelSt98 MichaelSt98 force-pushed the nams-cuda-beyond-k-caching branch from 7d92eef to 6beb1d9 Compare December 19, 2024 14:28
Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, many thanks!

I know that the CMake changes are a bit tedious but I really like how it looks now!

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant! Very, very impressive. :shipit:

@reuterbal reuterbal merged commit 8cec441 into develop Dec 20, 2024
32 checks passed
@reuterbal reuterbal deleted the nams-cuda-beyond-k-caching branch December 20, 2024 08:57
@marsdeno
Copy link
Contributor

My hat's off 🙇

@reuterbal
Copy link
Collaborator

A peculiar addition to this: @MichaelSt98's CMake changes make sure to now apply --ptxas-options=-O3 -O3, which seems to tip the double-precision performance over the (effective) 1TF/s barrier for throughput, and I'm seeing about ~4.3TF/s in single-precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants