Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transposer kernels for reordering field data. #29

Merged
merged 8 commits into from
Feb 6, 2024

Conversation

semi-h
Copy link
Member

@semi-h semi-h commented Jan 24, 2024

We need to reorder field data as we switch between x, y, and z orientations.

First commit adds transposing operation from x to y.

Select type stuff we have to do here is a bit annoying, and it is mainly due to the extention of the base field type in CUDA backend to support device arrays in Fortran. Any suggestion here is welcome @JamieJQuinn.

@semi-h semi-h requested a review from Nanoseb January 29, 2024 10:58
@semi-h semi-h changed the title [WIP] Add transposer kernels for reordering field data. Add transposer kernels for reordering field data. Jan 29, 2024
Copy link
Collaborator

@Nanoseb Nanoseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can't really comment on the core of the code, but just had a couple of remarks about code duplication. Feel free to ignore them though.

One last thing, that's the kind of feature for which an unit test make a lot of sense and will be fairly easy. Just have some data, transpose it several times and back to its original position and test if it is still the same as before.

src/cuda/backend.f90 Outdated Show resolved Hide resolved

contains

attributes(global) subroutine trans_x2y_k(u_y, u_x, nz)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I feel like these can be refactored into a single subroutine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can merge all these kernels into one while making sure the performance is good. This is specific to CUDA backend tough, on the OpenMP backend I think it is possible. What you have in CUDA kernels is like the bit of code you have inside some number of nested loops, and any conditional checks over there can make performance bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ok, that's the part I wasn't sure about.

real(dp), device, pointer, dimension(:, :, :) :: u_d, u_y_d
type(dim3) :: blocks, threads

select type(u); type is (cuda_field_t); u_d => u%data_d; end select
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT comment that "select type stuff is annoying" - I agree. One possible way to make it slightly better would be

subroutine set_data_pointer(u, u_d)
  class(field_t), intent(in) :: u
  real(dp), device, pointer, dimension(:, :, :), intent(out) :: u_d
  select type(u)
  type is (cuda_field_t)
    u_d => u%data_d
  end select
end subroutine

and then the code in the main subroutine should be

call set_data_pointer(u, u_d)
call set_data_pointer(u_y, u_y_d)

which is perhaps a little better.

The other option would be to rewrite this as a wrapper subroutine accepting class(field_t) and then it calls the actual subroutine with type(cuda_field_t) arguments. The wrapper subroutine would need to do the select type dance (and should probably use class default for some error checking), but the actual implementation would be free if this complication which may or may not be an improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need something like this for sure, but was planning to deal with it later on when fixing the allocator. The issue is mentioned here #24. And I think we can pass the size we need into this set_data_pointer function and handle the bounds remapping there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, better not to clutter the change.

@semi-h
Copy link
Member Author

semi-h commented Jan 30, 2024

Refactored all the transpose functions into one as we agreed, but kept the cuda kernels as they are due to performance concerns.

Also, renamed transpose and trans into reorder. The name tranpose stuck from the original Xcompact3D implementation, but the operation we're carrying out is not any more a transpose I believe. It is always local and just reorders data into x, y, or z directions so that we can use our 1D tridiagonal solver. What do you think about this? I added all the renaming stuff in a single commit and easily convert back if people disagree.

@pbartholomew08
Copy link
Member

pbartholomew08 commented Jan 30, 2024

Refactored all the transpose functions into one as we agreed, but kept the cuda kernels as they are due to performance concerns.

Also, renamed transpose and trans into reorder. The name tranpose stuck from the original Xcompact3D implementation, but the operation we're carrying out is not any more a transpose I believe. It is always local and just reorders data into x, y, or z directions so that we can use our 1D tridiagonal solver. What do you think about this? I added all the renaming stuff in a single commit and easily convert back if people disagree.

reorder seems sensible to me, alternatives (only if you're unhappy with it) could be "orient" or "rotate", but I see no need to change from reorder.

Copy link
Collaborator

@Nanoseb Nanoseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. And same, happy with reorder for the terminology.

src/common.f90 Show resolved Hide resolved
@semi-h
Copy link
Member Author

semi-h commented Feb 5, 2024

Added a simple test, please let me know if there is anything you want to add. I plan to merge tomorrow.

@Nanoseb
Copy link
Collaborator

Nanoseb commented Feb 5, 2024

Looking good to me. I think in the future we could move checkperf into a separate test module because it is used in other tests too. But this is outside the scope of this PR.

Copy link
Collaborator

@JamieJQuinn JamieJQuinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes except possible logging system but that should go in its own PR anyway. Whole reordering processes seems much cleaner now.

call reorder_z2y<<<blocks, threads>>>(u_o_d, u_i_d, &
self%nx_loc, self%nz_loc)
case default
print *, 'Transpose direction is undefined.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to report errors to an appropriate output stream, e.g. stderr via something like https://stackoverflow.com/a/8508757. Maybe we want to do this with a global logging system though. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please for a more appropriate output stream!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make use of this in the tests already. Changed the print here accordingly.

Could you give more details about a global logging system you have in mind? I haven't really used something like that. Maybe we can start an issue on this for further discussions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #30

src/cuda/kernels_reorder.f90 Show resolved Hide resolved
src/cuda/kernels_reorder.f90 Outdated Show resolved Hide resolved
@@ -0,0 +1,228 @@
program test_cuda_reorder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great but the kernels are being directly tested and the higher level reorder_cuda doesn't get tested. Might be worth adding a few tests for that specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll keep it as is for now, but I'll be happy to have a chat later about the best practices regarding testing. I'm tempted to test a single functionality only and ideally isolating it from the rest, and reorder_cuda would require instantiating the backend and the allocator at least, and also setting the class variables correctly so that the reorder subroutines are run currectly. The way we set these class variables are still not finalised, and currently done in the main program. Plan is to move them into a dedicated function in the backend, which reads the inputs from a config file.

@JamieJQuinn
Copy link
Collaborator

Can someone also comment here the reasoning behind refactoring the individual reorder functions (e.g. trans_x2y) into the switch-style reorder(u_y, u, RDR_X2Y)?

To me, the calling code remains semantically identical so there's no reduction in complexity there, and there has to be a separate kernel for every direction anyway, so we haven't reduced complexity on the backend. I feel like I'm missing the advantage of introducing this new switch statement.

@Nanoseb
Copy link
Collaborator

Nanoseb commented Feb 6, 2024

Can someone also comment here the reasoning behind refactoring the individual reorder functions (e.g. trans_x2y) into the switch-style reorder(u_y, u, RDR_X2Y)?

To me reorder(... RDR_X2Y) feels cleaner because you have a single subroutine to do the work, you don't have to setup 6 different ones, all having the same set of input and output definitions that is error prone. When reading the code it felt simpler that way to be more concise.

@semi-h
Copy link
Member Author

semi-h commented Feb 6, 2024

Can someone also comment here the reasoning behind refactoring the individual reorder functions (e.g. trans_x2y) into the switch-style reorder(u_y, u, RDR_X2Y)?

It helped us to remove some of the backend procedures and we ended up having fewer lines.
Also, we need separate kernels in the CUDA backend but in the OpenMP backend we can actually have a single subrotuine doing all the reorderings based on the switch parameter.

@semi-h semi-h merged commit 18fe0bd into xcompact3d:main Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants