-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transposer kernels for reordering field data. #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I can't really comment on the core of the code, but just had a couple of remarks about code duplication. Feel free to ignore them though.
One last thing, that's the kind of feature for which an unit test make a lot of sense and will be fairly easy. Just have some data, transpose it several times and back to its original position and test if it is still the same as before.
src/cuda/kernels_trans.f90
Outdated
|
||
contains | ||
|
||
attributes(global) subroutine trans_x2y_k(u_y, u_x, nz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I feel like these can be refactored into a single subroutine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can merge all these kernels into one while making sure the performance is good. This is specific to CUDA backend tough, on the OpenMP backend I think it is possible. What you have in CUDA kernels is like the bit of code you have inside some number of nested loops, and any conditional checks over there can make performance bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ok, that's the part I wasn't sure about.
src/cuda/backend.f90
Outdated
real(dp), device, pointer, dimension(:, :, :) :: u_d, u_y_d | ||
type(dim3) :: blocks, threads | ||
|
||
select type(u); type is (cuda_field_t); u_d => u%data_d; end select |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT comment that "select type stuff is annoying" - I agree. One possible way to make it slightly better would be
subroutine set_data_pointer(u, u_d)
class(field_t), intent(in) :: u
real(dp), device, pointer, dimension(:, :, :), intent(out) :: u_d
select type(u)
type is (cuda_field_t)
u_d => u%data_d
end select
end subroutine
and then the code in the main subroutine should be
call set_data_pointer(u, u_d)
call set_data_pointer(u_y, u_y_d)
which is perhaps a little better.
The other option would be to rewrite this as a wrapper subroutine accepting class(field_t)
and then it calls the actual subroutine with type(cuda_field_t)
arguments. The wrapper subroutine would need to do the select type dance (and should probably use class default
for some error checking), but the actual implementation would be free if this complication which may or may not be an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need something like this for sure, but was planning to deal with it later on when fixing the allocator. The issue is mentioned here #24. And I think we can pass the size we need into this set_data_pointer
function and handle the bounds remapping there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, better not to clutter the change.
Refactored all the transpose functions into one as we agreed, but kept the cuda kernels as they are due to performance concerns. Also, renamed transpose and trans into reorder. The name tranpose stuck from the original Xcompact3D implementation, but the operation we're carrying out is not any more a transpose I believe. It is always local and just reorders data into x, y, or z directions so that we can use our 1D tridiagonal solver. What do you think about this? I added all the renaming stuff in a single commit and easily convert back if people disagree. |
reorder seems sensible to me, alternatives (only if you're unhappy with it) could be "orient" or "rotate", but I see no need to change from reorder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. And same, happy with reorder
for the terminology.
Added a simple test, please let me know if there is anything you want to add. I plan to merge tomorrow. |
Looking good to me. I think in the future we could move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes except possible logging system but that should go in its own PR anyway. Whole reordering processes seems much cleaner now.
src/cuda/backend.f90
Outdated
call reorder_z2y<<<blocks, threads>>>(u_o_d, u_i_d, & | ||
self%nx_loc, self%nz_loc) | ||
case default | ||
print *, 'Transpose direction is undefined.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to report errors to an appropriate output stream, e.g. stderr via something like https://stackoverflow.com/a/8508757. Maybe we want to do this with a global logging system though. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please for a more appropriate output stream!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make use of this in the tests already. Changed the print here accordingly.
Could you give more details about a global logging system you have in mind? I haven't really used something like that. Maybe we can start an issue on this for further discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #30
@@ -0,0 +1,228 @@ | |||
program test_cuda_reorder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are great but the kernels are being directly tested and the higher level reorder_cuda
doesn't get tested. Might be worth adding a few tests for that specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll keep it as is for now, but I'll be happy to have a chat later about the best practices regarding testing. I'm tempted to test a single functionality only and ideally isolating it from the rest, and reorder_cuda would require instantiating the backend and the allocator at least, and also setting the class variables correctly so that the reorder subroutines are run currectly. The way we set these class variables are still not finalised, and currently done in the main program. Plan is to move them into a dedicated function in the backend, which reads the inputs from a config file.
Can someone also comment here the reasoning behind refactoring the individual reorder functions (e.g. To me, the calling code remains semantically identical so there's no reduction in complexity there, and there has to be a separate kernel for every direction anyway, so we haven't reduced complexity on the backend. I feel like I'm missing the advantage of introducing this new switch statement. |
To me |
It helped us to remove some of the backend procedures and we ended up having fewer lines. |
We need to reorder field data as we switch between x, y, and z orientations.
First commit adds transposing operation from x to y.
Select type stuff we have to do here is a bit annoying, and it is mainly due to the extention of the base field type in CUDA backend to support device arrays in Fortran. Any suggestion here is welcome @JamieJQuinn.