-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add device regions to non-banded routines #140
Add device regions to non-banded routines #140
Conversation
63ccb95
to
07bdefe
Compare
07bdefe
to
c20af46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet done reviewing files (there are lots), but giving some feedback now.
As discussed, either rebase or merge to fix the conflicts. If merge, I like to fast-forward master to avoid having 2 consecutive merges, with the 2nd doing nothing. Unfortunately, github doesn't have a way to do that through their web interface, so I do it locally and push. Just let me know when it's ready and I can do that. |
I did the merge + push-to-master approach, which appears to have worked correctly. |
This adds arbitrary regions for the all of the batched device functions, except the banded norms. Handling the bandwidth for non-uniform tile sizes isn't totally obvious and is probably broken in the other banded routines too. So, I'm leaving that for the future.
The core logic for building the regions is refactored into a couple functions in
src/internal/internal_batch.hh
. That's probably the place to start, in order to understand the rest of the changes.device_regions_build
is the key function there, which builds the regions as a vector ofdevice_regions_params
objects. This function and struct are both templated on several parameters in order to avoid too large of an overhead relative to manually inlining this logic everywhere. This function has several moving parts to address all of the ways we organize our batch kernels. Please let me know if anything needs more comments/explanations.On guyot, the performance seems to be the same or maybe a touch better. The norms seem to be a little slower. But, it's a little hard to tell with the system's noise.
Edit: I noticed that I missed trsmA and the he2hb variants of gemm, trmm, and her2k. I've addressed these routines.