You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
back of the envelope estimate for performance ( roofline plot )
Decrease the amount of shared memory used per kernel ( note that some amount of memory is used by the CUDA runtime ). Useful if limiting occupancy or limited by the number of operations.
Increase the arithmetic intensity by reducing the concurrent amount of loads and stores. Can be done by computing for than once C element per thread.
GPU can have vectorization. Instead of .32 instructions one should get .128 instructions for loading data/ operations. It might require looking at the SSA code
shared-memory bank conflicts
double buffering (?)
The text was updated successfully, but these errors were encountered:
Scope making an exercise based on matrix matrix multiplication.
Optimization guide for CUDA:
List of optimizations:
The text was updated successfully, but these errors were encountered: