Replies: 1 comment 4 replies
-
The two tiled MMAs you posted have different number of threads. The first one is 256 threads whereas the second one uses 32. Is that intentional? I think not |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am studying CuTe and I think I got stuck when I tried to replace a
UniversalFMA
atom to something non-trivial.Here is what I have:
Using the above configuration results in correct GEMM outputs from the CUDA kernel that I implemented.
However, if I changed the MMA atom to something else, say
cute::SM70_8x8x4_F16F16F16F16_NT
, only some of the GEMM outputs are correct.Some of the configurations I tried include:
But I have not got success after testing a good number of configurations that I believed were correct.
Can any expert please suggest what to set in my case? Thank you very much.
Beta Was this translation helpful? Give feedback.
All reactions