Configure Non-Trivial TiledMMAs #1985

leimao · 2024-12-15T08:05:33Z

leimao
Dec 15, 2024

Hello,

I am studying CuTe and I think I got stuck when I tried to replace a UniversalFMA atom to something non-trivial.

Here is what I have:

1. A: [BLK_M, BLK_K], column-major, where BLK_M = 256, BLK_K = 8.
2. B: [BLK_N, BLK_K], column-major, where BLK_N = 256, BLK_K = 8.
3. thread_layout_C: [THR_M, THR_N] column-major, where THR_M = 16, THR_N = 16.
4. MMA Atom: cute::UniversalFMA<cute::half_t, cute::half_t, cute::half_t>{}
5. TiledMMA: cute::make_tiled_mma(cute::UniversalFMA<cute::half_t, cute::half_t, cute::half_t>{}, thread_layout_C)};

Using the above configuration results in correct GEMM outputs from the CUDA kernel that I implemented.

    auto mma{cute::make_tiled_mma(cute::UniversalFMA<cute::half_t, cute::half_t, cute::half_t>{}, thread_layout_C)};

However, if I changed the MMA atom to something else, say cute::SM70_8x8x4_F16F16F16F16_NT, only some of the GEMM outputs are correct.

Some of the configurations I tried include:

    auto mma_atom{cute::MMA_Atom<cute::SM70_8x8x4_F16F16F16F16_NT>{}};
    auto mma_layout{cute::make_layout(
        cute::make_shape(cute::Int<2>{}, cute::Int<2>{}),
        cute::make_stride(cute::Int<1>{}, cute::Int<2>{}))};
    auto mma_tile{cute::make_tile(cute::Int<32>{}, cute::Int<32>{}, cute::Int<4>{})};
    auto mma{cute::make_tiled_mma(mma_atom, mma_layout, mma_tile)};

But I have not got success after testing a good number of configurations that I believed were correct.

Can any expert please suggest what to set in my case? Thank you very much.

thakkarV · 2024-12-15T08:13:18Z

thakkarV
Dec 15, 2024
Collaborator

The two tiled MMAs you posted have different number of threads. The first one is 256 threads whereas the second one uses 32. Is that intentional? I think not

4 replies

leimao Dec 15, 2024
Author

Thank you @thakkarV. I am a little bit inexperienced in this. Did you suggest that in the second tiled MMA I have to use 256 threads (I want to use 256 threads per thread block), i.e. cute::size(mma_layout) == 32, given the mma atom cute::MMA_Atom<cute::SM70_8x8x4_F16F16F16F16_NT>{} is defined for 8 threads.

leimao Dec 15, 2024
Author

I have also tried something like this

    auto mma_atom{cute::MMA_Atom<cute::SM70_8x8x4_F16F16F16F16_NT>{}};
    auto mma_layout{cute::make_layout(
        cute::make_shape(cute::Int<4>{}, cute::Int<8>{}),
        cute::make_stride(cute::Int<1>{}, cute::Int<4>{}))};
    auto mma{cute::make_tiled_mma(mma_atom, mma_layout)};

and it does not work. I did not use mma_tile for repeating using the mma for new values because I assume it's just optional for optimization?

leimao Dec 15, 2024
Author

It turns out that I was using the predicate version of the axpby after completing the gemm and the number of predicates is smaller than what's expected when the new mma atom is used.

leimao Dec 15, 2024
Author

I am thinking, if it's possible for CuTe to statically check the size compatibility between the predicate tensor and the other tensors in the functions that allows predicates. If so, such oversight can be captured by compiler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure Non-Trivial TiledMMAs #1985

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Configure Non-Trivial TiledMMAs #1985

leimao Dec 15, 2024

Replies: 1 comment · 4 replies

thakkarV Dec 15, 2024 Collaborator

leimao Dec 15, 2024 Author

leimao Dec 15, 2024 Author

leimao Dec 15, 2024 Author

leimao Dec 15, 2024 Author

leimao
Dec 15, 2024

Replies: 1 comment 4 replies

thakkarV
Dec 15, 2024
Collaborator

leimao Dec 15, 2024
Author

leimao Dec 15, 2024
Author

leimao Dec 15, 2024
Author

leimao Dec 15, 2024
Author