Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

learning-chip · 2024-08-18T16:18:59Z

Eventually will allow the e2e mamba2 example #39 to run without the dependency on the original mamba_ssm repo.

This PR adds unit tests to ensure equivalence between {chunk_simple_gla/torch_simple_gla/torch_simple_gla_recurrent under fla.ops.simple_gla of this repository} and {mamba_chunk_scan_combined/ssd_minimal_discrete inside mamba_ssm repository}.

Unit test output from this PR:

$ pytest -v ./test_simple_gla_for_mamba2.py
====================================================== test session starts ======================================================
collected 6 items                                                                                                               

test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float32-True] PASSED                                                    [ 16%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float32-False] PASSED                                                   [ 33%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float16-True] PASSED                                                    [ 50%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[float16-False] PASSED                                                   [ 66%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[bfloat16-True] PASSED                                                   [ 83%]
test_simple_gla_for_mamba2.py::test_gla_to_mamba2[bfloat16-False] PASSED                                                  [100%]

Differences between simple_gla kernel and "mamba2_ssd" kernel:

mamba2_ssd uses input/output layout [batch, seq, head, hidden], while simple_gla uses [batch, head, seq, hidden]
mamba2_ssd does not apply the attention-inspired scaling q * (DK ** -0.5)
mamba2_ssd takes an extra dt input for discretization, but this can be easily absorbed into the gating matrix A as did in mamba2 example
mamba2_ssd's fused kernel does not take time-varying A (though the minimal torch version does), probably because the time-dependence is expressed by dt, not A_t? simple_gla supports time-varying g directly.
mamba2_ssd uses "group query attention", but simple_gla (also other kernels in this repo?) always use the same number of heads for Q & K & V. For now, force the same number of heads in tests.

Ref Section 7.2 of Mamba-2 paper:

Todo:

Performance comparison for same input shapes and chunk size benchmark script for simple_gla vs mamba2 kernel #50
Support group-query pattern in simple_gla kernel (Mamba-Codestral uses n_groups=8)
Also check backward pass correctness
Swap-in simple_gla kernel into e2e mamba2 model code

FYI @DanFosing @yzhangcs @sustcsonglin

yzhangcs · 2024-08-18T17:39:39Z

@learning-chip very cool contributions! I think it would be great if you add some benchmarks regarding simple_gla and mamba2 kernels like in https://github.com/sustcsonglin/flash-linear-attention/blob/main/benchmarks/ops/benchmark_gla.py.

yzhangcs · 2024-08-18T17:40:40Z

I will be working on GQA recently

learning-chip · 2024-08-18T19:24:57Z

add some benchmarks regarding simple_gla and mamba2 kernels like in https://github.com/sustcsonglin/flash-linear-attention/blob/main/benchmarks/ops/benchmark_gla.py.

Some quick results #50

learning-chip added 4 commits August 18, 2024 14:27

change simple_gla scale factor to match mamba-2

518a4fd

allow optional scale factor for simple_gla

4359b18

check equivalence between simpl_ gla kernel and mamba2 ssd kernel

3a8ce0a

run autopep8

fd44b54

yzhangcs marked this pull request as ready for review August 18, 2024 17:40

yzhangcs added 2 commits August 19, 2024 02:22

[Simple GLA] Add comments & Fix bad grad

11c7f66

Update and rename test_simple_gla_for_mamba2.py to test_simple_gla.py

f71e096

yzhangcs merged commit 9aa2480 into fla-org:main Aug 18, 2024
1 check passed

learning-chip mentioned this pull request Aug 18, 2024

benchmark script for simple_gla vs mamba2 kernel #50

Merged

4 tasks

yzhangcs mentioned this pull request Aug 18, 2024

Add implementations of Mamba 2 into FLA #34

Closed

learning-chip mentioned this pull request Aug 30, 2024

llama : initial Mamba-2 support ggerganov/llama.cpp#9126

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

learning-chip commented Aug 18, 2024 •

edited

Loading

yzhangcs commented Aug 18, 2024

yzhangcs commented Aug 18, 2024

learning-chip commented Aug 18, 2024

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Conversation

learning-chip commented Aug 18, 2024 • edited Loading

yzhangcs commented Aug 18, 2024

yzhangcs commented Aug 18, 2024

learning-chip commented Aug 18, 2024

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

Replace mamba2 `mamba_chunk_scan_combined` triton kernel by `simple_gla` triton kernel #49

learning-chip commented Aug 18, 2024 •

edited

Loading