Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused moe gemm + silu activation kernel #710

Draft
wants to merge 12 commits into
base: main_perf
Choose a base branch
from

Conversation

Chi-Chu319
Copy link

@Chi-Chu319 Chi-Chu319 commented Jan 23, 2025

!!Merge only after the tianxing/moe_quantization branch is merged!!

Implemented the moe gemm + silu_and_mul fusion. The logic of the fusion is:

  • Load the b pointer in a interleaved manner, [0, 0 + BLOCK_N // 2, 1, 1 + BLOCK_N // 2, 3, 3 + BLOCK_N // 2, ..., BLOCK_N // 2 - 1, BLOCK_N]
  • After the gemm, we can simply reshape the gemm product to (BLOCK_SIZE_M, BLOCK_SIZE_N // 2, 2) and split to obtain the silu_acc and mul_acc
  • Apply silu on silu_acc and multiply it with mul_acc
  • Store the result

You can benchmark the fused kernel (moe gemm + silu activation) with the -use_silu_activation flag.
Implemented a separate silu_and_mul kernel. You can benchmark the moe gemm + silu activation combo as separate kernels on triton with -use_silu_activation_non_fused flag.

The fused kernel:

Model M N K E top_k Time (ms) TFLOPS Bandwidth (GB/s)
mistral-7B 4096 14336 4096 8 2 2.363253 414.184572 462.545290
mistral-22B 4096 16384 6144 8 2 4.062578 413.183111 443.231723

moe gemm + silu activation as separate kernels:

Model M N K E top_k Time (ms) TFLOPS Bandwidth (GB/s)
mistral-7B 4096 14336 4096 8 2 2.358011 382.156425 449.828031
mistral-22B 4096 16384 6144 8 2 4.288173 386.892861 431.299634

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • [ x I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@Chi-Chu319 Chi-Chu319 requested a review from vgokhale January 23, 2025 09:08
@Chi-Chu319 Chi-Chu319 self-assigned this Jan 23, 2025
@Chi-Chu319 Chi-Chu319 changed the title Tianxing/fused moe single gemm Fused moe gemm + silu activation kernel Jan 23, 2025
@Chi-Chu319 Chi-Chu319 mentioned this pull request Jan 23, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant