[MoE][PyTorch] Add mask-based MoE permutation #1373

hxbai · 2024-12-13T04:49:02Z

Description

Add mask-based token permutation and local chunk permutation fused kernels. These kernels are implemented with OpenAI Triton.

Related commit in Megatron-LM NVIDIA/Megatron-LM@ac0474d

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Non-breaking API changes in te.pytorch.permutation.moe_permute and te.pytorch.permutation.moe_unpermute
Add new APIs of te.pytorch.permutation.moe_sort_chunks_by_indices

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/pytorch/permutation.py

Signed-off-by: Hongxiao Bai <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Hongxiao Bai <[email protected]>

Signed-off-by: Hongxiao Bai <[email protected]>

timmoon10 · 2025-01-08T19:03:56Z

transformer_engine/pytorch/permutation.py

 ]


-class _moe_permute(torch.autograd.Function):
-    """functional Permute"""
+class _moe_permute_indice_map(torch.autograd.Function):


Suggested change

class _moe_permute_indice_map(torch.autograd.Function):

class _moe_permute_index_map(torch.autograd.Function):

We should make sure to use "index" in user-facing APIs like moe_permute/moe_unpermute.

OK, modified.

timmoon10 · 2025-01-08T21:16:00Z

transformer_engine/pytorch/permutation.py

 import warnings
 from typing import Tuple
 import torch

 import transformer_engine_torch as tex
-from .constants import TE_DType
-from .float8_tensor import Float8Tensor
+import transformer_engine.pytorch.triton.permutation as triton_permuataion


Nit:

Suggested change

import transformer_engine.pytorch.triton.permutation as triton_permuataion

import transformer_engine.pytorch.triton.permutation as triton_permutation

timmoon10 · 2025-01-08T21:23:41Z

transformer_engine/pytorch/permutation.py

+            if ctx.fp8:
+                assert isinstance(
+                    permuted_act_grad, Float8Tensor
+                ), "Grad of the output must be in Float8Tensor type for FP8 moe_permute."


Couldn't we decouple FP8 in the forward and backward?

Suggested change

if ctx.fp8:

assert isinstance(

permuted_act_grad, Float8Tensor

), "Grad of the output must be in Float8Tensor type for FP8 moe_permute."

fp8 = isinstance(permuted_act_grad, Float8Tensor)

if fp8:

If there are no obstacles, we could also do the same thing for _moe_unpermute_mask_map and _moe_chunk_sort.

Modified. Now for bwd, it would follow the dtype of the grad tensor.

timmoon10 · 2025-01-08T21:50:21Z

tests/pytorch/test_permutation.py

+    # Results Check
+    #
+    ###################################################################################################################################
+    tols = dtype_tols(te_dtype)


Shouldn't we expect bit-wise exact results?

Suggested change

tols = dtype_tols(te_dtype)

tols = { "atol": 0, "rtol": 0 }

Hi, Tim. I made some modifications here; it now uses two types of tols.

We cannot use bit-wise matching for all cases. Firstly, for fp8 case of the fusion, the function in PyTorch version uses fp32. Besides, there are reductions in the unpermutation kernels, and we cannot get bit-wise matching results for permute bwd, unpermute fwd, and unpermute bwd with probs.

For other cases, I modified to bit-wise matching. Is this OK for you?

timmoon10 · 2025-01-08T21:54:30Z

tests/pytorch/test_permutation.py

+    # Results Check
+    #
+    ###################################################################################################################################
+    tols = dtype_tols(te_dtype)


We should expect bit-wise exact results.

Suggested change

tols = dtype_tols(te_dtype)

tols = { "atol": 0, "rtol": 0 }

Like the one above, I changed to bit-wise matching except for fp8.

phu0ngng · 2025-01-10T17:21:50Z

transformer_engine/pytorch/triton/permutation.py

+        mask=(offset < num_tokens),
+        other=0,
+    ).to(tl.int64)
+    expert_token_cumsum = tl.cumsum(expert_token_mask) * expert_token_mask


An interesting way to exclude the zero token_mask. Happy to learn!

phu0ngng · 2025-01-10T17:45:28Z

transformer_engine/pytorch/triton/permutation.py

+    chunk_cumsum = tl.load(
+        row_id_map_ptr + pid_m * num_tokens + offset, mask=(offset < num_tokens), other=0
+    )
+
+    workspace_off = tl.arange(0, WORKSPACE_LOAD_WIDTH)
+    chunk_sums = tl.load(workspace_ptr + workspace_off, mask=workspace_off < chunk_idx)
+    chunk_cumsum = tl.where(chunk_cumsum == 0, -1, chunk_cumsum + tl.sum(chunk_sums) - 1)


These three names chuck_cumsum, chuck_sums, and chunk_cumsum are quite confusing.
If I understand it correctly, I suggest to rename them to:

chuck_cumsum -> row_id_within_token_block

chuck_sums -> n_tokens_per_expert

chuck_cumsum -> row_id

In addition, I think we should move the -1 to the pass1 as it is the correction for the calculation of expert_token_cumsum, as:

expert_token_cumsum = (tl.cumsum(expert_token_mask) - 1) * expert_token_mask

Thanks. You are right. I modified these names (renamed chunk_sums to n_tokens_per_block rather than n_tokens_per_expert).

For the -1, if we move it to pass1, then we cannot easily distinguish the row_id: 0 and the mask: 0 and we need extra ways to handle whether it is masked out. So, I still left the -1 in the pass2. Do you think it is OK?

Signed-off-by: Hongxiao Bai <[email protected]>

hxbai changed the title ~~[MoE][Common/PyTorch] Add mask-based MoE permutation~~ [MoE][PyTorch] Add mask-based MoE permutation Dec 13, 2024

yaox12 reviewed Dec 13, 2024

View reviewed changes

transformer_engine/pytorch/permutation.py Show resolved Hide resolved

hxbai and others added 4 commits December 13, 2024 06:05

add mask based moe permutation

7e04f9a

Signed-off-by: Hongxiao Bai <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

664af70

for more information, see https://pre-commit.ci Signed-off-by: Hongxiao Bai <[email protected]>

change moe_chunk_permute to moe_sort_chunks_by_indices

a8f1daa

Signed-off-by: Hongxiao Bai <[email protected]>

fix __all__ in pytorch/permutation.py

ca94d72

Signed-off-by: Hongxiao Bai <[email protected]>

hxbai force-pushed the permute_fusion branch from 6160104 to ca94d72 Compare December 13, 2024 06:05

phu0ngng self-requested a review January 8, 2025 15:20

timmoon10 reviewed Jan 8, 2025

View reviewed changes

timmoon10 self-requested a review January 8, 2025 21:57

phu0ngng reviewed Jan 10, 2025

View reviewed changes

hxbai and others added 5 commits January 16, 2025 04:56

fix func/var names and typos; update tols in UT

fca9406

Signed-off-by: Hongxiao Bai <[email protected]>

Merge branch 'main' into permute_fusion

dc7bbca

update copyright

b493a23

Signed-off-by: Hongxiao Bai <[email protected]>

update doc

2fae821

Signed-off-by: Hongxiao Bai <[email protected]>

minor fix in UT

2b337d9

Signed-off-by: Hongxiao Bai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][PyTorch] Add mask-based MoE permutation #1373

[MoE][PyTorch] Add mask-based MoE permutation #1373

hxbai commented Dec 13, 2024 •

edited

Loading

timmoon10 Jan 8, 2025

hxbai Jan 16, 2025

timmoon10 Jan 8, 2025

hxbai Jan 16, 2025

timmoon10 Jan 8, 2025

hxbai Jan 16, 2025

timmoon10 Jan 8, 2025

hxbai Jan 16, 2025

timmoon10 Jan 8, 2025

hxbai Jan 16, 2025

phu0ngng Jan 10, 2025

phu0ngng Jan 10, 2025 •

edited

Loading

hxbai Jan 16, 2025

	class _moe_permute_indice_map(torch.autograd.Function):
	class _moe_permute_index_map(torch.autograd.Function):

	import transformer_engine.pytorch.triton.permutation as triton_permuataion
	import transformer_engine.pytorch.triton.permutation as triton_permutation

[MoE][PyTorch] Add mask-based MoE permutation #1373

Are you sure you want to change the base?

[MoE][PyTorch] Add mask-based MoE permutation #1373

Conversation

hxbai commented Dec 13, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phu0ngng Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hxbai commented Dec 13, 2024 •

edited

Loading

phu0ngng Jan 10, 2025 •

edited

Loading