[RFC] Autotune should consider batch size and number of heads #117

sustcsonglin · 2025-01-11T11:09:28Z

Proposal

The optimal kernel configuration should adjust based on changes in (batch size × number of heads).

Rationale

The performance of the autotuned kernel can vary significantly when the product of (batch size × number of heads) changes, especially with different levels of parallelism determined by the batch and head dimensions.

sustcsonglin · 2025-01-14T21:34:50Z

Autotuning should also take the total sequence length into account, as the sequence length dimension provides parallelism in addition to the number of heads and batch size.

sustcsonglin added the enhancement New feature or request label Jan 11, 2025

sustcsonglin added this to the FLA v1.0.0 release milestone Jan 11, 2025

yzhangcs added a commit that referenced this issue Jan 14, 2025

[Rotary] Fix performance drop for varlen (#117)

656657a

sustcsonglin added the urgent label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Autotune should consider batch size and number of heads #117

[RFC] Autotune should consider batch size and number of heads #117

sustcsonglin commented Jan 11, 2025

sustcsonglin commented Jan 14, 2025

[RFC] Autotune should consider batch size and number of heads #117

[RFC] Autotune should consider batch size and number of heads #117

Comments

sustcsonglin commented Jan 11, 2025

Proposal

Rationale

sustcsonglin commented Jan 14, 2025