[PyTorch] cuBLAS workspace size fix for TP overlap unit test #1415

denera · 2025-01-17T19:08:23Z

Description

Ongoing work on PR #1337 exposed a bug in TP overlap where chunking/splitting a standard 32MiB cuBLAS workspace causes CUDA misaligned address error when cuBLAS dispatches an NVJET kernel for some (not all) GEMM sizes. Avoiding this misalignment requires the cuBLAS workspace allocation in the DL framework to be increased by a factor equal to the # of concurrent GEMM streams in TP overlap (i.e. 3 * 32MiB = 96 MiB for 3 concurrent streams).

Bootstrapping Userbuffers in transformer_engine.pytorch.base.initialize_ub() already accounts for this, but the pure GEMM unit test for TP overlap does not utilize this initialization. This PR corrects the workspace allocation in the pure GEMM unit test to avoid the misaligned address error in the CI.

Fixes #1332

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Alp Dener <[email protected]>

ksivaman

LGTM

denera · 2025-01-17T22:32:15Z

/te-ci pytorch L1

fixed workspace allocation for TP overlap test with pure GEMM

c13a81b

Signed-off-by: Alp Dener <[email protected]>

denera added the bug Something isn't working label Jan 17, 2025

denera requested review from timmoon10, ptrendx, erhoo82 and ksivaman January 17, 2025 19:08

denera self-assigned this Jan 17, 2025

ksivaman approved these changes Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] cuBLAS workspace size fix for TP overlap unit test #1415

[PyTorch] cuBLAS workspace size fix for TP overlap unit test #1415

denera commented Jan 17, 2025 •

edited

Loading

ksivaman left a comment

denera commented Jan 17, 2025

[PyTorch] cuBLAS workspace size fix for TP overlap unit test #1415

Are you sure you want to change the base?

[PyTorch] cuBLAS workspace size fix for TP overlap unit test #1415

Conversation

denera commented Jan 17, 2025 • edited Loading

Description

Type of change

Checklist:

ksivaman left a comment

Choose a reason for hiding this comment

denera commented Jan 17, 2025

denera commented Jan 17, 2025 •

edited

Loading