Support `store_param_remainders` feature from Apex in TE Fused Adam #1408

sanandaraj5597 · 2025-01-13T23:49:27Z

Description

When the master parameter is in FP32 and the model parameters are in BF16, we can store the trailing 16 remainder bits and reconstruct the master FP32 param from (BF16 model param + the remainder).

This helps us half the master parameter memory usage.

Signed-off-by: Selvaraj Anandaraj <[email protected]>

…ransformerEngine into param_remainder

for more information, see https://pre-commit.ci

MaciejBalaNV · 2025-01-17T13:27:10Z

transformer_engine/pytorch/optimizers/fused_adam.py

@@ -243,13 +256,14 @@ def _apply_scale(self, state_name, unscaled_state, scaled_state, scale):
            unscaled_state.mul_(rscale)
            scaled_state.copy_(unscaled_state)

-    def get_unscaled_state(self, param, state_name):
+    def get_unscaled_state(self, param, state_name, store_param_remainders=False):


The default value of store_param_remainders is False here, but it's True by default in the constructor. I think it's misleading, why not just set it to True here as well?

I don't want to store param remainders for state_name other than master_params, that's why it's defaulted to false.

MaciejBalaNV · 2025-01-17T14:36:12Z

I'm getting NaNs when using this feature. You can reproduce it by running test_fused_optimizer tests, after setting store_param_remainders=True in _initialize_state method (otherwise it fails earlier) and by commenting out torch.testing.assert_close(ref_params, master_params) check (this is expected to fail, since we now keep master_params as int16).

Still, with all these changes, the tests fail at torch.testing.assert_close(ref_params, model_params_to_fp32, rtol=1e-2, atol=1e-2, equal_nan=True) with an error message that weights are NaN.

Selvaraj Anandaraj and others added 6 commits January 13, 2025 14:35

Initial commit

9072c5f

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Fixed compilation errors

7d5d0dc

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'main' into param_remainder

bcaf16d

Fixed syntax errors

979a4c1

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'param_remainder' of https://github.com/sanandaraj5597/T…

75b737c

…ransformerEngine into param_remainder

[pre-commit.ci] auto fixes from pre-commit.com hooks

887432d

for more information, see https://pre-commit.ci

MaciejBalaNV reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `store_param_remainders` feature from Apex in TE Fused Adam #1408

Support `store_param_remainders` feature from Apex in TE Fused Adam #1408

sanandaraj5597 commented Jan 13, 2025

MaciejBalaNV Jan 17, 2025

sanandaraj5597 Jan 17, 2025

MaciejBalaNV commented Jan 17, 2025

Support store_param_remainders feature from Apex in TE Fused Adam #1408

Are you sure you want to change the base?

Support store_param_remainders feature from Apex in TE Fused Adam #1408

Conversation

sanandaraj5597 commented Jan 13, 2025

Description

MaciejBalaNV Jan 17, 2025

Choose a reason for hiding this comment

sanandaraj5597 Jan 17, 2025

Choose a reason for hiding this comment

MaciejBalaNV commented Jan 17, 2025

Support `store_param_remainders` feature from Apex in TE Fused Adam #1408

Support `store_param_remainders` feature from Apex in TE Fused Adam #1408