Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] loss error: flat_fp32_avg_grads will not be scaled when hybridzerooptimizer use cpu_offload #398

Open
fengsibo opened this issue Jan 1, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@fengsibo
Copy link
Contributor

fengsibo commented Jan 1, 2025

Describe the bug

single_grad_partition_groups.append(flat_fp32_avg_grads)

single_grad_partition_groups.append(flat_fp32_avg_grads) 收集了 flat_fp32_avg_grads 用于 unscale_and_clip_grad, 但开启 cpu_offload 后,self._fp32_flat_param_groups_of_current_rank[group_id].grad = flat_fp32_avg_grads.to(device) 设置 grad 的 tensor to CPU 了,这样 clip grad 只作用于 single_grad_partition_groups 中的 device tensor,真正用于计算的 cpu grad 仍然是之前的数值,会导致 loss 异常。


            single_grad_partition_groups.append(flat_fp32_avg_grads)
            device = self._fp32_flat_param_groups_of_current_rank[group_id].device
            self._fp32_flat_param_groups_of_current_rank[group_id].grad = flat_fp32_avg_grads.to(device)
        # unscale and clip grads
        # get the global norm
        global_norm_groups = {}
        if self._clip_grad_norm > 0:
            for group_name, norm in norms.items():
                global_norm_groups[group_name] = norm**0.5

        # the following operations are performed only on the rank to which parameters are assigned.
        if gpc.config.model.dtype is not torch.float32:
            if len(single_grad_partition_groups) != 0 and self._clip_grad_norm > 0:
                self._unscale_and_clip_grads(
                    single_grad_partition_groups,
                    list(global_norm_groups.values()),
                    loss_scale,
                )

Environment

internevo: 5ad2eb0

HybridZeroOptimizer 开启 offload_cpu 即可复现问题。

Other information

No response

@fengsibo
Copy link
Contributor Author

fengsibo commented Jan 1, 2025

fix PR: #399

@fengsibo
Copy link
Contributor Author

fengsibo commented Jan 2, 2025

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants