Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save self-defined model with deepspeed zero 3? #3320

Open
2 of 4 tasks
amoyplane opened this issue Jan 2, 2025 · 0 comments
Open
2 of 4 tasks

How to save self-defined model with deepspeed zero 3? #3320

amoyplane opened this issue Jan 2, 2025 · 0 comments

Comments

@amoyplane
Copy link

amoyplane commented Jan 2, 2025

System Info

- `Accelerate` version: 1.0.1
- Python version: 3.10.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 128.00 GB
- GPU type: NVIDIA H20
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}                                                                                               
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

My custom model inherits from torch.nn.Module.
I am training this model with 4 H20 GPUs using deepspeed zero 3.
I am trying to save a checkpoint with these code:
`

save model

        if (idx % args.save_per_steps == 0) and (idx != 0):
            accelerator.wait_for_everyone()
            if (accelerator.is_local_main_process):
                accelerator.print('Saving model ...')
                save_dir = os.path.join(args.save_path, args.save_name + '_epoch_' + str(epoch) + '_step_' + str(idx))
                accelerator.print('Getting state dict ...')
                state_dict = accelerator.get_state_dict(model)
                accelerator.print('Unwraping model ...')
                unwrapped_model = accelerator.unwrap_model(model)
                accelerator.print('Saving checkpoint ...')
                unwrapped_model.save_checkpoint(save_dir, idx, state_dict)
                accelerator.print('Model saved!')
            accelerator.wait_for_everyone()

`

Expected behavior

The code stuck when getting state dict.
I also tried accelerator.save_model but it couldn't work.

I am wondering what's the recommend‌ way to save and load a large model training with deepspeed zero 3?
Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant