How to save self-defined model with deepspeed zero 3? #3320

amoyplane · 2025-01-02T08:15:36Z

System Info

- `Accelerate` version: 1.0.1
- Python version: 3.10.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 128.00 GB
- GPU type: NVIDIA H20
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}                                                                                               
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

My custom model inherits from torch.nn.Module.
I am training this model with 4 H20 GPUs using deepspeed zero 3.
I am trying to save a checkpoint with these code:
`

save model

        if (idx % args.save_per_steps == 0) and (idx != 0):
            accelerator.wait_for_everyone()
            if (accelerator.is_local_main_process):
                accelerator.print('Saving model ...')
                save_dir = os.path.join(args.save_path, args.save_name + '_epoch_' + str(epoch) + '_step_' + str(idx))
                accelerator.print('Getting state dict ...')
                state_dict = accelerator.get_state_dict(model)
                accelerator.print('Unwraping model ...')
                unwrapped_model = accelerator.unwrap_model(model)
                accelerator.print('Saving checkpoint ...')
                unwrapped_model.save_checkpoint(save_dir, idx, state_dict)
                accelerator.print('Model saved!')
            accelerator.wait_for_everyone()

`

Expected behavior

The code stuck when getting state dict.
I also tried accelerator.save_model but it couldn't work.

I am wondering what's the recommend‌ way to save and load a large model training with deepspeed zero 3?
Thank you very much.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save self-defined model with deepspeed zero 3? #3320

How to save self-defined model with deepspeed zero 3? #3320

amoyplane commented Jan 2, 2025 •

edited

Loading

How to save self-defined model with deepspeed zero 3? #3320

How to save self-defined model with deepspeed zero 3? #3320

Comments

amoyplane commented Jan 2, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

save model

Expected behavior

amoyplane commented Jan 2, 2025 •

edited

Loading