You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
My custom model inherits from torch.nn.Module.
I am training this model with 4 H20 GPUs using deepspeed zero 3.
I am trying to save a checkpoint with these code:
`
save model
if (idx % args.save_per_steps == 0) and (idx != 0):
accelerator.wait_for_everyone()
if (accelerator.is_local_main_process):
accelerator.print('Saving model ...')
save_dir = os.path.join(args.save_path, args.save_name + '_epoch_' + str(epoch) + '_step_' + str(idx))
accelerator.print('Getting state dict ...')
state_dict = accelerator.get_state_dict(model)
accelerator.print('Unwraping model ...')
unwrapped_model = accelerator.unwrap_model(model)
accelerator.print('Saving checkpoint ...')
unwrapped_model.save_checkpoint(save_dir, idx, state_dict)
accelerator.print('Model saved!')
accelerator.wait_for_everyone()
`
Expected behavior
The code stuck when getting state dict.
I also tried accelerator.save_model but it couldn't work.
I am wondering what's the recommend way to save and load a large model training with deepspeed zero 3?
Thank you very much.
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
My custom model inherits from torch.nn.Module.
I am training this model with 4 H20 GPUs using deepspeed zero 3.
I am trying to save a checkpoint with these code:
`
save model
`
Expected behavior
The code stuck when getting state dict.
I also tried
accelerator.save_model
but it couldn't work.I am wondering what's the recommend way to save and load a large model training with deepspeed zero 3?
Thank you very much.
The text was updated successfully, but these errors were encountered: