You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the checkpoint is saved in the _save_checkpoint() method which saves the model, optimizer (optionally) and finally the Trainer state.
The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function. Then, the model etc. are loaded using the self._load_from_checkpoint() function and trainer state is loaded using the TrainerState.load_from_json call.
If training program ended abruptly in the middle of checkpointing, the directory is created but some of the files are missing. For ex. if the trainer state was not yet written, this throws an error during the TrainerState.load_from_json call and training is not able to resume at all.
Proposal: If either loading model or trainer state fails (in resume_from_checkpoint()) since the last directory is incomplete, then use the second last checkpoint folder for resuming.
Motivation
Motivation: Currently, our job can be killed in the middle of checkpointing and is not able to resume since last checkpoint is incomplete. We need to manually delete the folder to resume from previous checkpoint.
Some info (not sure if relevant): I use accelerate launch to launch the training.
Your contribution
I am willing to submit a PR for this if this feature seems acceptable.
We would need to wrap the 2 loads in a try/catch block and if either of them fail, re-run using the second last directory.
Looks like there are a lot of places where load_from_checkpoint is done, with exceptions for FSDP etc. which would complicate things. Maybe we can have a list of expected files and check for existence else load second last dir.
The text was updated successfully, but these errors were encountered:
SilverSoldier
changed the title
Use second last checkpoint if last checkpoint loading fails
Trainer: Use second last checkpoint if last checkpoint loading fails
Jan 6, 2025
Feature request
Currently, the checkpoint is saved in the
_save_checkpoint()
method which saves the model, optimizer (optionally) and finally the Trainer state.The
resume_from_checkpoint()
function gets the checkpoint directory from theget_last_checkpoint
function. Then, the model etc. are loaded using theself._load_from_checkpoint()
function and trainer state is loaded using theTrainerState.load_from_json
call.If training program ended abruptly in the middle of checkpointing, the directory is created but some of the files are missing. For ex. if the trainer state was not yet written, this throws an error during the
TrainerState.load_from_json
call and training is not able to resume at all.Proposal: If either loading model or trainer state fails (in
resume_from_checkpoint()
) since the last directory is incomplete, then use the second last checkpoint folder for resuming.Motivation
Motivation: Currently, our job can be killed in the middle of checkpointing and is not able to resume since last checkpoint is incomplete. We need to manually delete the folder to resume from previous checkpoint.
Some info (not sure if relevant): I use accelerate launch to launch the training.
Your contribution
I am willing to submit a PR for this if this feature seems acceptable.
We would need to wrap the 2 loads in a try/catch block and if either of them fail, re-run using the second last directory.Looks like there are a lot of places where
load_from_checkpoint
is done, with exceptions for FSDP etc. which would complicate things. Maybe we can have a list of expected files and check for existence else load second last dir.The text was updated successfully, but these errors were encountered: