Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

Open
SilverSoldier opened this issue Jan 6, 2025 · 0 comments · May be fixed by #35580
Open

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

SilverSoldier opened this issue Jan 6, 2025 · 0 comments · May be fixed by #35580
Labels
Feature request Request for a new feature

Comments

@SilverSoldier
Copy link

SilverSoldier commented Jan 6, 2025

Feature request

Currently, the checkpoint is saved in the _save_checkpoint() method which saves the model, optimizer (optionally) and finally the Trainer state.

The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function. Then, the model etc. are loaded using the self._load_from_checkpoint() function and trainer state is loaded using the TrainerState.load_from_json call.

If training program ended abruptly in the middle of checkpointing, the directory is created but some of the files are missing. For ex. if the trainer state was not yet written, this throws an error during the TrainerState.load_from_json call and training is not able to resume at all.

Proposal: If either loading model or trainer state fails (in resume_from_checkpoint()) since the last directory is incomplete, then use the second last checkpoint folder for resuming.

Motivation

Motivation: Currently, our job can be killed in the middle of checkpointing and is not able to resume since last checkpoint is incomplete. We need to manually delete the folder to resume from previous checkpoint.

Some info (not sure if relevant): I use accelerate launch to launch the training.

Your contribution

I am willing to submit a PR for this if this feature seems acceptable.

We would need to wrap the 2 loads in a try/catch block and if either of them fail, re-run using the second last directory.
Looks like there are a lot of places where load_from_checkpoint is done, with exceptions for FSDP etc. which would complicate things. Maybe we can have a list of expected files and check for existence else load second last dir.

@SilverSoldier SilverSoldier added the Feature request Request for a new feature label Jan 6, 2025
@SilverSoldier SilverSoldier changed the title Use second last checkpoint if last checkpoint loading fails Trainer: Use second last checkpoint if last checkpoint loading fails Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
1 participant