Trainer: Use second last checkpoint if last checkpoint loading fails #35525

SilverSoldier · 2025-01-06T05:32:04Z

Feature request

Currently, the checkpoint is saved in the _save_checkpoint() method which saves the model, optimizer (optionally) and finally the Trainer state.

The resume_from_checkpoint() function gets the checkpoint directory from the get_last_checkpoint function. Then, the model etc. are loaded using the self._load_from_checkpoint() function and trainer state is loaded using the TrainerState.load_from_json call.

If training program ended abruptly in the middle of checkpointing, the directory is created but some of the files are missing. For ex. if the trainer state was not yet written, this throws an error during the TrainerState.load_from_json call and training is not able to resume at all.

Proposal: If either loading model or trainer state fails (in resume_from_checkpoint()) since the last directory is incomplete, then use the second last checkpoint folder for resuming.

Motivation

Motivation: Currently, our job can be killed in the middle of checkpointing and is not able to resume since last checkpoint is incomplete. We need to manually delete the folder to resume from previous checkpoint.

Some info (not sure if relevant): I use accelerate launch to launch the training.

Your contribution

I am willing to submit a PR for this if this feature seems acceptable.

~~We would need to wrap the 2 loads in a try/catch block and if either of them fail, re-run using the second last directory.~~
Looks like there are a lot of places where load_from_checkpoint is done, with exceptions for FSDP etc. which would complicate things. Maybe we can have a list of expected files and check for existence else load second last dir.

The text was updated successfully, but these errors were encountered:

SilverSoldier added the Feature request Request for a new feature label Jan 6, 2025

SilverSoldier changed the title ~~Use second last checkpoint if last checkpoint loading fails~~ Trainer: Use second last checkpoint if last checkpoint loading fails Jan 6, 2025

This was referenced Jan 6, 2025

Resume from last correct checkpoint when restarting a pytorch job foundation-model-stack/fms-hf-tuning#432

Open

Get latest + complete checkpoint directory when auto-resume from checkpoint #35580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

SilverSoldier commented Jan 6, 2025 •

edited

Loading

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

Trainer: Use second last checkpoint if last checkpoint loading fails #35525

Comments

SilverSoldier commented Jan 6, 2025 • edited Loading

Feature request

Motivation

Your contribution

SilverSoldier commented Jan 6, 2025 •

edited

Loading