Resuming the training through checkpoint with tez #14

vikas-nexcom · 2021-01-15T11:12:06Z

Hi,

I am wondering if it is possible to pick up a saved model and resume/continue the training with tez. I am new to pytorch. Here is what I tried:

class Bert(tez.Model):

    def __init__(self, num_classes, num_train_steps=None):

        super().__init__()
        self.bert = transformers.BertModel.from_pretrained(
           'bert-base-uncased, 
            return_dict=False
            )

        if config.RETRAINING: # set to True
            self.bert.load(
            'demo.bin', 
            device='cuda')    

        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, num_classes)

and it doesn't work. I am not sure what I am missing. I found this for pytorch:

https://discuss.pytorch.org/t/loading-a-saved-model-for-continue-training/17244

but I am not sure how to use this together with tez.

The text was updated successfully, but these errors were encountered:

abhishekkrthakur · 2021-01-23T12:56:43Z

Sure, just do model.load() and you can re-train. you might also want to load the state of optimizer and scheduler. ill add support for saving them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming the training through checkpoint with tez #14

Resuming the training through checkpoint with tez #14

vikas-nexcom commented Jan 15, 2021 •

edited

Loading

abhishekkrthakur commented Jan 23, 2021

Resuming the training through checkpoint with tez #14

Resuming the training through checkpoint with tez #14

Comments

vikas-nexcom commented Jan 15, 2021 • edited Loading

abhishekkrthakur commented Jan 23, 2021

vikas-nexcom commented Jan 15, 2021 •

edited

Loading