Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming the training through checkpoint with tez #14

Open
vikas-nexcom opened this issue Jan 15, 2021 · 1 comment
Open

Resuming the training through checkpoint with tez #14

vikas-nexcom opened this issue Jan 15, 2021 · 1 comment

Comments

@vikas-nexcom
Copy link

vikas-nexcom commented Jan 15, 2021

Hi,

I am wondering if it is possible to pick up a saved model and resume/continue the training with tez. I am new to pytorch. Here is what I tried:

class Bert(tez.Model):

    def __init__(self, num_classes, num_train_steps=None):

        super().__init__()
        self.bert = transformers.BertModel.from_pretrained(
           'bert-base-uncased, 
            return_dict=False
            )

        if config.RETRAINING: # set to True
            self.bert.load(
            'demo.bin', 
            device='cuda')    

        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, num_classes)

and it doesn't work. I am not sure what I am missing. I found this for pytorch:

https://discuss.pytorch.org/t/loading-a-saved-model-for-continue-training/17244

but I am not sure how to use this together with tez.

@abhishekkrthakur
Copy link
Owner

Sure, just do model.load() and you can re-train. you might also want to load the state of optimizer and scheduler. ill add support for saving them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants