Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add linear from unifold for a clean-path residual transformer #38

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

hypnopump
Copy link

@hypnopump hypnopump commented Dec 12, 2023

Transformer layers were not following clean-path residual principles (model starts at identity). This PR borrows the Linear module from UniFold and ensures all transformer residual layers start at 0 in training.

Bonus: modifies the random seed to affect all GPUs at the same time, preventing lack of reproducibility in multi-gpu setup

New models train faster now (blue is new, red is old):
Captura de Pantalla 2023-12-12 a las 19 18 47

@hypnopump hypnopump marked this pull request as ready for review December 12, 2023 18:19
@hypnopump
Copy link
Author

hypnopump commented Dec 16, 2023

Previous param update_freq was doing the job of gradient accumulation, but not dividing by the number of accumulated steps. Changed it so it behaves like gradient_accumulation and the effective_batch_size can be used with the same learning rate depending on the number of gpus and memory consumption.

To be fair, this could be a separate PR. Could implement it like this if it's wanted @guolinke

See expected behaviour from huggingface's accelerate:
Captura de Pantalla 2023-12-16 a las 16 09 43

Toghether with the multi-gpu seed, merging this PR will improve robustness of training params and the ability to dynamically adjust the effective batch size (num_gpus * batch_size * accumulation_steps) given compute availability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant