-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out why AdamW + gradient accumulation leads to different results for test case #126
Comments
The issue might be with the forward pass. Created this test to verify the forward pass in one batch vs multiple batches, and it seems there are some numerical discrepancies there: https://github.com/sedrick-keh-tri/open_lm/commit/f0aa4e72bf038efc2dd793ca29b8a5bd7727f83f (Update: Seems like this forward test passes with FP32 until threshold=1e-5 but fails with 1e-6 onwards) |
hm, it's strange that the accumulation tests pass with SGD at tol=1e-7 then if the forward pass is failing checks at threshold=1e-6... |
I am very confused at this point. Seems like the backward is generally more precise than the forward... is this something that's even possible? I tried taking your exact branch and just editing the train_one_epoch to return the forward outputs. Forward output checking fails with precision 1e-7, but the backward tests pass fine with 1e-7. The trend of backward being more precise than forward seems to hold for some reason. For bf16, I tried doing single-layer and turning off the norm (this seems to affect the precision somehow). With 1 layer, forward threshold is 1e-2 but backward threshold is 1e-4. |
A useful test here could also just be a short training run with and without grad accum such that we'd expect the curves to be identical. If the model with grad accum is clearly worse then we know something is wrong. If they achieve very similar loss then that supports noise. |
Ran some 160M experiments here: Wandb report Loss curves for accum_freq=1,2,4 look basically identical. Validation perplexities are as follows: For reference, the evaluation perplexity of the other accum=4 experiment is 13.414325512942533, so there seems to be a fair bit of variation even with identical runs |
This is great, thanks @sedrick-keh-tri! I feel comfortable with our grad accum implementation for now, then. Let's leave this issue open in case anyone wants to dig into why the test fails, but continue using grad accum as implemented. |
In #125, we had to switch our gradient accumulation tests from SGD to AdamW to make gradient accumulation tests pass. It's unclear why this is the case; anecdotally, when training models with AdamW, training curves look similar with and without gradient accumulation. This could be a numerical issue, or some specific issue with AdamW that makes gradient accumulation behave differently.
The text was updated successfully, but these errors were encountered: