-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference with changing the gradient accumulation - ZeroEval and AlpacaEval 2 #61
Comments
Hi @sahsaeedi , thanks for using SimPO and reporting your results back to us! If I understand it correctly, Gradient Acc: 16 - Batch-Size-Per-Device: 2 -> Total Batch Size: 128 It seems that with a larger effective batch size, chat abilities are not as well trained. We've seen similar phenomenon. In our experiments, to ensure a fair comparison, we maintain the same batch size across different methods. Ideally, we would perform a grid search for each method wrt different batch sizes, but preliminary results indicate that the trend remains consistent as long as we use the same SFT model for different algorithms. You've observed that chat ability appears to be at odds with ZeroEval, which we've also openly discussed in our repo README. We’ve identified this issue specifically with Llama 3 instruct models, where they are prone to catastrophic forgetting. However, when using Gemma models, we find this problem is significantly reduced, and training with PO actually improves chat scores without compromising ZeroEval results. You can find more details in this section of the README. For further studying continued training with instruction tuned models, I'd suggest using gemma models. In summary, I believe your findings highlight a combination of two factors:
Yes, and that's why we have done thorough evaluations with Arena-Hard and WildBench as well, which are two much more challenging benchmarks for evaluating chat abilities of models, and we find the trends of the two largely consistent with AlpacaEval 2. Please let me know if this clears up any confusion! |
Hi @xiamengzhou, Thanks for answering my concerns, Your main improvement is on AlpcaEval 2. The difference between DPO and SimPO in MT-Bench is less than 0.1, and for Arena-Hard, it is less than 0.5%. Still, this question arises: Is SimPO the SOTA method or DPO? |
@sahsaeedi, thank you for your question. We've tested various settings to ensure a fair comparison between DPO and SimPO. Here's what we've observed:
Based on these observations, I am confident in stating that:
We should have conveyed this more clearly in our materials, such as the GitHub repository, preprint, and Twitter posts! Let me know if this answers your question, and I'm happy to have further discussions. |
Hi @sahsaeedi In addition to Mengzhou's answers above, I also wanted to mention the potential issues with the evaluation metrics of MT-Bench and Arena-Hard:
Best, |
Hi,
I fine-tuned the LLaMA-3-8b-Instruct on "llama3-ultrafeedback-armorm" in different gradient accumulation (other hyperparameters are the same with llama-3-8b-instruct-simpo-v2.yaml). For fine-tuning, I used 4 A100:80GB.
The results on Alpaca-Eval 2 (I used your config for evaluation):
Gradient Acc: 16 - Batch-Size-Per-Device: 2 - LC: 50.34 - WR: 47.4
Gradient Acc: 128 - Batch-Size-Per-Device: 2 - LC: 34.78 - WR: 31.99
Gradient Acc: 16 - Batch-Size-Per-Device: 4 - LC: 39.16 - WR: 44.97
The results on MMLU-Redux and GSM8k (ZeroEval):
Gradient Acc: 16 - Batch-Size-Per-Device: 2 - MMLU: 43.38 - GSM8k: 58
Gradient Acc: 128 - Batch-Size-Per-Device: 2 - MMLU: 62.38 - GSM8k: 79.68
Gradient Acc: 16 - Batch-Size-Per-Device: 4 - MMLU: 62.1 - GSM8k: 78.85
The ability of the model to make a reason to find the correct answer will increase if we use a larger Gradient Acc. However, the performance of the model on AlpacaEval 2 will decrease. How can we conclude that SimPO is better than other methods?
I think AlpacaEval 2 just evaluates the style of the answer, which is not a good way to compare the two models.
The text was updated successfully, but these errors were encountered: