confusion about the results in the paper #2

gouqi666 · 2024-01-04T09:28:19Z

Hello, thanks for your work~
I have some confusions about the experiments.

Why you train three epochs?(maybe this is unfair)
you set n_query = 100, and you report just 15 steps, so it just use 1500 samples in total, why don't you report the subsequent experimental results？

For my experiment, I use same model and dataset llama2-7b,dolly-15k, and I set n_query=500, train_epoch=1.
I test three rounds(rd-3,rd-15,rd-20), but the result is not good as report in your paper.
rd=3:

RS: 0.6028708133971292
Win: 0.0125
Tie: 0.0125
Lose: 0.975

rd=15:

RS: 0.6
Win: 0.0125
Tie: 0.0
Lose: 0.9875

rd=20:

RS: 0.7141887304820095
Win: 0.025
Tie: 0.0
Lose: 0.975

Could u give me an explanation or some guidance?
any reply will be appreciated

danielwusg · 2024-01-22T02:27:06Z

Hi! Thank you very much for your questions! I hope the following explanations help clarify our methodology and findings.

Why did we train 3 epochs for each step?

In each "round" of our experiment, we fine-tune the LLaMA model from scratch, using a dataset that increases by n_query data points with every iteration. The decision to train for 3 epochs per round adheres to the Alpaca-Style hyperparameters, as detailed here: https://github.com/tatsu-lab/stanford_alpaca, ensuring consistency across all iterations.
Why didn’t we run experiments for more rounds, i.e., on more datapoints?

The primary goal of our research was to explore efficient instruction tuning with reduced training data. We found that beyond a certain point, adding more data to the training set didn’t significantly improve performance. In our initial trials, we did implement more steps, specifically 20-30, but observed only marginal performance gains, if any.

If you could refer to Figure 2 in our paper, it also shows diminishing returns on performance gain with more steps. This observation again supports our findings that a significantly smaller subset of data can be just as effective for instruction tuning as using more data. This is also in line with the conclusions of other studies as well, such as 'Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning'.
How does the choice of n_query impact the results?

We greatly appreciate your own experiments. The selection of n_query is indeed a crucial factor. In our design, considering a fixed overall subset budget, a smaller n_query paired with a larger number of iterations (n_round) allows the model to gradually improve its performance. This is because each iteration poses a manageable challenge (100) to the model, and over multiple iterations, these smaller additions accumulate to produce substantial improvements. In contrast, a larger n_query per iteration, similar to what was used in your experiment (500), could potentially overwhelm the model’s ability to optimally select data points at each step.

Our findings suggest that a lower n_query, like 100, in conjunction with more iterations, is more effective than a higher n_query with fewer iterations. For an extreme case of selecting the entire subset budget in one iteration versus our approach of 100 per round, please refer to Section 4.3 (Dynamic Iteration) in our paper.

Lukeming-tsinghua self-assigned this Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confusion about the results in the paper #2

confusion about the results in the paper #2

gouqi666 commented Jan 4, 2024

danielwusg commented Jan 22, 2024

confusion about the results in the paper #2

confusion about the results in the paper #2

Comments

gouqi666 commented Jan 4, 2024

danielwusg commented Jan 22, 2024