Impact of acc_grad Setting on Model Performance #6

IceWYB · 2024-12-11T12:53:22Z

Hi, dear authors, thanks for the excellence work!
I am attempting to reproduce the experimental results reported in your paper and notice some issues. The grad_accumulation_steps argument is specified as 1 in the paper, but is set to 20 in the run_train.sh script. This discrepancy seems to affect the total amount of data used during training. I tried the setting in the paper (grad_accumulation_steps=1, global_batch_size=128) but fail to reproduce the performance metrics. So I wonder whether the grad_accumulation_steps setting is one critical factor causing inconsistencies in the results.
Thank you very much for your attention. I look forward to your reply!

JosephPai · 2024-12-12T02:00:42Z

Hi @IceWYB, thanks for your interests.
As stated in the instruction here, https://github.com/showlab/VideoLISA?tab=readme-ov-file#training, we use 8 node (64 A10 24G GPUs), each GPU has batch-size of 2, and grad_accumulation_steps=1, thus the global batch size is 64x2=128.
The final performance can be affected by batch size, grad_accumulation_steps, learning rate, and even the datasets sample ratio.
May I know how many GPUs do you use during training?
Regarding "fail to reproduce the performance metrics", how large is the discrepancy?

IceWYB · 2024-12-12T15:26:03Z

Thanks for your quick reply!
I first tried 8 GPUs with batch size=8 and grad_accumulation_steps=1 to align with the configuration mentioned in the paper. However, after 10 epochs, the results were significantly different, with the mevis_JF metric only reaching 30. I then experimented with many different batch_size and grad_acc_steps settings but still cannot replicate th performance. I noticed that, unlike typical setups, the dataset uses a random sampling strategy. I am curious if this randomness willintroduce significant variability in the results. Did you also encountered similar issues during your training process?
Thank you for your assistance!

JosephPai · 2024-12-13T12:13:39Z

HI @IceWYB , you are encouraged to monitor the performance with the image version ReasonSeg during training, which yields more stable results.
I will also try to reproduce the result with smaller scale training this weekend.

IceWYB · 2024-12-14T09:27:33Z

Yeah, I agree with you that the reproduced results on mevis is not stable. At first I tried to use a global batch size of 128 (8 GPUs, 8 batch_size, 2 grad_acc_steps) and conducted repeated experiments. The first run resulted in a mevis_JF metric of 29.9 while the second run achieved 50.0. And I also tried to fix images/videos used in the dataset but still get varible results, partly because the annotations are still random sampled.
Looking forward to your method to result with smaller scale training soon!

Lexarymade · 2024-12-18T15:17:28Z

I'm wondering whether the mevis here refers to the mevis_valid_u.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of acc_grad Setting on Model Performance #6

Impact of acc_grad Setting on Model Performance #6

IceWYB commented Dec 11, 2024

JosephPai commented Dec 12, 2024

IceWYB commented Dec 12, 2024

JosephPai commented Dec 13, 2024

IceWYB commented Dec 14, 2024

Lexarymade commented Dec 18, 2024

Impact of acc_grad Setting on Model Performance #6

Impact of acc_grad Setting on Model Performance #6

Comments

IceWYB commented Dec 11, 2024

JosephPai commented Dec 12, 2024

IceWYB commented Dec 12, 2024

JosephPai commented Dec 13, 2024

IceWYB commented Dec 14, 2024

Lexarymade commented Dec 18, 2024