Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About training replication #8

Open
qirui-chen opened this issue Dec 19, 2024 · 17 comments
Open

About training replication #8

qirui-chen opened this issue Dec 19, 2024 · 17 comments

Comments

@qirui-chen
Copy link

Hello authors, thanks for your great work. I encountered an issue while setting up the environment. After installing torch==2.1.0+cu121, I am unable to import torch. It seems that a similar issue was mentioned in link. Could you please double-check the correct version of torch and the installation method? Thank you.

Python 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/__init__.py", line 1382, in <module>
    from .functional import *  # noqa: F403
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/functional.py", line 7, in <module>
    import torch.nn.functional as F
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear, LazyLinear
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 7, in <module>
    from .. import functional as F
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/functional.py", line 20, in <module>
    from .._jit_internal import boolean_dispatch, _overload, BroadcastingList1, BroadcastingList2, BroadcastingList3
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/_jit_internal.py", line 41, in <module>
    import torch.distributed.rpc
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 74, in <module>
    from .server_process_global_profiler import (
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py", line 6, in <module>
    from torch.autograd.profiler_legacy import profile
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 442, in <module>
    if not torch._C._autograd_init():
AttributeError: module 'torch.autograd.graph' has no attribute 'GradientEdge'
@qirui-chen qirui-chen changed the title About environment About the environment Dec 19, 2024
@qirui-chen
Copy link
Author

qirui-chen commented Dec 19, 2024

In addition, installing torch==2.2.0+cu121, torchvision=0.17.0 can resolve the above issue.

However, the version of transformers installed based on the commit hash below should be 4.41.0.dev0.

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

This version works for inference, but during training, the following error occurs.

ImportError: cannot import name 'EncoderDecoderCache' from 'transformers'

I found that the main cause of this issue is that the peft package expects to import this module from transformers, so I would like to ask the author to provide the specific version of the peft package. It seems that this information is not provided in the pyproject.toml file.

@qirui-chen qirui-chen changed the title About the environment About training replication Dec 19, 2024
@qirui-chen
Copy link
Author

qirui-chen commented Dec 19, 2024

Installing peft=0.12.0 addressed the above issues.

However, I found that 1) increasing batch size adds additional great time cost; 2) increasing batch size will not introduce too much GPU memory cost. I'd like to know whether the authors had observed similar cases, then decided to use multiple 24G GPUs of different nodes. Thanks!

@qirui-chen
Copy link
Author

Additionally, I want to know whether the released checkpoint on HuggingFace corresponds to VideoLISA-3.8B (One-Token-Seg-All) in the paper. The replicated output with performance on ReasonVOS is as follows.

Q-Size: 458

......(omitted)

VideoLISA/evaluation/reason_vos/metrics.py:32: RuntimeWarning: invalid value encountered in divide
  j = inters / union
......(omitted)

J: 0.3763731515787202
F: 0.4261801187118281
J&F: 0.4012766351452741

It is weird because there is almost no potential error in downloading the provided dataset/checkpoint and conducting the two-step evaluation. Would the authors be able to help point out possible reasons? Thanks.

@qirui-chen
Copy link
Author

qirui-chen commented Dec 21, 2024

Please forgive me for having one more question. I noticed that the training script train_joint.py includes the use of the vqa and vid_qa as a parameter --dataset, but it seems that the final script run_train.sh did not use them as part of the training dataset. This might also have contributed to the model's subpar performance on VQA (referencing the NeurIPS rebuttal).

I would like to know whether the training for the released checkpoint used VisualQA and VideoQA datasets. Additionally, Tab.10 of the paper refers to which VideoQA dataset? Why the VideoQA dataset is removed in the final version?

@JosephPai
Copy link
Collaborator

JosephPai commented Dec 25, 2024

Hi @qirui-chen , thanks for you interests in VideoLISA.
Sorry for the late response and glad to see that you have successfully resolved the environment issue. We will update he peft version.
Regarding other questions:

  1. In our experiment, we actually encountered the issue of CUDA OOM when increasing the batch size. About the time cost, yes you are correct, within the CUDA memory constraint, adding more data can greatly increase the time cost.
    2. The current checkpoint is trained for 3k iterations, which corresponds to Tab. 5 and Tab. 6 in the ablation study. We will soon upload the final checkpoint, which is trained for 6k iterations.
  2. The released checkpoint does not involve VideoQA data. In concept, involving VideoQA can boost the reasoning capability of the model. However, as shown in Tab. 10, involving VideoQA may cause performance fluctuation on different datasets. Therefore, we did not employ this dataset at the final version.

Feel free to let me know if you have further questions.

@qirui-chen
Copy link
Author

qirui-chen commented Dec 25, 2024

Thank you for your reply, which has been very helpful to me!!!

In fact, I still have a few more questions and would appreciate your assistance in your available time. They are prioritized as follows:

  1. Could you provide the evaluation code for the ReasonSeg test split, even though the validation set is already included in the training process.
  2. Would it be possible for you to provide the evaluation code for the refCOCO series, as shown in Table 7?
  3. (optional) Could you provide the code for post-processing with XMem+ or only any related guidance in text?

Thank you very much for your help and response!

@JosephPai
Copy link
Collaborator

Hi @qirui-chen ,

We have updated the evaluation suite for image benchmarks, including ReasonSeg and refCOCO series: https://github.com/showlab/VideoLISA?tab=readme-ov-file#image-benchmarks

Regarding post-optimization, it is non-trivial to integrate XMem2 into another codebase. The best practice is to import the inference result into its codebase to have a post-optimization. Here is the guideline: https://github.com/showlab/VideoLISA?tab=readme-ov-file#post-optimization

About the problem of reproducing ReasonVOS number. We have carefully investigated the issue. The current checkpoint at huggingface is already the final version. The performance mismatch originates from the discrepancy between the cleaned code and the old data structure. We have updated the data and evaluation code. You should be able to reproduce the number reported in the paper, except for small numerical difference due to package version difference (torch, transformers, etc.)

Best,
Zechen

@qirui-chen
Copy link
Author

qirui-chen commented Dec 26, 2024

Thank you very much for your response and for providing the code quickly !!!

Maybe the one last question: does the current training script run_train.sh correspond to the final checkpoint? Specifically, I am wondering about the following two points:

  1. How many iterations should be set, 3k or 6k? (epochs, steps per epoch...)
  2. Should Image VQA (llava_instruct_150k) be included as part of the training data? If so, what should the sample rate be adjusted to?

@JosephPai
Copy link
Collaborator

Hi @qirui-chen , I just updated the training script that was used to produce the final result in the paper (Tab 1, 2, and 3).
It could answer most of your questions regarding iterations, data recipe, etc.

Best,
Zechen

@qirui-chen
Copy link
Author

Thank you for your quick reply, but the updated parameters

--dataset="sem_seg,refer_seg,reason_seg,vos,refer_seg_video,davis"

seem to no longer correspond to the part after line #L225 in dataset.py. This is because the dataset argument (tasks) does not seem to directly include davis (dataset). For example, davis seems to be included under the ref_vos task at line #330.

@JosephPai
Copy link
Collaborator

That's a nice catch. We used to treat Davis as an independent dataset, because it was added at later stage of the project. While during code cleaning before open-source, we re-organized it under ref-vos dataset.

@qirui-chen
Copy link
Author

Thank you for the reply. Your responses are very helpful to me.

@qirui-chen
Copy link
Author

qirui-chen commented Dec 29, 2024

Sorry to bother the author again, but I would like to ask why the phrase "Sure, [SEG]" is added to the input prompt during inference. Shouldn't this be part of the model's output? For example, here.

What’s more strange is that I found adding or omitting this phrase doesn't seem to affect the model's output. The model still ends up outputting "[SEG]. <|end|>". I want to know why this happens. Thank you.

@JosephPai
Copy link
Collaborator

This is a teacher-forcing technique adapted from the work of LISA.
In short, if you manually append "[SEG]. <|end|>", you will enforce the model to always output a [SEG] token and then you can decode it to a mask.
If you do not apply this teaching forcing, the model will output the response by itself.

  • For a well trained model, it can successfully output [SEG].
  • While for a not well-trained model, it may output some other sentences that do not include [SEG] token, which will make the evaluation fail.

To ensure smooth evaluation, we adapt the teaching forcing technique same as LISA.

@qirui-chen
Copy link
Author

qirui-chen commented Jan 1, 2025

Thank you for your response; it resolved my issue!!!

I would like to ask if you have encountered situations during inference where the model does not output <|end|> and repeats former words until =max_new_tokens. I observed this issue while training on Video QA datasets (llava-style formats), even in cases of overfitting a few samples. Additionally, I find that the eos token <|endoftext|> seems to be not used.

I wonder how to use generate() function correctly and if you could provide some insights on this. I sincerely appreciate your help!

@JosephPai
Copy link
Collaborator

Both <|end|> and <|endoftext|> should be okay, as long as the model is trained with proper templates. VideoLISA is finetuned from this model, so the chat template and EOS (or EOT) token are also adapted from there.
The reasons of not outputting <|end|> can be multifactorial, you may need to check the conv_template, pre-processing function, and loss curve, etc.

@qirui-chen
Copy link
Author

Thank you for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants