About training replication #8

qirui-chen · 2024-12-19T02:58:27Z

Hello authors, thanks for your great work. I encountered an issue while setting up the environment. After installing torch==2.1.0+cu121, I am unable to import torch. It seems that a similar issue was mentioned in link. Could you please double-check the correct version of torch and the installation method? Thank you.

Python 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/__init__.py", line 1382, in <module>
    from .functional import *  # noqa: F403
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/functional.py", line 7, in <module>
    import torch.nn.functional as F
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear, LazyLinear
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 7, in <module>
    from .. import functional as F
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/nn/functional.py", line 20, in <module>
    from .._jit_internal import boolean_dispatch, _overload, BroadcastingList1, BroadcastingList2, BroadcastingList3
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/_jit_internal.py", line 41, in <module>
    import torch.distributed.rpc
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 74, in <module>
    from .server_process_global_profiler import (
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py", line 6, in <module>
    from torch.autograd.profiler_legacy import profile
  File "/opt/conda/envs/videolisa/lib/python3.10/site-packages/torch/autograd/__init__.py", line 442, in <module>
    if not torch._C._autograd_init():
AttributeError: module 'torch.autograd.graph' has no attribute 'GradientEdge'

The text was updated successfully, but these errors were encountered:

qirui-chen · 2024-12-19T06:32:50Z

In addition, installing torch==2.2.0+cu121, torchvision=0.17.0 can resolve the above issue.

However, the version of transformers installed based on the commit hash below should be 4.41.0.dev0.

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

This version works for inference, but during training, the following error occurs.

ImportError: cannot import name 'EncoderDecoderCache' from 'transformers'

I found that the main cause of this issue is that the peft package expects to import this module from transformers, so I would like to ask the author to provide the specific version of the peft package. It seems that this information is not provided in the pyproject.toml file.

qirui-chen · 2024-12-19T14:54:10Z

Installing peft=0.12.0 addressed the above issues.

However, I found that 1) increasing batch size adds additional great time cost; 2) increasing batch size will not introduce too much GPU memory cost. I'd like to know whether the authors had observed similar cases, then decided to use multiple 24G GPUs of different nodes. Thanks!

qirui-chen · 2024-12-20T05:54:57Z

Additionally, I want to know whether the released checkpoint on HuggingFace corresponds to VideoLISA-3.8B (One-Token-Seg-All) in the paper. The replicated output with performance on ReasonVOS is as follows.

Q-Size: 458

......(omitted)

VideoLISA/evaluation/reason_vos/metrics.py:32: RuntimeWarning: invalid value encountered in divide
  j = inters / union
......(omitted)

J: 0.3763731515787202
F: 0.4261801187118281
J&F: 0.4012766351452741

It is weird because there is almost no potential error in downloading the provided dataset/checkpoint and conducting the two-step evaluation. Would the authors be able to help point out possible reasons? Thanks.

qirui-chen · 2024-12-21T08:00:00Z

Please forgive me for having one more question. I noticed that the training script train_joint.py includes the use of the vqa and vid_qa as a parameter --dataset, but it seems that the final script run_train.sh did not use them as part of the training dataset. This might also have contributed to the model's subpar performance on VQA (referencing the NeurIPS rebuttal).

I would like to know whether the training for the released checkpoint used VisualQA and VideoQA datasets. Additionally, Tab.10 of the paper refers to which VideoQA dataset? Why the VideoQA dataset is removed in the final version?

JosephPai · 2024-12-25T09:03:45Z

Hi @qirui-chen , thanks for you interests in VideoLISA.
Sorry for the late response and glad to see that you have successfully resolved the environment issue. We will update he peft version.
Regarding other questions:

In our experiment, we actually encountered the issue of CUDA OOM when increasing the batch size. About the time cost, yes you are correct, within the CUDA memory constraint, adding more data can greatly increase the time cost.
~~2. The current checkpoint is trained for 3k iterations, which corresponds to Tab. 5 and Tab. 6 in the ablation study. We will soon upload the final checkpoint, which is trained for 6k iterations.~~
The released checkpoint does not involve VideoQA data. In concept, involving VideoQA can boost the reasoning capability of the model. However, as shown in Tab. 10, involving VideoQA may cause performance fluctuation on different datasets. Therefore, we did not employ this dataset at the final version.

Feel free to let me know if you have further questions.

qirui-chen · 2024-12-25T11:04:15Z

Thank you for your reply, which has been very helpful to me!!!

In fact, I still have a few more questions and would appreciate your assistance in your available time. They are prioritized as follows:

Could you provide the evaluation code for the ReasonSeg test split, even though the validation set is already included in the training process.
Would it be possible for you to provide the evaluation code for the refCOCO series, as shown in Table 7?
(optional) Could you provide the code for post-processing with XMem+ or only any related guidance in text?

Thank you very much for your help and response!

JosephPai · 2024-12-26T05:35:08Z

Hi @qirui-chen ,

We have updated the evaluation suite for image benchmarks, including ReasonSeg and refCOCO series: https://github.com/showlab/VideoLISA?tab=readme-ov-file#image-benchmarks

Regarding post-optimization, it is non-trivial to integrate XMem2 into another codebase. The best practice is to import the inference result into its codebase to have a post-optimization. Here is the guideline: https://github.com/showlab/VideoLISA?tab=readme-ov-file#post-optimization

About the problem of reproducing ReasonVOS number. We have carefully investigated the issue. The current checkpoint at huggingface is already the final version. The performance mismatch originates from the discrepancy between the cleaned code and the old data structure. We have updated the data and evaluation code. You should be able to reproduce the number reported in the paper, except for small numerical difference due to package version difference (torch, transformers, etc.)

Best,
Zechen

qirui-chen · 2024-12-26T10:39:54Z

Thank you very much for your response and for providing the code quickly !!!

Maybe the one last question: does the current training script run_train.sh correspond to the final checkpoint? Specifically, I am wondering about the following two points:

How many iterations should be set, 3k or 6k? (epochs, steps per epoch...)
Should Image VQA (llava_instruct_150k) be included as part of the training data? If so, what should the sample rate be adjusted to?

JosephPai · 2024-12-26T16:35:45Z

Hi @qirui-chen , I just updated the training script that was used to produce the final result in the paper (Tab 1, 2, and 3).
It could answer most of your questions regarding iterations, data recipe, etc.

Best,
Zechen

qirui-chen · 2024-12-27T02:28:09Z

Thank you for your quick reply, but the updated parameters

--dataset="sem_seg,refer_seg,reason_seg,vos,refer_seg_video,davis"

seem to no longer correspond to the part after line #L225 in dataset.py. This is because the dataset argument (tasks) does not seem to directly include davis (dataset). For example, davis seems to be included under the ref_vos task at line #330.

JosephPai · 2024-12-27T12:20:00Z

That's a nice catch. We used to treat Davis as an independent dataset, because it was added at later stage of the project. While during code cleaning before open-source, we re-organized it under ref-vos dataset.

qirui-chen · 2024-12-28T05:05:07Z

Thank you for the reply. Your responses are very helpful to me.

qirui-chen · 2024-12-29T10:52:55Z

Sorry to bother the author again, but I would like to ask why the phrase "Sure, [SEG]" is added to the input prompt during inference. Shouldn't this be part of the model's output? For example, here.

What’s more strange is that I found adding or omitting this phrase doesn't seem to affect the model's output. The model still ends up outputting "[SEG]. <|end|>". I want to know why this happens. Thank you.

JosephPai · 2025-01-01T12:41:05Z

This is a teacher-forcing technique adapted from the work of LISA.
In short, if you manually append "[SEG]. <|end|>", you will enforce the model to always output a [SEG] token and then you can decode it to a mask.
If you do not apply this teaching forcing, the model will output the response by itself.

For a well trained model, it can successfully output [SEG].
While for a not well-trained model, it may output some other sentences that do not include [SEG] token, which will make the evaluation fail.

To ensure smooth evaluation, we adapt the teaching forcing technique same as LISA.

qirui-chen · 2025-01-01T13:09:49Z

Thank you for your response; it resolved my issue!!!

I would like to ask if you have encountered situations during inference where the model does not output <|end|> and repeats former words until =max_new_tokens. I observed this issue while training on Video QA datasets (llava-style formats), even in cases of overfitting a few samples. Additionally, I find that the eos token <|endoftext|> seems to be not used.

I wonder how to use generate() function correctly and if you could provide some insights on this. I sincerely appreciate your help!

JosephPai · 2025-01-03T07:32:28Z

Both <|end|> and <|endoftext|> should be okay, as long as the model is trained with proper templates. VideoLISA is finetuned from this model, so the chat template and EOS (or EOT) token are also adapted from there.
The reasons of not outputting <|end|> can be multifactorial, you may need to check the conv_template, pre-processing function, and loss curve, etc.

qirui-chen · 2025-01-03T13:50:54Z

Thank you for your response.

qirui-chen changed the title ~~About environment~~ About the environment Dec 19, 2024

qirui-chen changed the title ~~About the environment~~ About training replication Dec 19, 2024

JosephPai mentioned this issue Jan 1, 2025

Query About DAVIS Dataset #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About training replication #8

About training replication #8

qirui-chen commented Dec 19, 2024

qirui-chen commented Dec 19, 2024 •

edited

Loading

qirui-chen commented Dec 19, 2024 •

edited

Loading

qirui-chen commented Dec 20, 2024

qirui-chen commented Dec 21, 2024 •

edited

Loading

JosephPai commented Dec 25, 2024 •

edited

Loading

qirui-chen commented Dec 25, 2024 •

edited

Loading

JosephPai commented Dec 26, 2024

qirui-chen commented Dec 26, 2024 •

edited

Loading

JosephPai commented Dec 26, 2024

qirui-chen commented Dec 27, 2024

JosephPai commented Dec 27, 2024

qirui-chen commented Dec 28, 2024

qirui-chen commented Dec 29, 2024 •

edited

Loading

JosephPai commented Jan 1, 2025

qirui-chen commented Jan 1, 2025 •

edited

Loading

JosephPai commented Jan 3, 2025

qirui-chen commented Jan 3, 2025

About training replication #8

About training replication #8

Comments

qirui-chen commented Dec 19, 2024

qirui-chen commented Dec 19, 2024 • edited Loading

qirui-chen commented Dec 19, 2024 • edited Loading

qirui-chen commented Dec 20, 2024

qirui-chen commented Dec 21, 2024 • edited Loading

JosephPai commented Dec 25, 2024 • edited Loading

qirui-chen commented Dec 25, 2024 • edited Loading

JosephPai commented Dec 26, 2024

qirui-chen commented Dec 26, 2024 • edited Loading

JosephPai commented Dec 26, 2024

qirui-chen commented Dec 27, 2024

JosephPai commented Dec 27, 2024

qirui-chen commented Dec 28, 2024

qirui-chen commented Dec 29, 2024 • edited Loading

JosephPai commented Jan 1, 2025

qirui-chen commented Jan 1, 2025 • edited Loading

JosephPai commented Jan 3, 2025

qirui-chen commented Jan 3, 2025

qirui-chen commented Dec 19, 2024 •

edited

Loading

qirui-chen commented Dec 19, 2024 •

edited

Loading

qirui-chen commented Dec 21, 2024 •

edited

Loading

JosephPai commented Dec 25, 2024 •

edited

Loading

qirui-chen commented Dec 25, 2024 •

edited

Loading

qirui-chen commented Dec 26, 2024 •

edited

Loading

qirui-chen commented Dec 29, 2024 •

edited

Loading

qirui-chen commented Jan 1, 2025 •

edited

Loading