Skip to content

Latest commit

 

History

History
246 lines (194 loc) · 9.48 KB

README.md

File metadata and controls

246 lines (194 loc) · 9.48 KB

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai 1  Tong He 2  Haiyang Mei 1  Pichao Wang 2 

Ziteng Gao 1  Joya Chen 1  Lei Liu 2  Zheng Zhang 2  Mike Zheng Shou 1 

NeurIPS 2024

1 Show Lab, National University of Singapore   2 Amazon 

model arXiv

News

  • [2024-12-27] We updated the ReasonVOS benchmark after fixing some issue.
  • [2024-12-26] We updated an example on post-optimization.
  • [2024-12-26] We now support evaluation on image benchmarks, including refCOCO, etc.
  • [2024-12-08] We updated the inference example and evaluation instructions on all datasets.
  • [2024-11-27] We released the ReasonVOS benchmark!
  • [2024-11-26] We released pre-trained VideoLISA-3.8B at HuggingFace!.
  • [2024-11-20] We released the training and inference code.
  • [2024-09-29] We released our paper on arXiv.

TODO

  • Release the inference code.
  • Release the training code.
  • Instructions on supporting more datasets.

Setup Environment

git clone https://github.com/showlab/VideoLISA.git

conda create -n videolisa python=3.10 -y
conda activate videolisa
pip install --upgrade pip  # enable PEP 660 support


# for cuda 11.8
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118
# for cuda 12.1
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121

pip install -e .

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3
pip install flash-attn --no-build-isolation

Inference Example

CUDA_VISIBLE_DEVICES=0 python chat.py \
  --version="ZechenBai/VideoLISA-3.8B" \
  --vision_tower="openai/clip-vit-large-patch14-336" \
  --num_frames_dense=4 \
  --num_frames_sparse=32 \
  --save_overlay

> Please input your prompt: In this video, there is something that shocks the cat and makes it jump. Can you find the object?
> Please input the video path: examples/RBrZsgy4-SQ.mp4

Prepare Data for Training

First, please prepare the image data following this instruction in LISA.

We introduce the video datasets used in this project. Note that the data paths for video datasets are currently hard-coded in each dataset file in the utils folder. You may need to adjust it accordingly.

ReasonVOS

Please refer to BENCHMARK.md

MeViS

Download the dataset from the official release. Then, extract and organize the file. We expect the directory structure to be the following:

mevis
├── train                       // Split Train
│   ├── JPEGImages
│   │   ├── <video #1  >
│   │   ├── <video #2  >
│   │   └── <video #...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
├── valid_u                     // Split Val^u
│   ├── JPEGImages
│   │   └── <video ...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
└── valid                       // Split Val
    ├── JPEGImages
    │   └── <video ...>
    │
    └── meta_expressions.json

Ref-YouTube-VOS and Ref-DAVIS-17

Prepare Ref-YouTube-VOS and Ref-DAVIS-17 datasets following the instructions of ReferFormer.

YouTube-VOS

Download teh dataset from the website and organize it as follows:

YTVOS
├── train
│   ├── JPEGImages
│   ├── Annotations
│   ├── meta.json

Training

We provide a sample training script in run_train.sh. In our own experiments, we use 8 node (64 A10 24G GPUs) in total to train the model. Under this setting, we set batch_size=2 and grad_accumulation_steps=1, so that the global effective batch size is batch_size*grad_accumulation_steps*num_gpus=128. You can modify these settings based on your hardwares. However, we did not explore other training hyper-parameters. If you don't have sufficient GPUs, don't give up, you may still try to train the model with small batch size. One tip: if you use small batch size, also reducing the learning rate might help.

After training finished, to get the full model weight:

cd ./runs/video-lisa-3.8b-3k-iter/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Weight merging

Since the script do LoRA training with the help of deepspeed by default, after training, you need to merge the lora weights back to the model.

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="MBZUAI/LLaVA-Phi-3-mini-4k-instruct" \
  --weight="runs/video-lisa-3.8b-3k-iter/pytorch_model.bin" \
  --save_path="runs/video-lisa-3.8b-3k-iter/merged"

Evaluation

MeViS

Before jumping into the follow commands, you may look into the involved scripts and config the data paths.

# Step 1
bash evaluation/mevis_val_u/run_inference_mevis.sh

# Step 2
bash evaluation/mevis_val_u/run_eval_mevis.sh

ReasonVOS

# Step 1
bash evaluation/reason_vos/run_inference_reason_vos.sh

# Step 2
bash evaluation/reason_vos/run_eval.sh

Ref-YouTube-VOS

bash evaluation/refytvos/run_inference_refytvos.sh

Submit your result to the online evaluation server.

Ref-DAVIS-17

# Step 1
bash evaluation/refdavis/run_inference_refdavis.sh

# Step 2
bash evaluation/refdavis/run_post_process.sh

Image Benchmarks

To support evaluation on the image benchmarks, including ReasonSeg and refCoco series, we proved a holistic script as below. First, prepare image data following instruction in LISA. After that,

deepspeed --master_port=24999 evaluation/eval_img/val.py \
  --version="ZechenBai/VideoLISA-3.8B" \
  --dataset_dir='/data_sdf/LLM_DATA/LISA/datasets' \
  --vision_pretrained="/home/ubuntu/ckpt/SAM/sam_vit_h_4b8939.pth" \
  --vision_tower="openai/clip-vit-large-patch14-336" \
  --num_frames_sparse=32 \
  --num_frames_dense=4 \
  --model_max_length=2048 \
  --eval_only \
  --val_dataset="ReasonSeg|val"

# --val_dataset can be changed to:
# ReasonSeg subsets: ReasonSeg|val, ReasonSeg|test|short, ReasonSeg|val|long, ReasonSeg|val|all
# refCOCO variants: refcoco|unc|testA, refcoco|unc|testB, refcoco+|unc|testA, refcoco+|unc|testB, refcocog|umd|test, refcoco|unc|val, refcoco+|unc|val, refcocog|umd|val

Post-optimization

The post-optimization utilized in our paper is implemented based on XMem2.

XMem2 is well organized as a workflow, which cannot be trivially integrated into other codebase, e.g., VideoLISA. Through our exploration, we find that the best practice is to import the raw inference results to XMem2 and work on the XMem2 codebase. To facilitate this process, we provide an example of the workflow in xmem2_example.

It generally includes 4 steps:

  1. Select effective masks produced by [TRK] token. XMem2 supports multiple reference masks, compared to only single reference in XMem.
  2. Build Workspace and import your data into it. Workspace is a concept in XMem2's framework to organize data.
  3. Run optimization of XMem2.
  4. Recover color platte, as the XMem2's codebase is conducted on multi-channel color platte, while VideoLISA adapt binary masks.

After that, you may continue to following evaluation.

Citation

To cite the paper and model, please use the below:

@article{bai2024one,
  title={One token to seg them all: Language instructed reasoning segmentation in videos},
  author={Bai, Zechen and He, Tong and Mei, Haiyang and Wang, Pichao and Gao, Ziteng and Chen, Joya and Liu, Lei and Zhang, Zheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2409.19603},
  year={2024}
}

Acknowledgments

This work is heavily based on LISA, LLaVA, LLaVA-pp, Segment-Anything and Phi-3. Thanks to all the authors for their great works!