Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I evaluate the results after stage-1 of training BLIP2? #774

Open
hawkiyc opened this issue Dec 12, 2024 · 6 comments
Open

How do I evaluate the results after stage-1 of training BLIP2? #774

hawkiyc opened this issue Dec 12, 2024 · 6 comments

Comments

@hawkiyc
Copy link

hawkiyc commented Dec 12, 2024

Hi, developers,

I am revising your code to build a modified BLIP2 model for time-series input. Now, I am trying to figure out the architecture of this framework. I have tested the bash run_scripts/blip2/train/pretrain_stage1.sh command with the coco dataset (btw, there are mismatches between images and annotations in the vg dataset, so I removed it), and it seems to work fine. However, I cannot find any script or .yaml file for evaluation of the result of stage 1. I have checked the lavis/configs/datasets/coco/defaults_cap.yaml file, and there is information for train, val, and test subsets.

defaults_cap.yaml

datasets:
  coco_caption: # name of the dataset builder
    dataset_card: dataset_card/coco_caption.md
    # data_dir: ${env.data_dir}/datasets
    data_type: images # [images|videos|features]

    build_info:
      # Be careful not to append minus sign (-) before split to avoid itemizing
      annotations:
        train:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json
          md5: aa31ac474cf6250ebb81d18348a07ed8
          storage: coco/annotations/coco_karpathy_train.json
        val:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json
          md5: b273847456ef5580e33713b1f7de52a0
          storage:  coco/annotations/coco_karpathy_val.json
        test:
          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json
          md5: 3ff34b0ef2db02d01c37399f6a2a6cd1
          storage: coco/annotations/coco_karpathy_test.json
      images:
        storage: coco/images/

Here is the printed result in the terminal:

Train: data epoch: [4]  [5550/5667]  eta: 0:03:26  lr: 0.000019  loss: 4.0731  loss_itc: 0.9712 (0.9633)  loss_itm: 0.1881 (0.1714)  loss_lm: 2.8563 (2.8436)  time: 1.7917  data: 0.0000  max mem: 27191
Train: data epoch: [4]  [5600/5667]  eta: 0:01:58  lr: 0.000019  loss: 4.1341  loss_itc: 0.9485 (0.9633)  loss_itm: 0.1703 (0.1713)  loss_lm: 2.8336 (2.8436)  time: 1.7898  data: 0.0000  max mem: 27191
Train: data epoch: [4]  [5650/5667]  eta: 0:00:30  lr: 0.000019  loss: 3.8998  loss_itc: 0.9417 (0.9632)  loss_itm: 0.1509 (0.1713)  loss_lm: 2.8545 (2.8438)  time: 1.7882  data: 0.0000  max mem: 27191
Train: data epoch: [4]  [5666/5667]  eta: 0:00:01  lr: 0.000019  loss: 3.9018  loss_itc: 0.9507 (0.9632)  loss_itm: 0.1535 (0.1713)  loss_lm: 2.8405 (2.8438)  time: 1.8221  data: 0.0000  max mem: 27191
Train: data epoch: [4] Total time: 2:47:07 (1.7694 s / it)
INFO - 2024-12-12 03:24:12,536 - base_task - Averaged stats: lr: 0.0000  loss: 3.9783  loss_itc: 0.9632  loss_itm: 0.1713  loss_lm: 2.8438
INFO - 2024-12-12 03:24:12,543 - runner_base - No validation splits found.
INFO - 2024-12-12 03:24:12,598 - runner_base - Saving checkpoint at epoch 4 to /home/revlis_ai/Documents/training_models_temp/LAVIS_with_JoLT/lavis/output/BLIP2/Pretrain_stage1/20241211132/checkpoint_4.pth.
INFO - 2024-12-12 03:24:15,828 - runner_base - Saving checkpoint at epoch 4 to /home/revlis_ai/Documents/training_models_temp/LAVIS_with_JoLT/lavis/output/BLIP2/Pretrain_stage1/20241211132/checkpoint_4.pth.
INFO - 2024-12-12 03:24:23,201 - runner_base - No validation splits found.
INFO - 2024-12-12 03:24:23,203 - runner_base - Training time 13:55:33
[rank0]:[W1212 03:24:24.182641511 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Output log file

{
    "run": {
        "task": "image_text_pretrain",
        "lr_sched": "linear_warmup_cosine_lr",
        "init_lr": 0.0001,
        "min_lr": 1e-05,
        "warmup_lr": 1e-06,
        "weight_decay": 0.05,
        "max_epoch": 5,
        "batch_size_train": 100,
        "batch_size_eval": 64,
        "num_workers": 4,
        "warmup_steps": 5000,
        "seed": 42,
        "output_dir": "output/BLIP2/Pretrain_stage1",
        "amp": true,
        "resume_ckpt_path": null,
        "evaluate": false,
        "train_splits": [
            "train"
        ],
        "device": "cuda",
        "world_size": 1,
        "dist_url": "env://",
        "distributed": true,
        "rank": 0,
        "gpu": 0,
        "dist_backend": "nccl"
    },
    "model": {
        "arch": "blip2",
        "load_finetuned": false,
        "pretrained": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained.pth",
        "finetuned": "",
        "image_size": 224,
        "drop_path_rate": 0,
        "use_grad_checkpoint": false,
        "vit_precision": "fp16",
        "freeze_vit": true,
        "num_query_token": 32,
        "model_type": "pretrain",
        "load_pretrained": false
    },
    "preprocess": {
        "vis_processor": {
            "train": {
                "name": "blip_image_train",
                "image_size": 224
            },
            "eval": {
                "name": "blip_image_eval",
                "image_size": 224
            }
        },
        "text_processor": {
            "train": {
                "name": "blip_caption"
            },
            "eval": {
                "name": "blip_caption"
            }
        }
    },
    "datasets": {
        "coco_caption": {
            "dataset_card": "dataset_card/coco_caption.md",
            "data_type": "images",
            "build_info": {
                "annotations": {
                    "train": {
                        "url": "https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json",
                        "md5": "aa31ac474cf6250ebb81d18348a07ed8",
                        "storage": "coco/annotations/coco_karpathy_train.json"
                    },
                    "val": {
                        "url": "https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json",
                        "md5": "b273847456ef5580e33713b1f7de52a0",
                        "storage": "coco/annotations/coco_karpathy_val.json"
                    },
                    "test": {
                        "url": "https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json",
                        "md5": "3ff34b0ef2db02d01c37399f6a2a6cd1",
                        "storage": "coco/annotations/coco_karpathy_test.json"
                    }
                },
                "images": {
                    "storage": "coco/images/"
                }
            },
            "vis_processor": {
                "train": {
                    "name": "blip2_image_train",
                    "image_size": 224
                }
            },
            "text_processor": {
                "train": {
                    "name": "blip_caption"
                }
            }
        }
    }
}
{"train_lr": "0.000", "train_loss": "5.582", "train_loss_itc": "1.492", "train_loss_itm": "0.402", "train_loss_lm": "3.688"}
{"train_lr": "0.000", "train_loss": "4.538", "train_loss_itc": "1.097", "train_loss_itm": "0.266", "train_loss_lm": "3.174"}
{"train_lr": "0.000", "train_loss": "4.288", "train_loss_itc": "1.035", "train_loss_itm": "0.222", "train_loss_lm": "3.031"}
{"train_lr": "0.000", "train_loss": "4.110", "train_loss_itc": "0.993", "train_loss_itm": "0.192", "train_loss_lm": "2.925"}
{"train_lr": "0.000", "train_loss": "3.978", "train_loss_itc": "0.963", "train_loss_itm": "0.171", "train_loss_lm": "2.844"}
@parth1313
Copy link

Hey @hawkiyc

I want to train the BLIP2, however i am getting issues like this :
from diffusers import (

File "/usr/local/lib/python3.11/dist-packages/diffusers/__init__.py", line 3, in <module>
   from .configuration_utils import ConfigMixin
 File "/usr/local/lib/python3.11/dist-packages/diffusers/configuration_utils.py", line 34, in <module>
   from .utils import (
 File "/usr/local/lib/python3.11/dist-packages/diffusers/utils/__init__.py", line 38, in <module>
   from .dynamic_modules_utils import get_class_from_dynamic_module
 File "/usr/local/lib/python3.11/dist-packages/diffusers/utils/dynamic_modules_utils.py", line 29, in <module>
   from huggingface_hub import HfFolder, cached_download, hf_hub_download, model_info
ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/usr/local/lib/python3.11/dist-packages/huggingface_hub/__init__.py)

because, some of the packages are not being installed properly due to compatiblity. For example :

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.19 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible.
albumentations 1.4.20 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible.
sentence-transformers 3.3.1 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.26.1 which is incompatible.
ERROR: Could not find a version that satisfies the requirement open3d==0.13.0 (from salesforce-lavis) (from versions: 0.16.0, 0.17.0, 0.18.0, 0.19.0)
ERROR: No matching distribution found for open3d==0.13.0

I am running it on colab with A100.

Can you provide the solution?

@hawkiyc
Copy link
Author

hawkiyc commented Jan 16, 2025

Hi @parth1313, the cached_download is removed after transformer_hub v0.26. Downgrade your huggingface_hub to 0.25.* shall solve this problem.

@parth1313
Copy link

parth1313 commented Jan 16, 2025

Thank you for the reply @hawkiyc

Still getting the issue.
Can you tell me the versions of all libraries you have used during pretrain_satge1?

I am using salesforce-lavis==1.0.2 and all other libraries as given in the requirements.txt:

contexttimer
decord
diffusers<=0.16.0
einops>=0.4.1
fairscale==0.4.4
ftfy
iopath
ipython
omegaconf
opencv-python-headless==4.5.5.64
opendatasets
packaging
pandas
plotly
pre-commit
pycocoevalcap
pycocotools
python-magic
scikit-image
sentencepiece
spacy
streamlit
timm==0.4.12
torch>=1.10.0
torchvision
tqdm
transformers==4.33.2
webdataset
wheel
torchaudio
soundfile
moviepy
nltk
peft

easydict==1.9
pyyaml_env_tag==0.1
open3d==0.13.0
h5py

Here is what i am getting when installing !pip install salesforce-lavis

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 3.3.1 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.26.1 which is incompatible.
albumentations 1.4.20 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible.
albucore 0.0.19 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible.
Successfully installed antlr4-python3-runtime-4.9.3 braceexpand-0.1.7 cfgv-3.4.0 contexttimer-0.3.3 decord-0.6.0 distlib-0.3.9 fairscale-0.4.4 ftfy-6.3.1 identify-2.6.5 iopath-0.1.10 jedi-0.19.2 nodeenv-1.9.1 omegaconf-2.3.0 opencv-python-headless-4.5.5.64 opendatasets-0.1.22 portalocker-3.1.1 pre-commit-4.0.1 pycocoevalcap-1.2 pydeck-0.9.1 python-magic-0.4.27 salesforce-lavis-1.0.2 streamlit-1.41.1 timm-0.4.12 tokenizers-0.13.3 transformers-4.26.1 virtualenv-20.29.0 watchdog-6.0.0 webdataset-0.2.100

And following error while running !python evaluate.py --cfg-path lavis/projects/blip2/eval/caption_coco_opt2.7b_eval.yaml :

2025-01-16 14:04:01.761870: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-16 14:04:01.780186: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-16 14:04:01.801982: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-16 14:04:01.808581: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-16 14:04:01.825630: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-16 14:04:02.871386: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
error: XDG_RUNTIME_DIR not set in the environment.
ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1334:(snd_func_refer) error evaluating name
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5701:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1334:(snd_func_refer) error evaluating name
ALSA lib conf.c:5178:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5701:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM default
Traceback (most recent call last):
  File "/content/LAVIS/evaluate.py", line 15, in <module>
    import lavis.tasks as tasks
  File "/content/LAVIS/lavis/__init__.py", line 15, in <module>
    from lavis.datasets.builders import *
  File "/content/LAVIS/lavis/datasets/builders/__init__.py", line 8, in <module>
    from lavis.datasets.builders.base_dataset_builder import load_dataset_config
  File "/content/LAVIS/lavis/datasets/builders/base_dataset_builder.py", line 18, in <module>
    from lavis.processors.base_processor import BaseProcessor
  File "/content/LAVIS/lavis/processors/__init__.py", line 29, in <module>
    from lavis.processors.audio_processors import BeatsAudioProcessor
  File "/content/LAVIS/lavis/processors/audio_processors.py", line 17, in <module>
    from lavis.models.beats.Tokenizers import TokenizersConfig, Tokenizers
  File "/content/LAVIS/lavis/models/__init__.py", line 42, in <module>
    from lavis.models.blip2_models.blip2_vicuna_xinstruct import Blip2VicunaXInstruct
  File "/content/LAVIS/lavis/models/blip2_models/blip2_vicuna_xinstruct.py", line 22, in <module>
    from peft import (
  File "/usr/local/lib/python3.11/dist-packages/peft/__init__.py", line 22, in <module>
    from .auto import (
  File "/usr/local/lib/python3.11/dist-packages/peft/auto.py", line 32, in <module>
    from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING
  File "/usr/local/lib/python3.11/dist-packages/peft/mapping.py", line 25, in <module>
    from .mixed_model import PeftMixedModel
  File "/usr/local/lib/python3.11/dist-packages/peft/mixed_model.py", line 29, in <module>
    from .peft_model import PeftModel
  File "/usr/local/lib/python3.11/dist-packages/peft/peft_model.py", line 37, in <module>
    from transformers import Cache, DynamicCache, EncoderDecoderCache, PreTrainedModel
ImportError: cannot import name 'Cache' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)

@hawkiyc
Copy link
Author

hawkiyc commented Jan 17, 2025

Hi @parth1313, Lavis framework needs specific versions of transformer; please install your transformer with pip install transformers==4.33.2. This shall solve the error message you encounter. As the pip conflicts, you may need to downgrade some libraries or frameworks. You can grab my env if you want, but please note that I am using Ubuntu 22.04 and Anaconda.

@parth1313
Copy link

parth1313 commented Jan 17, 2025

Thank you for the help @hawkiyc

But, transformers==4.33.2 is not compatible with salesforce-lavis==1.0.2 as salesforce-lavis 1.0.2 requires transformers<4.27 and >=4.25.0 and doing so is further giving :

albumentations 1.4.20 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible .
albucore 0.0.19 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.5.5.64 which is incompatible.

And if i try to install opencv-python-headless>=4.9.0.80, it is giving:
salesforce-lavis 1.0.2 requires opencv-python-headless==4.5.5.64, but you have opencv-python-headless 4.9.0.80 which is incompatible.

Moreover, none of the downgraded version of albucore 0.0.19 is compatible with opencv-python-headless 4.5.5.64.

can you further calrify?

@hawkiyc
Copy link
Author

hawkiyc commented Jan 17, 2025

Hi, @parth1313
I'm sorry that I overlooked you installed the Lavis library. I only installed the needed libraries with pip install -r requirements.txt, and git cloned all files within the Lavis repository because I needed a modified BLIP2 model for my project. In my opinion, git clone the whole repository will be a better choice if you want to train your own model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants