Skip to content

This project aims to develop an image captioning model by leveraging the power of Vision Transformers (ViTs) as described in the 2020 paper "An Image is worth 16 x 16 words".

Notifications You must be signed in to change notification settings

sneha31415/vision_transformers_from_scratch

Repository files navigation

πŸš€ Vision Transformers From Scratch

SRA Eklavya 2024 ✨

✏️ Table of contents

⭐ About the project

Aim

The aim of this project is to generate descriptive captions for images by combining the power of Transformers and computer vision.

Description

This project focuses on image captioning using Vision Transformers (ViT), implemented from scratch. Initially, a basic CNN + LSTM approach was employed to establish a baseline. We then transitioned to a more advanced Vision Transformer (ViT) model to leverage its capability in capturing long-range dependencies in image data.

Tech Stack

Programming Language

Static Badge

Deeplearning Frameworks

Static Badge

Static Badge

Static Badge

Data handling

Static Badge

Static Badge

Static Badge

Natural Language Processsing

Static Badge

Dataset

The project uses the COCO 2017 dataset , a comprehensive dataset comprising of 5 descriptive captions for each image.

πŸ“ File structure

Static Badge

πŸ›  Getting started

Installation

  1. Clone the repo
    git clone https://github.com/sneha31415/vision_transformers_from_scratch.git

  2. Navigate to the project directory
    cd vision_transformers_from_scratch

πŸ“ Theory and Approach

CNN + LSTM Model

This is the complete architecture of the CNN + LSTM image captioning model. The CNN encoder basically finds patterns in images and encodes it into a vector that is passed to the LSTM decoder that outputs a word at each time step to best describe the image. Upon reaching the {endseq} token or the maximum length of the sentence, the entire caption is generated and that is our output for that particular image. cnn+lstm model

1) Encoder:

A pretrained CNN model (ResNet50) is used for feature extraction, transforming input images into fixed-length feature vectors.

2) Decoder:

An LSTM network is utilized to generate captions by taking image features and previous word embeddings as input to predict the next word.

Vision Transformers model (ViT)

What are transformers?

Before heading into vision transformers, lets understand transformers.
Since the introduction of transformers in the 2017 paper Attention is all you need by Google Brain, it steered an interest in its capability in NLP

Transformer Architecture

Transformer

In the Transformer model:

  • Encoder: Converts input tokens into continuous representations using self-attention to capture relationships between all tokens simultaneously.
  • Decoder: Generates output tokens by attending to both the encoder’s output and previously generated tokens, using masked self-attention and cross-attention.

So... What are Vision Transformers?

Vision Transformers were introduced in the 2020 paper An Image is worth 16x16 words.
The Vision Transformer, or ViT, is a model that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder.

🌟 Image Captioning using ViT from scratch πŸ’₯

vit caption

1) ViT Encoder

As depicted in the figure, instead of using a pretrained CNN or Faster R-CNN model to extract spatial features or bottom-up features like the previous methods, we divide the original image into a sequence of image patches to adapt to the input form of Transformer. We used a Conv2D layer(due to performance gain reasons) with a stride and kernel size equal to patch size. Alternatively you can also use linear layer here. and then reshape the 4D tensor to 3D to flatten it. Then add the learnable position embeddings.

The encoder of CPTR consists of Nx stacked identical layers, each of which consists of a multi-head self-attention (MHA) sublayer followed by a positional feed-forward sublayer. MHA contains H parallel heads and each head hi corresponds to an independent scaled dot-product attention function which allows the model to jointly attend to different subspaces.

alt text

2) Transformer Decoder

In the decoder side, we add positional embedding to the word embedding features and take the addition of encoder output and decoder 1st layer results as the input.

The decoder consists of Nd stacked identical layers with each layer containing a masked multi-head self-attention sublayer followed by a multi-head cross attention sublayer and a feedforward sublayer sequentially. The output feature of the last decoder layer is utilized to predict next word via a linear layer whose output dimension equals to the vocabulary size

πŸ€– Results

CNN + LSTM Model

1) Bleu score

  • The CNN + LSTM model achieved a BLEU-1 score of 0.553085 and BLEU-2 score of 0.333717

2) Predicted Captions

result1

result2

result3

result4

some fails

fail1 fail2

ViT Model

Static Badge

🌎 Applications:

  • Enhanced Image Understanding: Generates more accurate and context-aware captions by capturing complex relationships within images.

  • Accessibility: Improves accessibility for visually impaired users by converting visual information into descriptive text.

  • Image Search and Organization: Enhances image search engines by providing detailed descriptions, aiding in better indexing and retrieval.

  • E-commerce: Provides detailed product descriptions in online catalogs, improving user experience and product discovery.

Contributors

Acknowledgements and Resources

About

This project aims to develop an image captioning model by leveraging the power of Vision Transformers (ViTs) as described in the 2020 paper "An Image is worth 16 x 16 words".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages