🚀 Vision Transformers From Scratch

SRA Eklavya 2024 ✨

✏️ Table of contents

About the project
- Aim
- Description
- Tech Stack
- Dataset
- File Structure
Getting Started
- Installation
Theory and Approach
- CNN + LSTM Model
- ViT Model
Results
- CNN + LSTM model
- ViT model
Applications
Contributors
Acknowledgements

⭐ About the project

Aim

The aim of this project is to generate descriptive captions for images by combining the power of Transformers and computer vision.

Description

This project focuses on image captioning using Vision Transformers (ViT), implemented from scratch. Initially, a basic CNN + LSTM approach was employed to establish a baseline. We then transitioned to a more advanced Vision Transformer (ViT) model to leverage its capability in capturing long-range dependencies in image data.

Tech Stack

Programming Language

Deeplearning Frameworks

Data handling

Natural Language Processsing

Dataset

The project uses the COCO 2017 dataset , a comprehensive dataset comprising of 5 descriptive captions for each image.

📁 File structure

🛠 Getting started

Installation

Clone the repo
git clone https://github.com/sneha31415/vision_transformers_from_scratch.git
Navigate to the project directory
cd vision_transformers_from_scratch

📝 Theory and Approach

CNN + LSTM Model

This is the complete architecture of the CNN + LSTM image captioning model. The CNN encoder basically finds patterns in images and encodes it into a vector that is passed to the LSTM decoder that outputs a word at each time step to best describe the image. Upon reaching the {endseq} token or the maximum length of the sentence, the entire caption is generated and that is our output for that particular image.

1) Encoder:

A pretrained CNN model (ResNet50) is used for feature extraction, transforming input images into fixed-length feature vectors.

2) Decoder:

An LSTM network is utilized to generate captions by taking image features and previous word embeddings as input to predict the next word.

Vision Transformers model (ViT)

What are transformers?

Before heading into vision transformers, lets understand transformers.
Since the introduction of transformers in the 2017 paper Attention is all you need by Google Brain, it steered an interest in its capability in NLP

Transformer Architecture

In the Transformer model:

Encoder: Converts input tokens into continuous representations using self-attention to capture relationships between all tokens simultaneously.
Decoder: Generates output tokens by attending to both the encoder’s output and previously generated tokens, using masked self-attention and cross-attention.

So... What are Vision Transformers?

Vision Transformers were introduced in the 2020 paper An Image is worth 16x16 words.
The Vision Transformer, or ViT, is a model that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder.

🌟 Image Captioning using ViT from scratch 💥

1) ViT Encoder

As depicted in the figure, instead of using a pretrained CNN or Faster R-CNN model to extract spatial features or bottom-up features like the previous methods, we divide the original image into a sequence of image patches to adapt to the input form of Transformer. We used a Conv2D layer(due to performance gain reasons) with a stride and kernel size equal to patch size. Alternatively you can also use linear layer here. and then reshape the 4D tensor to 3D to flatten it. Then add the learnable position embeddings.

The encoder of CPTR consists of Nx stacked identical layers, each of which consists of a multi-head self-attention (MHA) sublayer followed by a positional feed-forward sublayer. MHA contains H parallel heads and each head hi corresponds to an independent scaled dot-product attention function which allows the model to jointly attend to different subspaces.

2) Transformer Decoder

In the decoder side, we add positional embedding to the word embedding features and take the addition of encoder output and decoder 1st layer results as the input.

The decoder consists of Nd stacked identical layers with each layer containing a masked multi-head self-attention sublayer followed by a multi-head cross attention sublayer and a feedforward sublayer sequentially. The output feature of the last decoder layer is utilized to predict next word via a linear layer whose output dimension equals to the vocabulary size

🤖 Results

CNN + LSTM Model

1) Bleu score

The CNN + LSTM model achieved a BLEU-1 score of 0.553085 and BLEU-2 score of 0.333717

2) Predicted Captions

some fails

ViT Model

🌎 Applications:

Enhanced Image Understanding: Generates more accurate and context-aware captions by capturing complex relationships within images.
Accessibility: Improves accessibility for visually impaired users by converting visual information into descriptive text.
Image Search and Organization: Enhances image search engines by providing detailed descriptions, aiding in better indexing and retrieval.
E-commerce: Provides detailed product descriptions in online catalogs, improving user experience and product discovery.

Contributors

Acknowledgements and Resources

Special thanks to SRA VJTI for Eklavya 2024
A heartful gratitude to our mentors Aryan Nanda and Abhinav Ananthu for guiding us throughout this project
Deep learning specialization for their course on Neural Networks and Deep Learning.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale for understanding Vision Transformers(ViTs)
Attention is all you need for understanding the transformer architecture
Deep Residual Learning for Image Recognition for understanding residual functions

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
assets		assets
mini_projects		mini_projects
notes		notes
report		report
research_paper_implementation		research_paper_implementation
saved_models		saved_models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
image-captioning-cnn-lstm.ipynb		image-captioning-cnn-lstm.ipynb
vit-image-captioning.ipynb		vit-image-captioning.ipynb

sneha31415/vision_transformers_from_scratch

Folders and files

Latest commit

History

Repository files navigation

🚀 Vision Transformers From Scratch

SRA Eklavya 2024 ✨

✏️ Table of contents

⭐ About the project

Aim

Description

Tech Stack

Programming Language

Deeplearning Frameworks

Data handling

Natural Language Processsing

Dataset

📁 File structure

🛠 Getting started

Installation

📝 Theory and Approach

CNN + LSTM Model

1) Encoder:

2) Decoder:

Vision Transformers model (ViT)

What are transformers?

Transformer Architecture

So... What are Vision Transformers?

🌟 Image Captioning using ViT from scratch 💥

1) ViT Encoder

2) Transformer Decoder

🤖 Results

CNN + LSTM Model

1) Bleu score

2) Predicted Captions

some fails

ViT Model

🌎 Applications:

Contributors

Acknowledgements and Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages