- About the project
- Getting Started
- Theory and Approach
- Results
- Applications
- Contributors
- Acknowledgements
The aim of this project is to generate descriptive captions for images by combining the power of Transformers and computer vision.
This project focuses on image captioning using Vision Transformers (ViT), implemented from scratch. Initially, a basic CNN + LSTM approach was employed to establish a baseline. We then transitioned to a more advanced Vision Transformer (ViT) model to leverage its capability in capturing long-range dependencies in image data.
The project uses the COCO 2017 dataset , a comprehensive dataset comprising of 5 descriptive captions for each image.
-
Clone the repo
git clone https://github.com/sneha31415/vision_transformers_from_scratch.git
-
Navigate to the project directory
cd vision_transformers_from_scratch
This is the complete architecture of the CNN + LSTM image captioning model. The CNN encoder basically finds patterns in images and encodes it into a vector that is passed to the LSTM decoder that outputs a word at each time step to best describe the image. Upon reaching the {endseq} token or the maximum length of the sentence, the entire caption is generated and that is our output for that particular image.
A pretrained CNN model (ResNet50) is used for feature extraction, transforming input images into fixed-length feature vectors.
An LSTM network is utilized to generate captions by taking image features and previous word embeddings as input to predict the next word.
Before heading into vision transformers, lets understand transformers.
Since the introduction of transformers in the 2017 paper Attention is all you need by Google Brain, it steered an interest in its capability in NLP
In the Transformer model:
- Encoder: Converts input tokens into continuous representations using self-attention to capture relationships between all tokens simultaneously.
- Decoder: Generates output tokens by attending to both the encoderβs output and previously generated tokens, using masked self-attention and cross-attention.
Vision Transformers were introduced in the 2020 paper An Image is worth 16x16 words.
The Vision Transformer, or ViT, is a model that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder.
As depicted in the figure, instead of using a pretrained CNN or
Faster R-CNN model to extract spatial features or bottom-up
features like the previous methods, we divide the original image into a sequence of image patches to adapt to the input
form of Transformer. We used a Conv2D layer(due to performance gain reasons) with a stride and kernel size equal to patch size. Alternatively you can also use linear layer here. and then reshape the 4D tensor to 3D to flatten it. Then add the learnable position embeddings.
The encoder of CPTR consists of Nx
stacked identical
layers, each of which consists of a multi-head self-attention
(MHA) sublayer followed by a positional feed-forward sublayer. MHA contains H parallel heads and each head hi corresponds to an independent scaled dot-product attention function which allows the model to jointly attend to different subspaces.
In the decoder side, we add positional embedding to
the word embedding features and take the addition of encoder output and decoder 1st layer results as the input.
The decoder consists of Nd
stacked identical layers with each layer containing a masked multi-head self-attention sublayer followed
by a multi-head cross attention sublayer and a feedforward sublayer sequentially.
The output feature of the last decoder layer is utilized to
predict next word via a linear layer whose output dimension
equals to the vocabulary size
- The CNN + LSTM model achieved a BLEU-1 score of 0.553085 and BLEU-2 score of 0.333717
-
Enhanced Image Understanding: Generates more accurate and context-aware captions by capturing complex relationships within images.
-
Accessibility: Improves accessibility for visually impaired users by converting visual information into descriptive text.
-
Image Search and Organization: Enhances image search engines by providing detailed descriptions, aiding in better indexing and retrieval.
-
E-commerce: Provides detailed product descriptions in online catalogs, improving user experience and product discovery.
-
Special thanks to SRA VJTI for Eklavya 2024
-
A heartful gratitude to our mentors Aryan Nanda and Abhinav Ananthu for guiding us throughout this project
-
Deep learning specialization for their course on Neural Networks and Deep Learning.
-
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale for understanding Vision Transformers(ViTs)
-
Attention is all you need for understanding the transformer architecture
-
Deep Residual Learning for Image Recognition for understanding residual functions