Skip to content

Commit

Permalink
Merge branch 'johko:main' into madhav_obj_detection
Browse files Browse the repository at this point in the history
  • Loading branch information
miniMaddy authored Feb 9, 2024
2 parents 94c6c69 + ca56508 commit 0f1babd
Show file tree
Hide file tree
Showing 16 changed files with 4,799 additions and 1 deletion.
20 changes: 20 additions & 0 deletions .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Build documentation

on:
push:
branches:
- main

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: computer-vision-course
package_name: computer-vision-course
repo_owner: johko
path_to_docs: computer-vision-course/chapters/
additional_args: --not_python_module
languages: en
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
21 changes: 21 additions & 0 deletions .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Build PR Documentation

on:
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: computer-vision-course
package_name: computer-vision-course
repo_owner: johko
path_to_docs: computer-vision-course/chapters/
additional_args: --not_python_module
languages: en
21 changes: 21 additions & 0 deletions .github/workflows/quality.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Quality checks

on:
push:
branches:
- main
pull_request:

jobs:
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install Python dependencies
run: pip install black
- name: Run quality check
run: make quality
17 changes: 17 additions & 0 deletions .github/workflows/upload_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Upload PR Documentation

on:
workflow_run:
workflows: ["Build PR Documentation"]
types:
- completed

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: computer-vision-course
hub_base_path: https://moon-ci-docs.huggingface.co/learn
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
86 changes: 86 additions & 0 deletions chapters/en/Unit 13 - Outlook/i-jepa.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Image-based Joint-Embedding Predictive Architecture (I-JEPA)

## Overview

The Image-based Joint-Embedding Predictive Architecture (I-JEPA) is a groundbreaking self-supervised learning model [introduced by Meta AI in 2023](https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/). It tackles the challenge of understanding images without relying on traditional labels or hand-crafted data augmentations.
To get to know I-JEPA better, let’s first discuss a few concepts.

### Invariance-based vs. Generative Pretraining Methods

We can say that there are broadly two main approaches for self-supervised learning from images: invariance-based methods and generative methods. Both approaches have their strengths and weaknesses.

- **Invariance-based methods**: In these methods, the model tries to reproduce similar embeddings for different views of the same image. And, of course, these different views are hand-crafted, the image augmentations we’re all familiar with. For example, rotating, scaling, and cropping. These methods are good at producing representations at high semantic levels, but the problem is that they introduce strong biases that may be detrimental to certain downstream tasks. For example, image classification and instance segmentation do not require data augmentations.

- **Generative methods**: The model tries to reconstruct the input image using these methods. That’s why these methods are sometimes called reconstruction-based self-supervised learning. Masks hide patches of the input image, and the model tries to reconstruct these corrupted patches at the pixel or token level (let’s keep this point in mind). This masked approach can easily generalize beyond image modality but doesn’t produce representations at the quality level of invariance-based methods. Also, these methods are computationally expensive and require large datasets for robust training.

Now let’s talk about Joint-Embedding Architectures.

### Joint-Embedding Architectures

This is a recent and popular approach for self-supervised learning from images in which two networks are trained to produce similar embeddings for different views of the same image. Basically, they train two networks to "speak the same language" about different views of the same picture. A common choice is the Siamese network architecture where the two networks share the same weights. But like everything else, it has its own problems:

- **Representation collapse**: A case in which the model produces the same representation regardless of the input.

- **Inputs compatibility criteria**: Finding good and appropriate compatibility measures can be challenging sometimes.

An example of a Joint-Embedding Architecture is [VICReg](https://arxiv.org/abs/2105.04906)

<Tip>
Different training methods can be employed to train Joint-Embedding Architectures, for example:

- Contrastive methods
- Non-Contrastive methods
- Clustering methods
</Tip>

So far so good, now to I-JEPA. As a start, the picture below from the I-JEPA paper shows the difference between Joint-Embedding methods, generative methods, and I-JEPA.

![I-JEPA Comparisons](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/i-jepa-1.png)

### Image-based Joint-Embedding Predictive Architecture (I-JEPA)

I-JEPA tries to improve on both generative and joint-embedding methods. Conceptually, it is similar to generative methods but with the following key differences:

1. **Abstract prediction**: This is the most fascinating aspect of I-JEPA in my opinion. Remember when we mentioned generative methods and how they try to reconstruct the corrupted input at the pixel level? Now, unlike generative methods, I-JEPA tries to predict it in representation space using its introduced predictor, that’s why they call it abstract prediction. This leads to the model learning more powerful semantic features.

2. **Multi-block masking**: Another design choice that improves the semantic features produced by I-JEPA is masking sufficiently large blocks of the input image.

### I-JEPA Components

The previous diagrams show and compare the I-JEPA architecture, below is a brief description of its main components:

1. **Target Encoder (y-encoder)**: Encodes target images and the target blocks are produced by masking its output.

2. **Context Encoder (x-encoder)**: Encodes a randomly sampled context block from the image to obtain a corresponding patch-level representation.

3. **Predictor**: Takes as input the output of the context encoder and a mask token for each patch we wish to predict and tries to predict the masked target blocks.

The target-encoder, context-encoder, and predictor all use a Vision Transformer (ViT) architecture. You have a refresher about them in unit 3 of this course.

The image below from the paper illustrates how I-JEPA works.

![I-JEPA method](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/i-jepa-2.png)

## Why It Matters

So, why I-JEPA? I-JEPA introduced many new design features while still being a simple and efficient method for learning semantic image representations without relying on hand-crafted data augmentations. Briefly,

1. I-JEPA outperforms pixel-reconstruction methods such as Masked autoencoders (MAE) on ImageNet-1K linear probing, semi-supervised 1% ImageNet-1K, and semantic transfer tasks.

2. I-JEPA is competitive with view-invariant pretraining approaches on semantic tasks and achieves better performance on low-level vision tasks such as object counting and depth prediction.

3. By using a simpler model with less rigid inductive bias, I-JEPA is applicable to a wider set of tasks.

4. I-JEPA is also scalable and efficient. Pretraining on ImageNet requires *less than 1200 GPU hours*.

## References

- [I-JEPA paper](https://arxiv.org/abs/2301.08243)

- [Meta's blog post about I-JEPA](https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/)

- [I-JEPA official GitHub repository](https://github.com/facebookresearch/ijepa)




2 changes: 1 addition & 1 deletion chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
title: Introduction to Vision Language Models
- local: Unit 4 - Multimodal Models/transfer_learning
title: Transfer learning
- local: Unit 4 - Mulitmodal Models/supplementary-material
- local: Unit 4 - Multimodal Models/supplementary-material
title: Supplemental reading and resources

- title: Unit 5. Multimodal Models
Expand Down
Loading

0 comments on commit 0f1babd

Please sign in to comment.