Merge branch 'johko:main' into madhav_obj_detection

johko · Feb 9, 2024 · 0f1babd · 0f1babd
2 parents 94c6c69 + ca56508
commit 0f1babd
Show file tree

Hide file tree

Showing 16 changed files with 4,799 additions and 1 deletion.
diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
@@ -0,0 +1,20 @@
+name: Build documentation
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: computer-vision-course
+      package_name: computer-vision-course
+      repo_owner: johko
+      path_to_docs: computer-vision-course/chapters/
+      additional_args: --not_python_module
+      languages: en
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
@@ -0,0 +1,21 @@
+name: Build PR Documentation
+
+on:
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    with:
+      commit_sha: ${{ github.event.pull_request.head.sha }}
+      pr_number: ${{ github.event.number }}
+      package: computer-vision-course
+      package_name: computer-vision-course
+      repo_owner: johko
+      path_to_docs: computer-vision-course/chapters/
+      additional_args: --not_python_module
+      languages: en
diff --git a/.github/workflows/quality.yml b/.github/workflows/quality.yml
@@ -0,0 +1,21 @@
+name: Quality checks
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  quality:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Install Python dependencies
+      run: pip install black
+    - name: Run quality check
+      run: make quality
diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml
@@ -0,0 +1,17 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: computer-vision-course
+      hub_base_path: https://moon-ci-docs.huggingface.co/learn
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
diff --git a/chapters/en/Unit 13 - Outlook/i-jepa.mdx b/chapters/en/Unit 13 - Outlook/i-jepa.mdx
@@ -0,0 +1,86 @@
+# Image-based Joint-Embedding Predictive Architecture (I-JEPA)
+
+## Overview
+
+The Image-based Joint-Embedding Predictive Architecture (I-JEPA) is a groundbreaking self-supervised learning model [introduced by Meta AI in 2023](https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/). It tackles the challenge of understanding images without relying on traditional labels or hand-crafted data augmentations.
+To get to know I-JEPA better, let’s first discuss a few concepts.
+
+### Invariance-based vs. Generative Pretraining Methods
+
+We can say that there are broadly two main approaches for self-supervised learning from images: invariance-based methods and generative methods. Both approaches have their strengths and weaknesses.
+
+- **Invariance-based methods**: In these methods, the model tries to reproduce similar embeddings for different views of the same image. And, of course, these different views are hand-crafted, the image augmentations we’re all familiar with. For example, rotating, scaling, and cropping. These methods are good at producing representations at high semantic levels, but the problem is that they introduce strong biases that may be detrimental to certain downstream tasks. For example, image classification and instance segmentation do not require data augmentations.
+
+- **Generative methods**: The model tries to reconstruct the input image using these methods. That’s why these methods are sometimes called reconstruction-based self-supervised learning. Masks hide patches of the input image, and the model tries to reconstruct these corrupted patches at the pixel or token level (let’s keep this point in mind). This masked approach can easily generalize beyond image modality but doesn’t produce representations at the quality level of invariance-based methods. Also, these methods are computationally expensive and require large datasets for robust training.
+
+Now let’s talk about Joint-Embedding Architectures.
+
+### Joint-Embedding Architectures
+
+This is a recent and popular approach for self-supervised learning from images in which two networks are trained to produce similar embeddings for different views of the same image. Basically, they train two networks to "speak the same language" about different views of the same picture. A common choice is the Siamese network architecture where the two networks share the same weights. But like everything else, it has its own problems:
+
+- **Representation collapse**: A case in which the model produces the same representation regardless of the input.
+
+- **Inputs compatibility criteria**: Finding good and appropriate compatibility measures can be challenging sometimes.
+
+An example of a Joint-Embedding Architecture is [VICReg](https://arxiv.org/abs/2105.04906)
+
+<Tip>
+Different training methods can be employed to train Joint-Embedding Architectures, for example:
+
+- Contrastive methods
+- Non-Contrastive methods
+- Clustering methods
+</Tip>
+
+So far so good, now to I-JEPA. As a start, the picture below from the I-JEPA paper shows the difference between Joint-Embedding methods, generative methods, and I-JEPA.
+
+![I-JEPA Comparisons](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/i-jepa-1.png)
+
+### Image-based Joint-Embedding Predictive Architecture (I-JEPA)
+
+I-JEPA tries to improve on both generative and joint-embedding methods. Conceptually, it is similar to generative methods but with the following key differences:
+
+1. **Abstract prediction**: This is the most fascinating aspect of I-JEPA in my opinion. Remember when we mentioned generative methods and how they try to reconstruct the corrupted input at the pixel level? Now, unlike generative methods, I-JEPA tries to predict it in representation space using its introduced predictor, that’s why they call it abstract prediction. This leads to the model learning more powerful semantic features.
+
+2. **Multi-block masking**: Another design choice that improves the semantic features produced by I-JEPA is masking sufficiently large blocks of the input image.
+
+### I-JEPA Components
+
+The previous diagrams show and compare the I-JEPA architecture, below is a brief description of its main components:
+
+1. **Target Encoder (y-encoder)**: Encodes target images and the target blocks are produced by masking its output.
+
+2. **Context Encoder (x-encoder)**: Encodes a randomly sampled context block from the image to obtain a corresponding patch-level representation.
+
+3. **Predictor**: Takes as input the output of the context encoder and a mask token for each patch we wish to predict and tries to predict the masked target blocks.
+
+The target-encoder, context-encoder, and predictor all use a Vision Transformer (ViT) architecture. You have a refresher about them in unit 3 of this course.
+
+The image below from the paper illustrates how I-JEPA works.
+
+![I-JEPA method](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/i-jepa-2.png)
+
+## Why It Matters
+
+So, why I-JEPA? I-JEPA introduced many new design features while still being a simple and efficient method for learning semantic image representations without relying on hand-crafted data augmentations. Briefly,
+
+1. I-JEPA outperforms pixel-reconstruction methods such as Masked autoencoders (MAE) on ImageNet-1K linear probing, semi-supervised 1% ImageNet-1K, and semantic transfer tasks.
+
+2. I-JEPA is competitive with view-invariant pretraining approaches on semantic tasks and achieves better performance on low-level vision tasks such as object counting and depth prediction.
+
+3. By using a simpler model with less rigid inductive bias, I-JEPA is applicable to a wider set of tasks.
+
+4. I-JEPA is also scalable and efficient. Pretraining on ImageNet requires *less than 1200 GPU hours*.
+
+## References
+
+- [I-JEPA paper](https://arxiv.org/abs/2301.08243)
+
+- [Meta's blog post about I-JEPA](https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/)
+
+- [I-JEPA official GitHub repository](https://github.com/facebookresearch/ijepa)
+
+
+
+
diff --git a/...odels/CLIP and relatives/Introduction.mdx → ...odels/CLIP and relatives/Introduction.mdx b/...odels/CLIP and relatives/Introduction.mdx → ...odels/CLIP and relatives/Introduction.mdx
diff --git a/...tmodal Models/CLIP and relatives/blip.mdx → ...imodal Models/CLIP and relatives/blip.mdx b/...tmodal Models/CLIP and relatives/blip.mdx → ...imodal Models/CLIP and relatives/blip.mdx
diff --git a/...tmodal Models/CLIP and relatives/clip.mdx → ...imodal Models/CLIP and relatives/clip.mdx b/...tmodal Models/CLIP and relatives/clip.mdx → ...imodal Models/CLIP and relatives/clip.mdx
diff --git a/...dal Models/CLIP and relatives/owl_vit.mdx → ...dal Models/CLIP and relatives/owl_vit.mdx b/...dal Models/CLIP and relatives/owl_vit.mdx → ...dal Models/CLIP and relatives/owl_vit.mdx
diff --git a/...it 4 - Mulitmodal Models/introduction.mdx → ...it 4 - Multimodal Models/introduction.mdx b/...it 4 - Mulitmodal Models/introduction.mdx → ...it 4 - Multimodal Models/introduction.mdx
diff --git a/...itmodal Models/supplementary-material.mdx → ...timodal Models/supplementary-material.mdx b/...itmodal Models/supplementary-material.mdx → ...timodal Models/supplementary-material.mdx
diff --git a/... Mulitmodal Models/tasks-models-part1.mdx → ... Multimodal Models/tasks-models-part1.mdx b/... Mulitmodal Models/tasks-models-part1.mdx → ... Multimodal Models/tasks-models-part1.mdx
diff --git a/...- Mulitmodal Models/transfer_learning.mdx → ...- Multimodal Models/transfer_learning.mdx b/...- Mulitmodal Models/transfer_learning.mdx → ...- Multimodal Models/transfer_learning.mdx
diff --git a/.../Unit 4 - Mulitmodal Models/vlm-intro.mdx → .../Unit 4 - Multimodal Models/vlm-intro.mdx b/.../Unit 4 - Mulitmodal Models/vlm-intro.mdx → .../Unit 4 - Multimodal Models/vlm-intro.mdx
diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml
@@ -36,7 +36,7 @@
     title: Introduction to Vision Language Models
   - local: Unit 4 - Multimodal Models/transfer_learning
     title: Transfer learning
-  - local: Unit 4 - Mulitmodal Models/supplementary-material
+  - local: Unit 4 - Multimodal Models/supplementary-material
     title: Supplemental reading and resources
 
 - title: Unit 5. Multimodal Models