From d82e0373658f16f020b6f2ff23791a07910f46ac Mon Sep 17 00:00:00 2001
From: Jungnerd <46880056+jungnerd@users.noreply.github.com>
Date: Thu, 30 May 2024 18:51:09 +0900
Subject: [PATCH 01/11] Update introduction-to-video.mdx

added video processing part
---
 .../introduction-to-video.mdx                 | 32 +++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/chapters/en/unit7/video-processing/introduction-to-video.mdx b/chapters/en/unit7/video-processing/introduction-to-video.mdx
index 9a9f95d86..edcc9c320 100644
--- a/chapters/en/unit7/video-processing/introduction-to-video.mdx
+++ b/chapters/en/unit7/video-processing/introduction-to-video.mdx
@@ -36,3 +36,35 @@ Codecs, short for “compressor-decompressor” are software or hardware compone
 There are two main types of codecs; "lossless codecs" and "lossy codecs". Lossless codecs are designed to compress data without any loss of quality, while lossy codecs are more designed to compress by removing some of the data resulting in a loss of quality.
 
 In summary, a video is a dynamic multimedia format that combines a series of individual frames, audio, and often additional metadata. It is used in a wide range of applications and can be tailored for different purposes, whether for entertainment, education, communication, or analysis.
+
+## What is Video Processing?
+
+In the research field of computer vision (CV) and artificial intelligence (AI), video processing is all about automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+
+Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in deep learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
+
+### Applications of Video Processing
+
+- **Surveillance Systems:**
+Video processing plays a critical role in public safety, crime prevention, and traffic monitoring. It enables the automated detection of suspicious activities, helps identify individuals, and enhances the efficiency of surveillance systems.  
+  
+- **Autonomous Driving:**
+In the realm of autonomous driving, video processing is essential for navigation, obstacle detection, and decision-making processes. It allows self-driving cars to understand their surroundings, recognize road signs, and react to changing environments, ensuring safe and efficient transportation. 
+
+- **Healthcare:**
+Video processing has significant applications in healthcare, including medical diagnostics, surgery, and patient monitoring. It helps analyze medical images, provides real-time feedback during surgical procedures, and continuously monitors patients to detect any abnormalities or emergencies.  
+
+### Challenges in Video Processing
+
+- **Computational Demands:**
+Real-time video analysis requires substantial processing power, which poses a significant challenge in developing and deploying efficient video processing systems. High-performance computing resources are essential to meet these demands.
+
+- **Storage Requirements:**
+High-resolution videos generate large volumes of data, leading to storage challenges. Efficient data compression and management techniques are necessary to handle the vast amounts of video data.
+
+- **Privacy and Ethical Concerns:**
+Video processing, especially in surveillance and healthcare, involves handling sensitive information. Ensuring privacy and addressing ethical concerns related to the misuse of video data are crucial considerations that must be carefully managed.
+
+## Conclusion
+
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into the deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers. Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.

From d9211f5fc0cba8a0b1c781de01f3d65e173b36f5 Mon Sep 17 00:00:00 2001
From: Jungnerd <46880056+jungnerd@users.noreply.github.com>
Date: Thu, 30 May 2024 18:53:56 +0900
Subject: [PATCH 02/11] Update introduction-to-video.mdx

---
 .../en/unit7/video-processing/introduction-to-video.mdx   | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/chapters/en/unit7/video-processing/introduction-to-video.mdx b/chapters/en/unit7/video-processing/introduction-to-video.mdx
index edcc9c320..637b54e78 100644
--- a/chapters/en/unit7/video-processing/introduction-to-video.mdx
+++ b/chapters/en/unit7/video-processing/introduction-to-video.mdx
@@ -5,8 +5,6 @@ Of course, the real world of Computer Vision has a lot more to offer. Videos are
 
 Given their importance in our society and research, we also want to talk about them here in our course. In this introduction chapter, you will learn some very basic theory behind videos before going on to have a closer look at video processing.
 
-Let's go! 🤓
-
 ## What is a Video?
 
 An image is a binary, two-dimensional (2D) representation of visual data. A video is a multimedia format that sequentially displays these frames or images.
@@ -67,4 +65,8 @@ Video processing, especially in surveillance and healthcare, involves handling s
 
 ## Conclusion
 
-Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into the deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers. Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into the deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers.  
+
+Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.
+
+Let's go! 🤓

From 0c1da84f0b0bf6a646d7b33ecf91abb3cc4c8d1d Mon Sep 17 00:00:00 2001
From: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
Date: Tue, 18 Jun 2024 11:46:10 +0900
Subject: [PATCH 03/11] Added credits as a writer

---
 chapters/en/unit0/welcome/welcome.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx
index 21a73e5b2..58e0020c6 100644
--- a/chapters/en/unit0/welcome/welcome.mdx
+++ b/chapters/en/unit0/welcome/welcome.mdx
@@ -125,7 +125,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 **Unit 7 - Video and Video Processing**
 
 - Reviewers: [Ameed Taylor](https://github.com/atayloraerospace)
-- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet)
+- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Woojun Jung](https://github.com/jungnerd)
 
 **Unit 8 - 3D Vision, Scene Rendering, and Reconstruction**
 

From 475d3ffe0372294341881af49f25ee60f0890947 Mon Sep 17 00:00:00 2001
From: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
Date: Thu, 20 Jun 2024 10:00:57 +0900
Subject: [PATCH 04/11] Apply suggestions from code review

Co-authored-by: A Taylor <112668339+ATaylorAerospace@users.noreply.github.com>
---
 .../en/unit7/video-processing/introduction-to-video.mdx     | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/chapters/en/unit7/video-processing/introduction-to-video.mdx b/chapters/en/unit7/video-processing/introduction-to-video.mdx
index 637b54e78..bec44c34f 100644
--- a/chapters/en/unit7/video-processing/introduction-to-video.mdx
+++ b/chapters/en/unit7/video-processing/introduction-to-video.mdx
@@ -37,7 +37,8 @@ In summary, a video is a dynamic multimedia format that combines a series of ind
 
 ## What is Video Processing?
 
-In the research field of computer vision (CV) and artificial intelligence (AI), video processing is all about automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+In the research field of computer vision (CV) and artificial intelligence (AI), video processing involves automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+
 
 Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in deep learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
 
@@ -65,7 +66,8 @@ Video processing, especially in surveillance and healthcare, involves handling s
 
 ## Conclusion
 
-Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into the deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers.  
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers.  
+
 
 Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.
 

From 158a12024ea681080683bcc6670a498543aa1bb5 Mon Sep 17 00:00:00 2001
From: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
Date: Sat, 20 Jul 2024 12:58:25 +0900
Subject: [PATCH 05/11] Apply suggestions from code review

Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com>
---
 .../en/unit7/video-processing/introduction-to-video.mdx  | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/chapters/en/unit7/video-processing/introduction-to-video.mdx b/chapters/en/unit7/video-processing/introduction-to-video.mdx
index bec44c34f..0b7cf8e57 100644
--- a/chapters/en/unit7/video-processing/introduction-to-video.mdx
+++ b/chapters/en/unit7/video-processing/introduction-to-video.mdx
@@ -37,10 +37,12 @@ In summary, a video is a dynamic multimedia format that combines a series of ind
 
 ## What is Video Processing?
 
-In the research field of computer vision (CV) and artificial intelligence (AI), video processing involves automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+In the research field of Computer Vision (CV) and Artificial Intelligence (AI), video processing involves automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
 
 
-Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in deep learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
+
+Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in Deep Learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
+
 
 ### Applications of Video Processing
 
@@ -66,7 +68,8 @@ Video processing, especially in surveillance and healthcare, involves handling s
 
 ## Conclusion
 
-Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and transformers.  
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and Transformers.  
+
 
 
 Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.

From 26a183384ba604e74c65605677974425e5471ef6 Mon Sep 17 00:00:00 2001
From: Eric <eoulster@gmail.com>
Date: Thu, 1 Aug 2024 17:01:58 -0400
Subject: [PATCH 06/11] Fixed Various Grammatical Issues Across Course

---
 chapters/en/unit13/hyena.mdx                                  | 2 +-
 ...ection.mdx => vision-transformer-for-object-detection.mdx} | 4 ++--
 chapters/en/unit4/multimodal-models/a_multimodal_world.mdx    | 2 +-
 chapters/en/unit9/intro_to_model_optimization.mdx             | 2 +-
 chapters/en/unit9/tools_and_frameworks.mdx                    | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)
 rename chapters/en/unit3/vision-transformers/{vision-transformer-for-objection-detection.mdx => vision-transformer-for-object-detection.mdx} (98%)

diff --git a/chapters/en/unit13/hyena.mdx b/chapters/en/unit13/hyena.mdx
index 63535df3b..895d31ff2 100644
--- a/chapters/en/unit13/hyena.mdx
+++ b/chapters/en/unit13/hyena.mdx
@@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,
 
 Long convolutions are similar to standard convolutions except the kernel is the size of the input. 
 It is equivalent to having a global receptive field instead of a local one. 
-Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered. 
+Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred. 
 
 </Tip>
 
diff --git a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx b/chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx
similarity index 98%
rename from chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx
rename to chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx
index 9b341fc9c..c1638d693 100644
--- a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx
+++ b/chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx
@@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out
 
 ### The Need to Fine-tune Models in Object Detection 🤔
 
-That is an awesome question. Training an object detection model from scratch means:
+Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:
 
 - Doing already done research over and over again.
 - Writing repetitive model code, training them, and maintaining different repositories for different use cases.
@@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus
 
 ### Dataset
 
-For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
+For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
 
 ```python
 from datasets import load_dataset
diff --git a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx
index ebe7c18ec..fac5ffa49 100644
--- a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx
+++ b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx
@@ -89,7 +89,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
 ## An application of multimodality: Multimodal Search 🔎📲💻
 
 Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
-powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
+powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
 
 Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.
 
diff --git a/chapters/en/unit9/intro_to_model_optimization.mdx b/chapters/en/unit9/intro_to_model_optimization.mdx
index 2356de8d1..213419fa4 100644
--- a/chapters/en/unit9/intro_to_model_optimization.mdx
+++ b/chapters/en/unit9/intro_to_model_optimization.mdx
@@ -32,7 +32,7 @@ A trade-off exists between accuracy, performance, and resource usage when deploy
 2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy.
 3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.
 
-Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size.
+The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size.
 
 ![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png)
 
diff --git a/chapters/en/unit9/tools_and_frameworks.mdx b/chapters/en/unit9/tools_and_frameworks.mdx
index 21eeb5291..63d51a1d8 100644
--- a/chapters/en/unit9/tools_and_frameworks.mdx
+++ b/chapters/en/unit9/tools_and_frameworks.mdx
@@ -6,7 +6,7 @@
 
 The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. 
 The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. 
-The tools also include API for pruning and quantization during training is post-training quantization is insufficient.
+The tools also include API for pruning and quantization during training if post-training quantization is insufficient.
 These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators.
 
 ### Setup guide

From e1b45a4d50697f210d71c2b578898bc5f444d29a Mon Sep 17 00:00:00 2001
From: Eric <eoulster@gmail.com>
Date: Wed, 14 Aug 2024 09:36:13 -0400
Subject: [PATCH 07/11] Staged change in toctree.yml from name change in title

---
 chapters/en/_toctree.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml
index a9bd6ec08..685fd1247 100644
--- a/chapters/en/_toctree.yml
+++ b/chapters/en/_toctree.yml
@@ -57,7 +57,7 @@
   - title: MobileViT v2
     local: "unit3/vision-transformers/mobilevit"
   - title: FineTuning Vision Transformer for Object Detection
-    local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
+    local: "unit3/vision-transformers/vision-transformer-for-object-detection"
   - title: DEtection TRansformer (DETR)
     local: "unit3/vision-transformers/detr"
   - title: Vision Transformers for Image Segmentation

From 9f1f024544e012232f5152fa22f254eb83e9e0f2 Mon Sep 17 00:00:00 2001
From: SahilCarterr <110806554+SahilCarterr@users.noreply.github.com>
Date: Wed, 21 Aug 2024 00:48:44 +0530
Subject: [PATCH 08/11] Updated Math description to correct rendering

Fixed maths rendering issue #315
---
 .../generative-models/variational_autoencoders.mdx    | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/chapters/en/unit5/generative-models/variational_autoencoders.mdx b/chapters/en/unit5/generative-models/variational_autoencoders.mdx
index a73074032..0040f82cf 100644
--- a/chapters/en/unit5/generative-models/variational_autoencoders.mdx
+++ b/chapters/en/unit5/generative-models/variational_autoencoders.mdx
@@ -7,9 +7,14 @@ Autoencoders are a class of neural networks primarily used for unsupervised lear
 
 ![Vanilla Autoencoder Image - Lilian Weng Blog](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/autoencoder.png)
 
-This encoder model consists of an encoder network (represented as \\(g_\phi\\)) and a decoder network (represented as \\(f_\theta\\)). The low-dimensional representation is learned in the bottleneck layer as \\(z\\) and the reconstructed output is represented as \\(x' = f_\theta(g_\phi(x))\\) with the goal of \\(x \approx x'\\).
+This encoder model consists of an encoder network (represented as $g_\phi$) and a decoder network (represented as $f_\theta$). The low-dimensional representation is learned in the bottleneck layer as $z$, and the reconstructed output is represented as $x' = f_\theta(g_\phi(x))$ with the goal of $x \approx x'$.
+
+A common loss function used in such vanilla autoencoders is 
+
+$$L(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2$$ 
+
+which tries to minimize the error between the original image and the reconstructed one. This is also known as the `reconstruction loss`.
 
-A common loss function used in such vanilla autoencoders is \\(L(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2\\), which tries to minimize the error between the original image and the reconstructed one and is also known as the `reconstruction loss`.
 
 Autoencoders are useful for tasks such as data denoising, feature learning, and compression. However, traditional autoencoders lack the probabilistic nature that makes VAEs particularly intriguing and also useful for generational tasks.
 
@@ -34,4 +39,4 @@ In summary, VAEs go beyond mere data reconstruction; they generate new samples a
 ## References
 1. [Lilian Weng's Awesome Blog on Autoencoders](https://lilianweng.github.io/posts/2018-08-12-vae/)
 2. [Generative models under a microscope: Comparing VAEs, GANs, and Flow-Based Models](https://medium.com/sciforce/generative-models-under-a-microscope-comparing-vaes-gans-and-flow-based-models-344f20085d83)
-3. [Autoencoders, Variational Autoencoders (VAE) and β-VAE](https://medium.com/@rushikesh.shende/autoencoders-variational-autoencoders-vae-and-%CE%B2-vae-ceba9998773d)
\ No newline at end of file
+3. [Autoencoders, Variational Autoencoders (VAE) and β-VAE](https://medium.com/@rushikesh.shende/autoencoders-variational-autoencoders-vae-and-%CE%B2-vae-ceba9998773d)

From 6989c34aef45a5df1da06cbb990172de84b69c3b Mon Sep 17 00:00:00 2001
From: Arjun Bhammar <112189950+BhammarArjun@users.noreply.github.com>
Date: Fri, 4 Oct 2024 18:01:36 +0530
Subject: [PATCH 09/11] spelling errors

---
 chapters/en/unit1/chapter1/definition.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/unit1/chapter1/definition.mdx b/chapters/en/unit1/chapter1/definition.mdx
index 7309406c8..576fb7518 100644
--- a/chapters/en/unit1/chapter1/definition.mdx
+++ b/chapters/en/unit1/chapter1/definition.mdx
@@ -16,7 +16,7 @@ The evolution of computer vision has been marked by a series of incremental adva
 
 Initially, to extract and learn information in an image, you extract features through image-preprocessing techniques (Pre-processing for Computer Vision Tasks). Once you have a group of features describing your image, you use a classical machine learning algorithm on your dataset of features. It is a strategy that already simplifies things from the hard-coded rules, but it still relies on domain knowledge and exhaustive feature engineering. A more state-of-the-art approach arises when deep learning methods and large datasets meet. Deep learning (DL) allows machines to automatically learn complex features from the raw data. This paradigm shift allowed us to build more adaptive and sophisticated models, causing a renaissance in the field.
 
-The seeds of computer vision were sown long before the rise of deep learning models during 1960's, pioneers like David Marr and Hans Moravec wrestled with the fundamental question: Can we get machines to see? Early breakthroughs like edge detection algorithms, object recognition were achived with a mix of cleverness and brute-force which laid the ground work for this developing computer vision systems. Over time, as research and development advanced and hardware capabilities improved, the computer vision community expanded exponentially. This vibrant community is composed of researchers,engineers, data scientists, and passionate hobbyists across the globe coming from a vast arrayof disciplines. With open-source and community driven projects we are witnessing democratized access to cutting-edge tools and technologies helping to create a renaissance in this field.
+The seeds of computer vision were sown long before the rise of deep learning models during 1960's, pioneers like David Marr and Hans Moravec wrestled with the fundamental question: Can we get machines to see? Early breakthroughs like edge detection algorithms, object recognition were achieved with a mix of cleverness and brute-force which laid the ground work for this developing computer vision systems. Over time, as research and development advanced and hardware capabilities improved, the computer vision community expanded exponentially. This vibrant community is composed of researchers,engineers, data scientists, and passionate hobbyists across the globe coming from a vast array of disciplines. With open-source and community driven projects we are witnessing democratized access to cutting-edge tools and technologies helping to create a renaissance in this field.
 
 ## Interdisciplinary with other fields and Image Understanding
 

From 290812f5f04caec6d078df62edc7c3310ee67275 Mon Sep 17 00:00:00 2001
From: Karan Jakhar <karanjakhar49@gmail.com>
Date: Sun, 20 Oct 2024 16:29:11 +0530
Subject: [PATCH 10/11] Update supplementary-material.mdx

---
 chapters/en/unit4/multimodal-models/supplementary-material.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/unit4/multimodal-models/supplementary-material.mdx b/chapters/en/unit4/multimodal-models/supplementary-material.mdx
index b7194e587..28da43453 100644
--- a/chapters/en/unit4/multimodal-models/supplementary-material.mdx
+++ b/chapters/en/unit4/multimodal-models/supplementary-material.mdx
@@ -10,4 +10,4 @@ We hope that you found the unit on multimodal models exciting. If you'd like to
 - [**EE/CS 148, Caltech**](https://gkioxari.github.io/teaching/cs148/) course on Large Language and Vision Models.
 
 In the next unit we will take a look at another kind of Neural Network Models that were revolutionized by multimodality in the last years: **Generative Neural Networks**
-Get you paint brush ready and join us on another exciting adventure in the realm of Computer Vision 🤠
+Get your paint brush ready and join us on another exciting adventure in the realm of Computer Vision 🤠

From b5f8ef68e77023de5d9143550c9c98b109ba6b42 Mon Sep 17 00:00:00 2001
From: neel429 <77201452+neel429@users.noreply.github.com>
Date: Tue, 29 Oct 2024 20:28:26 -0400
Subject: [PATCH 11/11] Update vlm-intro.mdx

There is a small grammatical mistake in line 96.
---
 chapters/en/unit4/multimodal-models/vlm-intro.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/en/unit4/multimodal-models/vlm-intro.mdx b/chapters/en/unit4/multimodal-models/vlm-intro.mdx
index c3c00b1e2..d92c7292b 100644
--- a/chapters/en/unit4/multimodal-models/vlm-intro.mdx
+++ b/chapters/en/unit4/multimodal-models/vlm-intro.mdx
@@ -94,4 +94,4 @@ One more such dataset called **Winoground** was designed to figure out how good
 ## What's Next?
 The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead. 
 
-To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.
\ No newline at end of file
+To capture more on these recent advances feel free to follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.