Merge branch 'main' into stage

johko · Dec 7, 2024 · d71208d · d71208d
2 parents aae25a2 + 96b87ce
commit d71208d
Show file tree

Hide file tree

Showing 11 changed files with 59 additions and 15 deletions.
diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml
@@ -59,7 +59,7 @@
   - title: MobileViT v2
     local: "unit3/vision-transformers/mobilevit"
   - title: FineTuning Vision Transformer for Object Detection
-    local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
+    local: "unit3/vision-transformers/vision-transformer-for-object-detection"
   - title: DEtection TRansformer (DETR)
     local: "unit3/vision-transformers/detr"
   - title: Vision Transformers for Image Segmentation

diff --git a/chapters/en/unit1/chapter1/definition.mdx b/chapters/en/unit1/chapter1/definition.mdx
@@ -16,7 +16,7 @@ The evolution of computer vision has been marked by a series of incremental adva
 
 Initially, to extract and learn information in an image, you extract features through image-preprocessing techniques (Pre-processing for Computer Vision Tasks). Once you have a group of features describing your image, you use a classical machine learning algorithm on your dataset of features. It is a strategy that already simplifies things from the hard-coded rules, but it still relies on domain knowledge and exhaustive feature engineering. A more state-of-the-art approach arises when deep learning methods and large datasets meet. Deep learning (DL) allows machines to automatically learn complex features from the raw data. This paradigm shift allowed us to build more adaptive and sophisticated models, causing a renaissance in the field.
 
-The seeds of computer vision were sown long before the rise of deep learning models during 1960's, pioneers like David Marr and Hans Moravec wrestled with the fundamental question: Can we get machines to see? Early breakthroughs like edge detection algorithms, object recognition were achived with a mix of cleverness and brute-force which laid the ground work for this developing computer vision systems. Over time, as research and development advanced and hardware capabilities improved, the computer vision community expanded exponentially. This vibrant community is composed of researchers,engineers, data scientists, and passionate hobbyists across the globe coming from a vast arrayof disciplines. With open-source and community driven projects we are witnessing democratized access to cutting-edge tools and technologies helping to create a renaissance in this field.
+The seeds of computer vision were sown long before the rise of deep learning models during 1960's, pioneers like David Marr and Hans Moravec wrestled with the fundamental question: Can we get machines to see? Early breakthroughs like edge detection algorithms, object recognition were achieved with a mix of cleverness and brute-force which laid the ground work for this developing computer vision systems. Over time, as research and development advanced and hardware capabilities improved, the computer vision community expanded exponentially. This vibrant community is composed of researchers,engineers, data scientists, and passionate hobbyists across the globe coming from a vast array of disciplines. With open-source and community driven projects we are witnessing democratized access to cutting-edge tools and technologies helping to create a renaissance in this field.
 
 ## Interdisciplinary with other fields and Image Understanding
 

diff --git a/chapters/en/unit13/hyena.mdx b/chapters/en/unit13/hyena.mdx
@@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,
 
 Long convolutions are similar to standard convolutions except the kernel is the size of the input. 
 It is equivalent to having a global receptive field instead of a local one. 
-Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered. 
+Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred. 
 
 </Tip>
 

diff --git a/...n-transformer-for-objection-detection.mdx → ...sion-transformer-for-object-detection.mdx b/...n-transformer-for-objection-detection.mdx → ...sion-transformer-for-object-detection.mdx
@@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out
 
 ### The Need to Fine-tune Models in Object Detection 🤔
 
-That is an awesome question. Training an object detection model from scratch means:
+Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:
 
 - Doing already done research over and over again.
 - Writing repetitive model code, training them, and maintaining different repositories for different use cases.
@@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus
 
 ### Dataset
 
-For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
+For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
 
 ```python
 from datasets import load_dataset

diff --git a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx
@@ -88,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
 ## An application of multimodality: Multimodal Search 🔎📲💻
 
 Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
-powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
+powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
 
 Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.
 

diff --git a/chapters/en/unit4/multimodal-models/supplementary-material.mdx b/chapters/en/unit4/multimodal-models/supplementary-material.mdx
@@ -10,4 +10,4 @@ We hope that you found the unit on multimodal models exciting. If you'd like to
 - [**EE/CS 148, Caltech**](https://gkioxari.github.io/teaching/cs148/) course on Large Language and Vision Models.
 
 In the next unit we will take a look at another kind of Neural Network Models that were revolutionized by multimodality in the last years: **Generative Neural Networks**
-Get you paint brush ready and join us on another exciting adventure in the realm of Computer Vision 🤠
+Get your paint brush ready and join us on another exciting adventure in the realm of Computer Vision 🤠
diff --git a/chapters/en/unit4/multimodal-models/vlm-intro.mdx b/chapters/en/unit4/multimodal-models/vlm-intro.mdx
@@ -94,4 +94,4 @@ One more such dataset called **Winoground** was designed to figure out how good
 ## What's Next?
 The community is moving fast and we can see already lot of amazing work like [FLAVA](https://arxiv.org/abs/2112.04482) which tries to have a single "foundational" model for all the target modalities at once. This is one possible scenario for the future - modality-agnostic foundation models that can read and generate many modalities! But maybe we also see other alternatives developing, one thing we can say for sure is . there is an interesting future ahead. 
 
-To capture more on these recent advances feel free follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.
+To capture more on these recent advances feel free to follow the HF's [Transformers Library](https://huggingface.co/docs/transformers/index), and [Diffusers Library](https://huggingface.co/docs/diffusers/index) where we try to add recent advances and models as fast as possible! If you feel like we are missing something important, you can also open an issue for these libraries and contribute code yourself.
diff --git a/chapters/en/unit5/generative-models/variational_autoencoders.mdx b/chapters/en/unit5/generative-models/variational_autoencoders.mdx
@@ -7,9 +7,14 @@ Autoencoders are a class of neural networks primarily used for unsupervised lear
 
 ![Vanilla Autoencoder Image - Lilian Weng Blog](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/autoencoder.png)
 
-This encoder model consists of an encoder network (represented as \\(g_\phi\\)) and a decoder network (represented as \\(f_\theta\\)). The low-dimensional representation is learned in the bottleneck layer as \\(z\\) and the reconstructed output is represented as \\(x' = f_\theta(g_\phi(x))\\) with the goal of \\(x \approx x'\\).
+This encoder model consists of an encoder network (represented as $g_\phi$) and a decoder network (represented as $f_\theta$). The low-dimensional representation is learned in the bottleneck layer as $z$, and the reconstructed output is represented as $x' = f_\theta(g_\phi(x))$ with the goal of $x \approx x'$.
+
+A common loss function used in such vanilla autoencoders is 
+
+$$L(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2$$ 
+
+which tries to minimize the error between the original image and the reconstructed one. This is also known as the `reconstruction loss`.
 
-A common loss function used in such vanilla autoencoders is \\(L(\theta, \phi) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - f_\theta(g_\phi(\mathbf{x}^{(i)})))^2\\), which tries to minimize the error between the original image and the reconstructed one and is also known as the `reconstruction loss`.
 
 Autoencoders are useful for tasks such as data denoising, feature learning, and compression. However, traditional autoencoders lack the probabilistic nature that makes VAEs particularly intriguing and also useful for generational tasks.
 
@@ -34,4 +39,4 @@ In summary, VAEs go beyond mere data reconstruction; they generate new samples a
 ## References
 1. [Lilian Weng's Awesome Blog on Autoencoders](https://lilianweng.github.io/posts/2018-08-12-vae/)
 2. [Generative models under a microscope: Comparing VAEs, GANs, and Flow-Based Models](https://medium.com/sciforce/generative-models-under-a-microscope-comparing-vaes-gans-and-flow-based-models-344f20085d83)
-3. [Autoencoders, Variational Autoencoders (VAE) and β-VAE](https://medium.com/@rushikesh.shende/autoencoders-variational-autoencoders-vae-and-%CE%B2-vae-ceba9998773d)
+3. [Autoencoders, Variational Autoencoders (VAE) and β-VAE](https://medium.com/@rushikesh.shende/autoencoders-variational-autoencoders-vae-and-%CE%B2-vae-ceba9998773d)
diff --git a/chapters/en/unit7/video-processing/introduction-to-video.mdx b/chapters/en/unit7/video-processing/introduction-to-video.mdx
@@ -5,8 +5,6 @@ Of course, the real world of Computer Vision has a lot more to offer. Videos are
 
 Given their importance in our society and research, we also want to talk about them here in our course. In this introduction chapter, you will learn some very basic theory behind videos before going on to have a closer look at video processing.
 
-Let's go! 🤓
-
 ## What is a Video?
 
 An image is a binary, two-dimensional (2D) representation of visual data. A video is a multimedia format that sequentially displays these frames or images.
@@ -36,3 +34,44 @@ Codecs, short for “compressor-decompressor” are software or hardware compone
 There are two main types of codecs; "lossless codecs" and "lossy codecs". Lossless codecs are designed to compress data without any loss of quality, while lossy codecs are more designed to compress by removing some of the data resulting in a loss of quality.
 
 In summary, a video is a dynamic multimedia format that combines a series of individual frames, audio, and often additional metadata. It is used in a wide range of applications and can be tailored for different purposes, whether for entertainment, education, communication, or analysis.
+
+## What is Video Processing?
+
+In the research field of Computer Vision (CV) and Artificial Intelligence (AI), video processing involves automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+
+
+
+Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in Deep Learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
+
+
+### Applications of Video Processing
+
+- **Surveillance Systems:**
+Video processing plays a critical role in public safety, crime prevention, and traffic monitoring. It enables the automated detection of suspicious activities, helps identify individuals, and enhances the efficiency of surveillance systems.  
+
+- **Autonomous Driving:**
+In the realm of autonomous driving, video processing is essential for navigation, obstacle detection, and decision-making processes. It allows self-driving cars to understand their surroundings, recognize road signs, and react to changing environments, ensuring safe and efficient transportation. 
+
+- **Healthcare:**
+Video processing has significant applications in healthcare, including medical diagnostics, surgery, and patient monitoring. It helps analyze medical images, provides real-time feedback during surgical procedures, and continuously monitors patients to detect any abnormalities or emergencies.  
+
+### Challenges in Video Processing
+
+- **Computational Demands:**
+Real-time video analysis requires substantial processing power, which poses a significant challenge in developing and deploying efficient video processing systems. High-performance computing resources are essential to meet these demands.
+
+- **Storage Requirements:**
+High-resolution videos generate large volumes of data, leading to storage challenges. Efficient data compression and management techniques are necessary to handle the vast amounts of video data.
+
+- **Privacy and Ethical Concerns:**
+Video processing, especially in surveillance and healthcare, involves handling sensitive information. Ensuring privacy and addressing ethical concerns related to the misuse of video data are crucial considerations that must be carefully managed.
+
+## Conclusion
+
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and Transformers.  
+
+
+
+Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.
+
+Let's go! 🤓
diff --git a/chapters/en/unit9/intro_to_model_optimization.mdx b/chapters/en/unit9/intro_to_model_optimization.mdx
@@ -32,7 +32,7 @@ A trade-off exists between accuracy, performance, and resource usage when deploy
 2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy.
 3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.
 
-Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size.
+The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size.
 
 ![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png)
 

diff --git a/chapters/en/unit9/tools_and_frameworks.mdx b/chapters/en/unit9/tools_and_frameworks.mdx
@@ -6,7 +6,7 @@
 
 The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. 
 The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. 
-The tools also include API for pruning and quantization during training is post-training quantization is insufficient.
+The tools also include API for pruning and quantization during training if post-training quantization is insufficient.
 These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators.
 
 ### Setup guide