Bugfix: use device in all Torch models #5026

jacobsela · 2024-10-31T22:06:34Z

Resolves #5271

Summary by CodeRabbit

New Features
- Added device configuration options for machine learning models.
- Enhanced model compatibility with different hardware setups.
Improvements
- Improved device management for GPU and CPU processing.
- More flexible device selection for transformer and AI models.
Technical Updates
- Updated device handling methods across multiple utility classes.
- Introduced device attribute in configuration classes for more precise control.

coderabbitai · 2024-10-31T22:06:40Z

Walkthrough

The changes involve modifications to the device management in the TorchOpenClipModel, TorchYoloNasModel, and transformer classes within the fiftyone/utils/open_clip.py, fiftyone/utils/super_gradients.py, fiftyone/utils/transformers.py, and fiftyone/utils/ultralytics.py files. The updates replace direct calls to .cuda() with .to(self.device) for moving tensors and models to the appropriate device, enhancing compatibility across different hardware configurations.

Changes

File	Change Summary
fiftyone/utils/open_clip.py	Updated `_get_text_features`, `_embed_prompts`, and `_predict_all` methods to use `text.to(self.device)` and `imgs.to(self.device)` for device management.
fiftyone/utils/super_gradients.py	Modified `_load_model` method to use `model.to(self.device)` for transferring the model to the appropriate device.
fiftyone/utils/transformers.py	Introduced `device` attribute in `FiftyOneTransformerConfig` and `FiftyOneZeroShotTransformerConfig`, modified initialization in transformer classes to utilize this attribute for device management.
fiftyone/utils/ultralytics.py	Added `device` attribute to `FiftyOneYOLOModelConfig` and updated the constructor in `FiftyOneYOLOModel` to use `model.to(self.device)`.

Assessment against linked issues

Objective	Addressed	Explanation
Resolve hardcoded CUDA device issue in `apply_model` (#5271)	✅

Possibly related PRs

Transformers GPU Support #4987: The changes in this PR also focus on device management for models, specifically enhancing GPU support in the fiftyone/utils/transformers.py file, which aligns with the device handling improvements made in the main PR's TorchOpenClipModel class.

Suggested reviewers

brimoor

Poem

In the patch of code, a rabbit hops,
With changes made, it never stops.
Through functions and loops, it scurries with glee,
Enhancing the zoo for all to see! 🐇✨

Finishing Touches

📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

danielgural · 2024-10-31T22:35:40Z

Still works fine and I can see difference between cpu and cuda. Note for future, this change is not pulled upstream by

fob.compute_similarity(
    dataset,
    model="clip-vit-base32-torch",
    brain_key="img_sim",
    device="cuda",
)

and just noticed. Something for next time :)

@harpreetsahota204 can you run this code when you test:

import fiftyone.brain as fob
model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda")
print(model._model.visual.conv1._parameters["weight"][0].device)

To make sure the model is also multi-gpu

danielgural

LGTM

import fiftyone.brain as fob
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset('quickstart')
session = fo.launch_app(dataset)
model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda")
embeddings = dataset.compute_embeddings(model)

worked as expected

fiftyone/utils/clip/zoo.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

fiftyone/utils/super_gradients.py (1)
98-100: Consider adding a docstring note about device flexibility.

Since this change enables flexible device selection, it would be helpful to document this capability in the class or method docstring. This would help users understand that they can use any available GPU.

Add a note like this to the class docstring:
 """FiftyOne wrapper around YOLO-NAS from
 https://github.com/Deci-AI/super-gradients.
+
+The model automatically uses the appropriate device (CPU/GPU) based on availability
+and can work with any CUDA device, not just the default one.

 Args:
     config: a :class:`TorchYoloNasModelConfig`
 """

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3596f and 9a89a70.

📒 Files selected for processing (3)

fiftyone/utils/clip/zoo.py (1 hunks)
fiftyone/utils/open_clip.py (3 hunks)
fiftyone/utils/super_gradients.py (1 hunks)

🔥 Files not summarized due to errors (1)

fiftyone/utils/clip/zoo.py: Error: Server error: no LLM provider could handle the message

🔇 Additional comments (4)

fiftyone/utils/super_gradients.py (1)

99-99: LGTM! Device management improvement.

The change from model.cuda() to model.to(self.device) improves GPU device flexibility, allowing the model to work with any available GPU instead of being restricted to the default one.

Let's verify that the model parameters are correctly moved to the specified device:

✅ Verification successful

Device management change is properly implemented

The verification confirms that the change from model.cuda() to model.to(self.device) is correct and consistent with the codebase's device management pattern. Key findings:

The base TorchImageModel class properly initializes self.device in its constructor:

device = self.config.device
if device is None:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"

All other model implementations (CLIP, SAM, YOLO-NAS) consistently use self.device for device management.
The change aligns with the existing pattern used across the codebase for moving tensors and models to the target device.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are any remaining hard-coded device assignments
# that might conflict with the new flexible device management

# Test 1: Look for any remaining .cuda() calls that might bypass device management
rg -g '*.py' '\.cuda\(' 

# Test 2: Check for proper device initialization in the parent class
ast-grep --pattern 'class TorchImageModel:
  $$$
  def __init__($$$):
    $$$
    self.device = $$$
    $$$
'

Length of output: 137

Script:

#!/bin/bash
# Let's try a different approach to verify device management

# Check for device-related code in TorchImageModel and its subclasses
rg -g '*.py' 'class.*TorchImageModel' -A 10

# Look for device initialization patterns
rg -g '*.py' 'self\.device' -B 2 -A 2

# Check for any remaining cuda-specific code
rg -g '*.py' 'cuda|gpu' -B 2 -A 2

Length of output: 26257

fiftyone/utils/open_clip.py (3)

109-109: LGTM: Device handling improvement in text feature extraction.

The change from .cuda() to .to(self.device) is correct and aligns with the PR's objective of supporting multiple GPUs.

121-121: LGTM: Device handling improvement in prompt embedding.

The change from .cuda() to .to(self.device) is correct and consistent with the previous change.

109-109: Verify device consistency across model components.

The device handling changes look good, but we should verify that all model components (model, tensors, etc.) are consistently using the same device.

Also applies to: 121-121, 146-146

✅ Verification successful

Based on the shell script results, I can now generate the final response as the device handling patterns are clear across the codebase.

Device handling is consistent across model components

The device handling in the OpenCLIP implementation follows the established patterns used throughout the codebase:

The model and tensors are consistently moved to the same device using to(self.device)
Device initialization follows the standard pattern of defaulting to CUDA if available, otherwise CPU
All tensor operations (text encoding, image encoding) maintain device consistency

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify device handling consistency across the codebase
# Look for any remaining .cuda() calls or device-related patterns

# Test 1: Check for any remaining .cuda() calls
echo "Checking for remaining .cuda() calls..."
rg '\.cuda\(' 

# Test 2: Check for device-related patterns in TorchOpenClipModel and related classes
echo "Checking device handling patterns..."
rg -A 3 'to\(.*device|\.device|device\s*='

Length of output: 14047

coderabbitai · 2024-11-01T14:05:05Z

fiftyone/utils/open_clip.py

@@ -143,7 +143,7 @@ def _predict_all(self, imgs):
        frame_size = (width, height)

        if self._using_gpu:
-            imgs = imgs.cuda()
+            imgs = imgs.to(self.device)


⚠️ Potential issue

Consider adding error handling for device transfers.

While the device handling change is correct, the subsequent autocast context is hardcoded to "cuda". This might cause issues when running on CPU.

Apply this diff to make it device-aware:

if self._using_gpu: imgs = imgs.to(self.device) - with torch.no_grad(), torch.amp.autocast("cuda"): + with torch.no_grad(), torch.amp.autocast(device_type=self.device.type if self._using_gpu else "cpu"): image_features = self._model.encode_image(imgs) text_features = self._get_text_features()

Also applies to: 147-152

@jacobsela coderabbit raises an interesting point here. Does torch.amp.autocast("cuda") need to be updated?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Yeah this is valid and will cause problems if not handled. It's in my todo for this week to more thoroughly review the code before moving further with this PR because this message makes me think that there are probably more places I haven't noticed that make hardware assumptions.

brimoor · 2025-01-05T22:47:31Z

@jacobsela friendly ping here; can we get this completed and merged?

deltheil · 2025-01-06T09:27:26Z

Any chance to modify fiftyone/utils/transformers.py in addition (since Hf transformers models are impacted too: see #5271)? Thanks!

jacobsela · 2025-01-06T15:31:44Z

@deltheil That can be done, we'll look into it.

jacobsela · 2025-01-09T02:13:32Z

@brimoor

Currently testing all zoo models. There seem to be some other unrelated issues that may be worth addressing, e.g. #5359 and an error pasted below with open clip.

I'll push the fixes for transformers once git goes back up. (EDIT: up now)

This also makes me think that we may need proper testing when adding new zoo models. The code isn't very consistent. Not sure if it's worth the time sink though.

Current status:

Tested devices ['cpu', 'cuda:2']

Tested models - all pass besides open-clip-torch

classification-transformer-torch
clip-vit-base32-torch
mnasnet0.5-imagenet-torch
keypoint-rcnn-resnet50-fpn-coco-torch
depth-estimation-transformer-torch
segment-anything-2.1-hiera-tiny-image-torch
resnext101-32x8d-imagenet-torch
yolov5s-coco-torch
mobilenet-v2-imagenet-torch
wide-resnet50-2-imagenet-torch
vgg16-imagenet-torch
densenet121-imagenet-torch
densenet201-imagenet-torch
dinov2-vits14-torch
fcn-resnet50-coco-torch
retinanet-resnet50-fpn-coco-torch
densenet161-imagenet-torch
vgg13-bn-imagenet-torch
segment-anything-2-hiera-large-image-torch
resnet152-imagenet-torch
wide-resnet101-2-imagenet-torch
dinov2-vitb14-torch
vgg16-bn-imagenet-torch
deeplabv3-resnet50-coco-torch
vgg13-imagenet-torch
detection-transformer-torch
fcn-resnet101-coco-torch
squeezenet-imagenet-torch
resnet50-imagenet-torch
squeezenet-1.1-imagenet-torch
yolov5l-coco-torch
vgg11-bn-imagenet-torch
vgg19-bn-imagenet-torch
resnet34-imagenet-torch
shufflenetv2-1.0x-imagenet-torch
faster-rcnn-resnet50-fpn-coco-torch
resnet18-imagenet-torch
resnext50-32x4d-imagenet-torch
mnasnet1.0-imagenet-torch
alexnet-imagenet-torch
yolov5x-coco-torch
vgg11-imagenet-torch
mask-rcnn-resnet50-fpn-coco-torch
segment-anything-2.1-hiera-large-image-torch
segment-anything-2-hiera-small-image-torch
googlenet-imagenet-torch
densenet169-imagenet-torch
inception-v3-imagenet-torch
segment-anything-2.1-hiera-small-image-torch
segmentation-transformer-torch
dinov2-vitg14-torch
resnet101-imagenet-torch
segment-anything-2-hiera-base-plus-image-torch
shufflenetv2-0.5x-imagenet-torch
segment-anything-2.1-hiera-base-plus-image-torch
yolov5m-coco-torch
deeplabv3-resnet101-coco-torch
dinov2-vitl14-torch
yolov5n-coco-torch
open-clip-torch
vgg19-imagenet-torch
segment-anything-2-hiera-tiny-image-torch

Not tested models (I need to setup an environment to test all of these):

Model med-sam-2-video-torch rtdetr-l-coco-torch rtdetr-x-coco-torch segment-anything-2-hi segment-anything-2-hi segment-anything-2-hi segment-anything-2-hi segment-anything-2.1- segment-anything-2.1- segment-anything-2.1- segment-anything-2.1- segment-anything-vitb-torch segment-anything-vith-torch segment-anything-vitl-torch yolo-nas-torch yolo11l-coco-torch yolo11l-seg-coco-torch yolo11m-coco-torch yolo11m-seg-coco-torch yolo11n-coco-torch yolo11n-seg-coco-torch yolo11s-coco-torch yolo11s-seg-coco-torch yolo11x-coco-torch yolo11x-seg-coco-torch yolov10l-coco-torch yolov10m-coco-torch yolov10n-coco-torch yolov10s-coco-torch yolov10x-coco-torch yolov8l-coco-torch yolov8l-obb-dotav1-torch yolov8l-oiv7-torch yolov8l-seg-coco-torch yolov8l-world-torch yolov8m-coco-torch yolov8m-obb-dotav1-torch yolov8m-oiv7-torch yolov8m-seg-coco-torch yolov8m-world-torch yolov8n-coco-torch yolov8n-obb-dotav1-torch yolov8n-oiv7-torch yolov8n-seg-coco-torch yolov8s-coco-torch yolov8s-obb-dotav1-torch yolov8s-oiv7-torch yolov8s-seg-coco-torch yolov8s-world-torch yolov8x-coco-torch yolov8x-obb-dotav1-torch yolov8x-oiv7-torch yolov8x-seg-coco-torch yolov8x-world-torch yolov9c-coco-torch yolov9c-seg-coco-torch yolov9e-coco-torch yolov9e-seg-coco-torch Why test was skipped
Model is not an image model
Model does not have a device attribute
Model does not have a device attribute
era-base-plus-video-torch Model is not an image model
era-large-video-torch Model is not an image model
era-small-video-torch Model is not an image model
era-tiny-video-torch Model is not an image model
hiera-base-plus-video-torch Model is not an image model
hiera-large-video-torch Model is not an image model
hiera-small-video-torch Model is not an image model
hiera-tiny-video-torch Model is not an image model
Failed to load model
Failed to load model
Failed to load model
Failed to load model
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute
Model does not have a device attribute

Errors

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cpu', input_format='numpy')

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 86, in _test_image_model
fo_torch_model.predict_all(dummy_inputs)
File "/home/jacobs/fiftyone/fiftyone/utils/torch.py", line 691, in predict_all
return self._predict_all(imgs)
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in _predict_all
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
TypeError: 'bool' object is not callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 147, in test_all_torch_image_models
self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
File "/home/jacobs/test_device_usage.py", line 90, in _test_image_model
raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model open-clip-torch on device cpu

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cuda:2', input_format='numpy')

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 86, in _test_image_model
fo_torch_model.predict_all(dummy_inputs)
File "/home/jacobs/fiftyone/fiftyone/utils/torch.py", line 691, in predict_all
return self._predict_all(imgs)
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in _predict_all
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
TypeError: 'bool' object is not callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 147, in test_all_torch_image_models
self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
File "/home/jacobs/test_device_usage.py", line 90, in _test_image_model
raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model open-clip-torch on device cuda:2

Ran 1 test in 275.831s

FAILED (errors=2)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

fiftyone/utils/transformers.py (2)
326-337: Add device parameter validation and documentation.

The device handling logic is correct, but consider these improvements:

Add validation for the device parameter to ensure only valid values are accepted (e.g., 'cuda', 'cpu', 'cuda:0', etc.)

Document the device parameter in the class docstring.
 """Configuration for a :class:`FiftyOneTransformer`.
 
 Args:
     model (None): a ``transformers`` model
     name_or_path (None): the name or path to a checkpoint file to load
+    device (None): the device to use for model execution (e.g., 'cuda', 'cpu', 'cuda:0').
+        If not specified, uses CUDA if available, otherwise CPU.
 """
759-760: Consider refactoring device initialization to reduce code duplication.

The device initialization pattern is repeated across multiple transformer classes. Consider moving this common functionality to a base class or mixin to promote DRY principles.

Example approach:
class DeviceMixin:
    def _initialize_device(self):
        self.device = torch.device(self.config.device)
        self.model.to(self.device)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 310a6bd and f57fa2a.

📒 Files selected for processing (1)

fiftyone/utils/transformers.py (4 hunks)

🔇 Additional comments (3)

fiftyone/utils/transformers.py (3)

463-464: LGTM! Device handling follows PyTorch best practices.

The implementation correctly initializes the device from config and moves the model to the appropriate device.

509-510: LGTM! Device handling is consistent.

The implementation maintains consistency with the base class's device handling approach.

Line range hint 326-760: Verify tensor operations use the specified device consistently.

Let's ensure all tensor operations in the codebase consistently use the specified device.

✅ Verification successful

Device handling is properly implemented

All tensor operations in the codebase consistently use the specified device through proper device transfer mechanisms. The implementation in the reviewed code aligns with the rest of the codebase's device handling patterns.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for potential tensor operations that might not use the specified device

# Search for tensor operations that might need device handling
rg -A 2 "torch\.(tensor|Tensor|from_numpy|zeros|ones|arange|linspace|empty|full|eye|rand|randn)" fiftyone/utils/

# Search for potential cuda-specific calls that should use device instead
rg "\.cuda" fiftyone/utils/
Length of output: 9099

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

fiftyone/utils/transformers.py (1)
464-465: Consider extracting device management into a mixin class.

The device initialization and model movement logic is duplicated across multiple transformer classes. Consider creating a DeviceManagementMixin class to reduce code duplication and maintain consistency.

Example implementation:
class DeviceManagementMixin:
    def initialize_device(self):
        self.device = torch.device(self.config.device)
        self.model.to(self.device)
Then update the transformer classes to use the mixin:
class FiftyOneTransformer(DeviceManagementMixin, TransformerEmbeddingsMixin, Model):
    def __init__(self, config):
        self.config = config
        self.model = self._load_model(config)
        self.initialize_device()
        self.image_processor = self._load_image_processor()
Also applies to: 510-511, 760-761

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f57fa2a and 0adf306.

📒 Files selected for processing (2)

fiftyone/utils/transformers.py (4 hunks)
fiftyone/utils/ultralytics.py (3 hunks)

🔇 Additional comments (7)

fiftyone/utils/ultralytics.py (3)

23-23: LGTM!

The torch import is correctly added using lazy_import to avoid unnecessary imports.

382-384: LGTM!

The device configuration is correctly implemented with a sensible default that falls back to CPU if CUDA is not available.

397-398: LGTM!

The device initialization and model movement to the specified device are correctly implemented.

fiftyone/utils/transformers.py (4)

326-338: LGTM!

The device configuration logic is well-implemented with proper handling of both string-based model paths and pre-loaded models.

464-465: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneTransformer.

510-511: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneZeroShotTransformer.

760-761: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneZeroShotTransformerForObjectDetection.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

fiftyone/utils/transformers.py (1)

457-458: Consider adding error handling for device availability.

While the device initialization looks correct, it would be beneficial to add error handling for cases where the specified device is not available.

 def __init__(self, config):
     self.config = config
     self.model = self._load_model(config)
-    self.device = torch.device(self.config.device)
-    self.model.to(self.device)
+    try:
+        self.device = torch.device(self.config.device)
+        self.model.to(self.device)
+    except RuntimeError as e:
+        logger.warning(f"Failed to move model to {self.config.device}. Falling back to CPU. Error: {e}")
+        self.device = torch.device("cpu")
+        self.model.to(self.device)
     self.image_processor = self._load_image_processor()

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0adf306 and 06ead81.

📒 Files selected for processing (1)

fiftyone/utils/transformers.py (4 hunks)

🔇 Additional comments (2)

fiftyone/utils/transformers.py (2)

326-328: LGTM: Device configuration with sensible defaults.

The device configuration is well-implemented with a sensible default that automatically selects CUDA if available, falling back to CPU otherwise.

326-328: Verify device compatibility across the codebase.

The changes introduce device management across multiple classes. Let's verify that all model operations consistently use the specified device.

Also applies to: 457-458, 503-504, 753-754

✅ Verification successful

Device compatibility verification successful

All model operations consistently use the specified device across the codebase. Input tensors and models are properly moved to the configured device before processing, maintaining compatibility throughout the model operations.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for potential device-related issues in model operations

# Look for tensor operations that might not respect the device setting
rg -A 2 "\.to\(" --type py

# Look for direct cuda() calls that should be replaced with to(self.device)
rg "\.cuda\(" --type py

# Look for device-related patterns in model operations
ast-grep --pattern 'with torch.no_grad():
  $$$
  outputs = $model($$$)
  $$$'
Length of output: 5936

fiftyone/utils/transformers.py

jacobsela · 2025-01-09T05:33:06Z

Still no testing done for MPS

Models tested w/ various coda devices

alexnet-imagenet-torch
classification-transformer-torch
clip-vit-base32-torch
deeplabv3-resnet101-coco-torch
deeplabv3-resnet50-coco-torch
densenet121-imagenet-torch
densenet161-imagenet-torch
densenet169-imagenet-torch
densenet201-imagenet-torch
depth-estimation-transformer-torch
detection-transformer-torch
dinov2-vitb14-torch
dinov2-vitg14-torch
dinov2-vitl14-torch
dinov2-vits14-torch
faster-rcnn-resnet50-fpn-coco-torch
fcn-resnet101-coco-torch
fcn-resnet50-coco-torch
googlenet-imagenet-torch
inception-v3-imagenet-torch
keypoint-rcnn-resnet50-fpn-coco-torch
mask-rcnn-resnet50-fpn-coco-torch
mnasnet0.5-imagenet-torch
mnasnet1.0-imagenet-torch
mobilenet-v2-imagenet-torch
open-clip-torch
resnet101-imagenet-torch
resnet152-imagenet-torch
resnet18-imagenet-torch
resnet34-imagenet-torch
resnet50-imagenet-torch
resnext101-32x8d-imagenet-torch
resnext50-32x4d-imagenet-torch
retinanet-resnet50-fpn-coco-torch
rtdetr-l-coco-torch
rtdetr-x-coco-torch
segment-anything-2-hiera-base-plus-image-torch
segment-anything-2-hiera-large-image-torch
segment-anything-2-hiera-small-image-torch
segment-anything-2-hiera-tiny-image-torch
segment-anything-2.1-hiera-base-plus-image-torch
segment-anything-2.1-hiera-large-image-torch
segment-anything-2.1-hiera-small-image-torch
segment-anything-2.1-hiera-tiny-image-torch
segmentation-transformer-torch
shufflenetv2-0.5x-imagenet-torch
shufflenetv2-1.0x-imagenet-torch
squeezenet-1.1-imagenet-torch
squeezenet-imagenet-torch
vgg11-bn-imagenet-torch
vgg11-imagenet-torch
vgg13-bn-imagenet-torch
vgg13-imagenet-torch
vgg16-bn-imagenet-torch
vgg16-imagenet-torch
vgg19-bn-imagenet-torch
vgg19-imagenet-torch
wide-resnet101-2-imagenet-torch
wide-resnet50-2-imagenet-torch
yolo11l-coco-torch
yolo11l-seg-coco-torch
yolo11m-coco-torch
yolo11m-seg-coco-torch
yolo11n-coco-torch
yolo11n-seg-coco-torch
yolo11s-coco-torch
yolo11s-seg-coco-torch
yolo11x-coco-torch
yolo11x-seg-coco-torch
yolov10l-coco-torch
yolov10m-coco-torch
yolov10n-coco-torch
yolov10s-coco-torch
yolov10x-coco-torch
yolov5l-coco-torch
yolov5m-coco-torch
yolov5n-coco-torch
yolov5s-coco-torch
yolov5x-coco-torch
yolov8l-coco-torch
yolov8l-obb-dotav1-torch
yolov8l-oiv7-torch
yolov8l-seg-coco-torch
yolov8l-world-torch
yolov8m-coco-torch
yolov8m-obb-dotav1-torch
yolov8m-oiv7-torch
yolov8m-seg-coco-torch
yolov8m-world-torch
yolov8n-coco-torch
yolov8n-obb-dotav1-torch
yolov8n-oiv7-torch
yolov8n-seg-coco-torch
yolov8s-coco-torch
yolov8s-obb-dotav1-torch
yolov8s-oiv7-torch
yolov8s-seg-coco-torch
yolov8s-world-torch
yolov8x-coco-torch
yolov8x-obb-dotav1-torch
yolov8x-oiv7-torch
yolov8x-seg-coco-torch
yolov8x-world-torch
yolov9c-coco-torch
yolov9c-seg-coco-torch
yolov9e-coco-torch
yolov9e-seg-coco-torch
zero-shot-classification-transformer-torch
zero-shot-detection-transformer-torch

Models that are still problematic

yolov5 - loads on the coda:0 before being loaded to the device in the argument. Not sure why
hugging face zero shot transformers - expect input_ids argument that isn't passed. probably unrelated error.
open-clip-torch - self._preprocess is a bool instead of a callable. not sure why. probably unrelated.

models that haven't been tested - need to setup env

med-sam-2-video-torch - Model is not an image model
segment-anything-2-hiera-base-plus-video-torch - Model is not an image model
segment-anything-2-hiera-large-video-torch - Model is not an image model
segment-anything-2-hiera-small-video-torch - Model is not an image model
segment-anything-2-hiera-tiny-video-torch - Model is not an image model
segment-anything-2.1-hiera-base-plus-video-torch - Model is not an image model
segment-anything-2.1-hiera-large-video-torch - Model is not an image model
segment-anything-2.1-hiera-small-video-torch - Model is not an image model
segment-anything-2.1-hiera-tiny-video-torch - Model is not an image model
segment-anything-vitb-torch - Failed to load model
segment-anything-vith-torch - Failed to load model
segment-anything-vitl-torch - Failed to load model
yolo-nas-torch - Failed to load model

fiftyone/utils/ultralytics.py

brimoor · 2025-01-09T17:08:22Z

Adding @manushreegangwar and @mwoodson1 as ML team reviewers 😄

brimoor · 2025-01-09T17:16:01Z

@jacobsela can you rebase on latest develop? Looks like these is a merge conflict that would currently prevent merging this.

Also:

yolov5 - loads on the coda:0 before being loaded to the device in the argument. Not sure why

We're using Ultralytics' model here. Can anything be done to address this?

hugging face zero shot transformers - expect input_ids argument that isn't passed. probably unrelated error.

On develop on macOS with CPU, these work for me. Are you seeing something different?

open-clip-torch - self._preprocess is a bool instead of a callable. not sure why. probably unrelated.

On develop on macOS with CPU, this works for me. Are you seeing something different?

https://docs.voxel51.com/model_zoo/models.html#open-clip-torch

danielgural · 2025-01-09T19:12:27Z

I have some scripts sitting around that can test. I will do Mac CPU + MPS (I have M4) and multi GPU. Will kick off runs tonight and hopefully will finish before morning. Will bring back findings

… already loaded not just string

jacobsela · 2025-01-09T23:05:49Z

Resolved the yolo5 issue. When loading from torch hub, it will automatically load the model onto the currently set default device (which is "cuda" when working in a cuda enabled environment). Wrapping line 788 in fiftyone.utils.torch in a with torch.device("cpu") fixes this while maintaining the default device in other parts of the code.

edit: Can't reproduce...

jacobsela · 2025-01-09T23:33:12Z

I'm just going to pass "cpu" to always be the device in the manifest. model is moved to correct device afterwards.

jacobsela · 2025-01-10T00:56:14Z

status:

Rebased locally and force pushed, git merge --no-commit --no-ff bugfix/zoo-clip-support-for-multi-gpu-setups from an up-to-date develop gives no errors. I'm not sure if this was the proper way of doing this.
Fixed the yolov5 issue by updating manifests to by default use CPU. In general it seems that there is no way to pass arguments directly into the entrypoint_fcn of a zoo model from the code, just the manifest. Am I missing something?
For the zero-shot transformers I'm getting the same error ValueError: You have to specify input_ids on cpu in develop. Something may be broken on my env.
SAM2 models have some internal component that defaults to loading on cuda:0. I don't know if it's just my environment or a general thing. I don't know if this is worth the trouble of debugging. Assuming the user has a functioning default "cuda" device with enough free memory they probably wouldn't even notice.

TL;DR
Models that are still up in the air:

video models
yolo-nas (getting error when downloading weights)
open-clip-torch (can't get it to run)
zero shot transformers (can't get them to run)

Works but for whatever reason loads on "cuda" before going to desired device:

sam 2 models

danielgural · 2025-01-10T02:41:38Z

MPS works on all but some transformers due to an aten::upsample_bicubic2d.out operator. Error spits out correctly as "not supported on MPS yet" from torch.

Multi GPU works except for zero-shot-classification-transformer-torch on device cuda

Traceback (most recent call last):
  File "/home/dan/model_testing/jacob_test.py", line 158, in test_all_torch_image_models
    self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
  File "/home/dan/model_testing/jacob_test.py", line 90, in _test_image_model
    raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model zero-shot-classification-transformer-torch on device cuda

+1 to clip input_ids issue.

LGTM for my tests just needs the stated above fixes

jacobsela requested review from harpreetsahota204 and danielgural October 31, 2024 22:06

danielgural previously approved these changes Oct 31, 2024

View reviewed changes

brimoor reviewed Nov 1, 2024

View reviewed changes

fiftyone/utils/clip/zoo.py Outdated Show resolved Hide resolved

brimoor changed the title ~~bugfix~~ Use device in all Torch models Nov 1, 2024

brimoor changed the title ~~Use device in all Torch models~~ Bugfix: use device in all Torch models Nov 1, 2024

jacobsela dismissed danielgural’s stale review via 9a89a70 November 1, 2024 13:58

coderabbitai bot reviewed Nov 1, 2024

View reviewed changes

brimoor changed the base branch from release/v1.0.2 to develop November 7, 2024 23:16

brimoor mentioned this pull request Jan 5, 2025

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device #5271

Open

3 tasks

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

fiftyone/utils/transformers.py Show resolved Hide resolved

jacobsela mentioned this pull request Jan 9, 2025

[FR] Support device="mps" for YOLO zoo models #5285

Open

3 tasks

kaixi-wang reviewed Jan 9, 2025

View reviewed changes

fiftyone/utils/ultralytics.py Show resolved Hide resolved

brimoor requested review from danielgural, mwoodson1 and manushreegangwar January 9, 2025 16:57

Jacob Sela added 2 commits January 9, 2025 13:02

bugfix

a0e372e

same bugfix for openCLIP and YOLONAS

c064c2f

Jacob Sela added 7 commits January 9, 2025 13:02

changed .to(torch.device(self.device)) to .to(self.device), object is…

62c2f02

… already loaded not just string

changed autocast to be device dependent

bfd8ad6

enabled device selection in hf transformers

a37fae4

bugfix

7d8d009

added device to ultralytics

e6d6ab7

removed guessing user gpu with hf transformers

bc6132c

bugfix for transformers

ca45a6f

added device=cpu argument for yolo5 models in manifest

fb7b179

jacobsela force-pushed the bugfix/zoo-clip-support-for-multi-gpu-setups branch from 06ead81 to fb7b179 Compare January 10, 2025 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: use device in all Torch models #5026

Bugfix: use device in all Torch models #5026

jacobsela commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

danielgural commented Oct 31, 2024 •

edited by brimoor

Loading

danielgural left a comment •

edited by brimoor

Loading

coderabbitai bot left a comment

coderabbitai bot Nov 1, 2024 •

edited

Loading

brimoor Nov 5, 2024

coderabbitai bot Nov 5, 2024

jacobsela Nov 5, 2024

brimoor commented Jan 5, 2025

deltheil commented Jan 6, 2025

jacobsela commented Jan 6, 2025

jacobsela commented Jan 9, 2025 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

jacobsela commented Jan 9, 2025

brimoor commented Jan 9, 2025

brimoor commented Jan 9, 2025

danielgural commented Jan 9, 2025

jacobsela commented Jan 9, 2025 •

edited

Loading

jacobsela commented Jan 9, 2025

jacobsela commented Jan 10, 2025 •

edited

Loading

danielgural commented Jan 10, 2025

Bugfix: use device in all Torch models #5026

Are you sure you want to change the base?

Bugfix: use device in all Torch models #5026

Conversation

jacobsela commented Oct 31, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Oct 31, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

danielgural commented Oct 31, 2024 • edited by brimoor Loading

danielgural left a comment • edited by brimoor Loading

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

brimoor Nov 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 5, 2024

Choose a reason for hiding this comment

jacobsela Nov 5, 2024

Choose a reason for hiding this comment

brimoor commented Jan 5, 2025

deltheil commented Jan 6, 2025

jacobsela commented Jan 6, 2025

jacobsela commented Jan 9, 2025 • edited Loading

Tested models - all pass besides open-clip-torch

Not tested models (I need to setup an environment to test all of these):

Errors

====================================================================== ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cpu', input_format='numpy')

====================================================================== ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cuda:2', input_format='numpy')

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jacobsela commented Jan 9, 2025

Still no testing done for MPS

Models tested w/ various coda devices

Models that are still problematic

models that haven't been tested - need to setup env

brimoor commented Jan 9, 2025

brimoor commented Jan 9, 2025

danielgural commented Jan 9, 2025

jacobsela commented Jan 9, 2025 • edited Loading

jacobsela commented Jan 9, 2025

jacobsela commented Jan 10, 2025 • edited Loading

danielgural commented Jan 10, 2025

jacobsela commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

danielgural commented Oct 31, 2024 •

edited by brimoor

Loading

danielgural left a comment •

edited by brimoor

Loading

coderabbitai bot Nov 1, 2024 •

edited

Loading

jacobsela commented Jan 9, 2025 •

edited

Loading

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cpu', input_format='numpy')

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cuda:2', input_format='numpy')

jacobsela commented Jan 9, 2025 •

edited

Loading

jacobsela commented Jan 10, 2025 •

edited

Loading