Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device #5271

Open
1 of 3 tasks
deltheil opened this issue Dec 13, 2024 · 2 comments · May be fixed by #5026
Open
1 of 3 tasks

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device #5271

deltheil opened this issue Dec 13, 2024 · 2 comments · May be fixed by #5026
Labels
bug Bug fixes

Comments

@deltheil
Copy link

deltheil commented Dec 13, 2024

Describe the problem

On a multi-GPU machine, using apply_model() with a HF transformer model gives a runtime error if the model is moved to another GPU than the default one.

Code to reproduce issue

import torch
import fiftyone as fo
import fiftyone.zoo as foz
from transformers import MobileNetV2ForImageClassification

dataset = foz.load_zoo_dataset("quickstart", max_samples=25)
model = MobileNetV2ForImageClassification.from_pretrained("google/mobilenet_v2_1.0_224")
assert torch.cuda.device_count() > 1
model.to("cuda:1")
dataset.apply_model(model, label_field="image_classif", skip_failures=False)

This gives a runtime error due to a device mismatch with the model preprocessor:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 22.04): Ubuntu 24.04 LTS
  • Python version (python --version): Python 3.12.7
  • FiftyOne version (fiftyone --version): FiftyOne v1.1.0, Voxel51, Inc.
  • FiftyOne installed from (pip or source): pip (via rye)

Other info/logs

After model conversion (into a FiftyOne Model), there are three occurrences of hardcoded "cuda" device like this:

self.device = (
"cuda" if next(self.model.parameters()).is_cuda else "cpu"
)

And then, at predict time, self.device is used to move the preprocessed inputs on the GPU:

def _predict(self, inputs):
with torch.no_grad():
results = self.model(**inputs.to(self.device))
return to_classification(results, self.model.config.id2label)

=> Hence the mismatch when the model has been moved to another GPU than cuda:0.

This could be replaced by self.model.device and/or, at CTOR-time, storing the attribute as self.device = self.model.device.

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another
member of your organization be willing to contribute a fix for this bug to the
FiftyOne codebase?

  • Yes. I can contribute a fix for this bug independently
  • Yes. I would be willing to contribute a fix for this bug with guidance
    from the FiftyOne community
  • No. I cannot contribute a bug fix at this time

cc @brimoor

@deltheil deltheil added the bug Bug fixes label Dec 13, 2024
@deltheil
Copy link
Author

deltheil commented Dec 17, 2024

I just noticed the problem is not limited to HF transformers models. E.g. because of this hardcoded cuda() call here:

if self._using_gpu:
imgs = imgs.cuda()

The same problem occurs with e.g. a TorchImageModel. To reproduce it:

import fiftyone.zoo as foz
from PIL import Image

model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda:1")
y = model.predict(Image.open("test.jpg"))

Fails with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

@deltheil deltheil changed the title [BUG] Hugging Face Transformers: apply_model error due to hardcoded CUDA device [BUG] apply_model fails on multi-GPU due to hardcoded CUDA device Dec 17, 2024
@brimoor
Copy link
Contributor

brimoor commented Jan 5, 2025

A fix for this is in progress in #5026. We'll get this fixed and merged ASAP!

@brimoor brimoor linked a pull request Jan 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants