Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openai-L-14-336 #19

Closed
LKELN opened this issue Dec 9, 2024 · 6 comments
Closed

Openai-L-14-336 #19

LKELN opened this issue Dec 9, 2024 · 6 comments

Comments

@LKELN
Copy link

LKELN commented Dec 9, 2024

Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to("cuda")
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

When I try to use LLM2CLIP-Openai-L-14-336. this erro is appear?Can you fix it?

@LKELN
Copy link
Author

LKELN commented Dec 9, 2024

the model struct that load by your code is is:
CLIPModel(
(text_model): CLIPTextTransformer(
(embeddings): CLIPTextEmbeddings(
(token_embedding): Embedding(49408, 512)
(position_embedding): Embedding(77, 512)
)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-11): 12 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
)
(layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(vision_model): CLIPVisionTransformer(
(embeddings): CLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(position_embedding): Embedding(577, 1024)
)
(pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-23): 24 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
(layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(visual_projection): Linear(in_features=1024, out_features=1280, bias=False)
(text_projection): Linear(in_features=512, out_features=1280, bias=False)
)

@raytrun
Copy link
Collaborator

raytrun commented Dec 13, 2024

Could you provide a more complete code? The example given here is able to run correctly.

@LKELN
Copy link
Author

LKELN commented Dec 17, 2024

processor = CLIPImageProcessor.from_pretrained("/group/40048/keningliu/tools/models/clip-vit-large-patch14-336")
model_name_or_path = "/group/40048/keningliu/tools/models/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=False).to('cuda').eval()
print(model)
print(type(model).mro)
captions = ["a diagram", "a dog", "a cat"]
image_path = "/group/40048/keningliu/tools/FlagData/pipeline.png"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')
print(model)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

@LKELN
Copy link
Author

LKELN commented Dec 17, 2024

I found that when loading the model, the weight files mentioned in your paper were not loaded. Instead, only the CLIP structure was loaded, which makes it impossible to conduct inference.

@raytrun
Copy link
Collaborator

raytrun commented Dec 17, 2024

The model should be loaded in the following manner:
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.float16, trust_remote_code=True).to('cuda').eval()

trust_remote_code=True is necessary.

@LKELN
Copy link
Author

LKELN commented Dec 19, 2024

Thanks

@LKELN LKELN closed this as completed Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants