Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sequence_length option in CLIPTokenizer #2031

Closed

Conversation

james77777778
Copy link
Collaborator

Fix #2018

There is a bug in CLIPTokenizer where specifying sequence_length causes it to incorrectly use tf.RaggedTensor.to_tensor for padding, instead of self.pad_token_id.
Additionally, we need to compute padding_mask manually rather than relying on tf.RaggedTensor.to_tensor for the same reason.

This PR resolves the issue.

@mattdangerw
Copy link
Member

@james77777778

Sorry for the delay! Finally catching up post holidays.

I'm not sure we really want to support this usage. I'll comment on the bug.

@mattdangerw
Copy link
Member

Commented on the issue. #2018 (comment)

I think we probably just want to remove the "packing" feature from the tokenizer to agree with other tokenizer. But we probably need to think about our task design for CLIP more generally. Tentatively closing, but can re open if we need!

@mattdangerw mattdangerw closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLIPTokenizer does not work as expected
2 participants