Fix `sequence_length` option in `CLIPTokenizer` #2031

james77777778 · 2024-12-24T10:09:10Z

Fix #2018

There is a bug in CLIPTokenizer where specifying sequence_length causes it to incorrectly use tf.RaggedTensor.to_tensor for padding, instead of self.pad_token_id.
Additionally, we need to compute padding_mask manually rather than relying on tf.RaggedTensor.to_tensor for the same reason.

This PR resolves the issue.

mattdangerw · 2025-01-07T20:29:14Z

@james77777778

Sorry for the delay! Finally catching up post holidays.

I'm not sure we really want to support this usage. I'll comment on the bug.

mattdangerw · 2025-01-08T19:56:26Z

Commented on the issue. #2018 (comment)

I think we probably just want to remove the "packing" feature from the tokenizer to agree with other tokenizer. But we probably need to think about our task design for CLIP more generally. Tentatively closing, but can re open if we need!

Fix in

1b6dd70

james77777778 mentioned this pull request Dec 24, 2024

CLIPTokenizer does not work as expected #2018

Open

mattdangerw closed this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `sequence_length` option in `CLIPTokenizer` #2031

Fix `sequence_length` option in `CLIPTokenizer` #2031

james77777778 commented Dec 24, 2024

mattdangerw commented Jan 7, 2025

mattdangerw commented Jan 8, 2025

Fix sequence_length option in CLIPTokenizer #2031

Fix sequence_length option in CLIPTokenizer #2031

Conversation

james77777778 commented Dec 24, 2024

mattdangerw commented Jan 7, 2025

mattdangerw commented Jan 8, 2025

Fix `sequence_length` option in `CLIPTokenizer` #2031

Fix `sequence_length` option in `CLIPTokenizer` #2031