LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.call() got an unexpected keyword argument 'legacy' (I have the fix) #35602

inf3rnus · 2025-01-10T00:36:28Z

System Info

Problem's root cause is in ImageTextToTextPipeline class in the image_text_to_text.py pipeline.

Line 438

        model_inputs = self.processor(
            images=images, text=text, return_tensors=self.framework, legacy=False, **processing_kwargs
        ).to(dtype=self.torch_dtype)

Notice how legacy is always specified as False?

If you use this model (llava-hf/LLaVA-NeXT-Video-7B-32K-hf) on transfomers==4.47.1 you will get this error because its config specifies to use the class: LlavaNextVideoProcessor from processing_llava_next_video.py and it's __call__ method is not expecting that kwarg.

The quick fix is this:

Modify __call__ (line 101) in processing_llava_next_video.py

from this:

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        images: ImageInput = None,
        videos: VideoInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: int = None,
        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
    ) -> BatchFeature:

to this:

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        images: ImageInput = None,
        videos: VideoInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: int = None,
        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
        **kwargs, # <-- this guy
    ) -> BatchFeature:

Notice the unused kwargs at the end. This reflects the pattern used for __init__

which looks like this:

    def __init__(
        self,
        video_processor=None,
        image_processor=None,
        tokenizer=None,
        chat_template=None,
        patch_size=None,
        vision_feature_select_strategy=None,
        video_token="<video>",
        image_token="<image>",
        num_additional_image_tokens=0,
        **kwargs, # <-- this guy
    ):

I ain't got time to step through the PR process, so I hope this helps the HF staff either make this quick patch, or solve the problem at a higher level in the code for image_text_to_text.py.

Who can help?

HF staff

Information

The official example scripts
My own modified scripts

Tasks

[x ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...) (image-to-text-to-text)

Reproduction

pipe = pipeline("image-text-to-text", model="llava-hf/LLaVA-NeXT-Video-7B-32K-hf")

messages = {'role': 'user', 'content': [{'type': 'text', 'text': "What's in this image?"}, {'type': 'video'}]}

videos = ["https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"]

out = pipe(text=messages, videos=videos)

Expected behavior

No exception raised due to an unexpected kwarg.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2025-01-10T11:47:01Z

cc @yonigozlan

I guess as quick fix, we need to standardize processor kwargs API for videoLLM also since those models can work in image-text-to-text setting. Which is why I am not surprised that users want to apply image-text-to-text pipeline, as the closest pipeline from existing ones

PS. I had an idea to add video-text-to-text as pipeline, but we know that all video models are also image models. So, not sure how to exactly separate these two. Anyway pipeline idea will go after videos have their own standard video-processors separate from images, so we can can with that :)

yonigozlan · 2025-01-10T15:10:29Z

Agreed for the pipeline, I'm just not sure if it will be intuitive for users to use image-text-to-text with video models.
As for this issue, thank you @inf3rnus for bringing it up! I forgot the video models in the kwargs standardization, I can make a note to address that indeed.
I also think the legacy kwarg for image-text-to-text models/pipeline and the corresponding deprecation warning have been there for a while so it might be time to remove them altogether?

zucchini-nlp · 2025-01-10T15:12:58Z

to remove them altogether

would be nice if we are way beyond the deprecation cycle

inf3rnus added the bug label Jan 10, 2025

inf3rnus changed the title ~~LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy'~~ LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' (I have the fix) Jan 10, 2025

zucchini-nlp added Core: Pipeline Internals of the library; Pipeline. VLM labels Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.call() got an unexpected keyword argument 'legacy' (I have the fix) #35602

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.call() got an unexpected keyword argument 'legacy' (I have the fix) #35602

inf3rnus commented Jan 10, 2025 •

edited

Loading

zucchini-nlp commented Jan 10, 2025 •

edited

Loading

yonigozlan commented Jan 10, 2025

zucchini-nlp commented Jan 10, 2025

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' (I have the fix) #35602

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' (I have the fix) #35602

Comments

inf3rnus commented Jan 10, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Jan 10, 2025 • edited Loading

yonigozlan commented Jan 10, 2025

zucchini-nlp commented Jan 10, 2025

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.call() got an unexpected keyword argument 'legacy' (I have the fix) #35602

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.call() got an unexpected keyword argument 'legacy' (I have the fix) #35602

inf3rnus commented Jan 10, 2025 •

edited

Loading

zucchini-nlp commented Jan 10, 2025 •

edited

Loading