Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' (I have the fix) #35602

Open
2 tasks
inf3rnus opened this issue Jan 10, 2025 · 3 comments
Labels
bug Core: Pipeline Internals of the library; Pipeline. VLM

Comments

@inf3rnus
Copy link
Contributor

inf3rnus commented Jan 10, 2025

System Info

Problem's root cause is in ImageTextToTextPipeline class in the image_text_to_text.py pipeline.

Line 438

        model_inputs = self.processor(
            images=images, text=text, return_tensors=self.framework, legacy=False, **processing_kwargs
        ).to(dtype=self.torch_dtype)

Notice how legacy is always specified as False?

If you use this model (llava-hf/LLaVA-NeXT-Video-7B-32K-hf) on transfomers==4.47.1 you will get this error because its config specifies to use the class: LlavaNextVideoProcessor from processing_llava_next_video.py and it's __call__ method is not expecting that kwarg.

The quick fix is this:

Modify __call__ (line 101) in processing_llava_next_video.py

from this:

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        images: ImageInput = None,
        videos: VideoInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: int = None,
        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
    ) -> BatchFeature:

to this:

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        images: ImageInput = None,
        videos: VideoInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: int = None,
        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
        **kwargs, # <-- this guy
    ) -> BatchFeature:

Notice the unused kwargs at the end. This reflects the pattern used for __init__

which looks like this:

    def __init__(
        self,
        video_processor=None,
        image_processor=None,
        tokenizer=None,
        chat_template=None,
        patch_size=None,
        vision_feature_select_strategy=None,
        video_token="<video>",
        image_token="<image>",
        num_additional_image_tokens=0,
        **kwargs, # <-- this guy
    ):

I ain't got time to step through the PR process, so I hope this helps the HF staff either make this quick patch, or solve the problem at a higher level in the code for image_text_to_text.py.

Who can help?

HF staff

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • [x ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...) (image-to-text-to-text)

Reproduction

pipe = pipeline("image-text-to-text", model="llava-hf/LLaVA-NeXT-Video-7B-32K-hf")

messages = {'role': 'user', 'content': [{'type': 'text', 'text': "What's in this image?"}, {'type': 'video'}]}

videos = ["https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"]

out = pipe(text=messages, videos=videos)

Expected behavior

No exception raised due to an unexpected kwarg.

@inf3rnus inf3rnus added the bug label Jan 10, 2025
@inf3rnus inf3rnus changed the title LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' LlavaNextVideoProcessor -> TypeError: LlavaNextVideoProcessor.__call__() got an unexpected keyword argument 'legacy' (I have the fix) Jan 10, 2025
@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jan 10, 2025

cc @yonigozlan

I guess as quick fix, we need to standardize processor kwargs API for videoLLM also since those models can work in image-text-to-text setting. Which is why I am not surprised that users want to apply image-text-to-text pipeline, as the closest pipeline from existing ones

PS. I had an idea to add video-text-to-text as pipeline, but we know that all video models are also image models. So, not sure how to exactly separate these two. Anyway pipeline idea will go after videos have their own standard video-processors separate from images, so we can can with that :)

@zucchini-nlp zucchini-nlp added Core: Pipeline Internals of the library; Pipeline. VLM labels Jan 10, 2025
@yonigozlan
Copy link
Member

Agreed for the pipeline, I'm just not sure if it will be intuitive for users to use image-text-to-text with video models.
As for this issue, thank you @inf3rnus for bringing it up! I forgot the video models in the kwargs standardization, I can make a note to address that indeed.
I also think the legacy kwarg for image-text-to-text models/pipeline and the corresponding deprecation warning have been there for a while so it might be time to remove them altogether?

@zucchini-nlp
Copy link
Member

to remove them altogether

would be nice if we are way beyond the deprecation cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Pipeline Internals of the library; Pipeline. VLM
Projects
None yet
Development

No branches or pull requests

3 participants