-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Enable Video Training #305
Comments
I agree adding video would be great! While we aren't making major changes to the codebase at the moment, I think you will find this to be partially supported already. For instance the resampler already can take in multiple frames, denoted by an F, in the current implementation this is always 1 but if you pass in multiple frames (a video) it will also work. I think you will still need to work on dataloader etc though. Hope this is helpful! I am excited to see what you train :). |
Thanks for the information! I’ll explore the resampler’s capability to handle multiple frames and will start working on integrating video support into the dataloader. I’ll keep you updated on my progress and let you know if I need any further assistance. Looking forward to contributing to this enhancement! |
May I ask if you have successfully modified the code to take video as input? @simplaj |
Is your feature request related to a problem? Please describe.
I have been actively using this repository for multimodal training involving images and text. It has been incredibly helpful for my research and development. However, I am interested in expanding the capabilities to include video-based multimodal training. Currently, the repository does not support video inputs, which limits the scope of applications that can be developed.
Describe the workflow you want to enable.
I would like to enable a workflow where video data can be seamlessly integrated into the existing multimodal training pipeline. This would involve handling video frames as sequential data and allowing the model to learn from both visual and textual information extracted from videos.
Describe your proposed solution.
To address this, I propose the following:
Implement support for video data by extending the current data handling pipeline to process video frames.
Describe alternatives you've considered
An alternative solution could be to preprocess videos externally into a sequence of images and then feed these images into the existing image-based pipeline. However, this approach may not fully leverage the temporal information present in videos, and the preprocessing step could introduce additional complexity.
Additional context
Supporting video inputs could significantly enhance the repository's utility for a wider range of applications, such as video captioning, action recognition, and video question answering.
Are you willing to help implement this feature?
Yes, I am very keen to contribute to this feature. I have experience in handling video data and training multimodal models. I expect it might take a few weeks to implement and test the feature, depending on the complexity. I would appreciate any guidance or support from the OpenFlamingo team to ensure seamless integration with the existing codebase.
The text was updated successfully, but these errors were encountered: