Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for audio and video inputs for Open AI and Google Gemini models #1102

Merged
merged 24 commits into from
Jan 11, 2025

Conversation

jjallaire
Copy link
Collaborator

This PR adds new ContentAudio and ContentVideo types that exist alongside ContentText and ContentImage and implements provider support for audio and video files where it is currently available (as of now still quite limited).

Audio

The following models currently support audio inputs:

  • Open AI: gpt-4o-audio-preview
  • Google/Vertex: Gemini 1.5 and 2.0 models

To include audio in a dataset you should use JSON input format (either standard JSON or JSON Lines). For example, here we include audio alongside some text content:

"input": [
  {
    "role": "user",
    "content": [
        { "type": "audio", "audio": "sample.mp3", "format": "mp3" },
        { "type": "text", "text": "What words are spoken in this audio sample?"}
    ]
  }
]

The "sample.mp3" path is resolved relative to the directory containing the dataset file. The audio file can be specified either as a file path or a base64 encoded Data URL.

If you are constructing chat messages programmatically, then the equivalent to the above would be:

input = [
    ChatMessageUser(content = [
        ContentAudio(audio="sample.mp3", format="mp3"),
        ContentText(text="What words are spoken in this audio sample?")
    ])
]

Formats

You can provide audio files in one of two formats:

  • MP3
  • WAV

As demonstrated above, you should specify the format explicitly when including audio input.

Video

The following models currently support video inputs:

  • Google: Gemini 1.5 and 2.0 models

To include video in a dataset you should use JSON input format (either standard JSON or JSON Lines). For example, here we include video alongside some text content:

"input": [
  {
    "role": "user",
    "content": [
        { "type": "video", "video": "video.mp4", "format": "mp4" },
        { "type": "text", "text": "Can you please describe the attached video?"}
    ]
  }
]

The "video.mp4" path is resolved relative to the directory containing the dataset file. The video file can be specified either as a file path or a base64 encoded Data URL.

If you are constructing chat messages programmatically, then the equivalent to the above would be:

input = [
    ChatMessageUser(content = [
        ContentVideo(video="video.mp4", format="mp4"),
        ContentText(text="Can you please describe the attached video?")
    ])
]

Formats

You can provide video files in one of three formats:

  • MP4
  • MPEG
  • MOV

As demonstrated above, you should specify the format explicitly when including video input.

Uploads

When using audio and video with the Google Gemini API, media is first uploaded using the File API and then the URL to the uploaded file is referenced in the chat message. This results in much faster performance for subsequent uses of the media file.

The File API lets you store up to 20GB of files per project, with a per-file maximum size of 2GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. The File API is available at no cost in all regions where the Gemini API is available.

@jjallaire jjallaire requested a review from dragonstyle January 11, 2025 15:14
@jjallaire jjallaire merged commit 326285f into main Jan 11, 2025
9 checks passed
@jjallaire jjallaire deleted the feature/audio-and-video-content branch January 11, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant