Support for audio and video inputs for Open AI and Google Gemini models #1102

jjallaire · 2025-01-11T15:13:40Z

This PR adds new ContentAudio and ContentVideo types that exist alongside ContentText and ContentImage and implements provider support for audio and video files where it is currently available (as of now still quite limited).

Audio

The following models currently support audio inputs:

Open AI: gpt-4o-audio-preview
Google/Vertex: Gemini 1.5 and 2.0 models

To include audio in a dataset you should use JSON input format (either standard JSON or JSON Lines). For example, here we include audio alongside some text content:

"input": [
  {
    "role": "user",
    "content": [
        { "type": "audio", "audio": "sample.mp3", "format": "mp3" },
        { "type": "text", "text": "What words are spoken in this audio sample?"}
    ]
  }
]

The "sample.mp3" path is resolved relative to the directory containing the dataset file. The audio file can be specified either as a file path or a base64 encoded Data URL.

If you are constructing chat messages programmatically, then the equivalent to the above would be:

input = [
    ChatMessageUser(content = [
        ContentAudio(audio="sample.mp3", format="mp3"),
        ContentText(text="What words are spoken in this audio sample?")
    ])
]

Formats

You can provide audio files in one of two formats:

MP3
WAV

As demonstrated above, you should specify the format explicitly when including audio input.

Video

The following models currently support video inputs:

Google: Gemini 1.5 and 2.0 models

To include video in a dataset you should use JSON input format (either standard JSON or JSON Lines). For example, here we include video alongside some text content:

"input": [
  {
    "role": "user",
    "content": [
        { "type": "video", "video": "video.mp4", "format": "mp4" },
        { "type": "text", "text": "Can you please describe the attached video?"}
    ]
  }
]

The "video.mp4" path is resolved relative to the directory containing the dataset file. The video file can be specified either as a file path or a base64 encoded Data URL.

If you are constructing chat messages programmatically, then the equivalent to the above would be:

input = [
    ChatMessageUser(content = [
        ContentVideo(video="video.mp4", format="mp4"),
        ContentText(text="Can you please describe the attached video?")
    ])
]

Formats

You can provide video files in one of three formats:

MP4
MPEG
MOV

As demonstrated above, you should specify the format explicitly when including video input.

Uploads

When using audio and video with the Google Gemini API, media is first uploaded using the File API and then the URL to the uploaded file is referenced in the chat message. This results in much faster performance for subsequent uses of the media file.

The File API lets you store up to 20GB of files per project, with a per-file maximum size of 2GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. The File API is available at no cost in all regions where the Gemini API is available.

…eo-content

jjallaire added 22 commits January 6, 2025 12:52

initial work on audio/video inputs

166b16c

handle audio and video along with other base64 content

15300be

check for existing data uri in file_as_data_uri

7fb190c

Merge remote-tracking branch 'origin/main' into feature/audio-and-vid…

643efe3

…eo-content

call tools typings

600b1df

add audio test

01d974b

handle audio and video in dataset

11839c0

add format to audio data

75500b8

record absolute paths to dataset locations

ea7e81b

openai audio test

de08a44

Merge remote-tracking branch 'origin/main' into feature/audio-and-vid…

99917f1

…eo-content

rename to test_media, confirm google audio test

ec587f9

change import order

af18357

initial google video support

6a87748

google: upload files for audio and video

5ad9140

wip on google files

c5d587c

use sqlite kvstore

f98afa9

Merge remote-tracking branch 'origin/main' into feature/audio-and-vid…

02e60bc

…eo-content

revise media test prefixes

5ca3a72

trace logging for google files

a690fc7

multimodal docs

700b18f

docs and changelog

a7e3eef

jjallaire requested a review from dragonstyle January 11, 2025 15:14

jjallaire added 2 commits January 11, 2025 10:18

tweak docs

9a79030

remove shelf

a9eafa3

jjallaire merged commit 326285f into main Jan 11, 2025
9 checks passed

jjallaire deleted the feature/audio-and-video-content branch January 11, 2025 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for audio and video inputs for Open AI and Google Gemini models #1102

Support for audio and video inputs for Open AI and Google Gemini models #1102

jjallaire commented Jan 11, 2025

Support for audio and video inputs for Open AI and Google Gemini models #1102

Support for audio and video inputs for Open AI and Google Gemini models #1102

Conversation

jjallaire commented Jan 11, 2025

Audio

Formats

Video

Formats

Uploads