Skip to content

Commit

Permalink
feat: add audio support with new Audio class and update documentati…
Browse files Browse the repository at this point in the history
…on (#1095)

Co-authored-by: Ivan Leo <[email protected]>
  • Loading branch information
jxnl and ivanleomk authored Oct 20, 2024
1 parent 59f1d6a commit 9a7822a
Show file tree
Hide file tree
Showing 11 changed files with 503 additions and 348 deletions.
87 changes: 87 additions & 0 deletions docs/blog/posts/openai-multimodal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
authors:
- jxnl
categories:
- OpenAI
- Audio
comments: true
date: 2025-10-17
description: Explore the new audio capabilities in OpenAI's Chat Completions API using the gpt-4o-audio-preview model.
draft: false
tags:
- OpenAI
- Audio Processing
- API
- Machine Learning
---

# Audio Support in OpenAI's Chat Completions API

OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new `gpt-4o-audio-preview` model, which brings advanced voice capabilities to the familiar Chat Completions API interface.

## Key Features

The new audio support in the Chat Completions API offers several compelling features:

1. **Flexible Input Handling**: The API can now process any combination of text and audio inputs, allowing for more versatile applications.

2. **Natural, Steerable Voices**: Similar to the Realtime API, developers can use prompting to shape various aspects of the generated audio, including language, pronunciation, and emotional range.

3. **Tool Calling Integration**: The audio support seamlessly integrates with existing tool calling functionality, enabling complex workflows that combine audio, text, and external tools.

## Practical Example

To demonstrate how to use this new functionality, let's look at a simple example using the `instructor` library:

"""python
from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

client = instructor.from_openai(OpenAI())

class Person(BaseModel):
name: str
age: int

resp = client.chat.completions.create(
model="gpt-4o-audio-preview",
response_model=Person,
modalities=["text"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
"Extract the following information from the audio",
Audio.from_path("./output.wav"),
],
},
],
)

print(resp)
# Expected output: Person(name='Jason', age=20)
"""

In this example, we're using the `gpt-4o-audio-preview` model to extract information from an audio file. The API processes the audio input and returns structured data (a Person object with name and age) based on the content of the audio.

## Use Cases

The addition of audio support to the Chat Completions API enables a wide range of applications:

1. **Voice-based Personal Assistants**: Create more natural and context-aware voice interfaces for various applications.

2. **Audio Content Analysis**: Automatically extract information, sentiments, or key points from audio recordings or podcasts.

3. **Language Learning Tools**: Develop interactive language learning applications that can process and respond to spoken language.

4. **Accessibility Features**: Improve accessibility in applications by providing audio-based interactions and text-to-speech capabilities.

## Considerations

While this new feature is exciting, it's important to note that it's best suited for asynchronous use cases that don't require extremely low latencies. For more dynamic and real-time interactions, OpenAI recommends using their Realtime API.

As with any AI-powered feature, it's crucial to consider ethical implications and potential biases in audio processing and generation. Always test thoroughly and consider the diversity of your user base when implementing these features.
42 changes: 41 additions & 1 deletion docs/concepts/multimodal.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Seamless Multimodal Interactions with Instructor
description: Learn how the Image class in Instructor enables seamless handling of images and text across different AI models.
description: Learn how the Image and Audio class in Instructor enables seamless handling of images, audio and text across different AI models.
---

# Multimodal
Expand Down Expand Up @@ -87,3 +87,43 @@ response = client.chat.completions.create(
autodetect_images=True
)
```

## `Audio`

The `Audio` class represents an audio file that can be loaded from a URL or file path. It provides methods to create `Audio` instances but currently only OpenAI supports it. You can create an instance using the `from_path` and `from_url` methods. The `Audio` class will automatically convert it to a base64-encoded image and include it in the API request.

### Usage

```python
from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

client = instructor.from_openai(OpenAI())


class User(BaseModel):
name: str
age: int


with open("./output.wav", "rb") as f:
encoded_string = base64.b64encode(f.read()).decode("utf-8")

resp = client.chat.completions.create(
model="gpt-4o-audio-preview",
response_model=User,
modalities=["text"],
audio={"voice": "alloy", "format": "wav"},
"Extract the following information from the audio:",
Audio.from_path("./output.wav"),
],
},
],
) # type: ignore

print(resp)
# > name='Jason' age=20
```
Binary file added examples/openai-audio/output.wav
Binary file not shown.
35 changes: 35 additions & 0 deletions examples/openai-audio/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

client = instructor.from_openai(OpenAI())


class Person(BaseModel):
name: str
age: int


with open("./output.wav", "rb") as f:
encoded_string = base64.b64encode(f.read()).decode("utf-8")

resp = client.chat.completions.create(
model="gpt-4o-audio-preview",
response_model=Person,
modalities=["text"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
"Extract the following information from the audio",
Audio.from_path("./output.wav"),
],
},
],
) # type: ignore

print(resp)
# > Person(name='Jason', age=20)
3 changes: 2 additions & 1 deletion instructor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from .mode import Mode
from .process_response import handle_response_model
from .distil import FinetuneFormat, Instructions
from .multimodal import Image
from .multimodal import Image, Audio
from .dsl import (
CitationMixin,
Maybe,
Expand All @@ -26,6 +26,7 @@
__all__ = [
"Instructor",
"Image",
"Audio",
"from_openai",
"from_litellm",
"AsyncInstructor",
Expand Down
1 change: 1 addition & 0 deletions instructor/batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ def create_from_messages(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
**kwargs,
),
)
else:
Expand Down
77 changes: 66 additions & 11 deletions instructor/multimodal.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import mimetypes
import requests
from pydantic import BaseModel, Field
from .mode import Mode

F = TypeVar("F", bound=Callable[..., Any])
K = TypeVar("K", bound=Hashable)
Expand Down Expand Up @@ -50,11 +51,11 @@ class Image(BaseModel):
)

@classmethod
def autodetect(cls, source: str | Path) -> Image:
def autodetect(cls, source: Union[str, Path]) -> Image: # noqa: UP007
"""Attempt to autodetect an image from a source string or Path.
Args:
source (str | Path): The source string or path.
source (Union[str,path]): The source string or path.
Returns:
An Image if the source is detected to be a valid image.
Raises:
Expand All @@ -75,11 +76,11 @@ def autodetect(cls, source: str | Path) -> Image:
raise ValueError("Unable to determine image type or unsupported image format")

@classmethod
def autodetect_safely(cls, source: str | Path) -> Union[Image, str]: # noqa: UP007
def autodetect_safely(cls, source: Union[str, Path]) -> Union[Image, str]: # noqa: UP007
"""Safely attempt to autodetect an image from a source string or path.
Args:
source (str | Path): The source string or path.
source (Union[str,path]): The source string or path.
Returns:
An Image if the source is detected to be a valid image, otherwise
the source itself as a string.
Expand Down Expand Up @@ -146,7 +147,7 @@ def from_url(cls, url: str) -> Image:

@classmethod
@lru_cache
def from_path(cls, path: str | Path) -> Image:
def from_path(cls, path: Union[str, Path]) -> Image: # noqa: UP007
path = Path(path)
if not path.is_file():
raise FileNotFoundError(f"Image file not found: {path}")
Expand Down Expand Up @@ -204,8 +205,47 @@ def to_openai(self) -> dict[str, Any]:
raise ValueError("Image data is missing for base64 encoding.")


class Audio(BaseModel):
"""Represents an audio that can be loaded from a URL or file path."""

source: Union[str, Path] = Field(..., description="URL or file path of the audio") # noqa: UP007
data: Union[str, None] = Field( # noqa: UP007
None, description="Base64 encoded audio data", repr=False
)

@classmethod
def from_url(cls, url: str) -> Audio:
"""Create an Audio instance from a URL."""
assert url.endswith(".wav"), "Audio must be in WAV format"

response = requests.get(url)
data = base64.b64encode(response.content).decode("utf-8")
return cls(source=url, data=data)

@classmethod
def from_path(cls, path: Union[str, Path]) -> Audio: # noqa: UP007
"""Create an Audio instance from a file path."""
path = Path(path)
assert path.is_file(), f"Audio file not found: {path}"
assert path.suffix.lower() == ".wav", "Audio must be in WAV format"

data = base64.b64encode(path.read_bytes()).decode("utf-8")
return cls(source=str(path), data=data)

def to_openai(self) -> dict[str, Any]:
"""Convert the Audio instance to OpenAI's API format."""
return {
"type": "input_audio",
"input_audio": {"data": self.data, "format": "wav"},
}

def to_anthropic(self) -> dict[str, Any]:
raise NotImplementedError("Anthropic is not supported yet")


class ImageWithCacheControl(Image):
"""Image with Anthropic prompt caching support."""

cache_control: OptionalCacheControlType = Field(
None, description="Optional Anthropic cache control image"
)
Expand All @@ -232,14 +272,18 @@ def to_anthropic(self) -> dict[str, Any]:

def convert_contents(
contents: Union[ # noqa: UP007
list[Union[str, dict[str, Any], Image]], str, dict[str, Any], Image # noqa: UP007
str,
dict[str, Any],
Image,
Audio,
list[Union[str, dict[str, Any], Image, Audio]], # noqa: UP007
],
mode: Mode,
) -> Union[str, list[dict[str, Any]]]: # noqa: UP007
"""Convert content items to the appropriate format based on the specified mode."""
if isinstance(contents, str):
return contents
if isinstance(contents, Image) or isinstance(contents, dict):
if isinstance(contents, (Image, Audio)) or isinstance(contents, dict):
contents = [contents]

converted_contents: list[dict[str, Union[str, Image]]] = [] # noqa: UP007
Expand All @@ -248,7 +292,7 @@ def convert_contents(
converted_contents.append({"type": "text", "text": content})
elif isinstance(content, dict):
converted_contents.append(content)
elif isinstance(content, Image):
elif isinstance(content, (Image, Audio)):
if mode in {Mode.ANTHROPIC_JSON, Mode.ANTHROPIC_TOOLS}:
converted_contents.append(content.to_anthropic())
elif mode in {Mode.GEMINI_JSON, Mode.GEMINI_TOOLS}:
Expand All @@ -264,9 +308,15 @@ def convert_messages(
messages: list[
dict[
str,
Union[list[Union[str, dict[str, Any], Image]], str, dict[str, Any], Image], # noqa: UP007
Union[ # noqa: UP007
str,
dict[str, Any],
Image,
Audio,
list[Union[str, dict[str, Any], Image, Audio]], # noqa: UP007
],
]
], # noqa: UP007
],
mode: Mode,
autodetect_images: bool = False,
) -> list[dict[str, Any]]:
Expand All @@ -277,11 +327,16 @@ def is_image_params(x: Any) -> bool:
return isinstance(x, dict) and x.get("type") == "image" and "source" in x # type: ignore

for message in messages:
if "type" in message:
if message["type"] in {"audio", "image"}:
converted_messages.append(message) # type: ignore
else:
raise ValueError(f"Unsupported message type: {message['type']}")
role = message["role"]
content = message["content"]
if autodetect_images:
if isinstance(content, list):
new_content: list[Union[str, dict[str, Any], Image]] = [] # noqa: UP007
new_content: list[Union[str, dict[str, Any], Image, Audio]] = [] # noqa: UP007
for item in content:
if isinstance(item, str):
new_content.append(Image.autodetect_safely(item))
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ nav:
- "blog/index.md"
- Concepts:
- Models: 'concepts/models.md'
- Multimodal : 'concepts/multimodal.md'
- Retrying: 'concepts/retrying.md'
- Patching: 'concepts/patching.md'
- Hooks: 'concepts/hooks.md'
Expand Down
Loading

0 comments on commit 9a7822a

Please sign in to comment.