feat: add audio support with new Audio class and update documentati…

…on (#1095) Co-authored-by: Ivan Leo <[email protected]>
instructor-ai · Oct 20, 2024 · 9a7822a · 9a7822a
1 parent 59f1d6a
commit 9a7822a
Show file tree

Hide file tree

Showing 11 changed files with 503 additions and 348 deletions.
diff --git a/docs/blog/posts/openai-multimodal.md b/docs/blog/posts/openai-multimodal.md
@@ -0,0 +1,87 @@
+---
+authors:
+- jxnl
+categories:
+- OpenAI
+- Audio
+comments: true
+date: 2025-10-17
+description: Explore the new audio capabilities in OpenAI's Chat Completions API using the gpt-4o-audio-preview model.
+draft: false
+tags:
+- OpenAI
+- Audio Processing
+- API
+- Machine Learning
+---
+
+# Audio Support in OpenAI's Chat Completions API
+
+OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new `gpt-4o-audio-preview` model, which brings advanced voice capabilities to the familiar Chat Completions API interface.
+
+## Key Features
+
+The new audio support in the Chat Completions API offers several compelling features:
+
+1. **Flexible Input Handling**: The API can now process any combination of text and audio inputs, allowing for more versatile applications.
+
+2. **Natural, Steerable Voices**: Similar to the Realtime API, developers can use prompting to shape various aspects of the generated audio, including language, pronunciation, and emotional range.
+
+3. **Tool Calling Integration**: The audio support seamlessly integrates with existing tool calling functionality, enabling complex workflows that combine audio, text, and external tools.
+
+## Practical Example
+
+To demonstrate how to use this new functionality, let's look at a simple example using the `instructor` library:
+
+"""python
+from openai import OpenAI
+from pydantic import BaseModel
+import instructor
+from instructor.multimodal import Audio
+import base64
+
+client = instructor.from_openai(OpenAI())
+
+class Person(BaseModel):
+    name: str
+    age: int
+
+resp = client.chat.completions.create(
+    model="gpt-4o-audio-preview",
+    response_model=Person,
+    modalities=["text"],
+    audio={"voice": "alloy", "format": "wav"},
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                "Extract the following information from the audio",
+                Audio.from_path("./output.wav"),
+            ],
+        },
+    ],
+)
+
+print(resp)
+# Expected output: Person(name='Jason', age=20)
+"""
+
+In this example, we're using the `gpt-4o-audio-preview` model to extract information from an audio file. The API processes the audio input and returns structured data (a Person object with name and age) based on the content of the audio.
+
+## Use Cases
+
+The addition of audio support to the Chat Completions API enables a wide range of applications:
+
+1. **Voice-based Personal Assistants**: Create more natural and context-aware voice interfaces for various applications.
+
+2. **Audio Content Analysis**: Automatically extract information, sentiments, or key points from audio recordings or podcasts.
+
+3. **Language Learning Tools**: Develop interactive language learning applications that can process and respond to spoken language.
+
+4. **Accessibility Features**: Improve accessibility in applications by providing audio-based interactions and text-to-speech capabilities.
+
+## Considerations
+
+While this new feature is exciting, it's important to note that it's best suited for asynchronous use cases that don't require extremely low latencies. For more dynamic and real-time interactions, OpenAI recommends using their Realtime API.
+
+As with any AI-powered feature, it's crucial to consider ethical implications and potential biases in audio processing and generation. Always test thoroughly and consider the diversity of your user base when implementing these features.
diff --git a/docs/concepts/multimodal.md b/docs/concepts/multimodal.md
@@ -1,6 +1,6 @@
 ---
 title: Seamless Multimodal Interactions with Instructor
-description: Learn how the Image class in Instructor enables seamless handling of images and text across different AI models.
+description: Learn how the Image and Audio class in Instructor enables seamless handling of images, audio and text across different AI models.
 ---
 
 # Multimodal
@@ -87,3 +87,43 @@ response = client.chat.completions.create(
     autodetect_images=True
 )
 ```
+
+## `Audio`
+
+The `Audio` class represents an audio file that can be loaded from a URL or file path. It provides methods to create `Audio` instances but currently only OpenAI supports it. You can create an instance using the `from_path` and `from_url` methods. The `Audio` class will automatically convert it to a base64-encoded image and include it in the API request.
+
+### Usage
+
+```python
+from openai import OpenAI
+from pydantic import BaseModel
+import instructor
+from instructor.multimodal import Audio
+import base64
+
+client = instructor.from_openai(OpenAI())
+
+
+class User(BaseModel):
+    name: str
+    age: int
+
+
+with open("./output.wav", "rb") as f:
+    encoded_string = base64.b64encode(f.read()).decode("utf-8")
+
+resp = client.chat.completions.create(
+    model="gpt-4o-audio-preview",
+    response_model=User,
+    modalities=["text"],
+    audio={"voice": "alloy", "format": "wav"},
+                "Extract the following information from the audio:",
+                Audio.from_path("./output.wav"),
+            ],
+        },
+    ],
+)  # type: ignore
+
+print(resp)
+# > name='Jason' age=20
+```
diff --git a/examples/openai-audio/output.wav b/examples/openai-audio/output.wav
diff --git a/examples/openai-audio/run.py b/examples/openai-audio/run.py
@@ -0,0 +1,35 @@
+from openai import OpenAI
+from pydantic import BaseModel
+import instructor
+from instructor.multimodal import Audio
+import base64
+
+client = instructor.from_openai(OpenAI())
+
+
+class Person(BaseModel):
+    name: str
+    age: int
+
+
+with open("./output.wav", "rb") as f:
+    encoded_string = base64.b64encode(f.read()).decode("utf-8")
+
+resp = client.chat.completions.create(
+    model="gpt-4o-audio-preview",
+    response_model=Person,
+    modalities=["text"],
+    audio={"voice": "alloy", "format": "wav"},
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                "Extract the following information from the audio",
+                Audio.from_path("./output.wav"),
+            ],
+        },
+    ],
+)  # type: ignore
+
+print(resp)
+# > Person(name='Jason', age=20)
diff --git a/instructor/__init__.py b/instructor/__init__.py
@@ -3,7 +3,7 @@
 from .mode import Mode
 from .process_response import handle_response_model
 from .distil import FinetuneFormat, Instructions
-from .multimodal import Image
+from .multimodal import Image, Audio
 from .dsl import (
     CitationMixin,
     Maybe,
@@ -26,6 +26,7 @@
 __all__ = [
     "Instructor",
     "Image",
+    "Audio",
     "from_openai",
     "from_litellm",
     "AsyncInstructor",

diff --git a/instructor/batch.py b/instructor/batch.py
@@ -141,6 +141,7 @@ def create_from_messages(
                             messages=messages,
                             max_tokens=max_tokens,
                             temperature=temperature,
+                            **kwargs,
                         ),
                     )
                 else:

diff --git a/instructor/multimodal.py b/instructor/multimodal.py
@@ -19,6 +19,7 @@
 import mimetypes
 import requests
 from pydantic import BaseModel, Field
+from .mode import Mode
 
 F = TypeVar("F", bound=Callable[..., Any])
 K = TypeVar("K", bound=Hashable)
@@ -50,11 +51,11 @@ class Image(BaseModel):
     )
 
     @classmethod
-    def autodetect(cls, source: str | Path) -> Image:
+    def autodetect(cls, source: Union[str, Path]) -> Image:  # noqa: UP007
         """Attempt to autodetect an image from a source string or Path.
 
         Args:
-            source (str | Path): The source string or path.
+            source (Union[str,path]): The source string or path.
         Returns:
             An Image if the source is detected to be a valid image.
         Raises:
@@ -75,11 +76,11 @@ def autodetect(cls, source: str | Path) -> Image:
         raise ValueError("Unable to determine image type or unsupported image format")
 
     @classmethod
-    def autodetect_safely(cls, source: str | Path) -> Union[Image, str]:  # noqa: UP007
+    def autodetect_safely(cls, source: Union[str, Path]) -> Union[Image, str]:  # noqa: UP007
         """Safely attempt to autodetect an image from a source string or path.
 
         Args:
-            source (str | Path): The source string or path.
+            source (Union[str,path]): The source string or path.
         Returns:
             An Image if the source is detected to be a valid image, otherwise
             the source itself as a string.
@@ -146,7 +147,7 @@ def from_url(cls, url: str) -> Image:
 
     @classmethod
     @lru_cache
-    def from_path(cls, path: str | Path) -> Image:
+    def from_path(cls, path: Union[str, Path]) -> Image:  # noqa: UP007
         path = Path(path)
         if not path.is_file():
             raise FileNotFoundError(f"Image file not found: {path}")
@@ -204,8 +205,47 @@ def to_openai(self) -> dict[str, Any]:
             raise ValueError("Image data is missing for base64 encoding.")
 
 
+class Audio(BaseModel):
+    """Represents an audio that can be loaded from a URL or file path."""
+
+    source: Union[str, Path] = Field(..., description="URL or file path of the audio")  # noqa: UP007
+    data: Union[str, None] = Field(  # noqa: UP007
+        None, description="Base64 encoded audio data", repr=False
+    )
+
+    @classmethod
+    def from_url(cls, url: str) -> Audio:
+        """Create an Audio instance from a URL."""
+        assert url.endswith(".wav"), "Audio must be in WAV format"
+
+        response = requests.get(url)
+        data = base64.b64encode(response.content).decode("utf-8")
+        return cls(source=url, data=data)
+
+    @classmethod
+    def from_path(cls, path: Union[str, Path]) -> Audio:  # noqa: UP007
+        """Create an Audio instance from a file path."""
+        path = Path(path)
+        assert path.is_file(), f"Audio file not found: {path}"
+        assert path.suffix.lower() == ".wav", "Audio must be in WAV format"
+
+        data = base64.b64encode(path.read_bytes()).decode("utf-8")
+        return cls(source=str(path), data=data)
+
+    def to_openai(self) -> dict[str, Any]:
+        """Convert the Audio instance to OpenAI's API format."""
+        return {
+            "type": "input_audio",
+            "input_audio": {"data": self.data, "format": "wav"},
+        }
+
+    def to_anthropic(self) -> dict[str, Any]:
+        raise NotImplementedError("Anthropic is not supported yet")
+
+
 class ImageWithCacheControl(Image):
     """Image with Anthropic prompt caching support."""
+
     cache_control: OptionalCacheControlType = Field(
         None, description="Optional Anthropic cache control image"
     )
@@ -232,14 +272,18 @@ def to_anthropic(self) -> dict[str, Any]:
 
 def convert_contents(
     contents: Union[  # noqa: UP007
-        list[Union[str, dict[str, Any], Image]], str, dict[str, Any], Image  # noqa: UP007
+        str,
+        dict[str, Any],
+        Image,
+        Audio,
+        list[Union[str, dict[str, Any], Image, Audio]],  # noqa: UP007
     ],
     mode: Mode,
 ) -> Union[str, list[dict[str, Any]]]:  # noqa: UP007
     """Convert content items to the appropriate format based on the specified mode."""
     if isinstance(contents, str):
         return contents
-    if isinstance(contents, Image) or isinstance(contents, dict):
+    if isinstance(contents, (Image, Audio)) or isinstance(contents, dict):
         contents = [contents]
 
     converted_contents: list[dict[str, Union[str, Image]]] = []  # noqa: UP007
@@ -248,7 +292,7 @@ def convert_contents(
             converted_contents.append({"type": "text", "text": content})
         elif isinstance(content, dict):
             converted_contents.append(content)
-        elif isinstance(content, Image):
+        elif isinstance(content, (Image, Audio)):
             if mode in {Mode.ANTHROPIC_JSON, Mode.ANTHROPIC_TOOLS}:
                 converted_contents.append(content.to_anthropic())
             elif mode in {Mode.GEMINI_JSON, Mode.GEMINI_TOOLS}:
@@ -264,9 +308,15 @@ def convert_messages(
     messages: list[
         dict[
             str,
-            Union[list[Union[str, dict[str, Any], Image]], str, dict[str, Any], Image],  # noqa: UP007
+            Union[  # noqa: UP007
+                str,
+                dict[str, Any],
+                Image,
+                Audio,
+                list[Union[str, dict[str, Any], Image, Audio]],  # noqa: UP007
+            ],
         ]
-    ],  # noqa: UP007
+    ],
     mode: Mode,
     autodetect_images: bool = False,
 ) -> list[dict[str, Any]]:
@@ -277,11 +327,16 @@ def is_image_params(x: Any) -> bool:
         return isinstance(x, dict) and x.get("type") == "image" and "source" in x  # type: ignore
 
     for message in messages:
+        if "type" in message:
+            if message["type"] in {"audio", "image"}:
+                converted_messages.append(message)  # type: ignore
+            else:
+                raise ValueError(f"Unsupported message type: {message['type']}")
         role = message["role"]
         content = message["content"]
         if autodetect_images:
             if isinstance(content, list):
-                new_content: list[Union[str, dict[str, Any], Image]] = []  # noqa: UP007
+                new_content: list[Union[str, dict[str, Any], Image, Audio]] = []  # noqa: UP007
                 for item in content:
                     if isinstance(item, str):
                         new_content.append(Image.autodetect_safely(item))

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -164,6 +164,7 @@ nav:
     - "blog/index.md"
   - Concepts:
     - Models: 'concepts/models.md'
+    - Multimodal : 'concepts/multimodal.md'
     - Retrying: 'concepts/retrying.md'
     - Patching: 'concepts/patching.md'
     - Hooks: 'concepts/hooks.md'