Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multimodal openai - early version #313

Merged
merged 9 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions adalflow/adalflow/components/model_client/openai_client.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""OpenAI ModelClient integration."""

import os
import base64
from typing import (
Dict,
Sequence,
Expand Down Expand Up @@ -51,6 +52,14 @@
log = logging.getLogger(__name__)
T = TypeVar("T")

# Models that support multimodal inputs
MULTIMODAL_MODELS = {
fm1320 marked this conversation as resolved.
Show resolved Hide resolved
"gpt-4o", # Versatile, high-intelligence flagship model
"gpt-4o-mini", # Fast, affordable small model for focused tasks
"o1", # Reasoning model that excels at complex, multi-step tasks
"o1-mini", # Smaller reasoning model for complex tasks
}


# completion parsing functions and you can combine them into one singple chat completion parser
def get_first_message_content(completion: ChatCompletion) -> str:
Expand Down Expand Up @@ -332,6 +341,102 @@ def to_dict(self) -> Dict[str, Any]:
output = super().to_dict(exclude=exclude)
return output

def _encode_image(self, image_path: str) -> str:
"""Encode image to base64 string.

Args:
image_path: Path to image file.

Returns:
Base64 encoded image string.
"""
with open(image_path, "rb") as image_file:
fm1320 marked this conversation as resolved.
Show resolved Hide resolved
return base64.b64encode(image_file.read()).decode("utf-8")

def _prepare_image_content(
self, image_source: Union[str, Dict[str, Any]], detail: str = "auto"
) -> Dict[str, Any]:
"""Prepare image content for API request.

Args:
image_source: Either a path to local image or a URL.
detail: Image detail level ('auto', 'low', or 'high').

Returns:
Formatted image content for API request.
"""
if isinstance(image_source, str):
if image_source.startswith(("http://", "https://")):
return {
"type": "image_url",
"image_url": {"url": image_source, "detail": detail},
}
else:
base64_image = self._encode_image(image_source)
return {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": detail,
},
}
return image_source

def generate(
fm1320 marked this conversation as resolved.
Show resolved Hide resolved
self,
prompt: str,
images: Optional[
Union[str, List[str], Dict[str, Any], List[Dict[str, Any]]]
] = None,
model_kwargs: Optional[Dict[str, Any]] = None,
) -> GeneratorOutput:
"""Generate text response for given prompt and optionally images.

Args:
prompt: Text prompt.
images: Optional image source(s) - can be path(s), URL(s), or formatted dict(s).
model_kwargs: Additional model parameters.

Returns:
GeneratorOutput containing the model's response.
"""
model_kwargs = model_kwargs or {}
model = model_kwargs.get("model", "gpt-4o-mini")
max_tokens = model_kwargs.get("max_tokens", 300)
detail = model_kwargs.get("detail", "auto")

# Check if model supports multimodal inputs when images are provided
if images and model not in MULTIMODAL_MODELS:
return GeneratorOutput(
error=f"Model {model} does not support multimodal inputs. Supported models: {MULTIMODAL_MODELS}"
)

# Prepare message content
if images:
content = [{"type": "text", "text": prompt}]
if not isinstance(images, list):
images = [images]
for img in images:
content.append(self._prepare_image_content(img, detail))
messages = [{"role": "user", "content": content}]
else:
messages = [{"role": "user", "content": prompt}]

try:
response = self.client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
return GeneratorOutput(
id=response.id,
data=response.choices[0].message.content,
usage=response.usage.model_dump() if response.usage else None,
raw_response=response.model_dump(),
)
except Exception as e:
return GeneratorOutput(error=str(e))


# if __name__ == "__main__":
# from adalflow.core import Generator
Expand Down
10 changes: 10 additions & 0 deletions adalflow/adalflow/utils/lazy_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,3 +215,13 @@ def safe_import(
raise ImportError(f"{install_message}")

return return_modules[0] if len(return_modules) == 1 else return_modules


OPTIONAL_PACKAGES = {
fm1320 marked this conversation as resolved.
Show resolved Hide resolved
"openai": "openai", # For OpenAI API clients
"transformers": "transformers", # For local models
"torch": "torch", # For PyTorch models
"anthropic": "anthropic", # For Claude models
"groq": "groq", # For Groq models
"cohere": "cohere", # For Cohere models
}
171 changes: 171 additions & 0 deletions docs/source/tutorials/multimodal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
.. _tutorials-multimodal:

Multimodal Generation
===================

.. raw:: html

<div style="display: flex; justify-content: flex-start; align-items: center; margin-bottom: 20px;">
<a href="https://colab.research.google.com/github/SylphAI-Inc/AdalFlow/blob/main/notebooks/tutorials/adalflow_multimodal.ipynb" target="_blank" style="margin-right: 10px;">
<img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" style="vertical-align: middle;">
</a>
<a href="https://github.com/SylphAI-Inc/AdalFlow/blob/main/notebooks/tutorials/adalflow_multimodal.ipynb" target="_blank" style="display: flex; align-items: center;">
<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" alt="GitHub" style="height: 20px; width: 20px; margin-right: 5px;">
<span style="vertical-align: middle;"> View Source</span>
</a>
</div>

What you will learn?
------------------

1. How to use OpenAI's multimodal capabilities in AdalFlow
2. Different ways to input images (local files, URLs)
3. Controlling image detail levels
4. Working with multiple images

Multimodal Support in OpenAIClient
--------------------------------

The :class:`OpenAIClient` supports both text and image inputs. For multimodal generation, you can use the following models:

- ``gpt-4o``: Versatile, high-intelligence flagship model
- ``gpt-4o-mini``: Fast, affordable small model for focused tasks (default)
- ``o1``: Reasoning model that excels at complex, multi-step tasks
- ``o1-mini``: Smaller reasoning model for complex tasks

The client supports:

- Local image files (automatically encoded to base64)
- Image URLs
- Multiple images in a single request
- Control over image detail level

Basic Usage
----------

First, install AdalFlow with OpenAI support:

.. code-block:: bash

pip install "adalflow[openai]"

Then you can use the client with the Generator. By default, it uses ``gpt-4o-mini``, but you can specify any supported model:

.. code-block:: python

from adalflow import Generator, OpenAIClient

# Using the default gpt-4o-mini model
generator = Generator(
model_client=OpenAIClient(),
model_kwargs={
"model": "gpt-4o-mini", # or "gpt-4o", "o1", "o1-mini"
"max_tokens": 300
}
)

# Using an image URL
response = generator(
prompt="Describe this image.",
images="https://example.com/image.jpg"
)

# Using the flagship model for more complex tasks
generator_flagship = Generator(
model_client=OpenAIClient(),
model_kwargs={
"model": "gpt-4o",
"max_tokens": 300
}
)

Image Detail Levels
-----------------

The client supports three detail levels:

- ``auto``: Let the model decide based on image size (default)
- ``low``: Low-resolution mode (512px x 512px)
- ``high``: High-resolution mode with detailed crops

.. code-block:: python

generator = Generator(
model_client=OpenAIClient(),
model_kwargs={
"model": "gpt-4o-mini",
"detail": "high" # or "low" or "auto"
}
)

Multiple Images
-------------

You can analyze multiple images in one request:

.. code-block:: python

images = [
"path/to/local/image.jpg",
"https://example.com/image.jpg"
]

response = generator(
prompt="Compare these images.",
images=images
)

Implementation Details
-------------------

The client handles:

1. Image Processing:
- Automatic base64 encoding for local files
- URL validation and formatting
- Detail level configuration

2. API Integration:
- Proper message formatting for OpenAI's vision models
- Error handling and response parsing
- Model compatibility checking
- Usage tracking

3. Output Format:
- Returns standard :class:`GeneratorOutput` format
- Includes model usage information
- Preserves error messages if any occur

Limitations
---------

Be aware of these limitations when using multimodal features:

1. Model Support and Capabilities:
- Four models available with different strengths:
- ``gpt-4o``: Best for complex visual analysis and detailed understanding
- ``gpt-4o-mini``: Good balance of speed and accuracy for common tasks
- ``o1``: Excels at multi-step reasoning with visual inputs
- ``o1-mini``: Efficient for focused visual reasoning tasks
- The client will return an error if using an unsupported model with images

2. Image Size and Format:
- Maximum file size: 20MB per image
- Supported formats: PNG, JPEG, WEBP, non-animated GIF

3. Common Limitations:
- May struggle with:
- Very small or blurry text
- Complex spatial relationships
- Detailed technical diagrams
- Non-Latin text or symbols

4. Cost and Performance Considerations:
- Image inputs increase token usage
- High detail mode uses more tokens
- Consider using:
- ``gpt-4o-mini`` for routine tasks
- ``o1-mini`` for basic reasoning tasks
- ``gpt-4o`` or ``o1`` for complex analysis

For more details, see the :class:`OpenAIClient` API reference.
Loading