Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal Embeddings? #414

Open
devlux76 opened this issue Jan 10, 2025 · 1 comment
Open

Multimodal Embeddings? #414

devlux76 opened this issue Jan 10, 2025 · 1 comment

Comments

@devlux76
Copy link

I'd like to generate embeddings using a multimodal model such as llama3.2-vision or minicam-v for images and text, for instance a pdf document with embedded images.

As far as I can tell this isn't supported, or at least it isn't documented.

Can someone explain to me what is needed here?

Thanks!

@dijarvrella
Copy link

Hi @devlux76 this is possible today.

Take a look at this example:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "ollama",
# ]
# ///

import os
import sys
import ollama

PROMPT = "Describe the provided image in a few sentences"


def run_inference(model: str, image_path: str):
    stream = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": PROMPT, "images": [image_path]}],
        stream=True,
    )

    for chunk in stream:
        print(chunk["message"]["content"], end="", flush=True)


def main():
    if len(sys.argv) != 3:
        print("Usage: python run.py <model_name> <image_path>")
        sys.exit(1)

    model_name = sys.argv[1]
    image_path = sys.argv[2]

    if not os.path.exists(image_path):
        print(f"Error: Image file '{image_path}' does not exist.")
        sys.exit(1)

    run_inference(model_name, image_path)


if __name__ == "__main__":
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants