Skip to content

Commit

Permalink
Resolve merge conflicts and make additional fixes and linter
Browse files Browse the repository at this point in the history
  • Loading branch information
fm1320 committed Dec 20, 2024
1 parent bb7cc10 commit f3cfa3e
Show file tree
Hide file tree
Showing 3 changed files with 490 additions and 0 deletions.
13 changes: 13 additions & 0 deletions docs/source/tutorials/embedder.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
.. raw:: html

<div style="display: flex; justify-content: flex-start; align-items: center; margin-bottom: 20px;">
<a href="https://colab.research.google.com/github/SylphAI-Inc/LightRAG/blob/main/notebooks/tutorials/adalflow_embedder.ipynb" target="_blank" style="margin-right: 10px;">
<img alt="Try RAG playbook in Colab" src="https://colab.research.google.com/assets/colab-badge.svg" style="vertical-align: middle;">
</a>
<a href="https://github.com/SylphAI-Inc/AdalFlow/blob/main/tutorials/adalflow_embedder.py" target="_blank" style="display: flex; align-items: center;">
<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" alt="GitHub" style="height: 20px; width: 20px; margin-right: 5px;">
<span style="vertical-align: middle;"> Open Source Code</span>
</a>
</div>


.. _tutorials-embedder:

Embedder
Expand Down
339 changes: 339 additions & 0 deletions notebooks/tutorials/adalflow_embedder.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🤗 Welcome to AdalFlow!\n",
"## The library to build & auto-optimize any LLM task pipelines\n",
"\n",
"Thanks for trying us out, we're here to provide you with the best LLM application development experience you can dream of 😊 any questions or concerns you may have, [come talk to us on discord,](https://discord.gg/ezzszrRZvT) we're always here to help! ⭐ <i>Star us on <a href=\"https://github.com/SylphAI-Inc/AdalFlow\">Github</a> </i> ⭐\n",
"\n",
"\n",
"# Quick Links\n",
"\n",
"Github repo: https://github.com/SylphAI-Inc/AdalFlow\n",
"\n",
"Full Tutorials: https://adalflow.sylph.ai/index.html#.\n",
"\n",
"Deep dive on each API: check out the [developer notes](https://adalflow.sylph.ai/tutorials/index.html).\n",
"\n",
"Common use cases along with the auto-optimization: check out [Use cases](https://adalflow.sylph.ai/use_cases/index.html).\n",
"\n",
"# Author\n",
"\n",
"This notebook was created by community contributor [Name](Replace_to_github_or_other_social_account).\n",
"\n",
"# Outline\n",
"\n",
"This is a quick introduction of what AdalFlow is capable of. We will cover:\n",
"\n",
"* Simple Chatbot with structured output\n",
"* RAG task pipeline + Data processing pipeline\n",
"* Agent\n",
"\n",
"**Next: Try our [auto-optimization](https://colab.research.google.com/drive/1n3mHUWekTEYHiBdYBTw43TKlPN41A9za?usp=sharing)**\n",
"\n",
"\n",
"# Installation\n",
"\n",
"1. Use `pip` to install the `adalflow` Python package. We will need `openai`, `groq`, and `faiss`(cpu version) from the extra packages.\n",
"\n",
" ```bash\n",
" pip install adalflow[openai,groq,faiss-cpu]\n",
" ```\n",
"2. Setup `openai` and `groq` API key in the environment variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import clear_output\n",
"\n",
"!pip install -U adalflow[openai,groq,faiss-cpu]\n",
"\n",
"clear_output()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip uninstall httpx anyio -y\n",
"!pip install \"anyio>=3.1.0,<4.0\"\n",
"!pip install httpx==0.24.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set Environment Variables\n",
"\n",
"Run the following code and pass your api key.\n",
"\n",
"Note: for normal `.py` projects, follow our [official installation guide](https://lightrag.sylph.ai/get_started/installation.html).\n",
"\n",
"*Go to [OpenAI](https://platform.openai.com/docs/introduction) and [Groq](https://console.groq.com/docs/) to get API keys if you don't already have.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from getpass import getpass\n",
"\n",
"# Prompt user to enter their API keys securely\n",
"openai_api_key = getpass(\"Please enter your OpenAI API key: \")\n",
"groq_api_key = getpass(\"Please enter your GROQ API key: \")\n",
"\n",
"\n",
"# Set environment variables\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
"os.environ[\"GROQ_API_KEY\"] = groq_api_key\n",
"\n",
"print(\"API keys have been set.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Embedder \n",
"\n",
"What you will learn?\n",
"\n",
"- What is Embedder and why is it designed this way?\n",
"\n",
"- When to use Embedder and how to use it?\n",
"\n",
"- How to batch processing with BatchEmbedder?\n",
"\n",
"core.embedder.Embedder class is similar to Generator, it is a user-facing component that orchestrates embedding models via ModelClient and output_processors. Compared with using ModelClient directly, Embedder further simplify the interface and output a standard EmbedderOutput format.\n",
"\n",
"By switching the ModelClient, you can use different embedding models in your task pipeline easily, or even embedd different data such as text, image, etc.\n",
"\n",
"# EmbedderOutput\n",
"core.types.EmbedderOutput is a standard output format of Embedder. It is a subclass of DataClass and it contains the following core fields:\n",
"\n",
"data: a list of embeddings, each embedding if of type core.types.Embedding.\n",
"\n",
"error: Error message if any error occurs during the model inference stage. Failure in the output processing stage will raise an exception instead of setting this field.\n",
"\n",
"raw_response: Used for failed model inference.\n",
"\n",
"Additionally, we add three properties to the EmbedderOutput:\n",
"\n",
"length: The number of embeddings in the data.\n",
"\n",
"embedding_dim: The dimension of the embeddings in the data.\n",
"\n",
"is_normalized: Whether the embeddings are normalized to unit vector or not using numpy.\n",
"\n",
"# Embedder in Action\n",
"We currently support all embedding models from OpenAI and ‘thenlper/gte-base’ from HuggingFace transformers. We will use these two to demonstrate how to use Embedder, one from the API provider and the other using local model. For the local model, you might need to ensure transformers is installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from adalflow.core.embedder import Embedder\n",
"from adalflow.components.model_client import OpenAIClient\n",
"\n",
"model_kwargs = {\n",
" \"model\": \"text-embedding-3-small\",\n",
" \"dimensions\": 256,\n",
" \"encoding_format\": \"float\",\n",
"}\n",
"\n",
"query = \"What is the capital of China?\"\n",
"\n",
"queries = [query] * 100\n",
"\n",
"\n",
"embedder = Embedder(model_client=OpenAIClient(), model_kwargs=model_kwargs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output = embedder(query)\n",
"print(output.length, output.embedding_dim, output.is_normalized)\n",
"# 1 256 True\n",
"output = embedder(queries)\n",
"print(output.length, output.embedding_dim)\n",
"# 100 256"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use Local Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from adalflow.core.embedder import Embedder\n",
"from adalflow.components.model_client import TransformersClient\n",
"\n",
"model_kwargs = {\"model\": \"thenlper/gte-base\"}\n",
"local_embedder = Embedder(model_client=TransformersClient(), model_kwargs=model_kwargs)\n",
"\n",
"output = local_embedder(query)\n",
"print(output.length, output.embedding_dim, output.is_normalized)\n",
"# 1 768 True\n",
"\n",
"output = local_embedder(queries)\n",
"print(output.length, output.embedding_dim, output.is_normalized)\n",
"# 100 768 True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use Output Processors\n",
"If we want to decreate the embedding dimension to only 256 to save memory, we can customize an additional output processing step and pass it to embedder via the output_processors argument.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from adalflow.core.types import Embedding, EmbedderOutput\n",
"from adalflow.core.functional import normalize_vector\n",
"from typing import List\n",
"from adalflow.core.component import Component\n",
"from copy import deepcopy\n",
"\n",
"class DecreaseEmbeddingDim(Component):\n",
" def __init__(self, old_dim: int, new_dim: int, normalize: bool = True):\n",
" super().__init__()\n",
" self.old_dim = old_dim\n",
" self.new_dim = new_dim\n",
" self.normalize = normalize\n",
" assert self.new_dim < self.old_dim, \"new_dim should be less than old_dim\"\n",
"\n",
" def call(self, input: List[Embedding]) -> List[Embedding]:\n",
" output: EmbedderOutput = deepcopy(input)\n",
" for embedding in output.data:\n",
" old_embedding = embedding.embedding\n",
" new_embedding = old_embedding[: self.new_dim]\n",
" if self.normalize:\n",
" new_embedding = normalize_vector(new_embedding)\n",
" embedding.embedding = new_embedding\n",
" return output.data\n",
"\n",
" def _extra_repr(self) -> str:\n",
" repr_str = f\"old_dim={self.old_dim}, new_dim={self.new_dim}, normalize={self.normalize}\"\n",
" return repr_str"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_embedder_256 = Embedder(\n",
" model_client=TransformersClient(),\n",
" model_kwargs=model_kwargs,\n",
" output_processors=DecreaseEmbeddingDim(768, 256),\n",
" )\n",
"print(local_embedder_256)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output = local_embedder_256(query)\n",
"print(output.length, output.embedding_dim, output.is_normalized)\n",
"# 1 256 True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batch Embedding\n",
"\n",
"Especially in data processing pipelines, you can often have more than 1000 queries to embed. We need to chunk our queries into smaller batches to avoid memory overflow. core.embedder.BatchEmbedder is designed to handle this situation. For now, the code is rather simple, but in the future it can be extended to support multi-processing when you use AdalFlow in production data pipeline.\n",
"\n",
"The BatchEmbedder orchestrates the Embedder and handles the batching process. To use it, you need to pass the Embedder and the batch size to the constructor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from adalflow.core.embedder import BatchEmbedder\n",
"\n",
"batch_embedder = BatchEmbedder(embedder=local_embedder, batch_size=100)\n",
"\n",
"queries = [query] * 1000\n",
"\n",
"response = batch_embedder(queries)\n",
"# 100%|██████████| 11/11 [00:04<00:00, 2.59it/s]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To integrate your own embedding model or from API providers, you need to implement your own subclass of ModelClient.\n",
"\n",
"References\n",
"\n",
"transformers: https://huggingface.co/docs/transformers/en/index\n",
"\n",
"thenlper/gte-base model: https://huggingface.co/thenlper/gte-base\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Issues and feedback\n",
"\n",
"If you encounter any issues, please report them here: [GitHub Issues](https://github.com/SylphAI-Inc/LightRAG/issues).\n",
"\n",
"For feedback, you can use either the [GitHub discussions](https://github.com/SylphAI-Inc/LightRAG/discussions) or [Discord](https://discord.gg/ezzszrRZvT)."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit f3cfa3e

Please sign in to comment.