-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add Azure AI Search + Azure OpenAI RAG recipe notebook #675
base: main
Are you sure you want to change the base?
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
@vagenas @pablocastro can you review :) |
@vagenas plz merge when ready |
" raise EnvironmentError(\n", | ||
" \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n", | ||
" )" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the intent here, but raising an error in case of no GPU could be a bit too much — could perhaps just print the message instead.
"source": [ | ||
"from openai import AzureOpenAI\n", | ||
"from azure.search.documents import SearchClient\n", | ||
"import uuid\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uuid
is not used, I would drop this line.
"# Part 2: Parse the PDF with Docling\n", | ||
"Example: \"State of AI\" slides from a remote link.\n", | ||
"\n", | ||
"You can find the raw powerpoint here: https://docs.google.com/presentation/d/1GmZmoWOa2O92BPrncRcTKa15xvQGhq7g4I4hJSNlC0M/edit?usp=sharing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the same URL that is actually parsed below (which is a PDF and not a PowerPoint).
"\n", | ||
"console = Console()\n", | ||
"\n", | ||
"source_url = \"https://ignite2024demo.blob.core.windows.net/state-of-ai-2024/State of AI Report 2024.pdf\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I see this is 212 pages of PDF. If the size was explicitly meant to be in that range, it could help to include a quick headsup to the user explaining that and that they would be parsing 200+ pages of PDF. From my point of view a shorter document could would work here.
- If we stick to this document, I would still URL-encode the URL.
"outputs": [], | ||
"source": [ | ||
"# If running in a new environment, uncomment and run these:\n", | ||
"%pip install docling~=\"2.7.0\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see further below that HierarchicalChunker
is imported by docling_core
import package, but the docling-core distribution package is not installed here. For simplicity, I would:
- require a minimum of docling 2.9.0 (i.e.
docling~="2.9"
), and - then import
HierarchicalChunker
asfrom docling.chunking import HierarchicalChunker
~=
notation (e.g. docling~="2.9.0"
) it will only consider that minor version (2.9
) and not include e.g. 2.10
etc.
"%pip install azure-search-documents==11.5.2\n", | ||
"%pip install azure-identity\n", | ||
"%pip install openai\n", | ||
"%pip install rich\n", | ||
"%pip install torch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would collapse all install statements into a single one.
} | ||
], | ||
"source": [ | ||
"from docling_core.transforms.chunker import HierarchicalChunker\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import line discussed further above.
}, | ||
"source": [ | ||
"# RAG with Azure AI Search\n", | ||
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/farzad528/docling/blob/main/docs/examples/rag_azuresearch.ipynb)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the URL to the target one that should work after merging, i.e. using DS4SD
instead of farzad528
"GPU or MPS usage can speed up Docling’s parsing (especially for large PDFs or when OCR/table extraction is needed). However, if no GPU is detected, you can comment out the following checks and proceed with CPU, albeit slower performance." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"CUDA GPU is enabled: NVIDIA A100 80GB PCIe\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import torch\n", | ||
"\n", | ||
"if torch.cuda.is_available():\n", | ||
" device = torch.device(\"cuda\")\n", | ||
" print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n", | ||
"elif torch.backends.mps.is_available():\n", | ||
" device = torch.device(\"mps\")\n", | ||
" print(\"MPS GPU is enabled.\")\n", | ||
"else:\n", | ||
" # Comment out the error if you'd like to allow CPU fallback\n", | ||
" # But be aware parsing could be slower\n", | ||
" raise EnvironmentError(\n", | ||
" \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n", | ||
" )" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting with version 2.12, Docling provides more extensive accelerator support — and automatically detects and enables any available accelerator device.
You could therefore require ~="2.12"
in the pip install
s above and replace this section with some simple heads-up / disclaimer to the users.
"AZURE_SEARCH_ENDPOINT = os.getenv(\"AZURE_SEARCH_ENDPOINT\")\n", | ||
"AZURE_SEARCH_KEY = os.getenv(\"AZURE_SEARCH_KEY\")\n", | ||
"AZURE_SEARCH_INDEX_NAME = os.getenv(\"AZURE_SEARCH_INDEX_NAME\")\n", | ||
"AZURE_OPENAI_ENDPOINT = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", | ||
"AZURE_OPENAI_API_KEY = os.getenv(\"AZURE_OPENAI_API_KEY\") # Ensure this your Admin Key\n", | ||
"AZURE_OPENAI_CHAT_MODEL = os.getenv(\"AZURE_OPENAI_CHAT_MODEL\") # Using a deployed model named \"gpt-4o\"\n", | ||
"AZURE_OPENAI_API_VERSION = os.getenv(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\n", | ||
"AZURE_OPENAI_EMBEDDINGS = os.getenv(\"AZURE_OPENAI_EMBEDDINGS\") # Using a deployed model named \"text-embeddings-3-small\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section with os.getenv()
is not going to work on Colab. Check out how we address this e.g. here (see _get_env_from_colab_or_os()
.
@farzad528 thanks for this nice example! 🙌 In order to align and merge:
|
Azure AI Search RAG Example Using Docling
This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using Docling and Azure AI Search. The example showcases document parsing, chunking, vector search integration, and RAG query implementation.
Checklist: