Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add Azure AI Search + Azure OpenAI RAG recipe notebook #675

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

farzad528
Copy link

Azure AI Search RAG Example Using Docling

This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using Docling and Azure AI Search. The example showcases document parsing, chunking, vector search integration, and RAG query implementation.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • [] Tests have been added, if necessary.

Copy link

mergify bot commented Jan 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@farzad528
Copy link
Author

farzad528 commented Jan 5, 2025

@vagenas @pablocastro can you review :)

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved
docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved
docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved
docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved
docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved
docs/examples/rag_azuresearch.ipynb Show resolved Hide resolved
@vagenas vagenas self-assigned this Jan 10, 2025
@farzad528 farzad528 reopened this Jan 10, 2025
@farzad528
Copy link
Author

farzad528 commented Jan 13, 2025

@vagenas plz merge when ready

Comment on lines +84 to +86
" raise EnvironmentError(\n",
" \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n",
" )"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the intent here, but raising an error in case of no GPU could be a bit too much — could perhaps just print the message instead.

"source": [
"from openai import AzureOpenAI\n",
"from azure.search.documents import SearchClient\n",
"import uuid\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid is not used, I would drop this line.

"# Part 2: Parse the PDF with Docling\n",
"Example: \"State of AI\" slides from a remote link.\n",
"\n",
"You can find the raw powerpoint here: https://docs.google.com/presentation/d/1GmZmoWOa2O92BPrncRcTKa15xvQGhq7g4I4hJSNlC0M/edit?usp=sharing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the same URL that is actually parsed below (which is a PDF and not a PowerPoint).

"\n",
"console = Console()\n",
"\n",
"source_url = \"https://ignite2024demo.blob.core.windows.net/state-of-ai-2024/State of AI Report 2024.pdf\"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I see this is 212 pages of PDF. If the size was explicitly meant to be in that range, it could help to include a quick headsup to the user explaining that and that they would be parsing 200+ pages of PDF. From my point of view a shorter document could would work here.
  2. If we stick to this document, I would still URL-encode the URL.

"outputs": [],
"source": [
"# If running in a new environment, uncomment and run these:\n",
"%pip install docling~=\"2.7.0\"\n",
Copy link
Contributor

@vagenas vagenas Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see further below that HierarchicalChunker is imported by docling_core import package, but the docling-core distribution package is not installed here. For simplicity, I would:

  1. require a minimum of docling 2.9.0 (i.e. docling~="2.9"), and
  2. then import HierarchicalChunker as from docling.chunking import HierarchicalChunker

⚠️ Note: if you include the patch version number in the ~= notation (e.g. docling~="2.9.0") it will only consider that minor version (2.9) and not include e.g. 2.10 etc.

Comment on lines +40 to +44
"%pip install azure-search-documents==11.5.2\n",
"%pip install azure-identity\n",
"%pip install openai\n",
"%pip install rich\n",
"%pip install torch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would collapse all install statements into a single one.

}
],
"source": [
"from docling_core.transforms.chunker import HierarchicalChunker\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import line discussed further above.

},
"source": [
"# RAG with Azure AI Search\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/farzad528/docling/blob/main/docs/examples/rag_azuresearch.ipynb)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the URL to the target one that should work after merging, i.e. using DS4SD instead of farzad528

Comment on lines +56 to +87
"GPU or MPS usage can speed up Docling’s parsing (especially for large PDFs or when OCR/table extraction is needed). However, if no GPU is detected, you can comment out the following checks and proceed with CPU, albeit slower performance."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CUDA GPU is enabled: NVIDIA A100 80GB PCIe\n"
]
}
],
"source": [
"import torch\n",
"\n",
"if torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
" print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n",
"elif torch.backends.mps.is_available():\n",
" device = torch.device(\"mps\")\n",
" print(\"MPS GPU is enabled.\")\n",
"else:\n",
" # Comment out the error if you'd like to allow CPU fallback\n",
" # But be aware parsing could be slower\n",
" raise EnvironmentError(\n",
" \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n",
" )"
]
Copy link
Contributor

@vagenas vagenas Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with version 2.12, Docling provides more extensive accelerator support — and automatically detects and enables any available accelerator device.
You could therefore require ~="2.12" in the pip installs above and replace this section with some simple heads-up / disclaimer to the users.

Comment on lines +107 to +114
"AZURE_SEARCH_ENDPOINT = os.getenv(\"AZURE_SEARCH_ENDPOINT\")\n",
"AZURE_SEARCH_KEY = os.getenv(\"AZURE_SEARCH_KEY\")\n",
"AZURE_SEARCH_INDEX_NAME = os.getenv(\"AZURE_SEARCH_INDEX_NAME\")\n",
"AZURE_OPENAI_ENDPOINT = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
"AZURE_OPENAI_API_KEY = os.getenv(\"AZURE_OPENAI_API_KEY\") # Ensure this your Admin Key\n",
"AZURE_OPENAI_CHAT_MODEL = os.getenv(\"AZURE_OPENAI_CHAT_MODEL\") # Using a deployed model named \"gpt-4o\"\n",
"AZURE_OPENAI_API_VERSION = os.getenv(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\n",
"AZURE_OPENAI_EMBEDDINGS = os.getenv(\"AZURE_OPENAI_EMBEDDINGS\") # Using a deployed model named \"text-embeddings-3-small\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section with os.getenv() is not going to work on Colab. Check out how we address this e.g. here (see _get_env_from_colab_or_os().

@vagenas
Copy link
Contributor

vagenas commented Jan 13, 2025

@farzad528 thanks for this nice example! 🙌

In order to align and merge:

  1. [CI] DCO check fails, i.e. commits are missing sign-off. To add your sign-off:
    • In your local branch, run: git rebase HEAD~10 --signoff
    • Force push your changes to overwrite the branch: git push --force-with-lease origin main
  2. [CI] Code styling fails. To address, install & run the commit hooks locally & push the new changes.
  3. I have added some in-line comments to address individual points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants