New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

docs: add Azure AI Search + Azure OpenAI RAG recipe notebook #675

Open

farzad528 wants to merge 10 commits into DS4SD:main from farzad528:main

+738 −0

farzad528 commented Jan 4, 2025

Azure AI Search RAG Example Using Docling

This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using Docling and Azure AI Search. The example showcases document parsing, chunking, vector search integration, and RAG query implementation.

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
[] Tests have been added, if necessary.

farzad528 added 3 commits

January 4, 2025 17:31


          docling azure ai search

6d718c1


          azure ai search updates

acdb32d


          mkdocs

df6201a

mergify bot commented Jan 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

farzad528 added 2 commits

January 4, 2025 17:59


          colab check

86c1fd2


          colab link fix

e582b88

Author

farzad528 commented Jan 5, 2025 •

edited

Loading

@vagenas @pablocastro can you review :)

pablocastro reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved

docs/examples/rag_azuresearch.ipynb Outdated Show resolved Hide resolved

docs/examples/rag_azuresearch.ipynb Show resolved Hide resolved

vagenas self-assigned this


          pr comments

674539c

farzad528 requested a review from pablocastro

January 10, 2025 22:40

farzad528 closed this

farzad528 force-pushed the main branch from 674539c to 1976584 Compare

January 10, 2025 22:42

farzad528 added 3 commits

January 10, 2025 16:44


          Merge branch 'main' of https://github.com/farzad528/docling

d53992e


          title change

43f8797


          table

64b608f

farzad528 reopened this


          rename to Azure OpenAI

c54ee39

Author

farzad528 commented Jan 13, 2025 •

edited

Loading

@vagenas plz merge when ready

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

Comment on lines +84 to +86

+                  "    raise EnvironmentError(\n",
+                  "        \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n",
+                  "    )"

Contributor

vagenas Jan 13, 2025

I see the intent here, but raising an error in case of no GPU could be a bit too much — could perhaps just print the message instead.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                 "source": [
+                  "from openai import AzureOpenAI\n",
+                  "from azure.search.documents import SearchClient\n",
+                  "import uuid\n",

Contributor

vagenas Jan 13, 2025

uuid is not used, I would drop this line.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                  "# Part 2: Parse the PDF with Docling\n",
+                  "Example: \"State of AI\" slides from a remote link.\n",
+                  "\n",
+                  "You can find the raw powerpoint here: https://docs.google.com/presentation/d/1GmZmoWOa2O92BPrncRcTKa15xvQGhq7g4I4hJSNlC0M/edit?usp=sharing"

Contributor

vagenas Jan 13, 2025

I would use the same URL that is actually parsed below (which is a PDF and not a PowerPoint).

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                  "\n",
+                  "console = Console()\n",
+                  "\n",
+                  "source_url = \"https://ignite2024demo.blob.core.windows.net/state-of-ai-2024/State of AI Report 2024.pdf\"\n",

Contributor

vagenas Jan 13, 2025

I see this is 212 pages of PDF. If the size was explicitly meant to be in that range, it could help to include a quick headsup to the user explaining that and that they would be parsing 200+ pages of PDF. From my point of view a shorter document could would work here.
If we stick to this document, I would still URL-encode the URL.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                 "outputs": [],
+                 "source": [
+                  "# If running in a new environment, uncomment and run these:\n",
+                  "%pip install docling~=\"2.7.0\"\n",

Contributor

vagenas Jan 13, 2025 •

edited

Loading

I see further below that HierarchicalChunker is imported by docling_core import package, but the docling-core distribution package is not installed here. For simplicity, I would:

require a minimum of docling 2.9.0 (i.e. docling~="2.9"), and
then import HierarchicalChunker as from docling.chunking import HierarchicalChunker

⚠️ Note: if you include the patch version number in the ~= notation (e.g. docling~="2.9.0") it will only consider that minor version (2.9) and not include e.g. 2.10 etc.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

Comment on lines +40 to +44

+                  "%pip install azure-search-documents==11.5.2\n",
+                  "%pip install azure-identity\n",
+                  "%pip install openai\n",
+                  "%pip install rich\n",
+                  "%pip install torch"

Contributor

vagenas Jan 13, 2025

I would collapse all install statements into a single one.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                  }
+                 ],
+                 "source": [
+                  "from docling_core.transforms.chunker import HierarchicalChunker\n",

Contributor

vagenas Jan 13, 2025

The import line discussed further above.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

+                 },
+                 "source": [
+                  "# RAG with Azure AI Search\n",
+                  "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/farzad528/docling/blob/main/docs/examples/rag_azuresearch.ipynb)\n",

Contributor

vagenas Jan 13, 2025

Let's change the URL to the target one that should work after merging, i.e. using DS4SD instead of farzad528

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

Comment on lines +56 to +87

+                  "GPU or MPS usage can speed up Docling’s parsing (especially for large PDFs or when OCR/table extraction is needed). However, if no GPU is detected, you can comment out the following checks and proceed with CPU, albeit slower performance."
+                 ]
+                },
+                {
+                 "cell_type": "code",
+                 "execution_count": 4,
+                 "metadata": {},
+                 "outputs": [
+                  {
+                   "name": "stdout",
+                   "output_type": "stream",
+                   "text": [
+                    "CUDA GPU is enabled: NVIDIA A100 80GB PCIe\n"
+                   ]
+                  }
+                 ],
+                 "source": [
+                  "import torch\n",
+                  "\n",
+                  "if torch.cuda.is_available():\n",
+                  "    device = torch.device(\"cuda\")\n",
+                  "    print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\n",
+                  "elif torch.backends.mps.is_available():\n",
+                  "    device = torch.device(\"mps\")\n",
+                  "    print(\"MPS GPU is enabled.\")\n",
+                  "else:\n",
+                  "    # Comment out the error if you'd like to allow CPU fallback\n",
+                  "    # But be aware parsing could be slower\n",
+                  "    raise EnvironmentError(\n",
+                  "        \"No GPU or MPS device found. Proceed with CPU only if you understand the performance implications.\"\n",
+                  "    )"
+                 ]

Contributor

vagenas Jan 13, 2025 •

edited

Loading

Starting with version 2.12, Docling provides more extensive accelerator support — and automatically detects and enables any available accelerator device.
You could therefore require ~="2.12" in the pip installs above and replace this section with some simple heads-up / disclaimer to the users.

vagenas reviewed

View reviewed changes

docs/examples/rag_azuresearch.ipynb

Comment on lines +107 to +114

+                  "AZURE_SEARCH_ENDPOINT = os.getenv(\"AZURE_SEARCH_ENDPOINT\")\n",
+                  "AZURE_SEARCH_KEY = os.getenv(\"AZURE_SEARCH_KEY\")\n",
+                  "AZURE_SEARCH_INDEX_NAME = os.getenv(\"AZURE_SEARCH_INDEX_NAME\")\n",
+                  "AZURE_OPENAI_ENDPOINT = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
+                  "AZURE_OPENAI_API_KEY = os.getenv(\"AZURE_OPENAI_API_KEY\") # Ensure this your Admin Key\n",
+                  "AZURE_OPENAI_CHAT_MODEL = os.getenv(\"AZURE_OPENAI_CHAT_MODEL\") # Using a deployed model named \"gpt-4o\"\n",
+                  "AZURE_OPENAI_API_VERSION = os.getenv(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\n",
+                  "AZURE_OPENAI_EMBEDDINGS = os.getenv(\"AZURE_OPENAI_EMBEDDINGS\") # Using a deployed model named \"text-embeddings-3-small\""

Contributor

vagenas Jan 13, 2025

This section with os.getenv() is not going to work on Colab. Check out how we address this e.g. here (see _get_env_from_colab_or_os().

Contributor

vagenas commented Jan 13, 2025 •

edited

Loading

@farzad528 thanks for this nice example! 🙌

In order to align and merge:

[CI] DCO check fails, i.e. commits are missing sign-off. To add your sign-off:
- In your local branch, run: git rebase HEAD~10 --signoff
- Force push your changes to overwrite the branch: git push --force-with-lease origin main
[CI] Code styling fails. To address, install & run the commit hooks locally & push the new changes.
I have added some in-line comments to address individual points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet