docs: OpenAI compatible API (#174)

predibase · Jan 10, 2024 · 64739ad · 64739ad
1 parent a90d443
commit 64739ad
Show file tree

Hide file tree

Showing 7 changed files with 655 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
   - [Launch LoRAX Server](#launch-lorax-server)
   - [Prompt via REST API](#prompt-via-rest-api)
   - [Prompt via Python Client](#prompt-via-python-client)
+  - [Chat via OpenAI API](#chat-via-openai-api)
   - [Next steps](#next-steps)
 - [🙇 Acknowledgements](#-acknowledgements)
 - [🗺️ Roadmap](#️-roadmap)
@@ -35,7 +36,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
 - 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
 - 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
 - 👬 **Optimized Inference:**  high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
-- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
+- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
 - 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
 
 
@@ -134,6 +135,34 @@ See [Reference - Python Client](https://predibase.github.io/lorax/reference/pyth
 
 For other ways to run LoRAX, see [Getting Started - Kubernetes](https://predibase.github.io/lorax/getting_started/kubernetes), [Getting Started - SkyPilot](https://predibase.github.io/lorax/getting_started/skypilot), and [Getting Started - Local](https://predibase.github.io/lorax/getting_started/local).
 
+### Chat via OpenAI API
+
+LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://127.0.0.1:8080/v1",
+)
+
+resp = client.chat.completions.create(
+    model="alignment-handbook/zephyr-7b-dpo-lora",
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a friendly chatbot who always responds in the style of a pirate",
+        },
+        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+    ],
+    max_tokens=100,
+)
+print("Response:", resp[0].choices[0].text)
+```
+
+See [OpenAI Compatible API](https://predibase.github.io/lorax/guides/openai_api) for details.
+
 ### Next steps
 
 Here are some other interesting Mistral-7B fine-tuned models to try out:

diff --git a/docs/guides/openai_api.md b/docs/guides/openai_api.md
@@ -0,0 +1,125 @@
+LoRAX supports [OpenAI Chat Completions v1](https://platform.openai.com/docs/api-reference/completions/create) compatible endpoints that serve as a drop-in replacement for the OpenAI SDK. It supports multi-turn
+chat conversations while retaining support for dynamic adapter loading.
+
+## Chat Completions v1
+
+Using the existing OpenAI Python SDK, replace the `base_url` with your LoRAX endpoint with `/v1` appended. The `api_key` can be anything, as it is unused.
+
+The `model` parameter can be set to the empty string `""` to use the base model, or any adapter ID on the HuggingFace hub.
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://127.0.0.1:8080/v1",
+)
+
+resp = client.chat.completions.create(
+    model="alignment-handbook/zephyr-7b-dpo-lora",
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a friendly chatbot who always responds in the style of a pirate",
+        },
+        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+    ],
+    max_tokens=100,
+)
+print("Response:", resp[0].choices[0].text)
+```
+
+### Streaming
+
+The streaming API is supported with the `stream=True` parameter:
+
+```python
+messages = client.chat.completions.create(
+    model="alignment-handbook/zephyr-7b-dpo-lora",
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a friendly chatbot who always responds in the style of a pirate",
+        },
+        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+    ],
+    max_tokens=100,
+    stream=True,
+)
+
+for message in messages:
+    print(message)
+```
+
+### REST API
+
+The REST API can be used directly in addition to the Python SDK:
+
+```bash
+curl http://127.0.0.1:8080/v1/chat/completions \
+-H "Content-Type: application/json" \
+-d '{
+  "model": "alignment-handbook/zephyr-7b-dpo-lora",
+  "messages": [
+  {
+      "role": "system",
+      "content": "You are a friendly chatbot who always responds in the style of a pirate"
+  },
+  {
+      "role": "user",
+      "content": "How many helicopters can a human eat in one sitting?"
+  }
+  ],
+  "max_tokens": 100
+}'
+```
+
+### Chat Templates
+
+Multi-turn chat conversations are supported through [HuggingFace chat templates](https://huggingface.co/docs/transformers/chat_templating).
+
+If the adapter selected with the `model` parameter has its own tokenizer and chat template, LoRAX will apply the adapter's chat template
+to the request during inference. If, however, the adapter does not have its own chat template, LoRAX will fallback to using the base model
+chat template. If this does not exist, an error will be raised, as chat templates are required for multi-turn conversations.
+
+## (Legacy) Completions v1
+
+The legacy completions v1 API can be used as well:
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://127.0.0.1:8080/v1",
+)
+
+# synchrounous completions
+completion = client.completions.create(
+    model=adapter_id,
+    prompt=prompt,
+)
+print("Completion result:", completion[0].choices[0].text)
+
+# streaming completions
+completion_stream = client.completions.create(
+    model=adapter_id,
+    prompt=prompt,
+    stream=True,
+)
+
+for message in completion_stream:
+    print("Completion message:", message)
+```
+
+REST:
+
+```bash
+curl http://127.0.0.1:8080/v1/completions \
+-H "Content-Type: application/json" \
+-d '{
+"model": "",
+"prompt": "Instruct: Write a detailed analogy between mathematics and a lighthouse.\nOutput:",
+"max_tokens": 100
+}'
+```
diff --git a/docs/index.md b/docs/index.md
@@ -31,7 +31,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
 - 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
 - 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
 - 👬 **Optimized Inference:**  high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
-- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
+- 🚢  **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
 - 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
 
 
@@ -119,6 +119,34 @@ See [Reference - Python Client](./reference/python_client.md) for full details.
 
 For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).
 
+### Chat via OpenAI API
+
+LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://127.0.0.1:8080/v1",
+)
+
+resp = client.chat.completions.create(
+    model="alignment-handbook/zephyr-7b-dpo-lora",
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a friendly chatbot who always responds in the style of a pirate",
+        },
+        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
+    ],
+    max_tokens=100,
+)
+print("Response:", resp[0].choices[0].text)
+```
+
+See [OpenAI Compatible API](./guides/openai_api.md) for details.
+
 ## 🙇 Acknowledgements
 
 LoRAX is built on top of HuggingFace's [text-generation-inference](https://github.com/huggingface/text-generation-inference), forked from v0.9.4 (Apache 2.0).