Skip to content

Commit

Permalink
docs: OpenAI compatible API (#174)
Browse files Browse the repository at this point in the history
  • Loading branch information
tgaddair authored Jan 10, 2024
1 parent a90d443 commit 64739ad
Show file tree
Hide file tree
Showing 7 changed files with 655 additions and 7 deletions.
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
- [Launch LoRAX Server](#launch-lorax-server)
- [Prompt via REST API](#prompt-via-rest-api)
- [Prompt via Python Client](#prompt-via-python-client)
- [Chat via OpenAI API](#chat-via-openai-api)
- [Next steps](#next-steps)
- [🙇 Acknowledgements](#-acknowledgements)
- [🗺️ Roadmap](#️-roadmap)
Expand All @@ -35,7 +36,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.


Expand Down Expand Up @@ -134,6 +135,34 @@ See [Reference - Python Client](https://predibase.github.io/lorax/reference/pyth

For other ways to run LoRAX, see [Getting Started - Kubernetes](https://predibase.github.io/lorax/getting_started/kubernetes), [Getting Started - SkyPilot](https://predibase.github.io/lorax/getting_started/skypilot), and [Getting Started - Local](https://predibase.github.io/lorax/getting_started/local).

### Chat via OpenAI API

LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.

```python
from openai import OpenAI

client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
model="alignment-handbook/zephyr-7b-dpo-lora",
messages=[
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
],
max_tokens=100,
)
print("Response:", resp[0].choices[0].text)
```

See [OpenAI Compatible API](https://predibase.github.io/lorax/guides/openai_api) for details.

### Next steps

Here are some other interesting Mistral-7B fine-tuned models to try out:
Expand Down
125 changes: 125 additions & 0 deletions docs/guides/openai_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
LoRAX supports [OpenAI Chat Completions v1](https://platform.openai.com/docs/api-reference/completions/create) compatible endpoints that serve as a drop-in replacement for the OpenAI SDK. It supports multi-turn
chat conversations while retaining support for dynamic adapter loading.

## Chat Completions v1

Using the existing OpenAI Python SDK, replace the `base_url` with your LoRAX endpoint with `/v1` appended. The `api_key` can be anything, as it is unused.

The `model` parameter can be set to the empty string `""` to use the base model, or any adapter ID on the HuggingFace hub.

```python
from openai import OpenAI

client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
model="alignment-handbook/zephyr-7b-dpo-lora",
messages=[
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
],
max_tokens=100,
)
print("Response:", resp[0].choices[0].text)
```

### Streaming

The streaming API is supported with the `stream=True` parameter:

```python
messages = client.chat.completions.create(
model="alignment-handbook/zephyr-7b-dpo-lora",
messages=[
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
],
max_tokens=100,
stream=True,
)

for message in messages:
print(message)
```

### REST API

The REST API can be used directly in addition to the Python SDK:

```bash
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "alignment-handbook/zephyr-7b-dpo-lora",
"messages": [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate"
},
{
"role": "user",
"content": "How many helicopters can a human eat in one sitting?"
}
],
"max_tokens": 100
}'
```

### Chat Templates

Multi-turn chat conversations are supported through [HuggingFace chat templates](https://huggingface.co/docs/transformers/chat_templating).

If the adapter selected with the `model` parameter has its own tokenizer and chat template, LoRAX will apply the adapter's chat template
to the request during inference. If, however, the adapter does not have its own chat template, LoRAX will fallback to using the base model
chat template. If this does not exist, an error will be raised, as chat templates are required for multi-turn conversations.

## (Legacy) Completions v1

The legacy completions v1 API can be used as well:

```python
from openai import OpenAI

client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8080/v1",
)

# synchrounous completions
completion = client.completions.create(
model=adapter_id,
prompt=prompt,
)
print("Completion result:", completion[0].choices[0].text)

# streaming completions
completion_stream = client.completions.create(
model=adapter_id,
prompt=prompt,
stream=True,
)

for message in completion_stream:
print("Completion message:", message)
```

REST:

```bash
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "",
"prompt": "Instruct: Write a detailed analogy between mathematics and a lighthouse.\nOutput:",
"max_tokens": 100
}'
```
30 changes: 29 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations.
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.


Expand Down Expand Up @@ -119,6 +119,34 @@ See [Reference - Python Client](./reference/python_client.md) for full details.

For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).

### Chat via OpenAI API

LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the `model` parameter.

```python
from openai import OpenAI

client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
model="alignment-handbook/zephyr-7b-dpo-lora",
messages=[
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
],
max_tokens=100,
)
print("Response:", resp[0].choices[0].text)
```

See [OpenAI Compatible API](./guides/openai_api.md) for details.

## 🙇 Acknowledgements

LoRAX is built on top of HuggingFace's [text-generation-inference](https://github.com/huggingface/text-generation-inference), forked from v0.9.4 (Apache 2.0).
Expand Down
Loading

0 comments on commit 64739ad

Please sign in to comment.