Skip to content

Commit

Permalink
usage and qdrant sections added for accesibility, small typo fixed in…
Browse files Browse the repository at this point in the history
… /v1/retrieve section
  • Loading branch information
suryyyansh committed May 7, 2024
1 parent 670080e commit b31472f
Showing 1 changed file with 36 additions and 2 deletions.
38 changes: 36 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ If the command runs successfully, you should see the similar output as below in
You can use `curl` to test it on a new terminal:

```bash
curl -X POST http://localhost:8080/v1/chat/completions \
curl -X POST http://localhost:8080/v1/retrieve \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris, France along the Seine River?"}], "model":"llama-2-chat"}'
Expand Down Expand Up @@ -511,7 +511,22 @@ To check the CLI options of the `rag-api-server` wasm app, you can run the follo

LlamaEdge-RAG API server requires two types of models: chat and embedding. The chat model is used for generating responses to user queries, while the embedding model is used for computing embeddings for user queries or file chunks.

For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf) and [all-MiniLM-L6-v2-ggml-model-f16.gguf](https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf) models as examples.
Execution also requires the presence of a running [Qdrant](https://qdrant.tech/) service.

For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf) and [all-MiniLM-L6-v2-ggml-model-f16.gguf](https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf) models as examples. Download these models and place them in the root directory of the repository.

- Ensure the Qdrant service is running

```bash
# Pull the Qdrant docker image
docker pull qdrant/qdrant

# Create a directory to store Qdrant data
mkdir qdrant_storage

# Run Qdrant service
docker run -p 6333:6333 -p 6334:6334 -v /home/nsen/llamaedge/rag-api-server/qdrant_storage:/qdrant/storage:z qdrant/qdrant
```

- Start an instance of LlamaEdge-RAG API server

Expand All @@ -527,3 +542,22 @@ For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](ht
--log-prompts \
--log-stat
```

## Usage Example

- [Execute](#execute) the server

- Generate embeddings for [paris.txt](https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt) via the `/v1/create/rag` endpoint

```bash
curl -X POST http://127.0.0.1:8080/v1/create/rag -F "[email protected]"
```

- Ask a question

```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris, France along the Seine River?"}], "model":"Llama-2-7b-chat-hf-Q5_K_M"}'
```

0 comments on commit b31472f

Please sign in to comment.