-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
usage and qdrant sections added for accesibility, small typo fixed in…
… /v1/retrieve section
- Loading branch information
1 parent
670080e
commit b31472f
Showing
1 changed file
with
36 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -349,7 +349,7 @@ If the command runs successfully, you should see the similar output as below in | |
You can use `curl` to test it on a new terminal: | ||
|
||
```bash | ||
curl -X POST http://localhost:8080/v1/chat/completions \ | ||
curl -X POST http://localhost:8080/v1/retrieve \ | ||
-H 'accept:application/json' \ | ||
-H 'Content-Type: application/json' \ | ||
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris, France along the Seine River?"}], "model":"llama-2-chat"}' | ||
|
@@ -511,7 +511,22 @@ To check the CLI options of the `rag-api-server` wasm app, you can run the follo | |
|
||
LlamaEdge-RAG API server requires two types of models: chat and embedding. The chat model is used for generating responses to user queries, while the embedding model is used for computing embeddings for user queries or file chunks. | ||
|
||
For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf) and [all-MiniLM-L6-v2-ggml-model-f16.gguf](https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf) models as examples. | ||
Execution also requires the presence of a running [Qdrant](https://qdrant.tech/) service. | ||
|
||
For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf) and [all-MiniLM-L6-v2-ggml-model-f16.gguf](https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf) models as examples. Download these models and place them in the root directory of the repository. | ||
|
||
- Ensure the Qdrant service is running | ||
|
||
```bash | ||
# Pull the Qdrant docker image | ||
docker pull qdrant/qdrant | ||
|
||
# Create a directory to store Qdrant data | ||
mkdir qdrant_storage | ||
|
||
# Run Qdrant service | ||
docker run -p 6333:6333 -p 6334:6334 -v /home/nsen/llamaedge/rag-api-server/qdrant_storage:/qdrant/storage:z qdrant/qdrant | ||
``` | ||
|
||
- Start an instance of LlamaEdge-RAG API server | ||
|
||
|
@@ -527,3 +542,22 @@ For the purpose of demonstration, we use the [Llama-2-7b-chat-hf-Q5_K_M.gguf](ht | |
--log-prompts \ | ||
--log-stat | ||
``` | ||
|
||
## Usage Example | ||
|
||
- [Execute](#execute) the server | ||
|
||
- Generate embeddings for [paris.txt](https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt) via the `/v1/create/rag` endpoint | ||
|
||
```bash | ||
curl -X POST http://127.0.0.1:8080/v1/create/rag -F "[email protected]" | ||
``` | ||
|
||
- Ask a question | ||
|
||
```bash | ||
curl -X POST http://localhost:8080/v1/chat/completions \ | ||
-H 'accept:application/json' \ | ||
-H 'Content-Type: application/json' \ | ||
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris, France along the Seine River?"}], "model":"Llama-2-7b-chat-hf-Q5_K_M"}' | ||
``` |