Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update llama.cpp docs #1326

Merged
merged 4 commits into from
Jul 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 35 additions & 7 deletions docs/hub/gguf-llamacpp.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,44 @@
# GGUF usage with llama.cpp

Llama.cpp directly allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp would download the model checkpoint in the directory you invoke it from:
Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp download the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable, read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826):

```bash
./main \
./llama-cli
--hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
-m Meta-Llama-3-8B-Instruct-Q8_0.gguf \
-p "I believe the meaning of life is " -n 128
--hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf \
-p "You are a helpful assistant" -cnv
```

Replace `--hf-repo` with any valid Hugging Face hub repo name and `-m` with the GGUF file name in the hub repo - off you go! 🦙
Note: You can remove `-cnv` to run the CLI in chat completion mode.

Find more information [here](https://github.com/ggerganov/llama.cpp/pull/6234).
Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:

Note: Remember to `build` llama.cpp with `LLAMA_CURL=ON` :)
```bash
./llama-server \
--hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
--hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf
```

Vaibhavs10 marked this conversation as resolved.
Show resolved Hide resolved
After running the server you can simply utilise the endpoint as below:

Vaibhavs10 marked this conversation as resolved.
Show resolved Hide resolved
```
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfilment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
```

Replace `--hf-repo` with any valid Hugging Face hub repo name and `--hf-file` with the GGUF file name in the hub repo - off you go! 🦙

Note: Remember to `build` llama.cpp with `LLAMA_CURL=1` :)
Loading