-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added chat template support to llama-run #11215
base: master
Are you sure you want to change the base?
Conversation
@@ -573,21 +592,76 @@ class LlamaData { | |||
} | |||
#endif | |||
|
|||
int huggingface_dl(const std::string & model, const std::vector<std::string> headers, const std::string & bn) { | |||
int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the cases, chat template is already stored inside gguf file and can be accessed using common_get_builtin_chat_template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint!
I added this to the initialize_chat_template
function where the gguf file is inspected for the chat template as well as the separate chat template file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so what we are trying to do here, only really effects the Ollama case where the template Ollama layer downloaded seems to take precedence over the one inside the gguf file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe it's just my ignorance, maybe huggingface has similar functionality like this to Ollama, you discovered it and I was just unaware :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this huggingface_dl_tmpl
function is redundant because convert_hf_to_gguf.py
always copy the jinja template from tokenizer_config.json
into GGUF.
And btw not sure if it's related, but for HF, the application/vnd.ollama.image.template
inside manifest is also a Go template, not a Jinja. We do have a compatibility to convert Jinja --> Go so that it can work on ollama. This tool allow you to debug what's inside: https://huggingface.co/spaces/ngxson/debug_ollama_manifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any model that works with Ollama should be able to work with llama-run . There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc. But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any model that works with Ollama should be able to work with llama-run
I'm doubt on this, as ollama has their own fork (and not clone) version of llama.cpp, so they can implement changes that are not compatible with upstream llama.cpp (example can be phi-4, as explained above)
There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc.
To be completely honest, you can archive that goal faster and easier by moving llama-run
into a full prod-ready product instead of letting it stay as an example. To properly support jinja or go, you have third party library that can handle that. The reason why we don't have it here in llama.cpp is because they are too big.
But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.
True, but don't forget that a big part of ollama is built on Go (i.e. template part), not C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that eventually things like llama-server and llama-run do become full prod-ready products eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Maybe we should separate the examples that are meant to show how to use llama.cpp, from programs like llama-cli
, llama-server
, llama-run
, and possibly others like llama-perplexity
and llama-bench
, that are intended to be used by everybody, and not particularly useful as examples. We should only distribute these programs in the binary distributions, rather than the entire set of examples and tests. They could be moved to a tools
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wishes we had a llama-client that interacted with llama-server (with a CLI interface like llama-run). Although rather than C++, python possibly makes more sense as one can use the openai client library.
@@ -607,16 +681,34 @@ class LlamaData { | |||
} | |||
|
|||
nlohmann::json manifest = nlohmann::json::parse(manifest_str); | |||
std::string layer; | |||
std::string sha_model; | |||
std::string sha_template; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a bad approach because ollama uses Go template, not Jinja.
If you want to add this, be aware that most templates won't be detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about this. Considering this I agree that this approach doesn't seem good.
What do you think? @ericcurtin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on this, but whatever works to make "vnd.ollama.image.template" compatible with llama-run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know the magic model to test this stuff granite-code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No shame in reading ollama code either to figure out how it gets passed to llama.cpp
8dbb9e5
to
c0b1b49
Compare
c0b1b49
to
baf2844
Compare
examples/run/run.cpp
Outdated
return common_get_builtin_chat_template(model.get()); | ||
} | ||
|
||
std::ifstream tmpl_file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer ggml_fopen over streams, but not necessary, most important is to get it working first :)
if (string_starts_with(model_, "file://") || std::filesystem::exists(model_)) { | ||
int resolve_model(std::string & model_, std::string & chat_template_) { | ||
int ret = 0; | ||
if (string_starts_with(model_, "file://")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we re-add "|| std::filesystem::exists(model_)" ?
The logic is supposed to be, if the file exists, use it, otherwise if we don't have a protocol specified, assume we need to pull from ollama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this into the download functions to check for the model and template file individually. If the model or template file is there, we'll skip.
examples/run/run.cpp
Outdated
int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) { | ||
// if template already exists, don't download it | ||
struct stat info; | ||
if (stat(tn.c_str(), &info) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a strictly required change, but we use std::filesystem::exists everywhere else in llama-run to check for existence of files. Does stat work on Windows? But this is not strictly required either.
Fixes: ggerganov#11178 The llama-run CLI currently doesn't take the chat template of a model into account. Thus executing llama-run on a model requiring a chat template will fail. In order to solve this, the chat template is being downloaded from ollama or huggingface as well and applied during the chat. Signed-off-by: Michael Engel <[email protected]>
baf2844
to
a899673
Compare
Fixes: #11178
The llama-run CLI currently doesn't take the chat template of a model into account. Thus executing llama-run on a model requiring a chat template will fail.
In order to solve this, the chat template is being downloaded from ollama or huggingface as well and applied during the chat.
Make sure to read the contributing guidelines before submitting a PR