Added chat template support to llama-run #11215

engelmi · 2025-01-13T13:03:34Z

The llama-run CLI currently doesn't take the chat template of a model into account. Thus executing llama-run on a model requiring a chat template will fail.
In order to solve this, the chat template is being downloaded from ollama or huggingface as well and applied during the chat.

Make sure to read the contributing guidelines before submitting a PR

ngxson · 2025-01-13T13:22:59Z

examples/run/run.cpp

@@ -573,21 +592,76 @@ class LlamaData {
    }
 #endif

-    int huggingface_dl(const std::string & model, const std::vector<std::string> headers, const std::string & bn) {
+    int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) {


In most of the cases, chat template is already stored inside gguf file and can be accessed using common_get_builtin_chat_template

Thanks for the hint!
I added this to the initialize_chat_template function where the gguf file is inspected for the chat template as well as the separate chat template file.

Yeah so what we are trying to do here, only really effects the Ollama case where the template Ollama layer downloaded seems to take precedence over the one inside the gguf file.

But maybe it's just my ignorance, maybe huggingface has similar functionality like this to Ollama, you discovered it and I was just unaware :)

IMO this huggingface_dl_tmpl function is redundant because convert_hf_to_gguf.py always copy the jinja template from tokenizer_config.json into GGUF.

And btw not sure if it's related, but for HF, the application/vnd.ollama.image.template inside manifest is also a Go template, not a Jinja. We do have a compatibility to convert Jinja --> Go so that it can work on ollama. This tool allow you to debug what's inside: https://huggingface.co/spaces/ngxson/debug_ollama_manifest

I think any model that works with Ollama should be able to work with llama-run . There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc. But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.

I think any model that works with Ollama should be able to work with llama-run

I'm doubt on this, as ollama has their own fork (and not clone) version of llama.cpp, so they can implement changes that are not compatible with upstream llama.cpp (example can be phi-4, as explained above)

There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc.

To be completely honest, you can archive that goal faster and easier by moving llama-run into a full prod-ready product instead of letting it stay as an example. To properly support jinja or go, you have third party library that can handle that. The reason why we don't have it here in llama.cpp is because they are too big.

But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.

True, but don't forget that a big part of ollama is built on Go (i.e. template part), not C++

I hope that eventually things like llama-server and llama-run do become full prod-ready products eventually.

Agree. Maybe we should separate the examples that are meant to show how to use llama.cpp, from programs like llama-cli, llama-server, llama-run, and possibly others like llama-perplexity and llama-bench, that are intended to be used by everybody, and not particularly useful as examples. We should only distribute these programs in the binary distributions, rather than the entire set of examples and tests. They could be moved to a tools directory.

Part of me wishes we had a llama-client that interacted with llama-server (with a CLI interface like llama-run). Although rather than C++, python possibly makes more sense as one can use the openai client library.

ngxson · 2025-01-13T13:24:15Z

examples/run/run.cpp

@@ -607,16 +681,34 @@ class LlamaData {
        }

        nlohmann::json manifest = nlohmann::json::parse(manifest_str);
-        std::string    layer;
+        std::string    sha_model;
+        std::string    sha_template;


IMO this is a bad approach because ollama uses Go template, not Jinja.

If you want to add this, be aware that most templates won't be detected.

Didn't know about this. Considering this I agree that this approach doesn't seem good.
What do you think? @ericcurtin

I'm not an expert on this, but whatever works to make "vnd.ollama.image.template" compatible with llama-run

You know the magic model to test this stuff granite-code :)

No shame in reading ollama code either to figure out how it gets passed to llama.cpp

examples/run/run.cpp

ericcurtin · 2025-01-13T15:45:44Z

examples/run/run.cpp

+            return common_get_builtin_chat_template(model.get());
+        }
+
+        std::ifstream tmpl_file;


Would prefer ggml_fopen over streams, but not necessary, most important is to get it working first :)

ericcurtin · 2025-01-13T16:04:33Z

examples/run/run.cpp

-        if (string_starts_with(model_, "file://") || std::filesystem::exists(model_)) {
+    int resolve_model(std::string & model_, std::string & chat_template_) {
+        int ret = 0;
+        if (string_starts_with(model_, "file://")) {


Could we re-add "|| std::filesystem::exists(model_)" ?

The logic is supposed to be, if the file exists, use it, otherwise if we don't have a protocol specified, assume we need to pull from ollama

I moved this into the download functions to check for the model and template file individually. If the model or template file is there, we'll skip.

ericcurtin · 2025-01-13T17:43:54Z

examples/run/run.cpp

+    int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) {
+        // if template already exists, don't download it
+        struct stat info;
+        if (stat(tn.c_str(), &info) == 0) {


Not a strictly required change, but we use std::filesystem::exists everywhere else in llama-run to check for existence of files. Does stat work on Windows? But this is not strictly required either.

Fixes: ggerganov#11178 The llama-run CLI currently doesn't take the chat template of a model into account. Thus executing llama-run on a model requiring a chat template will fail. In order to solve this, the chat template is being downloaded from ollama or huggingface as well and applied during the chat. Signed-off-by: Michael Engel <[email protected]>

ngxson reviewed Jan 13, 2025

View reviewed changes

github-actions bot added the examples label Jan 13, 2025

engelmi force-pushed the add-chat-template-support branch from 8dbb9e5 to c0b1b49 Compare January 13, 2025 15:25

ericcurtin reviewed Jan 13, 2025

View reviewed changes

examples/run/run.cpp Outdated Show resolved Hide resolved

engelmi force-pushed the add-chat-template-support branch from c0b1b49 to baf2844 Compare January 13, 2025 15:38

ericcurtin reviewed Jan 13, 2025

View reviewed changes

ochafik mentioned this pull request Jan 14, 2025

Add Jinja template support #11016

Open

3 tasks

engelmi force-pushed the add-chat-template-support branch from baf2844 to a899673 Compare January 14, 2025 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added chat template support to llama-run #11215

Added chat template support to llama-run #11215

engelmi commented Jan 13, 2025

ngxson Jan 13, 2025

engelmi Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025 •

edited

Loading

ngxson Jan 13, 2025

ericcurtin Jan 13, 2025

ngxson Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025

slaren Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading

ngxson Jan 13, 2025

engelmi Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025 •

edited

Loading

engelmi Jan 13, 2025

ericcurtin Jan 13, 2025 •

edited

Loading

Added chat template support to llama-run #11215

Are you sure you want to change the base?

Added chat template support to llama-run #11215

Conversation

engelmi commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 •

edited

Loading

ngxson Jan 13, 2025 •

edited

Loading

slaren Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading