Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added chat template support to llama-run #11215

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

engelmi
Copy link

@engelmi engelmi commented Jan 13, 2025

Fixes: #11178

The llama-run CLI currently doesn't take the chat template of a model into account. Thus executing llama-run on a model requiring a chat template will fail.
In order to solve this, the chat template is being downloaded from ollama or huggingface as well and applied during the chat.

Make sure to read the contributing guidelines before submitting a PR

@@ -573,21 +592,76 @@ class LlamaData {
}
#endif

int huggingface_dl(const std::string & model, const std::vector<std::string> headers, const std::string & bn) {
int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the cases, chat template is already stored inside gguf file and can be accessed using common_get_builtin_chat_template

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint!
I added this to the initialize_chat_template function where the gguf file is inspected for the chat template as well as the separate chat template file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so what we are trying to do here, only really effects the Ollama case where the template Ollama layer downloaded seems to take precedence over the one inside the gguf file.

Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it's just my ignorance, maybe huggingface has similar functionality like this to Ollama, you discovered it and I was just unaware :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this huggingface_dl_tmpl function is redundant because convert_hf_to_gguf.py always copy the jinja template from tokenizer_config.json into GGUF.

And btw not sure if it's related, but for HF, the application/vnd.ollama.image.template inside manifest is also a Go template, not a Jinja. We do have a compatibility to convert Jinja --> Go so that it can work on ollama. This tool allow you to debug what's inside: https://huggingface.co/spaces/ngxson/debug_ollama_manifest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any model that works with Ollama should be able to work with llama-run . There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc. But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.

Copy link
Collaborator

@ngxson ngxson Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any model that works with Ollama should be able to work with llama-run

I'm doubt on this, as ollama has their own fork (and not clone) version of llama.cpp, so they can implement changes that are not compatible with upstream llama.cpp (example can be phi-4, as explained above)

There's gaps in functionality, lets fill those gaps. Maybe the model isn't exactly well-formed, etc.

To be completely honest, you can archive that goal faster and easier by moving llama-run into a full prod-ready product instead of letting it stay as an example. To properly support jinja or go, you have third party library that can handle that. The reason why we don't have it here in llama.cpp is because they are too big.

But, Ollama can still run this model at the end of the day, there's no reason llama-run can't do the same.

True, but don't forget that a big part of ollama is built on Go (i.e. template part), not C++

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that eventually things like llama-server and llama-run do become full prod-ready products eventually.

Copy link
Collaborator

@slaren slaren Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Maybe we should separate the examples that are meant to show how to use llama.cpp, from programs like llama-cli, llama-server, llama-run, and possibly others like llama-perplexity and llama-bench, that are intended to be used by everybody, and not particularly useful as examples. We should only distribute these programs in the binary distributions, rather than the entire set of examples and tests. They could be moved to a tools directory.

Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wishes we had a llama-client that interacted with llama-server (with a CLI interface like llama-run). Although rather than C++, python possibly makes more sense as one can use the openai client library.

@@ -607,16 +681,34 @@ class LlamaData {
}

nlohmann::json manifest = nlohmann::json::parse(manifest_str);
std::string layer;
std::string sha_model;
std::string sha_template;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a bad approach because ollama uses Go template, not Jinja.

If you want to add this, be aware that most templates won't be detected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this. Considering this I agree that this approach doesn't seem good.
What do you think? @ericcurtin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on this, but whatever works to make "vnd.ollama.image.template" compatible with llama-run

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know the magic model to test this stuff granite-code :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No shame in reading ollama code either to figure out how it gets passed to llama.cpp

@engelmi engelmi force-pushed the add-chat-template-support branch from 8dbb9e5 to c0b1b49 Compare January 13, 2025 15:25
examples/run/run.cpp Outdated Show resolved Hide resolved
@engelmi engelmi force-pushed the add-chat-template-support branch from c0b1b49 to baf2844 Compare January 13, 2025 15:38
return common_get_builtin_chat_template(model.get());
}

std::ifstream tmpl_file;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer ggml_fopen over streams, but not necessary, most important is to get it working first :)

if (string_starts_with(model_, "file://") || std::filesystem::exists(model_)) {
int resolve_model(std::string & model_, std::string & chat_template_) {
int ret = 0;
if (string_starts_with(model_, "file://")) {
Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we re-add "|| std::filesystem::exists(model_)" ?

The logic is supposed to be, if the file exists, use it, otherwise if we don't have a protocol specified, assume we need to pull from ollama

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this into the download functions to check for the model and template file individually. If the model or template file is there, we'll skip.

int huggingface_dl_tmpl(const std::string & hfr, const std::vector<std::string> headers, const std::string & tn) {
// if template already exists, don't download it
struct stat info;
if (stat(tn.c_str(), &info) == 0) {
Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strictly required change, but we use std::filesystem::exists everywhere else in llama-run to check for existence of files. Does stat work on Windows? But this is not strictly required either.

@ochafik ochafik mentioned this pull request Jan 14, 2025
3 tasks
Fixes: ggerganov#11178

The llama-run CLI currently doesn't take the chat template of
a model into account. Thus executing llama-run on a model
requiring a chat template will fail.
In order to solve this, the chat template is being downloaded
from ollama or huggingface as well and applied during the chat.

Signed-off-by: Michael Engel <[email protected]>
@engelmi engelmi force-pushed the add-chat-template-support branch from baf2844 to a899673 Compare January 14, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: missing chat-template support in llama-run
4 participants