Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Open
cpumaxx opened this issue Jan 9, 2025 · 2 comments
Open

Misc. bug: model warmup doesn't work correctly for MoE models #11163

cpumaxx opened this issue Jan 9, 2025 · 2 comments

Comments

@cpumaxx
Copy link
Contributor

cpumaxx commented Jan 9, 2025

Name and Version

build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./build/bin/llama-cli -m ds3-q8.gguf -t 128 --numa distribute -c 8192 -ngl 0 --interactive-first --chat-template deepseek3

Problem description & steps to reproduce

If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.

However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)

I tested this and made an inefficient bruteforce patch to common.cpp:

>             if (decoder_start_token_id == -1) {
995,1003c992,993
<             printf("decoding warmup tokens.");
<             for (int i = 1; i <256 ; i++) {
<                 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
<                 tmp.clear();
<                 tmp.push_back(i);
<                 printf(".");
<             }
<         } else { LOG_WRN("No Decoder Present. Warmup impossible"); }
<         printf("\n");

The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.

I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)

I could probably make a good PR for this with some guidance.

First Bad Commit

This has never worked afaik

Relevant log output

No logging for this problem. Need to watch OS cache usage with a tool.
@jjparady
Copy link

Being new to MoE, is this how it is supposed to work or does it load parts into memory while performing inference? If I can get better performance by it loading the whole model into memory I would prefer it, but I was under the impression that this is how MoE was supposed to function.

@cpumaxx
Copy link
Contributor Author

cpumaxx commented Jan 13, 2025

@jjparady
Your understanding is correct, but the scale you're applying it to is slightly off: an MoE will use a single expert for a single token, but each token may very well be routed to a different expert. It is likely for a non-trivial prompt that every expert will contribute some percentage of the tokens to the output. Therefore the current system of sending a single null token for inference only hits a single expert, bringing 1/x of the model into memory caches (x being the number of experts in the model).
As I said in the issue, ideally there would be a mechanism to force the use of a given expert in order to only need to send x tokens for inference (again, x being the number of experts) in order to properly warm up the model for production inference workloads.
Right now, warmup is not very effective, and the first query will be very slow, often timing out llama-server multiple times on big MoE models before a successful token is returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants