Misc. bug: model warmup doesn't work correctly for MoE models #11163

cpumaxx · 2025-01-09T19:02:46Z

Name and Version

build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./build/bin/llama-cli -m ds3-q8.gguf -t 128 --numa distribute -c 8192 -ngl 0 --interactive-first --chat-template deepseek3

Problem description & steps to reproduce

If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.

However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)

I tested this and made an inefficient bruteforce patch to common.cpp:

>             if (decoder_start_token_id == -1) {
995,1003c992,993
<             printf("decoding warmup tokens.");
<             for (int i = 1; i <256 ; i++) {
<                 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
<                 tmp.clear();
<                 tmp.push_back(i);
<                 printf(".");
<             }
<         } else { LOG_WRN("No Decoder Present. Warmup impossible"); }
<         printf("\n");

The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.

I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)

I could probably make a good PR for this with some guidance.

First Bad Commit

This has never worked afaik

Relevant log output

No logging for this problem. Need to watch OS cache usage with a tool.

The text was updated successfully, but these errors were encountered:

jjparady · 2025-01-13T08:27:15Z

Being new to MoE, is this how it is supposed to work or does it load parts into memory while performing inference? If I can get better performance by it loading the whole model into memory I would prefer it, but I was under the impression that this is how MoE was supposed to function.

cpumaxx · 2025-01-13T16:15:16Z

@jjparady
Your understanding is correct, but the scale you're applying it to is slightly off: an MoE will use a single expert for a single token, but each token may very well be routed to a different expert. It is likely for a non-trivial prompt that every expert will contribute some percentage of the tokens to the output. Therefore the current system of sending a single null token for inference only hits a single expert, bringing 1/x of the model into memory caches (x being the number of experts in the model).
As I said in the issue, ideally there would be a mechanism to force the use of a given expert in order to only need to send x tokens for inference (again, x being the number of experts) in order to properly warm up the model for production inference workloads.
Right now, warmup is not very effective, and the first query will be very slow, often timing out llama-server multiple times on big MoE models before a successful token is returned.

cpumaxx added the bug-unconfirmed label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Misc. bug: model warmup doesn't work correctly for MoE models #11163

cpumaxx commented Jan 9, 2025

jjparady commented Jan 13, 2025

cpumaxx commented Jan 13, 2025

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Comments

cpumaxx commented Jan 9, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

jjparady commented Jan 13, 2025

cpumaxx commented Jan 13, 2025