You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.
I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)
I could probably make a good PR for this with some guidance.
First Bad Commit
This has never worked afaik
Relevant log output
No logging for this problem. Need to watch OS cache usage with a tool.
The text was updated successfully, but these errors were encountered:
Being new to MoE, is this how it is supposed to work or does it load parts into memory while performing inference? If I can get better performance by it loading the whole model into memory I would prefer it, but I was under the impression that this is how MoE was supposed to function.
@jjparady
Your understanding is correct, but the scale you're applying it to is slightly off: an MoE will use a single expert for a single token, but each token may very well be routed to a different expert. It is likely for a non-trivial prompt that every expert will contribute some percentage of the tokens to the output. Therefore the current system of sending a single null token for inference only hits a single expert, bringing 1/x of the model into memory caches (x being the number of experts in the model).
As I said in the issue, ideally there would be a mechanism to force the use of a given expert in order to only need to send x tokens for inference (again, x being the number of experts) in order to properly warm up the model for production inference workloads.
Right now, warmup is not very effective, and the first query will be very slow, often timing out llama-server multiple times on big MoE models before a successful token is returned.
Name and Version
build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
Problem description & steps to reproduce
If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.
However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)
I tested this and made an inefficient bruteforce patch to common.cpp:
The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.
I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)
I could probably make a good PR for this with some guidance.
First Bad Commit
This has never worked afaik
Relevant log output
No logging for this problem. Need to watch OS cache usage with a tool.
The text was updated successfully, but these errors were encountered: