You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:
In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat
plan copypasta from ggerganov/llama.cpp#6853 (reply in thread):
Relevant links:
Devices I can test it on are:
The text was updated successfully, but these errors were encountered: