-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : add support for multiple responses #11142
Comments
I think having multiple sequence id per slot will be far more difficult to keep track of the KV cache. This will also force all sequence id to share the same Instead, I'd suggest adding the notion of Upon receiving request for N completions, we create N+1 tasks but only one of them contains the
Because Whichever slot takes the task 0:
For other slots that takes either task 1, 2 or 3:
|
Nice! This seems like a reasonable way to do it. |
It would be very useful to add multi-response support per slot so that a single request would be able to generate
n
independent completions. This functionality is useful in different situations - for example, a FIM completion can provide multiple alternative suggestions at a smaller or equal compute cost compared to running them sequentially.I think this can be implemented by adding multiple sequence id per slot (instead of having just one like we currently do). However, I am not sure how yet much complexity would be introduced to support this.
The text was updated successfully, but these errors were encountered: