We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug The decode performance of the Llama 3.2 11B (vision) model is lower than expected:
Expected: 14.8 t/s/u, 2880 ms (ttft) Actual: 11.4 t/s/u, 3813 ms (ttft)
More details: https://docs.google.com/spreadsheets/d/1Mdn3mBIOHYRC0ETsMJdO9dXtSVNJn_tFaR6QipbpEXU/edit?usp=sharing
To Reproduce Steps to reproduce the behavior:
Please complete the following environment information:
The text was updated successfully, but these errors were encountered:
I'll add that this is for N300 only, T3000 performance was as expected.
Sorry, something went wrong.
@skhorasganiTT to re-measure the perf
Hey @milank94 @tstescoTT, are you running 3.2-11b on N300 with max_num_seqs=16? (that is the batch size which we reported the perf for on N300)
max_num_seqs=16
Yes, measured 16 concurrent requests. When testing with more than 16 requests sent the vLLM backend queues the additional requests as desired.
cglagovichTT
skhorasganiTT
No branches or pull requests
Describe the bug
The decode performance of the Llama 3.2 11B (vision) model is lower than expected:
Expected: 14.8 t/s/u, 2880 ms (ttft)
Actual: 11.4 t/s/u, 3813 ms (ttft)
More details: https://docs.google.com/spreadsheets/d/1Mdn3mBIOHYRC0ETsMJdO9dXtSVNJn_tFaR6QipbpEXU/edit?usp=sharing
To Reproduce
Steps to reproduce the behavior:
Please complete the following environment information:
The text was updated successfully, but these errors were encountered: