[Bug Report] Llama 3.2 11B (vision) performance lower than expected #17092

milank94 · 2025-01-24T21:19:11Z

Describe the bug
The decode performance of the Llama 3.2 11B (vision) model is lower than expected:

Expected: 14.8 t/s/u, 2880 ms (ttft)
Actual: 11.4 t/s/u, 3813 ms (ttft)

To Reproduce
Steps to reproduce the behavior:

Please complete the following environment information:

tstescoTT · 2025-01-24T21:58:16Z

I'll add that this is for N300 only, T3000 performance was as expected.

uaydonat · 2025-01-24T23:44:07Z

@skhorasganiTT to re-measure the perf

skhorasganiTT · 2025-01-25T00:12:34Z

Hey @milank94 @tstescoTT, are you running 3.2-11b on N300 with max_num_seqs=16? (that is the batch size which we reported the perf for on N300)

tstescoTT · 2025-01-25T19:42:57Z

Yes, measured 16 concurrent requests. When testing with more than 16 requests sent the vLLM backend queues the additional requests as desired.

milank94 added bug Something isn't working llama3 LLM_bug labels Jan 24, 2025

uaydonat assigned cglagovichTT Jan 24, 2025

uaydonat added the P1 label Jan 24, 2025

uaydonat assigned skhorasganiTT Jan 24, 2025

Provide feedback