Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

http403 · 2025-01-09T03:48:24Z

Name and Version

$ .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
version: 4450 (8d59d911)
built with  for x86_64-pc-windows-msvc

Operating systems

Windows 11 24H2 Build 26100.2605

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm,RPC   |  99 |         pp512 |        915.97 ± 5.53 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm,RPC   |  99 |         tg128 |          3.12 ± 0.01 |

build: 8d59d911 (4450)

Problem description & steps to reproduce

Description

It is horrendously slow. It shouldn't be this slow. You will get a sense how slow it is with the result with the Vulkan backend, which is suppose to be worse.

Step to reproduce

Get the latest hipBLAS build in release
run llama-bench.exe with a model you like

First Bad Commit

No response

Relevant log output

No response

Additional Information

Results with other backends and builds

Vulken Backend

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders............................................Done!
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan,RPC |  99 |         pp512 |        987.29 ± 0.62 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan,RPC |  99 |         tg128 |         64.03 ± 0.21 |

build: 8d59d911 (4450)

b3808-hip

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         pp512 |        914.20 ± 6.28 |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         tg128 |          3.12 ± 0.01 |

build: 1e7b929 (1)

mystery build from https://github.com/PiDanShouRouZhouXD/Sakura_Launcher_GUI/releases/tag/v0.0.3-alpha

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         pp512 |   1611.84 ± 6.60 |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         tg128 |     53.41 ± 0.08 |

build: 641f5dd2 (3534)

Temporary Workaround

Do not use the HIP build. Use Vulkan instead.

The text was updated successfully, but these errors were encountered:

tbocek · 2025-01-11T14:35:38Z

I get the following numbers with latest master (Linux):

./build/bin/llama-bench --model /mnt/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	pp512	1660.75 ± 3.84
qwen2 14B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	tg128	51.99 ± 0.27

build: c05e8c9 (4462)

Maybe a Windows issue?

http403 · 2025-01-17T23:12:31Z

I don't know if that's a Windows specific issue, as I don't have a Linux machine to test yet. However, I compiled the 3edfa7d and the result stay the time. My 7900XTX is at 100% load but system total power draw is only ~270W compare to Vulkan at ~530W. There is a page about performance troubleshooting but nothing about troubleshooting low GPU resource utilization.

Here's the result

build\bin\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm       |  99 |         pp512 |        923.24 ± 1.52 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm       |  99 |         tg128 |          3.12 ± 0.01 |

build: 3edfa7d3 (4502)

Edit: correcting GPU load observation

slaren · 2025-01-17T23:18:54Z

If you can find the exact commit that introduced the issue, that would greatly increase the chances of finding a solution. You can use git bisect to help you do this.

http403 · 2025-01-18T21:27:38Z

Upon more detailed examination, the slow performance is only occurring on Qwen2 model but not with LLama2. I'm still bisecting and trying to look for a commit that have a reasonable performance.

http403 · 2025-01-18T22:36:22Z

@slaren Bad news. I have got down to 9b75cb2, the commit that enables Qwen2 support and the performance is still bad. I couldn't find a commit that can considered good yet. For now, I can only assume is that the implementation is flawed from the beginning, while somehow the mystery build can behave normally.

Here are the bench results:

HIP 5.7

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model	size	params	backend	ngl	test	t/s
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	pp 512	1457.52 ± 1.78
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	tg 128	3.12 ± 0.00

build: 9b75cb2 (1923)

HIP 6.1

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model	size	params	backend	ngl	test	t/s
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	pp 512	931.57 ± 1.63
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	tg 128	3.12 ± 0.02

build: 9b75cb2 (1923)

HIP 6.2

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model	size	params	backend	ngl	test	t/s
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	pp 512	925.39 ± 4.42
qwen2 ?B Q4_K - Medium	8.37 GiB	14.77 B	ROCm	99	tg 128	3.12 ± 0.01

build: 9b75cb2 (1923)

slaren · 2025-01-18T23:01:33Z

I guess it comes down to build settings. I believe koboldcpp has ROCm releases, you can try checking if their builds work for you, and try their build settings if so.

http403 · 2025-01-18T23:15:34Z

koboldcpp doesn't have a ROCm release, but a fork. Downloaded v1.80.3 and the result is even worst than llama.cpp at 2.9 tok/s.

Benchmark Completed - v1.80.3.yr0-ROCm Results:
======
Flags: NoAVX2=False Threads=5 HighPriority=False Cublas_Args=['normal', '1', 'nommq'] Tensor_Split=None BlasThreads=5 BlasBatchSize=512 FlashAttention=False KvCache=0
Timestamp: 2025-01-18 23:11:31.743193+00:00
Backend: koboldcpp_hipblas.dll
Layers: 51
Model: Qwen2.5-14B-Instruct-Q4_K_M
MaxCtx: 4096
GenAmount: 100
-----
ProcessingTime: 4.601s
ProcessingSpeed: 868.51T/s
GenerationTime: 33.882s
GenerationSpeed: 2.95T/s
TotalTime: 38.483s
Output:  1 1 1 1
-----
===

FeepingCreature · 2025-01-20T21:05:02Z

~~Same issue here on Linux with Deepseek R1 (qwen based). This is how bad it is: CPU is more than twice as fast as my 7900 XTX with 49/49 GPU layers.~~

Correction: it was some build issue.

jtgladiator · 2025-01-20T22:01:02Z

I'm experiencing the same. Full GPU offload of qwen2.5-coder-7b-instruct onto 7900xtx. T/S is 3.1

Started with lmstudio 0.3.6 using llama.cpp. Need to check my run times.

FeepingCreature · 2025-01-20T22:26:06Z

I'm getting good performance (63tok/s on tg128) with Vulkan backend. So I guess I'll just use that...?

http403 · 2025-01-20T23:02:12Z

I'm experiencing the same. Full GPU offload of qwen2.5-coder-7b-instruct onto 7900xtx. T/S is 3.1

Started with lmstudio 0.3.6 using llama.cpp. Need to check my run times.

The HIP/ROCm build from LMStudio should be okay. However, I do experience some instability when loading some models (Phi4) so I default to Vulkan backend.

http403 · 2025-01-20T23:04:42Z

Same issue here on Linux with Deepseek R1 (qwen based). This is how bad it is: CPU is more than twice as fast as my 7900 XTX with 49/49 GPU layers.

@FeepingCreature Can you please post your benchmark result with llama-bench?

Another user reported the Linux build is working correctly. I'm especially interested about the commit of your build. It may help accelerate my effort to find where the bug is.

Thank you.

FeepingCreature · 2025-01-20T23:41:20Z

Update: I did a clean build of llama.cpp and now it runs fine.

ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 14B Q5_K - Medium        |   9.78 GiB |    14.77 B | ROCm,BLAS  |      16 |         pp512 |       1745.49 ± 4.22 |
| qwen2 14B Q5_K - Medium        |   9.78 GiB |    14.77 B | ROCm,BLAS  |      16 |         tg128 |         51.97 ± 0.04 |

I didn't have the rocm libs installed right when I first ran cmake-gui. I wonder if it hung on to some bad configuration settings.

edit: Stupid, I should have made a copy...

http403 added the bug-unconfirmed label Jan 9, 2025

http403 changed the title ~~Misc. bug: Very bad performance of latest llama.cpp builds with AMD GPU~~ Misc. bug: Very bad performance of latest llama.cpp HIP builds with AMD GPU Jan 9, 2025

http403 changed the title ~~Misc. bug: Very bad performance of latest llama.cpp HIP builds with AMD GPU~~ Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

http403 commented Jan 9, 2025 •

edited

Loading

tbocek commented Jan 11, 2025

http403 commented Jan 17, 2025 •

edited

Loading

slaren commented Jan 17, 2025

http403 commented Jan 18, 2025 •

edited

Loading

http403 commented Jan 18, 2025 •

edited

Loading

slaren commented Jan 18, 2025

http403 commented Jan 18, 2025 •

edited

Loading

FeepingCreature commented Jan 20, 2025 •

edited

Loading

jtgladiator commented Jan 20, 2025

FeepingCreature commented Jan 20, 2025 •

edited

Loading

http403 commented Jan 20, 2025 •

edited

Loading

http403 commented Jan 20, 2025 •

edited

Loading

FeepingCreature commented Jan 20, 2025 •

edited

Loading

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

Comments

http403 commented Jan 9, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Description

Step to reproduce

First Bad Commit

Relevant log output

Additional Information

Results with other backends and builds

Vulken Backend

b3808-hip

mystery build from https://github.com/PiDanShouRouZhouXD/Sakura_Launcher_GUI/releases/tag/v0.0.3-alpha

Temporary Workaround

tbocek commented Jan 11, 2025

http403 commented Jan 17, 2025 • edited Loading

slaren commented Jan 17, 2025

http403 commented Jan 18, 2025 • edited Loading

http403 commented Jan 18, 2025 • edited Loading

HIP 5.7

HIP 6.1

HIP 6.2

slaren commented Jan 18, 2025

http403 commented Jan 18, 2025 • edited Loading

FeepingCreature commented Jan 20, 2025 • edited Loading

jtgladiator commented Jan 20, 2025

FeepingCreature commented Jan 20, 2025 • edited Loading

http403 commented Jan 20, 2025 • edited Loading

http403 commented Jan 20, 2025 • edited Loading

FeepingCreature commented Jan 20, 2025 • edited Loading

http403 commented Jan 9, 2025 •

edited

Loading

http403 commented Jan 17, 2025 •

edited

Loading

http403 commented Jan 18, 2025 •

edited

Loading

http403 commented Jan 18, 2025 •

edited

Loading

http403 commented Jan 18, 2025 •

edited

Loading

FeepingCreature commented Jan 20, 2025 •

edited

Loading

FeepingCreature commented Jan 20, 2025 •

edited

Loading

http403 commented Jan 20, 2025 •

edited

Loading

http403 commented Jan 20, 2025 •

edited

Loading

FeepingCreature commented Jan 20, 2025 •

edited

Loading