Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

Open
http403 opened this issue Jan 9, 2025 · 13 comments
Open

Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm #11153

http403 opened this issue Jan 9, 2025 · 13 comments

Comments

@http403
Copy link

http403 commented Jan 9, 2025

Name and Version

$ .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
version: 4450 (8d59d911)
built with  for x86_64-pc-windows-msvc

Operating systems

Windows 11 24H2 Build 26100.2605

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm,RPC   |  99 |         pp512 |        915.97 ± 5.53 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm,RPC   |  99 |         tg128 |          3.12 ± 0.01 |

build: 8d59d911 (4450)

Problem description & steps to reproduce

Description

It is horrendously slow. It shouldn't be this slow. You will get a sense how slow it is with the result with the Vulkan backend, which is suppose to be worse.

Step to reproduce

  1. Get the latest hipBLAS build in release
  2. run llama-bench.exe with a model you like

First Bad Commit

No response

Relevant log output

No response

Additional Information

Results with other backends and builds

Vulken Backend

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders............................................Done!
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan,RPC |  99 |         pp512 |        987.29 ± 0.62 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan,RPC |  99 |         tg128 |         64.03 ± 0.21 |

build: 8d59d911 (4450)

b3808-hip

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         pp512 |        914.20 ± 6.28 |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         tg128 |          3.12 ± 0.01 |

build: 1e7b929 (1)

mystery build from https://github.com/PiDanShouRouZhouXD/Sakura_Launcher_GUI/releases/tag/v0.0.3-alpha

$ .\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         pp512 |   1611.84 ± 6.60 |
| qwen2 ?B Q4_K - Medium         |   8.37 GiB |    14.77 B | CUDA       |  99 |         tg128 |     53.41 ± 0.08 |

build: 641f5dd2 (3534)

Temporary Workaround

Do not use the HIP build. Use Vulkan instead.

@http403 http403 changed the title Misc. bug: Very bad performance of latest llama.cpp builds with AMD GPU Misc. bug: Very bad performance of latest llama.cpp HIP builds with AMD GPU Jan 9, 2025
@tbocek
Copy link

tbocek commented Jan 11, 2025

I get the following numbers with latest master (Linux):

./build/bin/llama-bench --model /mnt/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model size params backend ngl test t/s
qwen2 14B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 pp512 1660.75 ± 3.84
qwen2 14B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 tg128 51.99 ± 0.27

build: c05e8c9 (4462)

Maybe a Windows issue?

@http403
Copy link
Author

http403 commented Jan 17, 2025

I don't know if that's a Windows specific issue, as I don't have a Linux machine to test yet. However, I compiled the 3edfa7d and the result stay the time. My 7900XTX is at 100% load but system total power draw is only ~270W compare to Vulkan at ~530W. There is a page about performance troubleshooting but nothing about troubleshooting low GPU resource utilization.

Here's the result

build\bin\llama-bench.exe -m Qwen2.5-14B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm       |  99 |         pp512 |        923.24 ± 1.52 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | ROCm       |  99 |         tg128 |          3.12 ± 0.01 |

build: 3edfa7d3 (4502)

Edit: correcting GPU load observation

@slaren
Copy link
Collaborator

slaren commented Jan 17, 2025

If you can find the exact commit that introduced the issue, that would greatly increase the chances of finding a solution. You can use git bisect to help you do this.

@http403
Copy link
Author

http403 commented Jan 18, 2025

Upon more detailed examination, the slow performance is only occurring on Qwen2 model but not with LLama2. I'm still bisecting and trying to look for a commit that have a reasonable performance.

@http403 http403 changed the title Misc. bug: Very bad performance of latest llama.cpp HIP builds with AMD GPU Misc. bug: Very bad performance on Qwen 2 with HIP/ROCm Jan 18, 2025
@http403
Copy link
Author

http403 commented Jan 18, 2025

@slaren Bad news. I have got down to 9b75cb2, the commit that enables Qwen2 support and the performance is still bad. I couldn't find a commit that can considered good yet. For now, I can only assume is that the implementation is flawed from the beginning, while somehow the mystery build can behave normally.

Here are the bench results:

HIP 5.7

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model size params backend ngl test t/s
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 pp 512 1457.52 ± 1.78
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 tg 128 3.12 ± 0.00

build: 9b75cb2 (1923)

HIP 6.1

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model size params backend ngl test t/s
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 pp 512 931.57 ± 1.63
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 tg 128 3.12 ± 0.02

build: 9b75cb2 (1923)

HIP 6.2

build-b%LLAMA_TAG%-hip%HIP_VERSION%\bin\llama-bench.exe -m "Qwen2.5-14B-Instruct-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no

model size params backend ngl test t/s
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 pp 512 925.39 ± 4.42
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B ROCm 99 tg 128 3.12 ± 0.01

build: 9b75cb2 (1923)

@slaren
Copy link
Collaborator

slaren commented Jan 18, 2025

I guess it comes down to build settings. I believe koboldcpp has ROCm releases, you can try checking if their builds work for you, and try their build settings if so.

@http403
Copy link
Author

http403 commented Jan 18, 2025

koboldcpp doesn't have a ROCm release, but a fork. Downloaded v1.80.3 and the result is even worst than llama.cpp at 2.9 tok/s.

Benchmark Completed - v1.80.3.yr0-ROCm Results:
======
Flags: NoAVX2=False Threads=5 HighPriority=False Cublas_Args=['normal', '1', 'nommq'] Tensor_Split=None BlasThreads=5 BlasBatchSize=512 FlashAttention=False KvCache=0
Timestamp: 2025-01-18 23:11:31.743193+00:00
Backend: koboldcpp_hipblas.dll
Layers: 51
Model: Qwen2.5-14B-Instruct-Q4_K_M
MaxCtx: 4096
GenAmount: 100
-----
ProcessingTime: 4.601s
ProcessingSpeed: 868.51T/s
GenerationTime: 33.882s
GenerationSpeed: 2.95T/s
TotalTime: 38.483s
Output:  1 1 1 1
-----
===

@FeepingCreature
Copy link

FeepingCreature commented Jan 20, 2025

Same issue here on Linux with Deepseek R1 (qwen based). This is how bad it is: CPU is more than twice as fast as my 7900 XTX with 49/49 GPU layers.

Correction: it was some build issue.

@jtgladiator
Copy link

I'm experiencing the same. Full GPU offload of qwen2.5-coder-7b-instruct onto 7900xtx. T/S is 3.1

Started with lmstudio 0.3.6 using llama.cpp. Need to check my run times.

@FeepingCreature
Copy link

FeepingCreature commented Jan 20, 2025

I'm getting good performance (63tok/s on tg128) with Vulkan backend. So I guess I'll just use that...?

@http403
Copy link
Author

http403 commented Jan 20, 2025

I'm experiencing the same. Full GPU offload of qwen2.5-coder-7b-instruct onto 7900xtx. T/S is 3.1

Started with lmstudio 0.3.6 using llama.cpp. Need to check my run times.

The HIP/ROCm build from LMStudio should be okay. However, I do experience some instability when loading some models (Phi4) so I default to Vulkan backend.

@http403
Copy link
Author

http403 commented Jan 20, 2025

Same issue here on Linux with Deepseek R1 (qwen based). This is how bad it is: CPU is more than twice as fast as my 7900 XTX with 49/49 GPU layers.

@FeepingCreature Can you please post your benchmark result with llama-bench?

Another user reported the Linux build is working correctly. I'm especially interested about the commit of your build. It may help accelerate my effort to find where the bug is.

Thank you.

@FeepingCreature
Copy link

FeepingCreature commented Jan 20, 2025

Update: I did a clean build of llama.cpp and now it runs fine.

ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 14B Q5_K - Medium        |   9.78 GiB |    14.77 B | ROCm,BLAS  |      16 |         pp512 |       1745.49 ± 4.22 |
| qwen2 14B Q5_K - Medium        |   9.78 GiB |    14.77 B | ROCm,BLAS  |      16 |         tg128 |         51.97 ± 0.04 |

I didn't have the rocm libs installed right when I first ran cmake-gui. I wonder if it hung on to some bad configuration settings.

edit: Stupid, I should have made a copy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants