-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a strict requirement for GPUs that support flash_attention? #17
Comments
I solve this problem by changing "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false. |
No there is not. Feel free to modify the attention processor |
A new problem. It shows it requires 560.82 GiB to test after I change my code as shown above. And nothing has changed even though enable_cpu_offload is set to True. File "/Allegro/allegro/models/transformers/block.py", line 826, in call |
I found the issue. The V100 does not support bfloat16 precision, but it doesn't throw an error. The underlying implementation might default to some very complex computations. After I switched to float16 precision, it ran successfully, using 6 GiB on a single GPU. However, generating a result takes about 4 hours, so I guess I need to use faster GPUs. :) |
How do you switch precision modes? |
|
didn't work either way, but thank you anyways :) |
|
@Grownz that is literally unusable 3 hours on rtx 3090 |
I know, i pointed that out, too. |
@Grownz do you think that can be speed up somehow? or we have to wait rtx 5090 :D |
I don't think this is due to low raw performance, but due to unsupported attention modes (to dive deeper: https://developer.nvidia.com/blog/emulating-the-attention-mechanism-in-transformer-models-with-a-fully-convolutional-network/). This might be solved via updated drivers, but since nvidia doesn't care much about ML on consumer hardware, i doubt there will be an immediate official solution. |
@Grownz so again it is related to shameless monopoly nvidia :( ty |
I try this, it work for me in RTX4090 |
Is there a strict requirement for GPUs that support flash_attention? I tried to test on V100, but this GPU does not support flash_attention, resulting in an error with the Runtime Error: No available kernel Aborting execution.
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:723.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:495.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:725.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention only supports gpu architectures in the range [sm80, sm90]. Attempting to run on a sm 7.0 gpu. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:201.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:727.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variable
TORCH_CUDNN_SDPA_ENABLED=1
(Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:496.)The text was updated successfully, but these errors were encountered: