Is there a strict requirement for GPUs that support flash_attention？ #17

ChalvYongkang · 2024-10-23T07:33:14Z

Is there a strict requirement for GPUs that support flash_attention? I tried to test on V100, but this GPU does not support flash_attention, resulting in an error with the Runtime Error: No available kernel Aborting execution.

/Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:723.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:495.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:725.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention only supports gpu architectures in the range [sm80, sm90]. Attempting to run on a sm 7.0 gpu. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:201.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:727.)
hidden_states = F.scaled_dot_product_attention(
/Allegro/allegro/models/transformers/block.py:824: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variableTORCH_CUDNN_SDPA_ENABLED=1 (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:496.)

The text was updated successfully, but these errors were encountered:

ChalvYongkang · 2024-10-23T08:20:29Z

I solve this problem by changing "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false.

nightsnack · 2024-10-23T08:22:24Z

No there is not. Feel free to modify the attention processor

ChalvYongkang · 2024-10-23T08:37:11Z

A new problem. It shows it requires 560.82 GiB to test after I change my code as shown above. And nothing has changed even though enable_cpu_offload is set to True.

File "/Allegro/allegro/models/transformers/block.py", line 826, in call
hidden_states = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 560.82 GiB. GPU 0 has a total capacity of 31.74 GiB of which 26.35 GiB is free. Process 2048906 has 5.38 GiB memory in use. Of the allocated memory 4.81 GiB is allocated by PyTorch, and 218.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

nightsnack · 2024-10-23T08:50:13Z

What? 560G? It seems some wired things appeared in V100. I remember I tested xformers on A100 and the memory cost remained the same.

We don't have V100 and I'm afraid there's nothing I can do about it, unfortunately..

ChalvYongkang · 2024-10-23T09:04:40Z

I found the issue. The V100 does not support bfloat16 precision, but it doesn't throw an error. The underlying implementation might default to some very complex computations. After I switched to float16 precision, it ran successfully, using 6 GiB on a single GPU. However, generating a result takes about 4 hours, so I guess I need to use faster GPUs. :)

Grownz · 2024-10-23T11:44:58Z

How do you switch precision modes?

ChalvYongkang · 2024-10-23T12:07:07Z

How do you switch precision modes?
Just simply change the 13th line "dtype=torch. bfloat16" in single_inference. py to "dtype=torch. float16"

Grownz · 2024-10-23T13:29:41Z

didn't work either way, but thank you anyways :)

Grownz · 2024-10-23T13:41:29Z

I solve this problem by change "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false.

This did work. But it is brutally slow (GTX 3090).

reference: rhymes-ai#17

FurkanGozukara · 2024-10-27T21:43:13Z

@Grownz that is literally unusable

3 hours on rtx 3090

Grownz · 2024-10-28T22:48:26Z

@Grownz that is literally unusable

3 hours on rtx 3090

I know, i pointed that out, too.

FurkanGozukara · 2024-10-28T22:49:24Z

@Grownz do you think that can be speed up somehow? or we have to wait rtx 5090 :D

Grownz · 2024-10-28T22:55:11Z

I don't think this is due to low raw performance, but due to unsupported attention modes (to dive deeper: https://developer.nvidia.com/blog/emulating-the-attention-mechanism-in-transformer-models-with-a-fully-convolutional-network/). This might be solved via updated drivers, but since nvidia doesn't care much about ML on consumer hardware, i doubt there will be an immediate official solution.

FurkanGozukara · 2024-10-28T23:03:26Z

@Grownz so again it is related to shameless monopoly nvidia :( ty

YuanXiaoYaoZiZai · 2024-11-03T19:10:52Z

I solve this problem by change "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false.

I try this, it work for me in RTX4090

ai-anchorite added a commit to pinokiofactory/Allegro-txt2vid that referenced this issue Oct 25, 2024

fix: use basic attention kernel for consumer GPU compatibility

56e4236

reference: rhymes-ai#17

This was referenced Oct 28, 2024

"No available kernel. Aborting execution." I installed all of the requirements from requirements.txt and made sure to install torch 2.4 with cuda 12.4 enabled #28

Open

Windows - RuntimeError: No available kernel. Aborting execution. #25

Open

DaveGravel mentioned this issue Nov 28, 2024

No Available Kernel error bombax-xiaoice/ComfyUI-Allegro#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a strict requirement for GPUs that support flash_attention？ #17

Is there a strict requirement for GPUs that support flash_attention？ #17

ChalvYongkang commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

nightsnack commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

nightsnack commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

Grownz commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

Grownz commented Oct 23, 2024

Grownz commented Oct 23, 2024

FurkanGozukara commented Oct 27, 2024

Grownz commented Oct 28, 2024

FurkanGozukara commented Oct 28, 2024

Grownz commented Oct 28, 2024

FurkanGozukara commented Oct 28, 2024

YuanXiaoYaoZiZai commented Nov 3, 2024

Is there a strict requirement for GPUs that support flash_attention？ #17

Is there a strict requirement for GPUs that support flash_attention？ #17

Comments

ChalvYongkang commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 • edited Loading

nightsnack commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 • edited Loading

nightsnack commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 • edited Loading

Grownz commented Oct 23, 2024

ChalvYongkang commented Oct 23, 2024 • edited Loading

Grownz commented Oct 23, 2024

Grownz commented Oct 23, 2024

FurkanGozukara commented Oct 27, 2024

Grownz commented Oct 28, 2024

FurkanGozukara commented Oct 28, 2024

Grownz commented Oct 28, 2024

FurkanGozukara commented Oct 28, 2024

YuanXiaoYaoZiZai commented Nov 3, 2024

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

ChalvYongkang commented Oct 23, 2024 •

edited

Loading

ChalvYongkang commented Oct 23, 2024 •

edited

Loading