Is the attention bias necessary in MultiHeadCrossAttentionBlock? #149

0x6b64 · 2024-05-27T17:58:51Z

0x6b64
May 27, 2024

Hi,

https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py#L53

The following snippet is taken from the MHCA block

        attn_bias = None
        if mask is not None:
            attn_bias = xformers.ops.fmha.BlockDiagonalMask.from_seqlens([N] * B, mask)
        x = xformers.ops.memory_efficient_attention(q, k, v, p=self.attn_drop.p, attn_bias=attn_bias)

mask is a list with the length of (non zero) mask. Reading the memory_efficient_attention code, it doesn't seems like BlockDiagonalMask type mask is actually used for the provided input.

https://github.com/facebookresearch/xformers/blob/f6637120b58c4b3626b18234f8c0c74c561b8d01/xformers/ops/fmha/__init__.py#L156

I suppose I must be missing something in understanding the operation.
Any guidance will be very helpful! Let me know if you need additional details about my environment.

The following minified python code would yield the same results (with && without the bias), for the kind of mask being used.

d_model = 1152
num_heads = 16
attention_1 = MultiHeadCrossAttention(d_model, num_heads)
attention_1 = attention_1.to(torch.bfloat16)
attention_1.to("cuda")

x = torch.rand([1, 1024, 1152], dtype=torch.bfloat16, requires_grad=True, device="cuda")
y = torch.rand([1, 13, 1152], dtype=torch.bfloat16, requires_grad=True, device="cuda")
mask = [13]

out_1 = attention_1(x, y, mask)
target_1 = out_1*2
loss_1 = torch.nn.functional.mse_loss(out_1, target_1)
loss_1.backward()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PixArt

Is the attention bias necessary in MultiHeadCrossAttentionBlock? #149

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

PixArt

Is the attention bias necessary in MultiHeadCrossAttentionBlock? #149

0x6b64 May 27, 2024

Replies: 0 comments

0x6b64
May 27, 2024