Does this work for mt5 architectiture? #4

hrsmanian · 2024-06-26T02:04:09Z

Hi,
First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?

https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5

Thanks for the great work

Ingvarstep · 2024-06-27T16:47:38Z

@hrsmanian , thank you for your interest in the project. From the first look, it should work, but I didn't make comprehensive tests. If you tested let me know whether it works or not.

mariothedev · 2024-06-27T18:41:27Z

@Ingvarstep - does it work?

hrsmanian · 2024-06-30T06:15:22Z

It has a different base class in Transformers. So while it is able to load, am not sure if this is the correct output

Ingvarstep · 2024-07-01T17:44:25Z

I have tested https://huggingface.co/bigscience/mt0-base and it works and produces the same outputs as transformers MT5ForConditionalGeneration. Even if the classes are different for PyTorch is important to match keys of weights.

hrsmanian · 2024-07-01T17:48:25Z

Thanks for checking. did you observe any speed up? I did not observe any speed up either.
Also, since the classes and tokenizer are different, would it not be better to have a separate implementation for mt5?

Ingvarstep · 2024-07-01T18:07:20Z

For generation, attention mechanism is not a bottleneck, especially for small sequences. You can have speed up at sequence length 4k+. Regarding a separate class, I don't see any reason for it right now.

Ingvarstep · 2024-07-01T18:08:12Z

But Flash attention is clearly beneficial for training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this work for mt5 architectiture? #4

Does this work for mt5 architectiture? #4

hrsmanian commented Jun 26, 2024 •

edited

Loading

Ingvarstep commented Jun 27, 2024

mariothedev commented Jun 27, 2024

hrsmanian commented Jun 30, 2024

Ingvarstep commented Jul 1, 2024

hrsmanian commented Jul 1, 2024

Ingvarstep commented Jul 1, 2024

Ingvarstep commented Jul 1, 2024

Does this work for mt5 architectiture? #4

Does this work for mt5 architectiture? #4

Comments

hrsmanian commented Jun 26, 2024 • edited Loading

Ingvarstep commented Jun 27, 2024

mariothedev commented Jun 27, 2024

hrsmanian commented Jun 30, 2024

Ingvarstep commented Jul 1, 2024

hrsmanian commented Jul 1, 2024

Ingvarstep commented Jul 1, 2024

Ingvarstep commented Jul 1, 2024

hrsmanian commented Jun 26, 2024 •

edited

Loading