Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this work for mt5 architectiture? #4

Open
hrsmanian opened this issue Jun 26, 2024 · 7 comments
Open

Does this work for mt5 architectiture? #4

hrsmanian opened this issue Jun 26, 2024 · 7 comments

Comments

@hrsmanian
Copy link

hrsmanian commented Jun 26, 2024

Hi,
First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?

https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5

Thanks for the great work

@Ingvarstep
Copy link
Contributor

@hrsmanian , thank you for your interest in the project. From the first look, it should work, but I didn't make comprehensive tests. If you tested let me know whether it works or not.

@mariothedev
Copy link

@Ingvarstep - does it work?

@hrsmanian
Copy link
Author

It has a different base class in Transformers. So while it is able to load, am not sure if this is the correct output

@Ingvarstep
Copy link
Contributor

I have tested https://huggingface.co/bigscience/mt0-base and it works and produces the same outputs as transformers MT5ForConditionalGeneration. Even if the classes are different for PyTorch is important to match keys of weights.

@hrsmanian
Copy link
Author

Thanks for checking. did you observe any speed up? I did not observe any speed up either.
Also, since the classes and tokenizer are different, would it not be better to have a separate implementation for mt5?

@Ingvarstep
Copy link
Contributor

For generation, attention mechanism is not a bottleneck, especially for small sequences. You can have speed up at sequence length 4k+. Regarding a separate class, I don't see any reason for it right now.

@Ingvarstep
Copy link
Contributor

But Flash attention is clearly beneficial for training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants