-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does this work for mt5 architectiture? #4
Comments
@hrsmanian , thank you for your interest in the project. From the first look, it should work, but I didn't make comprehensive tests. If you tested let me know whether it works or not. |
@Ingvarstep - does it work? |
It has a different base class in Transformers. So while it is able to load, am not sure if this is the correct output |
I have tested https://huggingface.co/bigscience/mt0-base and it works and produces the same outputs as transformers MT5ForConditionalGeneration. Even if the classes are different for PyTorch is important to match keys of weights. |
Thanks for checking. did you observe any speed up? I did not observe any speed up either. |
For generation, attention mechanism is not a bottleneck, especially for small sequences. You can have speed up at sequence length 4k+. Regarding a separate class, I don't see any reason for it right now. |
But Flash attention is clearly beneficial for training |
Hi,
First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?
https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5
Thanks for the great work
The text was updated successfully, but these errors were encountered: