A collection of tricks to simplify and speed up transformer models:
- Slim attention: [podcast], [paper], [notebook]
- Flash normalization: [podcast], [paper], [code]
- Precomputing the first layer: [podcast], [paper]
- Removing weights from skipless transformers: [podcast], [paper], [notebook]
- Approximate attention [work in progress]: [podcast], [paper]
Many of these tricks follow a recent trend of removing parts from neural networks such as RMSNorm’s removal of mean centering from LayerNorm, T5’s removal of bias-parameters, NoPE’s removal of positional encoding, GPT’s removal of the encoder stack, and of course transformer’s revolutionary removal of recurrent layers. Specifically, our FlashNorm removes the weights from RMSNorm and merges them with the next linear layer. And slim attention removes the entire V-cache from the context memory for MHA transformers.
Install the transformer tricks package with pip
:
pip install transformer-tricks
OLD docu:
Tricks and tools for speeding up LLMs:
-
Slim attention: cut your context memory in half without loss of accuracy [work in progress]:
-
Flash normalization:
- Notebook example for converting an LLM to FlashNorm:
- Notebook for paper:
- HuggingFace repo
-
Removing weights from skipless transformers:
Please give us a ⭐ if you like this repo, thanks!