Releases: predibase/lorax
Releases · predibase/lorax
v0.5: CUDA graph compilation
🎉 Enhancements
🐛 Bugfixes
- Fixed deadlock in sgmv_shrink kernel caused by imbalanced segments by @tgaddair in #156
- Fixed loading adapter from absolute s3 path by @tgaddair in #161
📝 Docs
- Update client docs with new endpoint source by @abidwael in #126
- Update client docs with new endpoint source by @abidwael in #146
🔧 Maintenance
- Reduce Docker size by removing duplicate torch install by @tgaddair in #144
- remove CACHE_MANAGER in flash_causal_lm.py by @michaelfeil in #157
New Contributors
- @michaelfeil made their first contribution in #157
Full Changelog: v0.4.1...v0.5.0
v0.4.1
v0.4.0
🎉 Enhancements
- Mixtral by @flozi00 in #122
- Added Phi by @tgaddair in #132
- add support for H100s by @thelinuxkid in #111
- upgrade to py 3.10 by @flozi00 in #121
- Add predibase as a source for adapters by @magdyksaleh in #125
- enh: Add soci indexing to allow Lazy loading of LoRAX images by @gyanesh-mishra in #95
🐛 Bugfixes
- fix: Set Mistral sliding window to max position embeddings when None by @tgaddair in #128
- Fix Qwen tensor parallelism by @tgaddair in #120
- fix: Llama AWQ with GQA by @tgaddair in #114
- fix: Mixtral adapter loading wraps lm_head by @tgaddair in #131
📝 Docs
- Add Skypilot example and getting started guide by @tgaddair in #117
- docs: fix broken link by @Fluder-Paradyne in #133
- Added Mixtral and Phi to docs by @tgaddair in #134
🔧 Maintenance
- Increase default client timeout to 60s by @tgaddair in #119
- Make transpose contiguous for fan-in-fan-out by @tgaddair in #129
- remove lorax env var by @geoffreyangus in #113
New Contributors
- @gyanesh-mishra made their first contribution in #95
- @thelinuxkid made their first contribution in #111
- @Fluder-Paradyne made their first contribution in #133
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
Enhancements
- Add AWQ quantization by @flozi00 in #102
- Add support for Qwen by @tgaddair in #103
- Add Flash GPT2 by @geoffreyangus in #93
- LoRAX-compatible GPT-2 by @geoffreyangus in #109
Bugfixes
- decrease the max batch total tokens manually by @flozi00 in #89
- Added --max-active-adapters to launcher by @tgaddair in #96
- fix gptq fp16 inference by @flozi00 in #104
- fix static adapter merge by @geoffreyangus in #106
Maintenance
- Update values.yaml tag to always use the latest image by @arnavgarg1 in #87
- Update chart version by @abidwael in #88
- Warn if there are unused weights in the adapter by @tgaddair in #105
- docs: Added client docs for connecting to Predibase endpoints by @tgaddair in #98
- Generalized layer types and row parallel split logic by @tgaddair in #110
- Mkdocs by @tgaddair in #112
New Contributors
- @arnavgarg1 made their first contribution in #87
Full Changelog: v0.2.1...v0.3.0
lorax-0.3.0
LoRAX is the open-source framework for serving hundreds of fine-tuned LLMs in production for the price of one.
lorax-0.2.1
LoRAX is the open-source framework for serving hundreds of fine-tuned LLMs in production for the price of one.
v0.2.1
v0.2.0
What's Changed
Enhancements
- Implement sparse SGMV by @tgaddair in #64
- Implement tensor parallel SGMV by @tgaddair in #79
- Add adapter support for all linear layers in Llama and Mistral by @tgaddair in #75
- 4 bit support by @flozi00 in #66
- Exllamav2 by @flozi00 in #60
Bugfixes
- Updated to custom SGMV kernel to fix issue with certain ranks by @tgaddair in #70
- fix: Allow using unsupported base models without adapter loading by @tgaddair in #76
Maintenance
- Add DISABLE_SGMV env var to explicitly fallback to loop by @tgaddair in #69
- Upgrade the README discord badge and use an invite link that doesn't expire. by @justinxzhao in #73
New Contributors
- @justinxzhao made their first contribution in #73
Full Changelog: v0.1.2...v0.2.0
v0.1.2
v0.1.1
What's Changed
- Add Helm charts to deploy models by @abidwael in #27
- change defaults for helm chart by @noyoshi in #38
- add helm release wf by @noyoshi in #39
- Added support for YARN scaling by @tgaddair in #45
- Fixed tensor parallelism splits by @tgaddair in #47
- enh: enable CodeLlama by @geoffreyangus in #48
- Fallback when Punica is not installed by @tgaddair in #49
- add transformers gptq weights by @flozi00 in #52
- Add support for paged attention v2 and update flash attention v2 by @tgaddair in #54
- Fixed adapter loading for GPTQ base models by @tgaddair in #58
- Update gha to be able to automatically push images with release tags by @magdyksaleh in #59
New Contributors
Full Changelog: v0.1.0...v0.1.1