Releases: predibase/lorax
Releases · predibase/lorax
lorax-0.4.0
LoRAX is the open-source framework for serving hundreds of fine-tuned LLMs in production for the price of one.
v0.12.1
🎉 Enhancements
- Add support for adapter loading in mllama by @ajtejankar in #669
- Record number of skipped tokens in the response by @tgaddair in #681
- Record TTFT and TPOT in response headers by @tgaddair in #684
- Add cli arg --speculation-max-batch-size by @tgaddair in #686
- Use
--predibase-api-token
parameter when downloading by @joseph-predibase in #687 - Launcher args for compile max batch size and rank by @tgaddair in #690
🐛 Bugfixes
- Fix stella embeddings + Integration tests for lorax by @magdyksaleh in #668
- Fix lora loading and indexing bug in mllama by @ajtejankar in #682
- Set maximum grpc message receive size to 2GiB by @tgaddair in #667
- Fix
frequency_penalty
andpresence_penalty
by @tgaddair in #672 - Fix scores (remove debug code) by @tgaddair in #673
- Fix top_p to allow setting it to 1.0 by @magdyksaleh in #676
- Format fixes tool calling by @magdyksaleh in #680
- Use predibase API token when downloading pbase files by @joseph-predibase in #688
- Pbase adapter source resolution by @magdyksaleh in #689
- fix: Make logprob field optional for response Pydantic validation by @jeffreyftang in #692
🔧 Maintenance
- Only use sha tag for running int tests by @magdyksaleh in #674
- Fix int tests 2 by @magdyksaleh in #675
- Always build and push image before running IT by @arnavgarg1 in #678
- Only push main if int tests pass by @magdyksaleh in #677
- Remove bad check by @magdyksaleh in #683
Full Changelog: v0.12.0...v0.12.1
v0.12.0: Multi-LoRA prefix caching, fp8 kv cache, Mllama, function calling
🎉 Enhancements
- Prompt prefix caching for multi-LoRA by @tgaddair in #655
- Convert to Triton Punica kernels by @tgaddair in #658
- Support FP8 KV Cache by @ajtejankar in #652
- Added Mllama by @tgaddair in #619
- Flash mllama by @tgaddair in #622
- support MRL embeddings for qwen2 by @magdyksaleh in #621
- Support for Embeddings with XLM-RoBERTa and Adapters by @jfhetzer in #656
- Merge weights by @magdyksaleh in #600
- feat: Function calling with output schema enforcement by @jeffreyftang in #536
- Chunked prefill by @tgaddair in #653
- add num inputs to metrics by @magdyksaleh in #615
- Add --predibase-api-token CLI arg by @joseph-predibase in #617
- Add --disable-sgmv flag by @joseph-predibase in #639
- Enhance Structured Output Interface by @GirinMan in #644
🐛 Bugfixes
- Add done message to openai endpoints by @magdyksaleh in #618
- Fix CUDA graph compilation by @tgaddair in #627
- Fix CUDA graphs for Medusa by @tgaddair in #628
- Fix retrace message by @tgaddair in #629
- Fix prefix plumbing and BGMV compiler dimensions by @tgaddair in #631
- Fix punica kernel compilation by @tgaddair in #632
- Fix FlashInfer when not using prefix caching by @tgaddair in #633
- Fix cuda graph tracing without lora ranks by @tgaddair in #634
- Added ranks 96 and 128 to BGMV kernel by @tgaddair in #630
- Look for language model lm head by @Infernaught in #640
- Return n choices for chat completions API by @tgaddair in #638
- Fix llava_next for llama 3.2 vision cross attention states by @tgaddair in #641
- Fix compile for qwen-2.5-32b by @tgaddair in #645
- Added backwards compatible field to OpenAI json_object API by @tgaddair in #648
- Fix PREDIBASE_API_TOKEN env var being thrown away by @joseph-predibase in #654
- Fix absent
fp8_kv
property on llama and qwen models by @ajtejankar in #662 - Fix seqlen bug for sliding window models like Mistral v0.1 by @ajtejankar in #660
- Fix sliding window + compile bug by @ajtejankar in #666
📝 Docs
🔧 Maintenance
- upgrade poetry by @magdyksaleh in #613
- Fix deps4 by @magdyksaleh in #614
- Remove LD_PRELOAD from Docker and improve error message by @tgaddair in #623
- add label to id this as a lorax image by @noyoshi in #626
- pass correct stuff to predibase-reporter by @magdyksaleh in #635
- try using arc runner for build by @noyoshi in #646
- change runner 2 by @magdyksaleh in #650
New Contributors
- @joseph-predibase made their first contribution in #617
- @jfhetzer made their first contribution in #656
Full Changelog: v0.11.0...v0.12.0
v0.11.0: Prefix caching, VLMs, BERT (embed, NER), FP8
🎉 Enhancements
- Add prefix caching by @tgaddair in #581
- Add Llava Next (VLM) by @tgaddair in #586
- Embedder Service v0 with FlashBert by @magdyksaleh in #385
- Added eager prefill option by @tgaddair in #524
- BERT NER support by @magdyksaleh in #531
- Preload adapters during init by @tgaddair in #543
- Add support for batching to embedder models by @tgaddair in #503
- Bert to gpu by @magdyksaleh in #507
- Add distilbert by @magdyksaleh in #508
- feat: return usage in ChatCompletionStreamResponse by @GirinMan in #506
- Added Gemma2 by @tgaddair in #530
- Move kv cache allocation to router to ensure correct block allocation by @tgaddair in #545
- Tokenize inputs in router by @tgaddair in #548
- Add support for Llama 3 rotary embeddings by @tgaddair in #551
- Apply chat template in router to properly validate input length by @tgaddair in #538
- Allow eager_prefill to be set in Helm chart by @bdalal in #557
- Support FP8 for Mistral by @ajtejankar in #559
- Support FP8 for LLaMa by @ajtejankar in #562
- Support classify batch by @magdyksaleh in #577
- Adding longrope for serve Phi-3 by @huytuong010101 in #576
- Add new agnostic health endpoint by @magdyksaleh in #588
- Support FlashInfer for BERT by @tgaddair in #597
- Speed up NER inference by @magdyksaleh in #598
- Disable healthcheck tracing and add metrics to classify + classify_batch endpoints by @magdyksaleh in #603
- Added launcher args for preloaded_adapter_source and backend by @tgaddair in #604
- Parallelize tokenization for /classify_batch and remove block allocator for non-causal LMs by @tgaddair in #609
- support bge-base-en-v1.5 by @magdyksaleh in #593
🐛 Bugfixes
- Fix for the LM_HEAD issue by @ajtejankar in #475
- fix: load tokenizer/config with trust_remote_code by @thincal in #476
- Fix issue with Medusa batch load signature by @tgaddair in #492
- add missed dtypes for 8bit kv cache by @flozi00 in #490
- Fix quant cache OOM by @flozi00 in #494
- Add retries on common session errors for the client by @gyanesh-mishra in #495
- Revert AWQ to stable commit by @tgaddair in #498
- Fixed phi-3 with Su Rotary Embedding by @tgaddair in #499
- Fixed case where loaded lora adapter has no segments by @tgaddair in #510
- fix batching bug by @magdyksaleh in #513
- Fix issue with GQA initialization for Qwen2 by @arnavgarg1 in #514
- Disable fp8 kv cache for lovelace by @tgaddair in #520
- Bug fix for illegal memory access error caused when running medusa lora and plain loras in parallel. by @ajtejankar in #525
- bug : fix the type checking errors thrown by new ruff version by @ajtejankar in #533
- bug : fix Qwen-2 sliding_window config bug by @ajtejankar in #532
- Infer dtype from model config when not explicitly specified by @arnavgarg1 in #534
- Fix gemma2 by @Infernaught in #539
- Fix : compile bug causing models to error with 'lora' key not found by @ajtejankar in #547
- Fix: short circuit download, load, offload for preloaded adapters by @tgaddair in #552
- Fix the attention bug caused by upgrading vLLM by @ajtejankar in #555
- Fix LM head interaction with Medusa by @tgaddair in #567
- Fix adapter mask when using speculative decoding + LM head LoRA by @tgaddair in #570
- Fix outlines compatibility with speculative decoding by @tgaddair in #578
- Fix qwen lora by @magdyksaleh in #585
- Fix classify and classify_batch for Python client by @tgaddair in #608
- Fix ner entity merging by @magdyksaleh in #596
- Fix class ner by @magdyksaleh in #602
- Fix dependencies to address high urgency dependabot alerts by @magdyksaleh in #612
📝 Docs
- docs: update development_env.md by @eltociear in #515
- Doc updates for Medusa training by @arnavgarg1 in #544
- Add "pbase" to adapter_source docstrings by @alexsherstinsky in #583
- Add prerequisites to readme by @csabakecskemeti in #584
🔧 Maintenance
- chore: update infer.rs by @eltociear in #487
- start porting latest tgi by @flozi00 in #480
- Bump client to v0.6.1 by @tgaddair in #496
- Update Makefile-awq by @flozi00 in #493
- hqq upgrades by @flozi00 in #491
- try out an integration test workflow by @noyoshi in #516
- no warm up by @magdyksaleh in #540
- Update PyTorch, CUDA, vLLM, and Bitsandbytes by @ajtejankar in #553
- Added missing nvidia-ml-py package by @tgaddair in #558
- parse headers for errored requests by @noyoshi in #564
- handle folders for predibase by @noyoshi in #565
- enable mistral nemo by @magdyksaleh in #568
- bump version by @noyoshi in #569
- Install flashinfer in Docker by @tgaddair in #582
- feat : use --no-cache-dir flag to pip in dockerfiles to save space by @Rajpratik71 in #587
- Add missing configs by @magdyksaleh in #590
- Address rust compiler warnings by @magdyksaleh in #589
New Contributors
- @eltociear made their first contribution in #487
- @ajtejankar made their first contribution in #475
- @bdalal made their first contribution in #557
- @Rajpratik71 made their first contribution in #587
- @csabakecskemeti made their first contribution in #584
Full Changelog: v0.10.0...v0.11.0
v0.10.0: Speculative decoding adapters and SGMV + BGMV
🎉 Enhancements
- Added support for Medusa speculative decoding adapters by @tgaddair in #372
- Added Medusa adapters per request by @tgaddair in #454
- Support jointly trained Medusa + LoRA adapters by @tgaddair in #482
- Adds prompt lookup decoding (ngram speculation) by @tgaddair in #375
- Use SGMV for prefill BGMV for decode by @tgaddair in #464
- Added phi3 by @tgaddair in #445
- Added support for C4AI Command-R (cohere) by @tgaddair in #411
- Add DBRX by @tgaddair in #423
- Refactor adapter interface to support adapters other than LoRA (e.g., speculative decoding) by @tgaddair in #359
- Initializing server with an adapter sets it as the default by @tgaddair in #370
- Implement Seed Parameter Support for OpenAI-Compatible API Endpoints by @GirinMan in #374
- lorax launcher now has --default-adapter-source by @noyoshi in #419
- enh: Make client's handling of error responses more robust and user-friendly by @jeffreyftang in #418
- Support both medusa v1 and v2 by @tgaddair in #421
- use default HF HUB token when checking for base model info by @noyoshi in #428
- Added adapter_source and api_token to completions API by @tgaddair in #446
- Increase max stop sequences by @tgaddair in #453
- Support LORAX_USE_GLOBAL_HF_TOKEN by @tgaddair in #462
- Allow setting temperature=0 by @tgaddair in #467
- Merge medusa segments by @tgaddair in #471
🐛 Bugfixes
- Fix CUDA compile when using long sequence lengths by @tgaddair in #363
- Fix CUDA graph compile with speculative decoding by @tgaddair in #381
- Fix mixtral for speculative decoding by @tgaddair in #382
- Fix import of EntryNotFoundError by @tgaddair in #401
- Fix warmup when using spculative decoding by @tgaddair in #402
- fix: assign bias directly by @thincal in #398
- fix: Enable ignoring botocore ClientError during download_file by @jeffreyftang in #404
- Fix Pydantic v2
adapter_id
andmerged_adapters
validation by @claudioMontanari in #408 - fix: Suppress pydantic warning over model_id field in DeployedModel by @jeffreyftang in #409
- Fix phi by @noyoshi in #410
- fix: Missing / in pbase endpoint by @jeffreyftang in #415
- Print correct number of key value heads on dimension assertion. by @dstripelis in #414
- Fix request variable by @Infernaught in #416
- fix: Rename _get_slice to get_slice by @tgaddair in #424
- fix: Hack for llama3 eos_token_id by @tgaddair in #427
- fix: checking the base_model_name_or_path of adapter_config and early return if null by @thincal in #431
- fix: use logits to calculate alternative tokens by @JTS22 in #425
- Fixed default pbase endpoint url by @tgaddair in #435
- fix: Downloading private adapters from HF by @tgaddair in #443
- Fix Outlines compatibility with speculative decoding by @tgaddair in #447
- fix: Handle edge case where allowed tokens are out of bounds by @tgaddair in #449
- Fix special tokens showing up in the response by @tgaddair in #450
- Fix Medusa + LoRA by @tgaddair in #455
- Ensure Llama 3 stops on all EOS tokens by @arnavgarg1 in #456
- Reuse session per class instance by @gyanesh-mishra in #468
📝 Docs
- Fix chat completion and docs by @GirinMan in #358
- Added batch processing example by @tgaddair in #386
- Medusa docs by @tgaddair in #459
- Updated supported base models in docs by @arnavgarg1 in #458
- Docs for private HF models by @tgaddair in #460
- Auth header docs by @tgaddair in #461
🔧 Maintenance
- Add CNAME file for Docs by @martindavis in #364
- Update tagging logic and add flake8 linter by @magdyksaleh in #365
- Apply black formatting by @tgaddair in #376
- Switch formatting and linting to ruff by @tgaddair in #378
- Style: change line length to 120 and enforce import sort order by @tgaddair in #383
- Bump pydantic version to >2, <3 by @claudioMontanari in #405
- refactor: set config into weights for quantization feature support more easily by @thincal in #400
- Update Predibase integration to support v2 API by @jeffreyftang in #403
- logging by @magdyksaleh in #436
- revert by @magdyksaleh in #437
- Upgrade to CUDA 12.1 and PyTorch 2.3.0 by @tgaddair in #472
- int: Bump Lorax Client to 3.9 by @gyanesh-mishra in #486
- Bump lorax client v0.6.0 by @tgaddair in #488
New Contributors
- @GirinMan made their first contribution in #358
- @martindavis made their first contribution in #364
- @thincal made their first contribution in #398
- @claudioMontanari made their first contribution in #405
- @dstripelis made their first contribution in #414
Full Changelog: v0.9.0...v0.10.0
v0.9.0
🎉 Enhancements
- Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
- Enforce adapters cannot be loaded past
--adapter-memory-fraction
by @tgaddair in #306 - Added Qwen2 by @tgaddair in #327
- Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
- Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
- Generate to
max_total_tokens
during warmup by @tgaddair in #286 - Add support for returning alternative tokens by @JTS22 in #297
- feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
- Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
- Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
- Allow specifying base model as model param in OpenAI API by @tgaddair in #331
- Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
- Log whether SGMV kernel is enabled by @tgaddair in #342
- Log generated tokens out to file when streaming by @magdyksaleh in #309
🐛 Bugfixes
- Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
- Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
- Fix dynamic RoPE by @tgaddair in #350
- Only update cache during warmup by @tgaddair in #351
- Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
- Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
- Fix Qwen2 LoRA loading by @tgaddair in #345
- Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
- Disallow early stopping during warmup by @tgaddair in #290
- Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
- Fixed static adapter loading with same arch by @tgaddair in #300
- Ensure model_id is a string when using a model from s3 by @fadebek in #291
- Fix name for adapter id by @noyoshi in #284
- Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341
📝 Docs
- Update docs now that we no longer return a list from OpenAI-compatible endpoints by @jeffreyftang in #281
- Change guided generation to structured generation by @jeffreyftang in #302
- Clarify getting started documentation regarding port number used in pre-built Docker image. by @alexsherstinsky in #313
- Added system requirements to README by @tgaddair in #293
- Update README.md by @tgaddair in #294
🔧 Maintenance
- Split out server and router unit tests by @tgaddair in #275
- Add in response headers to streaming endpoint by @noyoshi in #282
- Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
- Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
- Autogen python client docs by @tgaddair in #295
- Reporting on total tokens by @noyoshi in #349
New Contributors
- @huytuong010101 made their first contribution in #288
- @fadebek made their first contribution in #291
- @JTS22 made their first contribution in #297
- @alexsherstinsky made their first contribution in #313
- @mitchklusty made their first contribution in #325
Full Changelog: v0.8.1...v0.9.0
v0.8.1: Gemma support
🎉 Enhancements
- Added Gemma by @tgaddair in #267
- Pass details param into client by @magdyksaleh in #265
🔧 Maintenance
- bump version by @magdyksaleh in #268
- Bump by @magdyksaleh in #270
Full Changelog: v0.8.0...v0.8.1
v0.8: Structured Output via Outlines
🎉 Enhancements
- Added Outlines logits processor for JSON schema validation by @tgaddair in #224
- Enable JSON guided generation via OpenAI-compatible API by @jeffreyftang in #243
- JSON schema for guided generation now optionally respects field order by @jeffreyftang in #264
- Set default adapter source by @magdyksaleh in #223
- Pad LoRA ranks to ensure compatibility with SGMV kernel by @tgaddair in #256
- Add model and adapter response headers by @magdyksaleh in #220
- Add Cors params by @magdyksaleh in #221
- Add expose headers by @magdyksaleh in #230
🐛 Bugfixes
- Properly split out model_id when retrieving adapter weights downloaded from S3 by @jeffreyftang in #246
- Fixed TIES merging to calculate sign before applying weights by @tgaddair in #239
- Update s3.py by @llama-shepard in #234
- Fix concatenate for flash batch by @tgaddair in #254
- Fixed batch merging and filtering to handle Outlines state by @tgaddair in #263
📝 Docs
- Add guide for guided generation by @jeffreyftang in #240
- Added contributing guide by @tgaddair in #226
- Update README to include model merging by @tgaddair in #225
- Updated structured output by @tgaddair in #258
- Minor corrections to development env setup instructions by @jeffreyftang in #228
🔧 Maintenance
- Upgrade docker to use rust 1.75 and ubuntu 22.04 by @tgaddair in #250
- Upgrading rust for dependency changes by @DhruvaBansal00 in #248
- fix paths on runner by @noyoshi in #242
New Contributors
- @jeffreyftang made their first contribution in #228
- @DhruvaBansal00 made their first contribution in #248
Full Changelog: v0.7.0...v0.8.0
v0.7: LoRA Merging (linear, TIES, DARE) per request
🎉 Enhancements
- Merge multiple LoRA adapters per request (linear, TIES, DARE) by @tgaddair in #212
- Eetq by @flozi00 in #195
- hqq JIT Quantization by @flozi00 in #147
- Added Bloom dynamic adapter loading by @tgaddair in #187
- Added pbase adapter_source and expose api_token in client by @tgaddair in #181
- Cloudflare R2 Source by @llama-shepard in #198
🐛 Bugfixes
- Fixed Phi for new HF format by @tgaddair in #192
- Fixed OpenAI stream response data by @tgaddair in #193
- fix: OpenAI response format by @tgaddair in #184
- Fix RoPE and YARN scaling by @tgaddair in #202
- check for base model earlier in the adapter function by @noyoshi in #196
📝 Docs
🔧 Maintenance
- Upgrade to pytorch==2.2.0 by @tgaddair in #217
- upgrade exllama kernel by @flozi00 in #209
- Add a model cache to avoid running out of storage by @magdyksaleh in #201
New Contributors
- @llama-shepard made their first contribution in #198
Full Changelog: v0.6.0...v0.7.0
v0.6: OpenAI compatible API
🎉 Enhancements
- OpenAI v1 Completions API by @tgaddair in #170
- OpenAI v1 Chat Completions API by @tgaddair in #171
- Added
prompt_tokens
to the response by @tgaddair in #165
🐛 Bugfixes
📝 Docs
🔧 Maintenance
- fix: Only install stanford-stk on linux by @tgaddair in #169
- added separate installation for torch by @asingh9530 in #173
New Contributors
- @asingh9530 made their first contribution in #173
Full Changelog: v0.5.0...v0.6.0