Release v0.9.0 · predibase/lorax

🎉 Enhancements

Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
Enforce adapters cannot be loaded past --adapter-memory-fraction by @tgaddair in #306
Added Qwen2 by @tgaddair in #327
Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
Generate to max_total_tokens during warmup by @tgaddair in #286
Add support for returning alternative tokens by @JTS22 in #297
feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
Allow specifying base model as model param in OpenAI API by @tgaddair in #331
Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
Log whether SGMV kernel is enabled by @tgaddair in #342
Log generated tokens out to file when streaming by @magdyksaleh in #309

Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
Fix dynamic RoPE by @tgaddair in #350
Only update cache during warmup by @tgaddair in #351
Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
Fix Qwen2 LoRA loading by @tgaddair in #345
Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
Disallow early stopping during warmup by @tgaddair in #290
Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
Fixed static adapter loading with same arch by @tgaddair in #300
Ensure model_id is a string when using a model from s3 by @fadebek in #291
Fix name for adapter id by @noyoshi in #284
Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341

Update docs now that we no longer return a list from OpenAI-compatible endpoints by @jeffreyftang in #281
Change guided generation to structured generation by @jeffreyftang in #302
Clarify getting started documentation regarding port number used in pre-built Docker image. by @alexsherstinsky in #313
Added system requirements to README by @tgaddair in #293
Update README.md by @tgaddair in #294

Split out server and router unit tests by @tgaddair in #275
Add in response headers to streaming endpoint by @noyoshi in #282
Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
Autogen python client docs by @tgaddair in #295
Reporting on total tokens by @noyoshi in #349

Full Changelog: v0.8.1...v0.9.0