v0.9.0
🎉 Enhancements
- Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
- Enforce adapters cannot be loaded past
--adapter-memory-fraction
by @tgaddair in #306 - Added Qwen2 by @tgaddair in #327
- Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
- Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
- Generate to
max_total_tokens
during warmup by @tgaddair in #286 - Add support for returning alternative tokens by @JTS22 in #297
- feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
- Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
- Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
- Allow specifying base model as model param in OpenAI API by @tgaddair in #331
- Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
- Log whether SGMV kernel is enabled by @tgaddair in #342
- Log generated tokens out to file when streaming by @magdyksaleh in #309
🐛 Bugfixes
- Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
- Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
- Fix dynamic RoPE by @tgaddair in #350
- Only update cache during warmup by @tgaddair in #351
- Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
- Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
- Fix Qwen2 LoRA loading by @tgaddair in #345
- Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
- Disallow early stopping during warmup by @tgaddair in #290
- Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
- Fixed static adapter loading with same arch by @tgaddair in #300
- Ensure model_id is a string when using a model from s3 by @fadebek in #291
- Fix name for adapter id by @noyoshi in #284
- Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341
📝 Docs
- Update docs now that we no longer return a list from OpenAI-compatible endpoints by @jeffreyftang in #281
- Change guided generation to structured generation by @jeffreyftang in #302
- Clarify getting started documentation regarding port number used in pre-built Docker image. by @alexsherstinsky in #313
- Added system requirements to README by @tgaddair in #293
- Update README.md by @tgaddair in #294
🔧 Maintenance
- Split out server and router unit tests by @tgaddair in #275
- Add in response headers to streaming endpoint by @noyoshi in #282
- Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
- Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
- Autogen python client docs by @tgaddair in #295
- Reporting on total tokens by @noyoshi in #349
New Contributors
- @huytuong010101 made their first contribution in #288
- @fadebek made their first contribution in #291
- @JTS22 made their first contribution in #297
- @alexsherstinsky made their first contribution in #313
- @mitchklusty made their first contribution in #325
Full Changelog: v0.8.1...v0.9.0