-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Cannot get past 50 RPS #6592
Comments
hi @vutrung96 looking into this, how do you get the % complete log output ? |
Hi @ishaan-jaff I was just using tqdm |
Hi @ishaan-jaff , any updates on this, also facing this issue! |
hi @vutrung96 @CharlieJCJ do you see the issue on litellm.router too ? https://docs.litellm.ai/docs/routing It would help me if you could test with litellm router too |
Hi @ishaan-jaff Litellm uses the official OpenAI python client
The official OpenAI client has performance issues with high numbers of concurrent requests due to issues in httpx The issues in httpx are due to a number of factors related to anyio vs asyncio Which are addressed in the open PRs below
We saw this when implementing litellm as the backend for our synthetic data engine When using our own openai client (with aiohttp instead of httpx) we saturate the highest rate limits (30,000 requests per minute on gpt-4o-mini tier 5). When using litellm, the performance issues cap us well under the highest rate limit (200 queries per second - 12,000 requests per minute). |
@RyanMarten you are right ! just ran a load test to confirm. The right is with |
@RyanMarten started work on this
@RyanMarten, @vutrung96 and @CharlieJCJ can y'all help us test this change as we start rolling it out ? As of now we just added support for non-streaming. I can let you know once streaming support is added too |
@ishaan-jaff Thanks for creating a PR for this! We can certainly help test the change 😄. I'll run a benchmarking test with Our use-case on non-streaming so that shouldn't be a problem. |
Here is our benchmarking using the curator request processor and viewer (with different backends). I see that this was released in https://github.com/BerriAI/litellm/releases/tag/v1.56.8. I upgraded my litellm version to the latest
(1) our own aiohttp backend
(2) default litellm backend
(3) litellm backend with
For some reason I'm not seeing an improvement in performance |
hmm that's odd - we see RPS going much higher on our testing Do you see anything off with our implementation (I know you mentioned you also implemented aiohttp) ? https://github.com/BerriAI/litellm/blob/main/litellm/llms/custom_httpx/aiohttp_handler.py#L30 |
ohh - I think I know the issue, it's still getting routed to the OpenAI sdk when you pass (we route to using the OpenAI sdk if the In my testing I was using will update this thread to ensure |
I'll take a look! For reference here is our aiohttp implementation: And here is how we are using litellm as a backend: |
Ah yes, what you said about the routing makes sense! When the fix is in, I'll try my benchmark again and post the results 👍 |
Fixed here @RyanMarten #7598 could you test on our new release ? (Will be out in 12 hrs) on v1.57.2 |
@ishaan-jaff - yes absolutely (looking out for the release) |
Sorry ci / cd causing issues - will update here once new release out |
@ishaan-jaff Also curious, what software / visualization are you using for your load tests? |
@RyanMarten -can you help test this: https://github.com/BerriAI/litellm/releases/tag/v1.57.3
I was using locust |
I'm getting this error now which I wasn't before. I think this is an issue from our side, let me test.
|
Ah this is because we do a test call with
fails with an unintuitive error message
What I can do is just switch this call to use acompletion as well |
OK now I'm running into an issue in the main loop where
@vutrung96 could you take a look at this since you wrote the custom event loop handling? |
What happened?
I have OpenAI tier 5 usage, which should give me 30,000 RPM = 500 RPS with "gpt-4o-mini". However I struggle get past 50 RPS.
The minimal replication:
I only get 50 items/second as opposed to ~500 items/second when sending raw HTTP requests.
Relevant log output
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: