-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docling serve at scale #10
Comments
Also, to add to the question above regarding scaling, how do you scale this based hundreds of requests per second? If you're running in the cloud, do you spin up multiple containers? |
Single container: So the benchmark shows BERT-Large model with automatic batching and multiprocessing. A single model process can runs prediction on a batch of 16-32 requests to increase the throughput. Additionally, if GPU memory allows it can also spin up extra process to handle more requests. The requests are load balanced on process level via uvicorn socket. And yes, in cloud you can also spin up multiple containers for further scale. |
Agree, we need both the options to scale the apis. Thanks for raising the issue and capturing the details. |
Thank you @vishnoianil! If the maintainers are okay with this then I can send a PR for this. |
Docling is a great project! Got to know about this from Spacy-layout.
This is powered by vanilla FastAPI, which is good but won't scale and lacks stuff like dynamic batching and autoscaling. I would suggest to use a library specialized for serving ML based APIs like LitServe or RayServe.
The text was updated successfully, but these errors were encountered: