Best strategy for inference on multiple GPUs
#124
by
symdec
- opened
Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.
I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?
Thanks in advance !