What is right GPU to run this
I tried using 4 * 24 GB inference was very slow, Can you suggest the right gpu to run it on for fast inference
I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.
P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.
I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.
P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.
Thanks for the suggestion, quite helpful :)
i tried running 4*v100(32G) inference was very slow, One inference takes 6 minutes.input token len 1700