What is right GPU to run this

by Varunk29 - opened Aug 26, 2023

Discussion

Varunk29

Aug 26, 2023

I tried using 4 * 24 GB inference was very slow, Can you suggest the right gpu to run it on for fast inference

gongy

Aug 26, 2023

I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.

P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.

dieselbaby

Sep 6, 2023

I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.

P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.

Thanks for the suggestion, quite helpful :)

vermanic

Sep 8, 2023

@Varunk29 Are generate() encode() functions from tokenizer and model thread safe?

I also want to concurrent inferences (from multiple threads on same model object) but not sure if they are thread safe?

wangchenkang2023

Sep 14, 2023

•

edited Sep 14, 2023

i tried running 4*v100(32G) inference was very slow, One inference takes 6 minutes.input token len 1700

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment