gpu requirement
#7
by
mdeniz1
- opened
which gpu would be enough for a good enough inference speed? say 3-4 seconds for each request.
Two things:
- To get the model reasonably into VRAM you need about 25GB of gpu memory. In my case it sits at about 20.3GB in VRAM with bfloat16 and flash_attn2:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3651783 C .../projects/idefics3/.venv/bin/python 20324MiB |
+-----------------------------------------------------------------------------------------+
- The request time depends on one hand on the GPU but on the other on the length of the answer. If want to have a very short answer say 15 tokens, 3-4 seconds should not be a problem on any modern GPU that can load the model into memory. Getting a 500 token response on a single GPU in 4s might be pushing the boundary of tokens/s. Not sure specifically for Idefics3 but 50 tokens/s is already quite fast on a single GPU.