gpu requirement

#7
by mdeniz1 - opened

which gpu would be enough for a good enough inference speed? say 3-4 seconds for each request.

Two things:

  • To get the model reasonably into VRAM you need about 25GB of gpu memory. In my case it sits at about 20.3GB in VRAM with bfloat16 and flash_attn2:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3651783      C   .../projects/idefics3/.venv/bin/python      20324MiB |
+-----------------------------------------------------------------------------------------+
  • The request time depends on one hand on the GPU but on the other on the length of the answer. If want to have a very short answer say 15 tokens, 3-4 seconds should not be a problem on any modern GPU that can load the model into memory. Getting a 500 token response on a single GPU in 4s might be pushing the boundary of tokens/s. Not sure specifically for Idefics3 but 50 tokens/s is already quite fast on a single GPU.

Sign up or log in to comment