HuggingFaceM4/Idefics3-8B-Llama3

Two things:

To get the model reasonably into VRAM you need about 25GB of gpu memory. In my case it sits at about 20.3GB in VRAM with bfloat16 and flash_attn2:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3651783      C   .../projects/idefics3/.venv/bin/python      20324MiB |
+-----------------------------------------------------------------------------------------+

The request time depends on one hand on the GPU but on the other on the length of the answer. If want to have a very short answer say 15 tokens, 3-4 seconds should not be a problem on any modern GPU that can load the model into memory. Getting a 500 token response on a single GPU in 4s might be pushing the boundary of tokens/s. Not sure specifically for Idefics3 but 50 tokens/s is already quite fast on a single GPU.

HuggingFaceM4
/

Idefics3-8B-Llama3

gpu requirement