Loading a 4-bit quantized version of Mini-InternVL-Chat-2B-V1-5 and run inference with transformers
#5
by
belofn
- opened
we're loading and running the model with "Inference with Transformers" as per https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5#inference-with-transformers, all works well.
for the sake of performance, we quantized the model with AWQ, and would like to use the same approach as above. What should be the right way for loading the quantized model, while still using Inference with Transformers code?