Loading a 4-bit quantized version of Mini-InternVL-Chat-2B-V1-5 and run inference with transformers

by belofn - opened 27 days ago

27 days ago

•

we're loading and running the model with "Inference with Transformers" as per https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5#inference-with-transformers, all works well.

for the sake of performance, we quantized the model with AWQ, and would like to use the same approach as above. What should be the right way for loading the quantized model, while still using Inference with Transformers code?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment