Problem. GPTQ Quantization to 4bit

#7
by hiauiarau - opened

Hi! I'm trying to quantize a model using GPTQ to 4bit, primarily maintaining quality for Russian language. I'm using 256 examples in the calibration dataset, which includes Russian, Chinese, and English languages. (I've tried adding more Russian and less of the other languages, or keeping each language equally represented.)

I'm using the auto_gptq library with CPU offloading and dump_percent = 0.01. I see that the knowledge and responses are improving, but there's a significant increase in language switching, especially to Chinese in the middle of generations. Could you suggest anything to me? My V100 GPU does not support AWQ.

Sign up or log in to comment