A new HQQ Quantization in <16GB VRAM

#8
by Chronal - opened

After discovering @ProphetOfBostrom 's HQQ quant of this model, I had to try making one myself. This version uses Metadata Offloading which wasn't available before, meaning you can load this model into <16GB of VRAM.

I haven't done any in depth testing yet, but the quant seems to generate sensible output so far.

https://huggingface.co/Chronal/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss_attn-4bit-moe-2bit-metaoffload-HQQ

Sign up or log in to comment