Q4_K_M?

#1
by eepos - opened

Could you make a Q4_K_M for us 24GB havers? <3

Also, please an IQ4_XS?

Q4_K_M and IQ4_XS quant files have been uploaded.

Thanks @zetasepic ! I just tried Q4_K_M and it works great. Could you also upload Q3_K_M? It’s smaller and seems like it might be better according to https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/qwen2532bggufevaluationresults/

@zetasepic this has been fantastic. I can run the Q4_K_M on my 3090 fully on the card! Though I also second zhicwu's request for a Q3_K_M if it's not too much trouble. While I can run the Q4 with about 27tok/sec, the context length is 8k. If I try to make the context length 16K then it doesn't have enough space on the GPU and offloads dropping down to 10tok/sec. I think the Q3_K_M would be small enough to allow the 32k context length, and if that post is accurate, then I'll still have about the same quality and might even have faster inference speeds. Thank you!

Hi @phazei , @zhicwu
Please try the second version, Q3_K_M is included. (https://huggingface.co/zetasepic/Qwen2.5-32B-Instruct-abliterated-v2-GGUF)

Thank you, @zetasepic ! I have tested the updated model and, while I did not notice significant differences, Q3_K_M is indeed faster :D

Sign up or log in to comment