zetasepic/Qwen2.5-32B-Instruct-abliterated-pass2-gguf

eepos

Sep 30

Could you make a Q4_K_M for us 24GB havers? <3

xhyi

Oct 2

Also, please an IQ4_XS?

zetasepic

Owner Oct 2

Q4_K_M and IQ4_XS quant files have been uploaded.

zhicwu

Oct 3

Thanks @zetasepic ! I just tried Q4_K_M and it works great. Could you also upload Q3_K_M? It’s smaller and seems like it might be better according to https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/qwen2532bggufevaluationresults/

phazei

Oct 5

•

edited Oct 5

@zetasepic this has been fantastic. I can run the Q4_K_M on my 3090 fully on the card! Though I also second zhicwu's request for a Q3_K_M if it's not too much trouble. While I can run the Q4 with about 27tok/sec, the context length is 8k. If I try to make the context length 16K then it doesn't have enough space on the GPU and offloads dropping down to 10tok/sec. I think the Q3_K_M would be small enough to allow the 32k context length, and if that post is accurate, then I'll still have about the same quality and might even have faster inference speeds. Thank you!

zetasepic

Owner Oct 11

Hi @phazei , @zhicwu
Please try the second version, Q3_K_M is included. (https://huggingface.co/zetasepic/Qwen2.5-32B-Instruct-abliterated-v2-GGUF)

zhicwu

24 days ago

Thank you, @zetasepic ! I have tested the updated model and, while I did not notice significant differences, Q3_K_M is indeed faster :D

zetasepic
/

Qwen2.5-32B-Instruct-abliterated-pass2-gguf

Q4_K_M?