Thank you so much for doing this chore. IQ2 > EXL2 @ <3bpw ?
I am writing this comment because you asked so nicely, and also because I got sick of even trying to do this sort of thing. (I will never forgive HQQ) Clearly your upload rate is better than mine. I appreciate that you don't half-ass these. The 'proper' imatrix tuning is noted and I reckon going from 32 to 512 tokens has to help a model remember how to zero-shot. I download all of them - just in case.
Other readers wanting advice*: Q3-XXS is rock solid, doesn't break 3 tk/s on a 3090 best effort offload, even with just 6k ctx. 5800X at 5GHz, tried a lot of permutations for settings.
IQ2-XS GGUF feels nearly as robust. I think it's better than the 2.4 BPW EXL2.
If you don't really have enough video memory for any of these - honestly you should: get the IQ3 - it's not really any slower with a partial offload than the smaller files.
I think most people who are here because they have 20-24 GB of VRAM should: pick IQ2-XS.
If you have some manycore monster CPU or > 24GB VRAM you should: Get the IQ3 GGUF. Even 4BPW is overkill.
If you have a H100, you should: give it to me. thanks.
These are/(this model is) great fun - what your perspective is on how these importance tuned GGUF weights stack up to the likes of the GPTQ derivatives (EXL2 and AWQ are the two I know of)?
Nothing on my computer works for more than 5 minutes so I'm not running a perplexity eval but my view:
It sure seems like IQ2-XS has only advantages over the slightly bigger EXL2.4?
Only exception I can think of is the 8 bit KV cache that I used to use uncritically - but today I got 'bad vibes' trying 2.4bpw with the 8 bit KV cache on a context of 6 ktk - just seemed unstable. Far from scientific, I know.
I'd be most interested to know whatever you know about this topic - I saw a kccp repo with old Q8 KV cache code and I have to wonder if I'm completely wrong and 8 bit KV is borderline lossless for ctx << 300000
I've not yet been impressed by EXL2 70b-on-a-3090 recipes. I've no doubt it's possible (much like writing DeepSpeed ZeRo-3 might be "possible" for one person) but EXL2 doesn't get the attention llama.cpp does and honestly there's no replacement for true platform agnosticism, so it shouldn't.
I reckon ultimately it's going to be upstream witchcraft (TorchAO, Triton, unexpectedly superintelligent Julia scripts etc.) which will free us from this nightmare. Image generation is much more flexible as long as you accept that you do actually need SOME kind of GPU to produce the graphics. Just not very much of one. SD.Next is magic.
Until then, Someone (Kooten) stop has to stop Them (NeverSleep) from feeding their beautiful Noromaids in to the jaws of q4_0 and q2_K.
*As for the IQ2-XXS - It's theoretically sound? Get it if it's a huge speedup for your particular config to go XXS and not XS. I liked it fine for a 120b goliath shaped model. Too slow offloaded ofc but the quality was fine on first blush.
Thank you.
Overall I think EXL and GGUF quants bring more flexibility over the alternatives like more precise control of the bit weights to fit your exact requirements and GGUF especially with offloading allows you to run these models on less powerful devices at high quality, imatrix, quip# etc makes it even more accessible although at the moment less than 3-4bpw exl starts to hurt and there seems to be some issues with the imatrix quants discussed in other threads but it is still impressive and quite usable if that is what is available to you.
Unfortunately I do not really know about the cache, i have not encountered any obvious issues, unless you notice a clear difference with it on/off I would assume 2.4bpw is a bit to lobotomized leading to weirdness.
There is a lot happening, a lot of improvement in all sorts of ways.