Alternative quantizations.
These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
I think that the degradation is minor , when I train I try to train quantized .. so I expect the model to be in perfect precision in double quantized ..
I had read that the models were basically best at fp16 and personally considered this the base model ..
When I got the unsloth set up . I noticed they was using the 4-bit mistral ? So if they was using this why not use the 4 bit models as a base also !
It's a pity they could not make ggnls from 4-bit . The unsloth does not even want to give you a 4 bit model you have to specify forced !
But since I have been using 4 bit o have not gone back . I maintain a fp16 onsite , but the conversion between types only the switch down sometime has errors .. but I think there is a order to things .
So a finetuned model has it's deltas (extra) they are not entirely merged into the model as it still recognises these tensors . When compressing if a model is highly tuned then some of these tensors can give errors.
If the model has been merged with another model that product is good to reduce as a 4bit as the mathematics solved these potential future reduction errors ..
It's becomes a process of elimination when testing and merging etc . To discover where those tensor errors came from .. or quantisation errors which seems it overcame or not ? ..
So if the model does queantize to 4 bit with no errors then there is zero loss .
And zero degradation .
Hence not using an interface which hides these important verbose logging .
Maybe ? As that is my thoughta and testing on this so far .. and we are all experimenting and sharing what we learn as we go..
So it's not gospel !
Thanks , I use the q4km or s ..
It good to know that we can go much lower ! As this is the aim ... But also we need to be able to reexpand the models whan we need them. .
I consider gfml as a zip methodology
also i have some upcommming projects which require a remote - code to run ? how do you quantize these custom models ?