[llama.cpp PR#6844] Custom Quantizations
I think it might be worth exploring, finding a good balance between quality and speed.
I am currently experimenting with the config below:
# Used for everything not specified below.
ftype=IQ4_NL
token_embd.weight=Q8_0
output.weight=Q8_0
# These are quite small, keeping them in a higher quantization to help with context.
blk.*.attn_output.weight=F16
blk.*.attn_?.weight=F16
Edit: Seems the code above is 6.95 BPW, I will try reducing it. It is pretty fast though.
I never played with customizing layers as such.
I updated the llama.cpp but still only get degraded quants. Are there any tutorials or something like that to use llama.cpp for llama 3 models? I only know the convert.py method (python convert.py ./models/myllama3merge --vocab-type bpe)
@WesPro I added a notice about that in the GGUF script page.
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script
More context:
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/discussions/27#66361fceccadfaaeacb0cdb5
Related Discussion:
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/discussions/26#66317fb30a84b77d96b0c4e6
@WesPro I added a notice about that in the GGUF script page.
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script
More context:
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/discussions/27#66361fceccadfaaeacb0cdb5Related Discussion:
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/discussions/26#66317fb30a84b77d96b0c4e6
thanks I figured it out now... this helped me more than reading the whole issue thread on github ;)
That's why we're here <3