Full GGUF quants
Full set of GGUF quants over here: https://huggingface.co/FaradayDotDev/Llama-3-Lumimaid-8B-v0.1-GGUF
Full means some with lots missing? (e.g. IQ1_M, Q4_0, IQ4_NL... :)
But seriously, it would be better to leave out IQ1_S rather than IQ1_M, although neither is going to be important for an 8B.
Yeah, I would also like some more refined quants for Q5 - Q8 for people with only 8 - 10 GB of VRAM, without the loss in quality. :(
A couple of 100MB in Size is a difference in these cases.
Hmm, but they have Q5-Q8, and imatrix ones, too, apparently? Maybe I don't understand what you mean with refined?
Hmm... looking at them, they either have not been made with an imatrix, or with an old version of llama.cpp, meaning they have reduced quality indeed. Almost certainly with an old version of llama.cpp that didn't have llama 3 tokenizer support.
You might be able to work around this by specifying --override-kv tokenizer.ggml.pre=str:llama3
with llama.cpp.
@Skydea https://huggingface.co/NeverSleep/Llama-3-Lumimaid-8B-v0.1-GGUF has correctly made q5..q8.
You were responded to in the other repo, but the GGUFs at the link are fully functional and made with a current version of llama.cpp. Imatrix was used for the quants that profit from its use, rather than indiscriminately.
My apologies if I misspoke when I said ‘full set’. We have a broad range that we feel covers possible needs well.
@PacmanIncarnate the quants at https://huggingface.co/FaradayDotDev/Llama-3-Lumimaid-8B-v0.1-GGUF are broken as explained earlier. They are "fully functional" in the sense that they probably don't crash, but they have severely reduced quality because they were done with the wrong converter tool and thus the wrong tokenizer config (i.e. they were done with convert.py, which does not work for llama-3).
Also, an imatrix was clearly not used for all quants that benefit from it (e.g. none of the .Q quants have it applied, but almost all would benefit from it). At least, that is what @brooketh wrote in the other repo.
Just pointing out the facts, don't shoot the messenger. There is an opportunity to improve things for the benefit of all.
Seems this applies to the other llama-3 and command-r quants from faradaydotdev, they are all similarly broken, and only the .IQ quants were done using an imatrix.
Enough. You’re spreading false information at this point. The GGUFs we are making are fully functional made after the llama 3 tokenizer update.
The command-r model is pre-tokenizer fix for that type of model because that was merged days ago and the model, for whatever reason, doesn’t seem to suffer for it.
And of course only the IQ files are using imatrix. That’s how that works.
Anyway, I recommend the quants by NeverSleep at this point, as they were done correctly, and the equivalent faradaydotdev ones are made without an imatrix anyway. If anybody wants correctly done .iq quants, I will happily provide them if needed.
@PacmanIncarnateand since you wrongly accuse me of spreading false information, why don't you tell us how you managed to end up with ggufs that don't have the pretokenizer set? And where is the data that shows that the (e.g.) Q2_K quant does not benefit from an imatrix?
The answer to the first is that you used convert.py instead of convert-hf-to-gguf.py and the answer to the second is that you just made up that claim.
Anyway, I recommend the quants by NeverSleep at this point, as they were done correctly, and the equivalent faradaydotdev ones are made without an imatrix anyway. If anybody wants correctly done .iq quants, I will happily provide them if needed.
The quants by NeverSleep are also made without an imatrix, by the way.
Yup, that's why neversleeps quants only have upsides and no downsides. Since you both have time to reply here and on your repo, why don't you reply to the actual criticism I made and back up your claim that I am spreading false information? Should be easy if there is any substance to it. Or are ad hominems simply easier than dealing with the facts? Deal with the criticism, not the person bringing up valid points.
And the fact that is extremely easy to verify (just click on the quant on the right side) is that your quants specify the default (llama 2) pretokenizer, while Neversleeps quants correctly specify "llama-bpe" as pretokenizer.
Yup, that's why neversleeps quants only have upsides and no downsides. Since you both have time to reply here and on your repo, why don't you reply to the actual criticism I made and back up your claim that I am spreading false information?
I'm just wondering why you aren't criticizing NeverSleep for failing to use an imatrix in their quants, if that is really such a big deal to you? The answer, I suspect, is that you realize it's a rubbish claim.
And the fact that is extremely easy to verify (just click on the quant on the right side) is that your quants specify the default (llama 2) pretokenizer, while Neversleeps quants correctly specify "llama-bpe" as pretokenizer.
What's interesting that in your comment here you imply that "llama3" was the correct value for this string; now you're claiming that "llama-bpe" is correct. However, neither of those strings is what "convert-hf-to-gguf.py" actually outputs for this model, which you would know if you had taken the time to run it yourself.
llama-bpe is what convert-hf-to-gguf.py outputs, as can be seen in neversleeps quants. llama3 and llama-bpe select the same pretokenizer, so are both correct, so my claim that llama3 might be a workaround stands. I never implied it is the only correct value, only that it probably improves the quants (the imatrix is broken though, so needs to be redone, it cannot fix the imatrix retroactively. But since the quants I was comparing did not use an imatrix or did not record it, it should work for those quants).
If your copy doesn't output a pretokenizer config, then it's simply outdated and predates the pretokenizer implementation. This is extremely easy to look up. Why not actually do your research instead of spreading FUD?
Still, you don't address my actual critcism, which is that your quants use the wrong pretokenizer.
Better yet, why not admit that you used convert.py (or an outdated llama.cpp), which is not working for llama-3, and be done with it?
And why would I criticise neversleep for not using an imatrix? I am not criticising you for that either. I am criticising you for the claim that you used an imatrix when you didn't (you claim all quants that benefit are using an imatrix, Q2_K is one that would benefit. your Q2_K was not done with an imatrix as you admit yourself. q.e.d.).