"Uses Q8_0 for embed and output weights." Hmm... request for explanation.

#1
by Danioken - opened

Hello, a little question. How big is the difference between the Q4_K_M and Q4_K_L versions? Are the versions with "Q8_0 for embed and output weights" significantly better than the standard versions of quants?

Maybe some tests... I see that some people use f16 or even f32, in some it seems to help, but in some models with f16 I see that it doesn't (the models seem more chaotic)... it's hard to find any sensible information on this subject.

https://oobabooga.github.io/benchmark.html

Here I see a lot of models with f16 outputs (including yours, which are not bad)... but I see that you no longer include models with f16 outputs, why?

I ran some MMLU Pro tests and found that, compared the Q8 embeddings, the quality was actually often degraded by using fp16, and almost never increased

The tests weren't extremely conclusive, just enough for me to say that the size increase of the f16 embeddings wasn't worth the changes, so i'm getting my hands on more compute to do much more thorough testing

As for your first question, they should be better but I'm not sure yet if "significantly" better, but I'm doubtful. Hoping that MMLU pro tests will give a clearer answer

Thank you. I'm closing the discussions.

Danioken changed discussion status to closed

Sign up or log in to comment