8x7B (Q3) vs 7B
Since both 8x7B (Q3) vs 7B would fit in GPU RAM of 24G. What would be more accurate? What is easy way to test?
Performance wise 8x7B (Q3) is 83 t/s and 7B is 129 t/s on RTX 4090. As soon as we switch to 8x7B (Q4) it exceeds 24G GPU RAM and hence drop to 27 t/s.
@vidyamantra
a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality
@YaTharThShaRma999 I don't think this is always true, we should do benchmarks!
@shroominic Here is the bench mark useful to you https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md
If you want to run locally, you can build llamacpp and run ppl score on gguf models
@vidyamantra a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality
@YaTharThShaRma999 That doesn't seem true for things like MemGPT. Perhaps not true for RAG in general?
A very simple (amateurish) test. Given a multiple choice questions, model was asked to give suitable drilled down subject tags. Total 109 questions were given and result was compared with already computed correct answers. Score was given based on how much accurate was drilled down tags.
Model | Average Score | Correct Tags |
---|---|---|
gpt-3.5-turbo | 35.65834862 | 82/109 |
gpt-3.5-turbo-instruct | 32.0842605 | 72/109 |
mixtral-8x7b-instruct-v0.1.Q5_K_M | 25.17394495 | 59/109 |
mixtral-8x7b-instruct-v0.1.Q6_K | 23.64691589 | 60/109 |
mixtral-8x7b-instruct-v0.1.Q8_K | 23.17743119 | 59/109 |
mixtral-8x7b-instruct-v0.1.Q4_K_M | 23.06449541 | 60/109 |
mistralai_Mistral-7B-Instruct-v0.1 | 20.99638889 | 51/109 |
mixtral-8x7b-instruct-v0.1.Q3_K_M | 20.74944444 | 49/109 |
Mistral-7B-Instruct-v0.2 | 18.59256881 | 59/109 |
upstage_SOLAR-10.7B-Instruct-v1.0 | 15.88376147 | 52/109 |
microsoft_phi-2 | 4.715196078 | 14/109 |
I will try to write a better test. Any pointers?