TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF

Dec 15, 2023

Since both 8x7B (Q3) vs 7B would fit in GPU RAM of 24G. What would be more accurate? What is easy way to test?

Performance wise 8x7B (Q3) is 83 t/s and 7B is 129 t/s on RTX 4090. As soon as we switch to 8x7B (Q4) it exceeds 24G GPU RAM and hence drop to 27 t/s.

YaTharThShaRma999

Dec 15, 2023

@vidyamantra a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality

shroominic

Dec 16, 2023

@YaTharThShaRma999 I don't think this is always true, we should do benchmarks!

Yhyu13

Dec 16, 2023

@shroominic Here is the bench mark useful to you https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md

If you want to run locally, you can build llamacpp and run ppl score on gguf models

paretooptimaldev

Dec 23, 2023

•

edited Dec 23, 2023

@vidyamantra a bigger quantized model is always better then a smaller unquantized model.
so use the 8x7b q3 if you want better quality

@YaTharThShaRma999 That doesn't seem true for things like MemGPT. Perhaps not true for RAG in general?

vidyamantra

Dec 24, 2023

A very simple (amateurish) test. Given a multiple choice questions, model was asked to give suitable drilled down subject tags. Total 109 questions were given and result was compared with already computed correct answers. Score was given based on how much accurate was drilled down tags.

Model	Average Score	Correct Tags
gpt-3.5-turbo	35.65834862	82/109
gpt-3.5-turbo-instruct	32.0842605	72/109
mixtral-8x7b-instruct-v0.1.Q5_K_M	25.17394495	59/109
mixtral-8x7b-instruct-v0.1.Q6_K	23.64691589	60/109
mixtral-8x7b-instruct-v0.1.Q8_K	23.17743119	59/109
mixtral-8x7b-instruct-v0.1.Q4_K_M	23.06449541	60/109
mistralai_Mistral-7B-Instruct-v0.1	20.99638889	51/109
mixtral-8x7b-instruct-v0.1.Q3_K_M	20.74944444	49/109
Mistral-7B-Instruct-v0.2	18.59256881	59/109
upstage_SOLAR-10.7B-Instruct-v1.0	15.88376147	52/109
microsoft_phi-2	4.715196078	14/109

I will try to write a better test. Any pointers?