Higher PPL than Mixtral?
I ran a PPL eval and noticed that the PPL is much higher than the original Mixtral model on wikitext.
- LoneStriker_dolphin-2.5-mixtral-8x7b-6.0bpw-h6-exl2-2 - 4.464363098144531
- turboderp_Mixtral-8x7B-instruct-exl2_8.0bpw - 3.7087724208831774
I was wondering if this is expected.
For ref, 70b dolphin models give me PPLs just below 4:
- LoneStriker_dolphin-2.2-70b-6.0bpw-h6-exl2-2 - 3.965563297271729
You're comparing 6.0bpw to 8.0bpw, so yes, it's expected that they might have higher perplexity in general, but also with exl2, the quality can vary depending on the calibration dataset used during quantization.
@HiroseKoichi , it's a 0.76 PPL jump.
If someone can share the PPL on the non-quantized version I'd be interested to see how far it is from Mixtral original model.
Taking a look at Turboderp's page, it looks like your test is the outlier here and dolphin is right in line with the expected numbers: https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2
I'm by no means an expert on exl2 quantization, but wikitext is a popular calibration dataset for exl2 quants, which could explain why the perplexity is much lower for Mixtral-Instruct. Try running it through a different dataset.