Maybe a slerp or some other merge method will preserve the component experts better?
Just a thought. Would be great if we could get Mixtral down to 3-4 experts for lower end hardware. Given it only needs two at a time there's probably a lot of redundancy.
Here we are prototyping a mixtral model that extracts experts :)
https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1
In the notebook for conversion, You can choose which experts to target.
convert_mixtral_8x7b_to_4x7b_extract.ipynb
Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)
Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)
Tremendous!
Yes, have had to use slerp and other more complicated merge methods (gradient merge, ties merge) to preserve as much of the two models nuances as possible. It's likely with a MoE model you need to preserve as much of the differences as possible.