Maybe a slerp or some other merge method will preserve the component experts better?

by BlueNipples - opened Dec 22, 2023

Dec 22, 2023

Just a thought. Would be great if we could get Mixtral down to 3-4 experts for lower end hardware. Given it only needs two at a time there's probably a lot of redundancy.

mmnga

Owner Dec 22, 2023

Here we are prototyping a mixtral model that extracts experts :)
https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1

In the notebook for conversion, You can choose which experts to target.
convert_mixtral_8x7b_to_4x7b_extract.ipynb

mmnga

Owner Dec 22, 2023

Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)

BlueNipples

Dec 23, 2023

Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)

Tremendous!

Yes, have had to use slerp and other more complicated merge methods (gradient merge, ties merge) to preserve as much of the two models nuances as possible. It's likely with a MoE model you need to preserve as much of the differences as possible.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment