File size: 2,742 Bytes
5a563f7
 
bb09a79
5a563f7
 
 
 
 
f4651f0
5a563f7
f4651f0
5a563f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4651f0
5a563f7
 
 
 
 
 
 
 
 
d3c65a3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<p align="center">
 <a href="https://huggingface.co/Crystalcareai/GemMoE-Medium-v0.4/blob/main/imgs/F640kPICQTWs0yRFz3hVpQ.png">
   <img src="https://huggingface.co/Crystalcareai/GemMoE-Medium-v0.4/resolve/main/imgs/F640kPICQTWs0yRFz3hVpQ.png" alt="GemMoE-Logo" border="0">
 </a>
</p>

# GemMoE: A New MoE Method via Branch Train Mix

I would like to introduce GemMoE, a new Mixture of Experts (MoE) method that utilizes a custom implementation of Meta's Branch Train Mix MoE method. This approach, detailed in the research paper ["Branch Train Mix: A Novel Mixture-of-Experts Training Procedure"](https://arxiv.org/abs/2403.07816), has allowed me to create an efficient model that overcomes the limitations of previous GemMoE models.

GemMoE is a 4x8.5b MoE model, consisting of 4 experts that were trained separately and then combined using a custom fork of axolotl. This fork enabled me to freeze all experts and focus on training the router mechanism. The router was trained on 4 epochs of my Self-Discover-MM dataset and 2 epochs of TruthyDPO from Jon Durbin.

One of the main differences between GemMoE and previous versions is the use of tokenization routing instead of semantic routing. This approach, similar to the one used in mixtral, results in improved VRAM usage and competitive performance for its size.

## The Branch Train Mix Method

The Branch Train Mix method offers several potential benefits over traditional MoE training approaches:

1. Improved training stability and convergence
2. Reduced computational cost and memory usage
3. Enhanced model performance and generalization

By utilizing this training procedure, GemMoE aims to achieve competitive results while maintaining a compact and efficient architecture.

## Collaboration and Open-Source Development

GemMoE builds upon the work of researchers and developers from various organizations, including Meta, Hugging Face, and the broader open-source community. I would like to thank the researchers behind the Branch Train Mix method, as well as the developers of axolotl and Jon Durbin for creating TruthyDPO.

## Future Development

As GemMoE continues to develop, I am open to collaborating with the community to further refine and improve the model. By sharing knowledge, insights, and resources, we can explore the potential of MoE architectures and advance the field of natural language processing.

I invite researchers, developers, and enthusiasts to explore GemMoE, provide feedback, and contribute to its ongoing development. Together, we can work towards creating powerful tools that benefit society as a whole.

Thank you for your interest in GemMoE, and I look forward to seeing the applications and discoveries that emerge from this project.

-Lucas