DS-MoE: Making MoE Models More Efficient and Less Memory-Intensive
Estimated reading time: 4 minutes
Mixture-of-Experts (MoE) language models are known for their ability to reduce computing needs by 2 to 4 times compared to traditional dense models, without sacrificing performance. This makes them especially useful in situations where computing resources are limited. However, MoE models typically need 2 to 4 times more parameters to perform as well as a dense model. For example, models like DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B which has 16B parameters were designed to match the performance of a 7B model. The large number of parameters in MoE models incurs larger GPU memory requirements which makes them less efficient in I/O-bounded scenarios like autoregressive generation.
Is it necessary for MoE models to be so large to achieve high performance? Can we create an MoE model that maintains performance but uses fewer parameters and less computational power? Enter DS-MoE. This model achieves similar performance to dense models but uses about one-third of the computational resources and only half as many parameters as other MoE models.
The concept of DS-MoE involves densely training the experts and forcing the model's routers to gradually ignore unnecessary experts for a given token. We employ the Mutual Information (MI) loss to the training process, which balances the load of each expert across the entire batch, but also encourages each input token to concentrate their gating probability to fewer experts.
The MI loss is defined as where X denotes the tokens in a minibatch, and e denotes the experts. Intuitively, maximizing H(e) balances the load of each expert across the entire batch, and minimizing H(e|x) encourages each input x to concentrate their gating probability to fewer experts.
During inference, DS-MoE chooses only the top K experts based on their scores. The determination of the number of K is based on either a predefined value or an adaptive method, contingent upon the count of experts with scores surpassing a certain threshold. As a result, DS-MoE can perform as well as similarly sized dense models while using far fewer active parameters, as demonstrated in the table.
Model | HellaSwag | PIQA | WinoGrande | SciQ | Arc-e | Arc-c | Avg. Perf. | Active Params |
---|---|---|---|---|---|---|---|---|
Dense-3B | 40.4 | 71.4 | 58.7 | 86.0 | 59.6 | 26.1 | 57.0 | 2705M |
SMoE-5B | 40.1 | 70.7 | 56.5 | 85.6 | 58.4 | 24.8 | 56.0 | 1212M |
DS-MoE-3B | 39.3 | 71.6 | 57.9 | 85.6 | 57.7 | 24.9 | 56.2 | 934M |
Dense-6B | 44.3 | 72.2 | 59.9 | 88.0 | 62.9 | 27.9 | 59.2 | 6186M |
DS-MoE-6B | 43.5 | 73.0 | 57.9 | 86.9 | 61.9 | 27.9 | 58.5 | 1813M |
We also tested DS-MoE with vLLM to see how it compares to other models in terms of processing speed and memory usage at the 7B performance tier. We looked at how many requests and tokens it could handle per second, using a setup where each input and output consisted of 1,000 tokens and the GPU memory usage was capped at 90%.
Model | Total Params | Active Params | Model Memory | A100 Throughput | A100 TPS | H100 Throughput | H100 TPS |
---|---|---|---|---|---|---|---|
Dense-6B | 6.4B | 6.4B | 12.3 GiB | 1.04 | 2079.8 | 1.40 | 2808.7 |
Mistral-7B | 7.2B | 7.2B | 13.5 GiB | 1.07 | 2140.8 | 1.52 | 3047.4 |
DeepSeekMoE | 17.3B | 2.8B | 30.5 GiB | 1.17 | 2330.1 | 1.57 | 3144.1 |
Qwen1.5-MoE | 16.4B | 2.7B | 26.7 GiB | 1.33 | 2665.7 | 1.81 | 3616.9 |
DS-MoE-6B | 6.5B | 2.2B | 12.6 GiB | 2.00 | 3992.8 | 2.30 | 4603.9 |
The test shows that DS-MoE outperforms both dense models in terms of computational cost and sparsely trained MoEs in model memory, leading to faster processing in the computation-bounded scenarios as well the I/O bounded scenarios. Note that DS-MoE-6B is not yet comparable with other models regarding downstream performance, due to its training on merely 100 billion tokens (versus the trillions for other models). Nevertheless, DS-MoE has demonstrated significant promise in achieving the performance levels of dense models with a comparable volume of training data.