QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models Paper • 2310.16795 • Published Oct 25, 2023 • 26
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference Paper • 2308.12066 • Published Aug 23, 2023 • 4
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference Paper • 2303.06182 • Published Mar 10, 2023 • 1
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate Paper • 2112.14397 • Published Dec 29, 2021 • 1
Experts Weights Averaging: A New General Training Scheme for Vision Transformers Paper • 2308.06093 • Published Aug 11, 2023 • 2
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer Paper • 2306.06446 • Published Jun 10, 2023 • 1
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints Paper • 2212.05055 • Published Dec 9, 2022 • 5
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing Paper • 2212.05191 • Published Dec 10, 2022 • 1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks Paper • 2306.04073 • Published Jun 7, 2023 • 2
Multi-Head Adapter Routing for Cross-Task Generalization Paper • 2211.03831 • Published Nov 7, 2022 • 2
Improving Visual Prompt Tuning for Self-supervised Vision Transformers Paper • 2306.05067 • Published Jun 8, 2023 • 2
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies Paper • 2302.06218 • Published Feb 13, 2023 • 1
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception Paper • 2305.06324 • Published May 10, 2023 • 1
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale Paper • 2201.05596 • Published Jan 14, 2022 • 2
Approximating Two-Layer Feedforward Networks for Efficient Transformers Paper • 2310.10837 • Published Oct 16, 2023 • 10
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers Paper • 2303.13755 • Published Mar 24, 2023 • 1
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation Paper • 2310.15961 • Published Oct 24, 2023 • 1
Build a Robust QA System with Transformer-based Mixture of Experts Paper • 2204.09598 • Published Mar 20, 2022 • 1
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts Paper • 2305.14839 • Published May 24, 2023 • 1
A Mixture-of-Expert Approach to RL-based Dialogue Management Paper • 2206.00059 • Published May 31, 2022 • 1
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System Paper • 2205.10034 • Published May 20, 2022 • 1
Eliciting and Understanding Cross-Task Skills with Task-Level Mixture-of-Experts Paper • 2205.12701 • Published May 25, 2022 • 1
FEAMOE: Fair, Explainable and Adaptive Mixture of Experts Paper • 2210.04995 • Published Oct 10, 2022 • 1
HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization Paper • 2211.08253 • Published Nov 15, 2022 • 1
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Paper • 2101.03961 • Published Jan 11, 2021 • 14
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training Paper • 2303.06318 • Published Mar 11, 2023 • 1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts Paper • 2306.04845 • Published Jun 8, 2023 • 4
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation Paper • 2210.07535 • Published Oct 14, 2022 • 1
Optimizing Mixture of Experts using Dynamic Recompilations Paper • 2205.01848 • Published May 4, 2022 • 1
Towards Understanding Mixture of Experts in Deep Learning Paper • 2208.02813 • Published Aug 4, 2022 • 1
Learning Factored Representations in a Deep Mixture of Experts Paper • 1312.4314 • Published Dec 16, 2013 • 1
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer Paper • 1701.06538 • Published Jan 23, 2017 • 4
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Paper • 2112.06905 • Published Dec 13, 2021 • 1
Contextual Mixture of Experts: Integrating Knowledge into Predictive Modeling Paper • 2211.00558 • Published Nov 1, 2022 • 1
Taming Sparsely Activated Transformer with Stochastic Experts Paper • 2110.04260 • Published Oct 8, 2021 • 2
Heterogeneous Multi-task Learning with Expert Diversity Paper • 2106.10595 • Published Jun 20, 2021 • 1
SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation Paper • 2310.15539 • Published Oct 24, 2023 • 1
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition Paper • 2307.13269 • Published Jul 25, 2023 • 31
SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach Paper • 2204.12184 • Published Apr 26, 2022 • 1
SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural Language Understanding Paper • 2203.03312 • Published Mar 7, 2022 • 1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models Paper • 2203.01104 • Published Mar 2, 2022 • 2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures? Paper • 2310.10908 • Published Oct 17, 2023 • 1
One Student Knows All Experts Know: From Sparse to Dense Paper • 2201.10890 • Published Jan 26, 2022 • 1
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System Paper • 2203.14685 • Published Mar 28, 2022 • 1
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts Paper • 2305.18691 • Published May 30, 2023 • 1
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training Paper • 2306.17165 • Published Jun 29, 2023 • 1
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts Paper • 2211.15841 • Published Nov 29, 2022 • 7
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts Paper • 2105.03036 • Published May 7, 2021 • 2
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition Paper • 2307.05956 • Published Jul 12, 2023 • 1
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability Paper • 2204.10598 • Published Apr 22, 2022 • 2
Efficient Large Scale Language Modeling with Mixtures of Experts Paper • 2112.10684 • Published Dec 20, 2021 • 2
TAME: Task Agnostic Continual Learning using Multiple Experts Paper • 2210.03869 • Published Oct 8, 2022 • 1
Learning an evolved mixture model for task-free continual learning Paper • 2207.05080 • Published Jul 11, 2022 • 1
Model Spider: Learning to Rank Pre-Trained Models Efficiently Paper • 2306.03900 • Published Jun 6, 2023 • 1
Task-Specific Expert Pruning for Sparse Mixture-of-Experts Paper • 2206.00277 • Published Jun 1, 2022 • 1
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning Paper • 2309.05444 • Published Sep 11, 2023 • 1
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach Paper • 2310.12004 • Published Oct 18, 2023 • 2
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts Paper • 2310.14188 • Published Oct 22, 2023 • 1
Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates Paper • 2007.02432 • Published Jul 5, 2020 • 1
Mixture of experts models for multilevel data: modelling framework and approximation theory Paper • 2209.15207 • Published Sep 30, 2022 • 1
ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization Paper • 2311.13171 • Published Nov 22, 2023 • 1
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles Paper • 2306.01705 • Published Jun 2, 2023 • 1
Scaling Expert Language Models with Unsupervised Domain Discovery Paper • 2303.14177 • Published Mar 24, 2023 • 2
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model Paper • 2212.09811 • Published Dec 19, 2022 • 1
Exploiting Transformer Activation Sparsity with Dynamic Inference Paper • 2310.04361 • Published Oct 6, 2023 • 1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness Paper • 2310.02410 • Published Oct 3, 2023 • 1