MOE papers to read

davanstrien 's Collections

synthetic-data-generation-demos

sentence-transformers-from-synthetic-data

Synthetic (text) Dataset Generation

haiku

Historic language modeling

Climate

Sourced from Wikimedia

Legal Named Entity Recognition

Top 10% instruction tuning datasets

Top 10 Instruction Tuning Datasets copy

Metadata-generation

MOE papers to read

German Text Embedding Clustering Benchmark datasets

cosmochat-reading-list

datasets-tldr-project

Probably DPO datasets

Probably Alpaca Style Datasets

Direct Preference Optimization Datasets

Image Preference Optimization Datasets

query-to-hub-datasets-viewer-project

updated Jun 21

Copied from MoE using https://huggingface.co/spaces/librarian-bots/collection_cloner.

Upvote

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Paper • 2310.16795 • Published Oct 25, 2023 • 26
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Paper • 2308.12066 • Published Aug 23, 2023 • 4
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Paper • 2303.06182 • Published Mar 10, 2023 • 1
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Paper • 2112.14397 • Published Dec 29, 2021 • 1
From Sparse to Soft Mixtures of Experts

Paper • 2308.00951 • Published Aug 2, 2023 • 20
Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Paper • 2308.06093 • Published Aug 11, 2023 • 2
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Paper • 2306.06446 • Published Jun 10, 2023 • 1
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Paper • 2212.05055 • Published Dec 9, 2022 • 5
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing

Paper • 2212.05191 • Published Dec 10, 2022 • 1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks

Paper • 2306.04073 • Published Jun 7, 2023 • 2
Multi-Head Adapter Routing for Cross-Task Generalization

Paper • 2211.03831 • Published Nov 7, 2022 • 2
Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Paper • 2306.05067 • Published Jun 8, 2023 • 2
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies

Paper • 2302.06218 • Published Feb 13, 2023 • 1
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Paper • 2305.06324 • Published May 10, 2023 • 1
Sparse Backpropagation for MoE Training

Paper • 2310.00811 • Published Oct 1, 2023 • 2
Zorro: the masked multimodal transformer

Paper • 2301.09595 • Published Jan 23, 2023 • 2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Paper • 2201.05596 • Published Jan 14, 2022 • 2
Approximating Two-Layer Feedforward Networks for Efficient Transformers

Paper • 2310.10837 • Published Oct 16, 2023 • 10
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Paper • 2303.13755 • Published Mar 24, 2023 • 1
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Paper • 2310.15961 • Published Oct 24, 2023 • 1
LoRA ensembles for large language model fine-tuning

Paper • 2310.00035 • Published Sep 29, 2023 • 2
Build a Robust QA System with Transformer-based Mixture of Experts

Paper • 2204.09598 • Published Mar 20, 2022 • 1
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Paper • 2305.14839 • Published May 24, 2023 • 1
A Mixture-of-Expert Approach to RL-based Dialogue Management

Paper • 2206.00059 • Published May 31, 2022 • 1
Spatial Mixture-of-Experts

Paper • 2211.13491 • Published Nov 24, 2022 • 1
FastMoE: A Fast Mixture-of-Expert Training System

Paper • 2103.13262 • Published Mar 24, 2021 • 2
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

Paper • 2205.10034 • Published May 20, 2022 • 1
Eliciting and Understanding Cross-Task Skills with Task-Level Mixture-of-Experts

Paper • 2205.12701 • Published May 25, 2022 • 1
FEAMOE: Fair, Explainable and Adaptive Mixture of Experts

Paper • 2210.04995 • Published Oct 10, 2022 • 1
On the Adversarial Robustness of Mixture of Experts

Paper • 2210.10253 • Published Oct 19, 2022 • 1
HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization

Paper • 2211.08253 • Published Nov 15, 2022 • 1
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 14
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Paper • 2303.06318 • Published Mar 11, 2023 • 1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Paper • 2306.04845 • Published Jun 8, 2023 • 4
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

Paper • 2210.07535 • Published Oct 14, 2022 • 1
Optimizing Mixture of Experts using Dynamic Recompilations

Paper • 2205.01848 • Published May 4, 2022 • 1
Towards Understanding Mixture of Experts in Deep Learning

Paper • 2208.02813 • Published Aug 4, 2022 • 1
Learning Factored Representations in a Deep Mixture of Experts

Paper • 1312.4314 • Published Dec 16, 2013 • 1
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Paper • 1701.06538 • Published Jan 23, 2017 • 4
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Paper • 2112.06905 • Published Dec 13, 2021 • 1
Contextual Mixture of Experts: Integrating Knowledge into Predictive Modeling

Paper • 2211.00558 • Published Nov 1, 2022 • 1
Taming Sparsely Activated Transformer with Stochastic Experts

Paper • 2110.04260 • Published Oct 8, 2021 • 2
Heterogeneous Multi-task Learning with Expert Diversity

Paper • 2106.10595 • Published Jun 20, 2021 • 1
SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation

Paper • 2310.15539 • Published Oct 24, 2023 • 1
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

Paper • 2307.13269 • Published Jul 25, 2023 • 31
SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach

Paper • 2204.12184 • Published Apr 26, 2022 • 1
SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural Language Understanding

Paper • 2203.03312 • Published Mar 7, 2022 • 1
Residual Mixture of Experts

Paper • 2204.09636 • Published Apr 20, 2022 • 1
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models

Paper • 2203.01104 • Published Mar 2, 2022 • 2
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures?

Paper • 2310.10908 • Published Oct 17, 2023 • 1
One Student Knows All Experts Know: From Sparse to Dense

Paper • 2201.10890 • Published Jan 26, 2022 • 1
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Paper • 2203.14685 • Published Mar 28, 2022 • 1
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts

Paper • 2305.18691 • Published May 30, 2023 • 1
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

Paper • 2306.17165 • Published Jun 29, 2023 • 1
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

Paper • 2211.15841 • Published Nov 29, 2022 • 7
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Paper • 2105.03036 • Published May 7, 2021 • 2
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition

Paper • 2307.05956 • Published Jul 12, 2023 • 1
M6-T: Exploring Sparse Expert Models and Beyond

Paper • 2105.15082 • Published May 31, 2021 • 1
Cross-token Modeling with Conditional Computation

Paper • 2109.02008 • Published Sep 5, 2021 • 1
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability

Paper • 2204.10598 • Published Apr 22, 2022 • 2
Efficient Language Modeling with Sparse all-MLP

Paper • 2203.06850 • Published Mar 14, 2022 • 1
Efficient Large Scale Language Modeling with Mixtures of Experts

Paper • 2112.10684 • Published Dec 20, 2021 • 2
TAME: Task Agnostic Continual Learning using Multiple Experts

Paper • 2210.03869 • Published Oct 8, 2022 • 1
Learning an evolved mixture model for task-free continual learning

Paper • 2207.05080 • Published Jul 11, 2022 • 1
Model Spider: Learning to Rank Pre-Trained Models Efficiently

Paper • 2306.03900 • Published Jun 6, 2023 • 1
Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Paper • 2206.00277 • Published Jun 1, 2022 • 1
SiRA: Sparse Mixture of Low Rank Adaptation

Paper • 2311.09179 • Published Nov 15, 2023 • 8
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Paper • 2309.05444 • Published Sep 11, 2023 • 1
MoEC: Mixture of Expert Clusters

Paper • 2207.09094 • Published Jul 19, 2022 • 1
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach

Paper • 2310.12004 • Published Oct 18, 2023 • 2
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

Paper • 2310.14188 • Published Oct 22, 2023 • 1
Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates

Paper • 2007.02432 • Published Jul 5, 2020 • 1
Mixture of experts models for multilevel data: modelling framework and approximation theory

Paper • 2209.15207 • Published Sep 30, 2022 • 1
ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Paper • 2311.13171 • Published Nov 22, 2023 • 1
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

Paper • 2306.01705 • Published Jun 2, 2023 • 1
Exponentially Faster Language Modelling

Paper • 2311.10770 • Published Nov 15, 2023 • 118
Scaling Expert Language Models with Unsupervised Domain Discovery

Paper • 2303.14177 • Published Mar 24, 2023 • 2
Hash Layers For Large Sparse Models

Paper • 2106.04426 • Published Jun 8, 2021 • 2
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

Paper • 2212.09811 • Published Dec 19, 2022 • 1
Exploiting Transformer Activation Sparsity with Dynamic Inference

Paper • 2310.04361 • Published Oct 6, 2023 • 1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

Paper • 2310.02410 • Published Oct 3, 2023 • 1
Punica: Multi-Tenant LoRA Serving

Paper • 2310.18547 • Published Oct 28, 2023 • 2

Upvote

Collection guide
Browse collections