Text Generation
Transformers
Inference Endpoints
mikecovlee commited on
Commit
1828627
1 Parent(s): a3911ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -15
README.md CHANGED
@@ -10,27 +10,21 @@ pipeline_tag: text-generation
10
  ---
11
  # MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
12
 
13
- [![arXiv](https://img.shields.io/badge/arXiv-2404.15159-b31b1b.svg)](https://arxiv.org/abs/2404.15159)
14
- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixlora-enhancing-large-language-models-fine/question-answering-on-social-iqa)](https://paperswithcode.com/sota/question-answering-on-social-iqa?p=mixlora-enhancing-large-language-models-fine)
15
- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixlora-enhancing-large-language-models-fine/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=mixlora-enhancing-large-language-models-fine)
16
-
17
  <div align="left"><img src="MixLoRA.png" width=60%"></div>
18
 
19
- Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted. On the other hand, Mix-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance while maintaining a reduced parameter count. However, the resource requirements of these models pose challenges, particularly for consumer-grade GPUs.
20
-
21
- To address this challenge, we propose MixLoRA, an innovative approach aimed at constructing a resource-efficient sparse MoE model. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model through fine-tuning, employing a top-k routing strategy. Unlike other LoRA MoE methods, MixLoRA enhances model performance by utilizing independently configurable attention layer LoRA adapters, supporting LoRA and its variants for the construction of experts, and applying auxiliary load balance loss to address the imbalance problem of the router.
22
 
23
- In experiments, MixLoRA achieves commendable performance across all evaluation metrics in both single-task and multi-task learning scenarios. Implemented within the m-LoRA framework, MixLoRA enables parallel fine-tuning, inference, and evaluation of multiple mixture-of-experts models on a single 24GB consumer-grade GPU without quantization, thereby reducing GPU memory consumption by 41% and latency during the training process by 17%.
24
-
25
- | PEFT Method | # Params (%) | ARC-e | ARC-c | BoolQ | OBQA | PIQA | AVG. |
26
- |-------------|--------------|-------|-------|-------|------|------|------|
27
- | LoRA | 2.6% | 73.8 | 50.9 | 62.2 | 80.4 | 69.9 | 67.4 |
28
- | DoRA | 2.6% | 76.5 | 59.8 | 71.7 | 80.6 | 78.8 | 73.5 |
29
- | **MixLoRA** | 2.6% | 76.5 | 58.1 | 73.8 | 84.4 | 82.6 | 75.1 |
30
- | **MixDoRA** | 2.6% | 78.3 | 59.6 | 74.2 | 84.4 | 83.6 | 76.0 |
31
 
32
  The table above presents the performance of MixLoRA and compares these results with outcomes obtained by employing LoRA and DoRA for fine-tuning. The results demonstrate that the language model with MixLoRA achieves commendable performance across all evaluation methods. All methods are fine-tuned and evaluated with [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on m-LoRA, with all metrics reported as accuracy.
33
 
 
 
34
  ## How to Use
35
 
36
  Please visit our GitHub repository: https://github.com/TUDB-Labs/MixLoRA
 
10
  ---
11
  # MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
12
 
 
 
 
 
13
  <div align="left"><img src="MixLoRA.png" width=60%"></div>
14
 
15
+ Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. The figure above shows the architecture of the MixLoRA transformer block. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios.
 
 
16
 
17
+ | PEFT Method | # Params (%) | ARC-e | ARC-c | BoolQ | OBQA | PIQA | SIQA | HellaS | WinoG | AVG. |
18
+ |-------------|--------------|-------|-------|-------|------|------|------|--------|-------|------|
19
+ | LoRA | 2.9% | 73.8 | 50.9 | 62.2 | 80.4 | 82.1 | 69.9 | 88.4 | 66.8 | 71.8 |
20
+ | DoRA | 2.9% | 76.5 | 59.8 | 71.7 | 80.6 | 82.7 | 74.1 | 89.6 | 67.3 | 75.3 |
21
+ | **MixLoRA** | 2.9% | 77.7 | 58.1 | 72.7 | 81.6 | 83.2 | 78.0 | 93.1 | 76.8 | **77.6** |
22
+ | **MixDoRA** | 2.9% | 77.5 | 58.2 | 72.6 | 80.9 | 82.2 | 80.4 | 90.6 | 83.4 | **78.2** |
 
 
23
 
24
  The table above presents the performance of MixLoRA and compares these results with outcomes obtained by employing LoRA and DoRA for fine-tuning. The results demonstrate that the language model with MixLoRA achieves commendable performance across all evaluation methods. All methods are fine-tuned and evaluated with [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on m-LoRA, with all metrics reported as accuracy.
25
 
26
+ You can check the full experimental results, including other pre-trained models such as Gemma 2B, LLaMA3 8B, and LLaMA2 13B, and detailed performance metrics in our preprint paper: [Li D, Ma Y, Wang N, et al. MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts[J]. arXiv preprint arXiv:2404.15159, 2024.](https://arxiv.org/abs/2404.15159)
27
+
28
  ## How to Use
29
 
30
  Please visit our GitHub repository: https://github.com/TUDB-Labs/MixLoRA