YikangS commited on
Commit
445846c
1 Parent(s): f1bc1f4

update model config and readme

Browse files
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  base_model: jetmoe/jetmoe-8b
3
  tags:
4
  - alignment-handbook
@@ -15,58 +16,118 @@ model-index:
15
  results: []
16
  ---
17
 
18
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
19
- should probably proofread and complete it, then remove this comment. -->
 
 
 
20
 
21
- # jetmoe-8b-sft
22
 
23
- This model is a fine-tuned version of [jetmoe-8b](https://huggingface.co/jetmoe/jetmoe-8b) on the HuggingFaceH4/ultrachat_200k, the HuggingFaceH4/airoboros-3.2, the HuggingFaceH4/Code-Feedback, the HuggingFaceH4/orca-math-word-problems-200k, the HuggingFaceH4/SystemChat and the HuggingFaceH4/capybara datasets.
24
- It achieves the following results on the evaluation set:
25
- - Loss: 0.9952
26
 
27
- ## Model description
28
 
29
- More information needed
 
 
30
 
31
- ## Intended uses & limitations
32
 
33
- More information needed
34
 
35
- ## Training and evaluation data
36
 
37
- More information needed
38
 
39
- ## Training procedure
40
 
41
- ### Training hyperparameters
42
 
43
- The following hyperparameters were used during training:
44
- - learning_rate: 2e-05
45
- - train_batch_size: 4
46
- - eval_batch_size: 8
47
- - seed: 42
48
- - distributed_type: multi-GPU
49
- - num_devices: 8
50
- - gradient_accumulation_steps: 4
51
- - total_train_batch_size: 128
52
- - total_eval_batch_size: 64
53
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
54
- - lr_scheduler_type: cosine
55
- - lr_scheduler_warmup_ratio: 0.1
56
- - num_epochs: 3
57
 
58
- ### Training results
 
59
 
60
- | Training Loss | Epoch | Step | Validation Loss |
61
- |:-------------:|:-----:|:----:|:---------------:|
62
- | 1.2458 | 1.0 | 2049 | 0.9776 |
63
- | 1.1966 | 2.0 | 4099 | 0.9756 |
64
- | 1.1073 | 3.0 | 6147 | 0.9952 |
65
 
 
 
 
 
 
 
 
 
 
66
 
67
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- - Transformers 4.39.0.dev0
70
- - Pytorch 2.1.2
71
- - Datasets 2.14.6
72
- - Tokenizers 0.15.2
 
1
  ---
2
+ license: apache-2.0
3
  base_model: jetmoe/jetmoe-8b
4
  tags:
5
  - alignment-handbook
 
16
  results: []
17
  ---
18
 
19
+ <div align="center">
20
+ <div>&nbsp;</div>
21
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/ieHnwuczidNNoGRA_FN2y.png" width="500"/>
22
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/UOsk9_zcbHpCCy6kmryYM.png" width="530"/>
23
+ </div>
24
 
25
+ # JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
26
 
27
+ ## Key Messages
 
 
28
 
29
+ 1. JetMoE-8B is **trained with less than $ 0.1 million**<sup>1</sup> **cost but outperforms LLaMA2-7B from Meta AI**, who has multi-billion-dollar training resources. LLM training can be **much cheaper than people previously thought**.
30
 
31
+ 2. JetMoE-8B is **fully open-sourced and academia-friendly** because:
32
+ - It **only uses public datasets** for training, and the code is open-sourced. No proprietary resource is needed.
33
+ - It **can be finetuned with very limited compute budget** (e.g., consumer-grade GPU) that most labs can afford.
34
 
35
+ 3. JetMoE-8B **only has 2.2B active parameters** during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.
36
 
37
+ <sup>1</sup> We used a 96×H100 GPU cluster for 2 weeks, which cost ~$0.08 million.
38
 
39
+ Website: [https://research.myshell.ai/jetmoe](https://research.myshell.ai/jetmoe)
40
 
41
+ HuggingFace: [https://huggingface.co/jetmoe/jetmoe-8b](https://huggingface.co/jetmoe/jetmoe-8b)
42
 
43
+ Online Demo on Lepton AI: [https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat](https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat)
44
 
45
+ ## Authors
46
 
47
+ The project is contributed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ), [Zhen Guo](https://zguo0525.github.io/), [Tianle Cai](https://www.tianle.website/#/) and [Zengyi Qin](https://www.qinzy.tech/). For technical inquiries, please contact [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ). For media and collaboration inquiries, please contact [Zengyi Qin](https://www.qinzy.tech/).
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ ## Collaboration
50
+ **If you have great ideas but need more resources (GPU, data, funding, etc.)**, welcome to contact **MyShell.ai** via [Zengyi Qin](https://www.qinzy.tech/). **MyShell.ai** is open to collaborations and are actively supporting high-quality open-source projects.
51
 
52
+ ## Benchmarks
53
+ We use the same evaluation methodology as in the Open LLM leaderboard. For MBPP code benchmark, we use the same evaluation methodology as in the LLaMA2 and Deepseek-MoE paper. The results are shown below:
 
 
 
54
 
55
+ |Model|Activate Params|Training Tokens|Open LLM Leaderboard Avg|ARC|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|MBPP|HumanEval|
56
+ |---|---|---|---|---|---|---|---|---|---|---|---|
57
+ |Shot||||25|10|5|0|5|5|3|0|
58
+ |Metric||||acc_norm|acc_norm|acc|mc2|acc|acc|Pass@1|Pass@1|
59
+ |LLaMA2-7B|7B|2T|51.0|53.1|78.6|46.9|38.8|74|14.5|20.8|12.8|
60
+ |LLaMA-13B|13B|1T|51.4|**56.2**|**80.9**|47.7|39.5|**76.2**|7.6|22.0|15.8|
61
+ |DeepseekMoE-16B|2.8B|2T|51.1|53.2|79.8|46.3|36.1|73.7|17.3|34.0|**25.0**|
62
+ |Gemma-2B|2B|2T|46.4|48.4|71.8|41.8|33.1|66.3|16.9|28.0|24.4|
63
+ |JetMoE-8B|2.2B|1.25T|**53.0**|48.7|80.5|**49.2**|**41.7**|70.2|**27.8**|**34.2**|14.6|
64
 
65
+ | Model | MT-Bench Score |
66
+ |---------------------|-----------|
67
+ | GPT-4 | 9.014 |
68
+ | GPT-3.5-turbo | 7.995 |
69
+ | Claude-v1 | 7.923 |
70
+ | **JetMoE-8B-chat** | **6.681** |
71
+ | Llama-2-13b-chat | 6.650 |
72
+ | Vicuna-13b-v1.3 | 6.413 |
73
+ | Wizardlm-13b | 6.353 |
74
+ | Llama-2-7b-chat | 6.269 |
75
+
76
+ To our surprise, despite the lower training cost and computation, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B. Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves better performance.
77
+
78
+ ## Model Usage
79
+ To load the models, you need install [this package](https://github.com/myshell-ai/JetMoE):
80
+ ```
81
+ pip install -e .
82
+ ```
83
+
84
+ Then you can load the model with the following code:
85
+ ```python
86
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
87
+ from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification
88
+
89
+ AutoConfig.register("jetmoe", JetMoEConfig)
90
+ AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
91
+ AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained('jetmoe/jetmoe-8b')
94
+ model = AutoModelForCausalLM.from_pretrained('jetmoe/jetmoe-8b')
95
+ ```
96
+
97
+ ## Model Details
98
+ JetMoE-8B has 24 blocks.
99
+ Each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE).
100
+ Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input token.
101
+ It has 8 billion parameters in total and 2.2B active parameters.
102
+ JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
103
+
104
+ <figure>
105
+ <center>
106
+ <img src="images/jetmoe_architecture.png" width="40%">
107
+ <figcaption>JetMoE Architecture</figcaption>
108
+ </center>
109
+ </figure>
110
+
111
+ ## Training Details
112
+ Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
113
+
114
+ <figure>
115
+ <center>
116
+ <img src="images/Phase1_data.png" width="60%">
117
+ <img src="images/Phase2_data.png" width="60%">
118
+ </center>
119
+ </figure>
120
+
121
+ ## Technical Report
122
+ For more details, please refer to the JetMoE Technical Report (Coming Soon).
123
+
124
+ ## JetMoE Model Index
125
+ |Model|Index|
126
+ |---|---|
127
+ |JetMoE-8B-Base| [Link](https://huggingface.co/jetmoe/jetmoe-8B) |
128
+ |JetMoE-8B-SFT| [Link](https://huggingface.co/jetmoe/jetmoe-8B-sft) |
129
+ |JetMoE-8B-Chat| [Link](https://huggingface.co/jetmoe/jetmoe-8B-chat) |
130
+
131
+ ## Acknowledgement
132
+ We express our gratitude to [Shengding Hu](https://shengdinghu.github.io/) for his valuable advice on the Phase 2 data mixture. We also express our gratitude to [Exabits](https://www.exabits.ai/) for their assistance in setting up the GPU clusters, and to [Lepton AI](https://www.lepton.ai/) for their support in setting up the chat demo.
133
 
 
 
 
 
config.json CHANGED
@@ -1,92 +1,30 @@
1
  {
2
- "_attn_implementation_internal": "flash_attention_2",
3
- "_commit_hash": null,
4
- "_name_or_path": "jetmoe/jetmoe-8b-sft",
5
  "activation_function": "silu",
6
- "add_cross_attention": false,
7
  "architectures": [
8
  "JetMoEForCausalLM"
9
  ],
10
- "auto_map": {
11
- "AutoConfig": "configuration_jetmoe.JetMoEConfig",
12
- "AutoModelForCausalLM": "modeling_jetmoe.JetMoEForCausalLM"
13
- },
14
- "aux_loss_coef": 0.01,
15
- "bad_words_ids": null,
16
- "begin_suppress_tokens": null,
17
- "bias": true,
18
  "bos_token_id": 1,
19
- "chunk_size_feed_forward": 0,
20
- "cross_attention_hidden_size": null,
21
- "decoder_start_token_id": null,
22
- "diversity_penalty": 0.0,
23
- "do_sample": false,
24
- "early_stopping": false,
25
- "encoder_no_repeat_ngram_size": 0,
26
  "eos_token_id": 2,
27
- "exponential_decay_length_penalty": null,
28
  "ffn_hidden_size": 5632,
29
- "finetuning_task": null,
30
- "forced_bos_token_id": null,
31
- "forced_eos_token_id": null,
32
- "glu": true,
33
- "id2label": {
34
- "0": "LABEL_0",
35
- "1": "LABEL_1"
36
- },
37
- "initializer_range": 0.01,
38
- "is_decoder": false,
39
- "is_encoder_decoder": false,
40
  "kv_channels": 128,
41
- "label2id": {
42
- "LABEL_0": 0,
43
- "LABEL_1": 1
44
- },
45
  "layer_norm_epsilon": 1e-05,
46
  "length_penalty": 1.0,
47
- "max_length": 20,
48
- "min_length": 0,
49
- "model_type": "jetmoe",
50
  "moe_num_experts": 8,
51
  "moe_top_k": 2,
52
- "n_embd": 2048,
53
- "n_head": 16,
54
- "n_layer": 24,
55
  "n_positions": 4096,
56
- "no_repeat_ngram_size": 0,
57
- "num_beam_groups": 1,
58
- "num_beams": 1,
59
  "num_key_value_heads": 16,
60
  "num_layers": 24,
61
- "num_return_sequences": 1,
62
- "output_attentions": false,
63
- "output_hidden_states": false,
64
- "output_scores": false,
65
- "pad_token_id": null,
66
- "prefix": null,
67
- "problem_type": null,
68
- "pruned_heads": {},
69
- "remove_invalid_values": false,
70
- "repetition_penalty": 1.0,
71
- "return_dict": true,
72
- "return_dict_in_generate": false,
73
  "rms_norm_eps": 1e-05,
74
  "rope_theta": 10000.0,
75
  "rotary_percent": 1.0,
76
- "sep_token_id": null,
77
- "suppress_tokens": null,
78
- "task_specific_params": null,
79
- "temperature": 1.0,
80
- "tf_legacy_loss": false,
81
- "tie_encoder_decoder": false,
82
  "tie_word_embeddings": true,
83
- "tokenizer_class": null,
84
- "top_k": 50,
85
- "top_p": 1.0,
86
- "torchscript": false,
87
  "transformers_version": null,
88
- "typical_p": 1.0,
89
- "use_bfloat16": false,
90
  "use_cache": true,
91
- "vocab_size": 32000
 
92
  }
 
1
  {
2
+ "_name_or_path": "jetmoe/jetmoe-8b-chat",
3
+ "model_type": "jetmoe",
 
4
  "activation_function": "silu",
 
5
  "architectures": [
6
  "JetMoEForCausalLM"
7
  ],
 
 
 
 
 
 
 
 
8
  "bos_token_id": 1,
 
 
 
 
 
 
 
9
  "eos_token_id": 2,
 
10
  "ffn_hidden_size": 5632,
 
 
 
 
 
 
 
 
 
 
 
11
  "kv_channels": 128,
 
 
 
 
12
  "layer_norm_epsilon": 1e-05,
13
  "length_penalty": 1.0,
 
 
 
14
  "moe_num_experts": 8,
15
  "moe_top_k": 2,
16
+ "hidden_size": 2048,
17
+ "num_hidden_layers": 24,
 
18
  "n_positions": 4096,
19
+ "num_attention_heads": 32,
 
 
20
  "num_key_value_heads": 16,
21
  "num_layers": 24,
 
 
 
 
 
 
 
 
 
 
 
 
22
  "rms_norm_eps": 1e-05,
23
  "rope_theta": 10000.0,
24
  "rotary_percent": 1.0,
 
 
 
 
 
 
25
  "tie_word_embeddings": true,
 
 
 
 
26
  "transformers_version": null,
 
 
27
  "use_cache": true,
28
+ "vocab_size": 32000,
29
+ "glu": true
30
  }
images/Phase1_data.png ADDED
images/Phase2_data.png ADDED
images/jetmoe_architecture.png ADDED