facat commited on
Commit
47aff61
1 Parent(s): 01f1a78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -97
README.md CHANGED
@@ -3,130 +3,166 @@ license: other
3
  license_name: yi-license
4
  license_link: LICENSE
5
  widget:
6
- - example_title: Yi-34B-Chat
7
  text: hi
8
  output:
9
  text: ' Hello! How can I assist you today?'
10
- - example_title: Yi-34B
11
- text: >-
12
- There's a place where time stands still. A place of breath taking wonder,
13
- but also
14
- output:
15
- text: >2-
16
- an eerie sense that something is just not right…
17
- Between the two worlds lies The Forgotten Kingdom - home to creatures
18
- long since thought extinct and ancient magic so strong it defies belief!
19
- Only here can you find what has been lost for centuries: An Elixir Of
20
- Life which will restore youth and vitality if only those who seek its
21
- power are brave enough to face up against all manner of dangers lurking
22
- in this mysterious land! But beware; some say there may even exist
23
- powerful entities beyond our comprehension whose intentions towards
24
- humanity remain unclear at best ---- they might want nothing more than
25
- destruction itself rather then anything else from their quest after
26
- immortality (and maybe someone should tell them about modern medicine)?
27
- In any event though – one thing remains true regardless : whether or not
28
- success comes easy depends entirely upon how much effort we put into
29
- conquering whatever challenges lie ahead along with having faith deep
30
- down inside ourselves too ;) So let’s get started now shall We?
31
  pipeline_tag: text-generation
32
  ---
33
 
34
- .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ## Introduction
37
 
38
- **SUS-Chat** is powered by SUSTech x IDEA-CCNL, based on `01-ai/Yi-34B`
 
 
 
 
 
 
 
39
 
40
- ## News
41
 
42
- <details open>
43
- <summary>🎯 <b>2023/11/23</b>: The chat models are open to public.</summary>
44
 
45
- This release contains two chat models based on previous released base models, two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ.
46
 
47
- - `Yi-34B-Chat`
48
- - `Yi-34B-Chat-4bits`
49
- - `Yi-34B-Chat-8bits`
50
- - `Yi-6B-Chat`
51
- - `Yi-6B-Chat-4bits`
52
- - `Yi-6B-Chat-8bits`
53
 
54
- You can try some of them interactively at:
55
 
56
- - [HuggingFace](https://huggingface.co/spaces/01-ai/Yi-34B-Chat)
57
- - [Replicate](https://replicate.com/01-ai)
58
- </details>
 
 
 
 
 
 
59
 
60
- <details open>
61
- <summary>🔔 <b>2023/11/23</b>: The Yi Series Models Community License Agreement is updated to v2.1.</summary>
62
- </details>
63
 
64
- <details>
65
- <summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary>
66
 
67
- Application form:
 
 
 
68
 
69
- - [English](https://cn.mikecrm.com/l91ODJf)
70
- - [Chinese](https://cn.mikecrm.com/gnEZjiQ)
71
 
72
- </details>
 
 
 
 
73
 
74
- <details>
75
- <summary>🎯 <b>2023/11/05</b>: The base model of <code>Yi-6B-200K</code> and <code>Yi-34B-200K</code>.</summary>
76
 
77
- This release contains two base models with the same parameter sizes of previous
78
- release, except that the context window is extended to 200K.
79
 
80
- </details>
81
 
82
- <details>
83
- <summary>🎯 <b>2023/11/02</b>: The base model of <code>Yi-6B</code> and <code>Yi-34B</code>.</summary>
 
 
 
84
 
85
- The first public release contains two bilingual (English/Chinese) base models
86
- with the parameter sizes of 6B and 34B. Both of them are trained with 4K
87
- sequence length and can be extended to 32K during inference time.
88
 
89
- </details>
90
 
91
- ## Model Performance
 
 
92
 
93
- ### Base Model Performance
94
 
95
- | Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code |
96
- | :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
97
- | | 5-shot | 5-shot | 5-shot | 0-shot | 3-shot@1 | - | - | - |
98
- | LLaMA2-34B | 62.6 | - | - | - | 44.1 | 69.9 | 68.0 | 26.0 |
99
- | LLaMA2-70B | 68.9 | 53.3 | - | 49.8 | 51.2 | 71.9 | 69.4 | 36.8 |
100
- | Baichuan2-13B | 59.2 | 62.0 | 58.1 | 54.3 | 48.8 | 64.3 | 62.4 | 23.0 |
101
- | Qwen-14B | 66.3 | 71.0 | 72.1 | 62.5 | 53.4 | 73.3 | 72.5 | **39.8** |
102
- | Skywork-13B | 62.1 | 61.8 | 60.6 | 68.1 | 41.7 | 72.4 | 61.4 | 24.9 |
103
- | InternLM-20B | 62.1 | 59.0 | 58.8 | 45.5 | 52.5 | 78.3 | - | 30.4 |
104
- | Aquila-34B | 67.8 | 71.4 | 63.1 | - | - | - | - | - |
105
- | Falcon-180B | 70.4 | 58.0 | 57.8 | 59.0 | 54.0 | 77.3 | 68.8 | 34.0 |
106
- | Yi-6B | 63.2 | 75.5 | 72.0 | 72.2 | 42.8 | 72.3 | 68.7 | 19.8 |
107
- | Yi-6B-200K | 64.0 | 75.3 | 73.5 | 73.9 | 42.0 | 72.0 | 69.1 | 19.0 |
108
- | **Yi-34B** | **76.3** | **83.7** | 81.4 | 82.8 | **54.3** | **80.1** | 76.4 | 37.1 |
109
- | Yi-34B-200K | 76.1 | 83.6 | **81.9** | **83.4** | 52.7 | 79.7 | **76.6** | 36.3 |
110
 
111
- While benchmarking open-source models, we have observed a disparity between the
112
- results generated by our pipeline and those reported in public sources (e.g.
113
- OpenCompass). Upon conducting a more in-depth investigation of this difference,
114
- we have discovered that various models may employ different prompts,
115
- post-processing strategies, and sampling techniques, potentially resulting in
116
- significant variations in the outcomes. Our prompt and post-processing strategy
117
- remains consistent with the original benchmark, and greedy decoding is employed
118
- during evaluation without any post-processing for the generated content. For
119
- scores that were not reported by the original authors (including scores reported
120
- with different settings), we try to get results with our pipeline.
121
 
122
- To evaluate the model's capability extensively, we adopted the methodology
123
- outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande,
124
- ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ
125
- were incorporated to evaluate reading comprehension. CSQA was exclusively tested
126
- using a 7-shot setup, while all other tests were conducted with a 0-shot
127
- configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1),
128
- HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due
129
- to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score
130
- is derived by averaging the scores on the remaining tasks. Since the scores for
131
- these two tasks are generally lower than the average, we believe that
132
- Falcon-180B's performance was not underestimated.
 
3
  license_name: yi-license
4
  license_link: LICENSE
5
  widget:
6
+ - example_title: SUS-Chat
7
  text: hi
8
  output:
9
  text: ' Hello! How can I assist you today?'
10
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  pipeline_tag: text-generation
12
  ---
13
 
14
+ # 🐗SUS-Chat: Instruction tuning done right
15
+
16
+ <div align="center">
17
+
18
+ <p align="center">
19
+ <img width="200px" src="https://github.com/SUSTech-IDEA/SUS-Chat/raw/main/assets/sustech.svg?sanitize=true">
20
+ </p>
21
+
22
+ <div style="display: inline-block;">
23
+
24
+ <a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/issues">
25
+ <img src="https://img.shields.io/github/issues/SUSTech-IDEA/SUS-Chat?logo=github" style="margin: 0 0;">
26
+ </a>
27
+
28
+ </div>
29
+
30
+ <div style="display: inline-block;">
31
+
32
+ <a href="https://huggingface.co/SUSTech">
33
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-SUSTech-blue" style="margin: 0 0;">
34
+ </a>
35
+
36
+ </div>
37
+
38
+ <div style="display: inline-block;">
39
+
40
+ <a rel="noopener nofollow" href="https://www.modelscope.cn/organization/sustc/">
41
+ <img src="https://img.shields.io/badge/ModelScope-sustec-blue" style="margin: 0 0;">
42
+ </a>
43
+
44
+ </div>
45
+
46
+ <div style="display: inline-block;">
47
+
48
+ <a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/LICENSE">
49
+ <img src="https://img.shields.io/badge/Code_License-Apache_2.0-lightblue" style="margin: 0 0;">
50
+ </a>
51
+
52
+ </div>
53
+
54
+ <div style="display: inline-block;">
55
+
56
+ <a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/MODEL_LICENSE_AGREEMENT.txt">
57
+ <img src="https://img.shields.io/badge/Model_License-Model_Agreement-lightblue" style="margin: 0 0;">
58
+ </a>
59
+
60
+ </div>
61
+
62
+ <div style="display: inline-block;">
63
+
64
+ <a rel="noopener nofollow" href="mailto:[email protected]">
65
+ <img src="https://img.shields.io/badge/✉️[email protected]" style="margin: 0 0;">
66
+ </a>
67
+
68
+ </div>
69
+
70
+ </div>
71
+
72
+ # Inrtoduction
73
+
74
+ <img src="https://hackmd.io/_uploads/S1dXCTIHp.png" id="fig-sus"
75
+ alt="Figure 1: DALL·E 2023-12-01 11.03.28 - An imposing, majestic wild boar combined with elements of a futuristic transformer robot. The boar itself should be intricately blended with these tra" />
76
+
77
+ **SUS-Chat**
78
+ 是一个34B的中英文对话模型,由南方科技大学和粤港澳大湾区数字经济研究院联合发布。SUS-Chat-34B模型在数百万高质、多语言的指令数据上进行了微调,在保持基础模型强大的语言能力的同时,SUS-Chat-34B模型通过高质量指令微调改善了模型对人类指令的响应方式并擅长通过思维链的方式模仿人类思考过程。
79
+
80
+ 它在几乎所有基准测试中超过了所有同尺寸的模型,而且能够更好地满足了复杂多语言任务的实际需求,相比于更大的模型,SUS-Chat-34B仍具有相当竞争力,在我们的综合评测中取得了最先进的表现。
81
+
82
+ SUS-Chat有力地证明了通过正确的指令微调,学术机构可以在不增加模型参数的情况下,通过开源的数据集和模型,获得更好的性能,
83
+ 这弥合了学术界和工业界的在大语言模型上的差距,为学术界和工业界的合作提供了新的可能性。
84
+
85
+ # Performance
86
+
87
+ 为了更好地评估SUS-Chat-34B模型的性能,我们在多个基准测试中进行了评估,并开源了评估框架[TLEM](https://huggingface.co/spaces/SUSTech/tlem),以便于其他研究人员进行复现和比较。
88
+
89
+ 在TLEM中,我们使用了多个基准测试,包括:MMLU, CMMLU, C-Eval, BBH,
90
+ GSM-8K, MATH,
91
+ 专注于衡量模型的知识和思维能力,在这些指标中SUS-Chat-34B模型取得了最先进的表现,我们还额外引入了[lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)测试了SUS-Chat和同类模型在winogrande,
92
+ hellaswag, arc, truthful-qa的表现, 衡量模型的常识性推理能力和幻觉。
93
 
94
+ 综合上看,SUS-Chat-34B模型显著领先于同规模的模型,并取得了最先进的综合性能。
95
 
96
+ | model | mmlu-chat | cmmlu-chat | ceval-chat | gsm8k | BBH | MATH | winogrande | arc | hellaswag | truthfulqa | average |
97
+ |:------------------|----------:|-----------:|-----------:|------:|------:|------:|-----------:|------:|----------:|-----------:|--------:|
98
+ | GPT-4 | 83 | 71 | 69.9 | 91.4 | 86.7 | 45.8 | 87.5 | 94.5 | 91.4 | nan | 80.1333 |
99
+ | SUS-Chat-34B | 77.35 | 78.68 | 82.42 | 80.06 | 67.62 | 28.8 | 81.22 | 81.54 | 83.79 | 57.47 | 71.895 |
100
+ | Qwen-72B-Chat | 74.52 | 77.02 | 77.22 | 76.57 | 72.63 | 35.9 | 80.58 | 81.29 | 87.02 | 50.64 | 71.339 |
101
+ | DeepSeek-67B-Chat | 69.43 | 48.51 | 59.7 | 74.45 | 69.73 | 29.56 | 76.09 | 82.1 | 86.06 | 56.37 | 65.2 |
102
+ | OrionStar-34B | 68.51 | 66.88 | 65.13 | 54.36 | 62.88 | 12.8 | 77.27 | 80.19 | 84.54 | 53.24 | 62.58 |
103
+ | Yi-34B-Chat | 66.96 | 55.16 | 77.16 | 63.76 | 61.54 | 10.02 | 76.64 | 70.66 | 82.29 | 54.57 | 61.876 |
104
 
105
+ <img src="assets/radar.png" id="fig-bench" alt="Figure 2: Benchmark" />
106
 
107
+ # 用法
 
108
 
109
+ SUS-Chat-34B是标准的LLaMA模型,使用方法和开发环境与大多数其它开源模型相同,可以通过以下方式进行多轮对话
110
 
111
+ ``` python
112
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
113
 
 
114
 
115
+ def chat_template(messages):
116
+ history = ""
117
+ for message in messages:
118
+ match message:
119
+ case {"role": "human", "content": message}:
120
+ history += f"### Human: {message}\n\n### Assistant: "
121
+ case {"role": "assistant", "content": message}:
122
+ history += message
123
+ return history
124
 
 
 
 
125
 
126
+ model_path = "SUSTech/SUS-Chat-34B"
 
127
 
128
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
129
+ model = AutoModelForCausalLM.from_pretrained(
130
+ model_path, device_map="auto", torch_dtype="auto"
131
+ ).eval()
132
 
133
+ messages = [{"role": "user", "content": "hi"}]
 
134
 
135
+ input_ids = tokenizer.encode(chat_template(messages), return_tensors="pt").to("cuda")
136
+ output_ids = model.generate(input_ids.to("cuda"))
137
+ response = tokenizer.decode(
138
+ output_ids[0][input_ids.shape[1] :], skip_special_tokens=True
139
+ )
140
 
141
+ messages.append({"role": "assistant", "content": response})
 
142
 
143
+ # Second round
 
144
 
145
+ messages.append({"role": "user", "content": "What is the capital of China?"})
146
 
147
+ input_ids = tokenizer.encode(chat_template(messages), return_tensors="pt").to("cuda")
148
+ output_ids = model.generate(input_ids.to("cuda"))
149
+ response = tokenizer.decode(
150
+ output_ids[0][input_ids.shape[1] :], skip_special_tokens=True
151
+ )
152
 
153
+ messages.append({"role": "assistant", "content": response})
154
+ ```
 
155
 
156
+ # 限制
157
 
158
+ SUS-Chat只进行了监督微调,尚未进行人类偏好学习,因此在一些情况下可能会产生不合理的回复,并放大某些语言模型现有的问题,
159
+ 包括幻觉、非确定性和累积误差,
160
+ 为了实现更有利于下游任务的性能,我们建议相应地调整生成是配置参数。
161
 
162
+ # 免责声明
163
 
164
+ 我们在训练过程中使用数据合规检查算法,尽力确保训练模型的合规性。由于数据复杂且语言模型使用场景多样,我们无法保证模型在所有情况下生成正确和合理的输出。请注意,模型仍然存在产生问题输出的风险。对于因滥用、误导、非法使用和相关错误信息以及相关数据安全问题而导致的任何风险和问题,我们将不承担责任。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
+ # 许可
 
 
 
 
 
 
 
 
 
167
 
168
+ 该模型完全开发供学术研究和免费商业使用,但需要遵守来自零一万物的[许可](https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/MODEL_LICENSE_AGREEMENT.txt)