SUSTech
/

SUS-Chat-34B

@@ -3,130 +3,166 @@ license: other
 license_name: yi-license
 license_link: LICENSE
 widget:
-  - example_title: Yi-34B-Chat
     text: hi
     output:
       text: ' Hello! How can I assist you today?'
-  - example_title: Yi-34B
-    text: >-
-      There's a place where time stands still. A place of breath taking wonder,
-      but also
-    output:
-      text: >2-
-         an eerie sense that something is just not right…
-        Between the two worlds lies The Forgotten Kingdom - home to creatures
-        long since thought extinct and ancient magic so strong it defies belief!
-        Only here can you find what has been lost for centuries: An Elixir Of
-        Life which will restore youth and vitality if only those who seek its
-        power are brave enough to face up against all manner of dangers lurking
-        in this mysterious land! But beware; some say there may even exist
-        powerful entities beyond our comprehension whose intentions towards
-        humanity remain unclear at best ---- they might want nothing more than
-        destruction itself rather then anything else from their quest after
-        immortality (and maybe someone should tell them about modern medicine)?
-        In any event though – one thing remains true regardless : whether or not
-        success comes easy depends entirely upon how much effort we put into
-        conquering whatever challenges lie ahead along with having faith deep
-        down inside ourselves too ;) So let’s get started now shall We?
 pipeline_tag: text-generation
 ---
-.
-## Introduction
-**SUS-Chat** is powered by SUSTech x IDEA-CCNL, based on `01-ai/Yi-34B`
-## News
-<details open>
-<summary>🎯 <b>2023/11/23</b>: The chat models are open to public.</summary>
-This release contains two chat models based on previous released base models, two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ.
-- `Yi-34B-Chat`
-- `Yi-34B-Chat-4bits`
-- `Yi-34B-Chat-8bits`
-- `Yi-6B-Chat`
-- `Yi-6B-Chat-4bits`
-- `Yi-6B-Chat-8bits`
-You can try some of them interactively at:
-- [HuggingFace](https://huggingface.co/spaces/01-ai/Yi-34B-Chat)
-- [Replicate](https://replicate.com/01-ai)
-</details>
-<details open>
-<summary>🔔 <b>2023/11/23</b>: The Yi Series Models Community License Agreement is updated to v2.1.</summary>
-</details>
-<details>
-<summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary>
-Application form:
-- [English](https://cn.mikecrm.com/l91ODJf)
-- [Chinese](https://cn.mikecrm.com/gnEZjiQ)
-</details>
-<details>
-<summary>🎯 <b>2023/11/05</b>: The base model of <code>Yi-6B-200K</code> and <code>Yi-34B-200K</code>.</summary>
-This release contains two base models with the same parameter sizes of previous
-release, except that the context window is extended to 200K.
-</details>
-<details>
-<summary>🎯 <b>2023/11/02</b>: The base model of <code>Yi-6B</code> and <code>Yi-34B</code>.</summary>
-The first public release contains two bilingual (English/Chinese) base models
-with the parameter sizes of 6B and 34B.  Both of them are trained with 4K
-sequence length and can be extended to 32K during inference time.
-</details>
-## Model Performance
-### Base Model Performance
-| Model         |   MMLU   |  CMMLU   |  C-Eval  |  GAOKAO  |   BBH    | Common-sense Reasoning | Reading Comprehension | Math & Code |
-| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
-|               |  5-shot  |  5-shot  |  5-shot  |  0-shot  | 3-shot@1 |           -            |           -           |      -      |
-| LLaMA2-34B    |   62.6   |    -     |    -     |    -     |   44.1   |          69.9          |         68.0          |    26.0     |
-| LLaMA2-70B    |   68.9   |   53.3   |    -     |   49.8   |   51.2   |          71.9          |         69.4          |    36.8     |
-| Baichuan2-13B |   59.2   |   62.0   |   58.1   |   54.3   |   48.8   |          64.3          |         62.4          |    23.0     |
-| Qwen-14B      |   66.3   |   71.0   |   72.1   |   62.5   |   53.4   |          73.3          |         72.5          |  **39.8**   |
-| Skywork-13B   |   62.1   |   61.8   |   60.6   |   68.1   |   41.7   |          72.4          |         61.4          |    24.9     |
-| InternLM-20B  |   62.1   |   59.0   |   58.8   |   45.5   |   52.5   |          78.3          |           -           |    30.4     |
-| Aquila-34B    |   67.8   |   71.4   |   63.1   |    -     |    -     |           -            |           -           |      -      |
-| Falcon-180B   |   70.4   |   58.0   |   57.8   |   59.0   |   54.0   |          77.3          |         68.8          |    34.0     |
-| Yi-6B         |   63.2   |   75.5   |   72.0   |   72.2   |   42.8   |          72.3          |         68.7          |    19.8     |
-| Yi-6B-200K    |   64.0   |   75.3   |   73.5   |   73.9   |   42.0   |          72.0          |         69.1          |    19.0     |
-| **Yi-34B**    | **76.3** | **83.7** |   81.4   |   82.8   | **54.3** |        **80.1**        |         76.4          |    37.1     |
-| Yi-34B-200K   |   76.1   |   83.6   | **81.9** | **83.4** |   52.7   |          79.7          |       **76.6**        |    36.3     |
-While benchmarking open-source models, we have observed a disparity between the
-results generated by our pipeline and those reported in public sources (e.g.
-OpenCompass). Upon conducting a more in-depth investigation of this difference,
-we have discovered that various models may employ different prompts,
-post-processing strategies, and sampling techniques, potentially resulting in
-significant variations in the outcomes. Our prompt and post-processing strategy
-remains consistent with the original benchmark, and greedy decoding is employed
-during evaluation without any post-processing for the generated content. For
-scores that were not reported by the original authors (including scores reported
-with different settings), we try to get results with our pipeline.
-To evaluate the model's capability extensively, we adopted the methodology
-outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande,
-ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ
-were incorporated to evaluate reading comprehension. CSQA was exclusively tested
-using a 7-shot setup, while all other tests were conducted with a 0-shot
-configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1),
-HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due
-to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score
-is derived by averaging the scores on the remaining tasks. Since the scores for
-these two tasks are generally lower than the average, we believe that
-Falcon-180B's performance was not underestimated.

 license_name: yi-license
 license_link: LICENSE
 widget:
+  - example_title: SUS-Chat
     text: hi
     output:
       text: ' Hello! How can I assist you today?'
 pipeline_tag: text-generation
 ---
+# 🐗SUS-Chat: Instruction tuning done right
+<div align="center">
+<p align="center">
+<img width="200px" src="https://github.com/SUSTech-IDEA/SUS-Chat/raw/main/assets/sustech.svg?sanitize=true">
+</p>
+<div style="display: inline-block;">
+<a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/issues">
+<img src="https://img.shields.io/github/issues/SUSTech-IDEA/SUS-Chat?logo=github" style="margin: 0 0;">
+</a>
+</div>
+<div style="display: inline-block;">
+<a href="https://huggingface.co/SUSTech">
+<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-SUSTech-blue" style="margin: 0 0;">
+</a>
+</div>
+<div style="display: inline-block;">
+<a rel="noopener nofollow" href="https://www.modelscope.cn/organization/sustc/">
+<img src="https://img.shields.io/badge/ModelScope-sustec-blue" style="margin: 0 0;">
+</a>
+</div>
+<div style="display: inline-block;">
+<a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/LICENSE">
+<img src="https://img.shields.io/badge/Code_License-Apache_2.0-lightblue" style="margin: 0 0;">
+</a>
+</div>
+<div style="display: inline-block;">
+<a rel="noopener nofollow" href="https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/MODEL_LICENSE_AGREEMENT.txt">
+<img src="https://img.shields.io/badge/Model_License-Model_Agreement-lightblue" style="margin: 0 0;">
+</a>
+</div>
+<div style="display: inline-block;">
+<a rel="noopener nofollow" href="mailto:[email protected]">
+<img src="https://img.shields.io/badge/✉️[email protected]" style="margin: 0 0;">
+</a>
+</div>
+</div>
+# Inrtoduction
+<img src="https://hackmd.io/_uploads/S1dXCTIHp.png" id="fig-sus"
+alt="Figure 1: DALL·E 2023-12-01 11.03.28 - An imposing, majestic wild boar combined with elements of a futuristic transformer robot. The boar itself should be intricately blended with these tra" />
+**SUS-Chat**
+是一个34B的中英文对话模型，由南方科技大学和粤港澳大湾区数字经济研究院联合发布。SUS-Chat-34B模型在数百万高质、多语言的指令数据上进行了微调，在保持基础模型强大的语言能力的同时，SUS-Chat-34B模型通过高质量指令微调改善了模型对人类指令的响应方式并擅长通过思维链的方式模仿人类思考过程。
+它在几乎所有基准测试中超过了所有同尺寸的模型，而且能够更好地满足了复杂多语言任务的实际需求，相比于更大的模型，SUS-Chat-34B仍具有相当竞争力，在我们的综合评测中取得了最先进的表现。
+SUS-Chat有力地证明了通过正确的指令微调，学术机构可以在不增加模型参数的情况下，通过开源的数据集和模型，获得更好的性能,
+这弥合了学术界和工业界的在大语言模型上的差距，为学术界和工业界的合作提供了新的可能性。
+# Performance
+为了更好地评估SUS-Chat-34B模型的性能，我们在多个基准测试中进行了评估，并开源了评估框架[TLEM](https://huggingface.co/spaces/SUSTech/tlem)，以便于其他研究人员进行复现和比较。
+在TLEM中，我们使用了多个基准测试，包括：MMLU, CMMLU, C-Eval, BBH,
+GSM-8K, MATH,
+专注于衡量模型的知识和思维能力，在这些指标中SUS-Chat-34B模型取得了最先进的表现，我们还额外引入了[lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)测试了SUS-Chat和同类模型在winogrande,
+hellaswag, arc, truthful-qa的表现, 衡量模型的常识性推理能力和幻觉。
+综合上看，SUS-Chat-34B模型显著领先于同规模的模型，并取得了最先进的综合性能。
+| model             | mmlu-chat | cmmlu-chat | ceval-chat | gsm8k |   BBH |  MATH | winogrande |   arc | hellaswag | truthfulqa | average |
+|:------------------|----------:|-----------:|-----------:|------:|------:|------:|-----------:|------:|----------:|-----------:|--------:|
+| GPT-4             |        83 |         71 |       69.9 |  91.4 |  86.7 |  45.8 |       87.5 |  94.5 |      91.4 |        nan | 80.1333 |
+| SUS-Chat-34B      |     77.35 |      78.68 |      82.42 | 80.06 | 67.62 |  28.8 |      81.22 | 81.54 |     83.79 |      57.47 |  71.895 |
+| Qwen-72B-Chat     |     74.52 |      77.02 |      77.22 | 76.57 | 72.63 |  35.9 |      80.58 | 81.29 |     87.02 |      50.64 |  71.339 |
+| DeepSeek-67B-Chat |     69.43 |      48.51 |       59.7 | 74.45 | 69.73 | 29.56 |      76.09 |  82.1 |     86.06 |      56.37 |    65.2 |
+| OrionStar-34B     |     68.51 |      66.88 |      65.13 | 54.36 | 62.88 |  12.8 |      77.27 | 80.19 |     84.54 |      53.24 |   62.58 |
+| Yi-34B-Chat       |     66.96 |      55.16 |      77.16 | 63.76 | 61.54 | 10.02 |      76.64 | 70.66 |     82.29 |      54.57 |  61.876 |
+<img src="assets/radar.png" id="fig-bench" alt="Figure 2: Benchmark" />
+# 用法
+SUS-Chat-34B是标准的LLaMA模型，使用方法和开发环境与大多数其它开源模型相同，可以通过以下方式进行多轮对话
+``` python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def chat_template(messages):
+    history = ""
+    for message in messages:
+        match message:
+            case {"role": "human", "content": message}:
+                history += f"### Human: {message}\n\n### Assistant: "
+            case {"role": "assistant", "content": message}:
+                history += message
+    return history
+model_path = "SUSTech/SUS-Chat-34B"
+tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, device_map="auto", torch_dtype="auto"
+).eval()
+messages = [{"role": "user", "content": "hi"}]
+input_ids = tokenizer.encode(chat_template(messages), return_tensors="pt").to("cuda")
+output_ids = model.generate(input_ids.to("cuda"))
+response = tokenizer.decode(
+    output_ids[0][input_ids.shape[1] :], skip_special_tokens=True
+)
+messages.append({"role": "assistant", "content": response})
+# Second round
+messages.append({"role": "user", "content": "What is the capital of China?"})
+input_ids = tokenizer.encode(chat_template(messages), return_tensors="pt").to("cuda")
+output_ids = model.generate(input_ids.to("cuda"))
+response = tokenizer.decode(
+    output_ids[0][input_ids.shape[1] :], skip_special_tokens=True
+)
+messages.append({"role": "assistant", "content": response})
+```
+# 限制
+SUS-Chat只进行了监督微调，尚未进行人类偏好学习，因此在一些情况下可能会产生不合理的回复，并放大某些语言模型现有的问题,
+包括幻觉、非确定性和累积误差,
+为了实现更有利于下游任务的性能，我们建议相应地调整生成是配置参数。
+# 免责声明
+我们在训练过程中使用数据合规检查算法，尽力确保训练模型的合规性。由于数据复杂且语言模型使用场景多样，我们无法保证模型在所有情况下生成正确和合理的输出。请注意，模型仍然存在产生问题输出的风险。对于因滥用、误导、非法使用和相关错误信息以及相关数据安全问题而导致的任何风险和问题，我们将不承担责任。
+# 许可
+该模型完全开发供学术研究和免费商业使用，但需要遵守来自零一万物的[许可](https://github.com/SUSTech-IDEA/SUS-Chat/blob/main/MODEL_LICENSE_AGREEMENT.txt)