MediaTek-Research
/

Breeze-7B-Instruct-v0_1

@@ -7,7 +7,6 @@ language:
 # Model Card for Breeze-7B-Instruct-v0.1
 Breeze-7B is a language model that builds upon the foundation of [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), specifically enhanced for Traditional Chinese.
 [Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1) introduces an expanded vocabulary with additional 30,000 Traditional Chinese tokens and
@@ -67,7 +66,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
-| Models                                       |        | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 |                                              |        | 5 shot       | 3 shot      | 5 shot      | 5 shot     |
@@ -83,14 +82,14 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category ACC of TMMLU+ (5 shot)**
-| Models                           | STEM         | Social Science | Humanities | Other      |
-|-----------------------------------------------------|--------------|----------------|------------|------------|
-| Yi-34B                                        | 56.03        | 73.06          | 61.12      | 62.19      |
-| Qwen-14B                                       | 46.51        | 58.20          | 51.12      | 49.38      |
-| Yi-6B                                         | 41.14        | 57.77          | 50.22      | 49.39      |
-| Qwen-7B                                        | 28.25        | 47.80          | 43.14      | 42.17      |
-| **Breeze-7B-Base-v0.1**               | 35.74        | 46.08          | 40.29      | 39.27      |
-| Mistral-7B-v0.1                           | 33.01        | 42.23          | 35.86      | 37.63      |
@@ -105,7 +104,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
-| Models                                                                                                  |        |MT-Bench-tw (Score) | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MT-Bench (Score) | MMLU (ACC)  | MMLU (ACC)  |
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
@@ -123,8 +122,8 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category Score of MT-Bench-tw (0 shot)**
-| Models                                              | STEM    |Extraction|Reasoning| Math   | Coding  | Roleplay| Writing |Humanities|Average|
-|-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|--------|
 | gpt-3.5-turbo                                       |         |         |         |         |         |         |         |         |         |
 | Yi-34B-Chat                                         |         |         |         |         |         |         |         |         |         |
 | Qwen-14B-Chat                                       |         |         |         |         |         |         |         |         |         |
@@ -137,17 +136,17 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category ACC of TMMLU+ (0 shot)**
-| Model                                               | STEM         | Social Science | Humanities | Other      | Average |
 |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
-| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      |         |
-| Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      |         |
-| Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      |         |
-| **Breeze-7B-Instruct-v0.1**                         | 37.41        | 46.81          | 42.06      | 40.16      |         |
-| **Breeze-7B-Instruct-64k-v0.1**                     | 37.88        | 46.35          | 40.31      | 39.40      |         |
-| Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      |         |
-| Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      |         |
-| Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      |         |
-| Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      |         |
@@ -155,17 +154,17 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
-| Models                                                             | Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
-| Yi-6B                                                        |   10.62  |   5.2k                |
-| **Breeze-7B-Instruct-v0.1**                              |  10.74  |    11.1k                 |
-| **Breeze-7B-Instruct-64k-v0.1**                              | 10.74       |  88.8k            |
-| Qwen-7B                                                       |   10.86         |    9.8k                  |
-| Qwen-14B                                                      |   18.89  |    9.8k                  |
-| Mistral-7B-v0.1                                          |  20.48   |    5.1k                 |
-| Taiwan-LLM-7B-v2.1-base                                 |   26.26          |    2.2k                  |
-| Taiwan-LLM-13B-v2.0-base                                |   36.80          |    2.2k                  |
-| Yi-34B                                                       |  43.71   |    4.5k                  |
 ## Long-context Performance
@@ -209,3 +208,14 @@ The suggested default `SYS_PROMPT` is
 ```txt
 You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
 ```

 # Model Card for Breeze-7B-Instruct-v0.1
 Breeze-7B is a language model that builds upon the foundation of [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), specifically enhanced for Traditional Chinese.
 [Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1) introduces an expanded vocabulary with additional 30,000 Traditional Chinese tokens and
  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
+| Models                                       |        |↑ TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 |                                              |        | 5 shot       | 3 shot      | 5 shot      | 5 shot     |
 **Category ACC of TMMLU+ (5 shot)**
+| Models                           | STEM         | Social Science | Humanities | Other      | ↑ AVG |
+|----------------------------------|--------------|----------------|------------|------------|-------|
+| Yi-34B                           | 56.03        | 73.06          | 61.12      | 62.19      | 63.10 |
+| Qwen-14B                         | 46.51        | 58.20          | 51.12      | 49.38      | 51.30 |
+| Yi-6B                            | 41.14        | 57.77          | 50.22      | 49.39      | 49.63 |
+| Qwen-7B                          | 28.25        | 47.80          | 43.14      | 42.17      | 42.84 |
+| **Breeze-7B-Base-v0.1**          | 35.74        | 46.08          | 40.29      | 39.27      | 40.35 |
+| Mistral-7B-v0.1                  | 33.01        | 42.23          | 35.86      | 37.63      | 36.93 |
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
+| Models                                                                                                  |        |↑ MT-Bench-tw (Score)| TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MT-Bench (Score) | MMLU (ACC)  | MMLU (ACC)  |
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
 **Category Score of MT-Bench-tw (0 shot)**
+| Models                                              | STEM    |Extraction|Reasoning| Math   | Coding  | Roleplay| Writing |Humanities|↑ AVG   |
+|-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
 | gpt-3.5-turbo                                       |         |         |         |         |         |         |         |         |         |
 | Yi-34B-Chat                                         |         |         |         |         |         |         |         |         |         |
 | Qwen-14B-Chat                                       |         |         |         |         |         |         |         |         |         |
 **Category ACC of TMMLU+ (0 shot)**
+| Model                                               | STEM         | Social Science | Humanities | Other      | ↑ AVG   |
 |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
+| Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
+| Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
+| Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      | 44.79   |
+| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      | 41.76   |
+| **Breeze-7B-Instruct-v0.1**                         | 37.41        | 46.81          | 42.06      | 40.16      | 41.61   |
+| **Breeze-7B-Instruct-64k-v0.1**                     | 37.88        | 46.35          | 40.31      | 39.40      | 40.99   |
+| Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      | 40.02   |
+| Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      | 29.47   |
+| Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      | 28.08   |
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
+| Models                                                             | ↓ Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
+| Yi-6B                                                              |   10.62  |   5.2k                |
+| **Breeze-7B-Instruct-v0.1**                                        |  10.74  |    11.1k                 |
+| **Breeze-7B-Instruct-64k-v0.1**                                    | 10.74       |  88.8k            |
+| Qwen-7B                                                            |   10.86         |    9.8k                  |
+| Qwen-14B                                                           |   18.89  |    9.8k                  |
+| Mistral-7B-v0.1                                                    |  20.48   |    5.1k                 |
+| Taiwan-LLM-7B-v2.1-base                                            |   26.26          |    2.2k                  |
+| Taiwan-LLM-13B-v2.0-base                                           |   36.80          |    2.2k                  |
+| Yi-34B                                                             |  43.71   |    4.5k                  |
 ## Long-context Performance
 ```txt
 You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
 ```
+## Citation
+```
+@article{breeze7b2024,
+  title={},
+  author={},
+  journal={arXiv},
+  year={2024}
+}
+```