Update README.md
Browse files
README.md
CHANGED
@@ -100,7 +100,7 @@ For more information, please refer to our [Github repo](https://github.com/QwenL
|
|
100 |
The details of the model architecture of Qwen-7B-Chat are listed as follows
|
101 |
|
102 |
| Hyperparameter | Value |
|
103 |
-
|
104 |
| n_layers | 32 |
|
105 |
| n_heads | 32 |
|
106 |
| d_model | 4096 |
|
@@ -139,7 +139,7 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
|
|
139 |
We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
|
140 |
|
141 |
| Model | Avg. Acc. |
|
142 |
-
|
143 |
| LLaMA2-7B-Chat | 31.9 |
|
144 |
| LLaMA2-13B-Chat | 40.6 |
|
145 |
| Chinese-Alpaca-2-7B | 41.3 |
|
@@ -154,7 +154,7 @@ C-Eval测试集上,Qwen-7B-Chat模型的zero-shot准确率结果如下:
|
|
154 |
The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
|
155 |
|
156 |
| Model | Avg. | STEM | Social Sciences | Humanities | Others |
|
157 |
-
|
158 |
| Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
|
159 |
| Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
|
160 |
| ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
|
@@ -175,7 +175,7 @@ The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
|
|
175 |
The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
|
176 |
|
177 |
| Model | Avg. Acc. |
|
178 |
-
|
179 |
| ChatGLM2-6B-Chat | 45.5 |
|
180 |
| LLaMA2-7B-Chat | 47.0 |
|
181 |
| InternLM-7B-Chat | 50.8 |
|
@@ -190,7 +190,7 @@ Qwen-7B-Chat在[HumanEval](https://github.com/openai/human-eval)的zero-shot Pas
|
|
190 |
The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
|
191 |
|
192 |
| Model | Pass@1 |
|
193 |
-
|
194 |
| LLaMA2-7B-Chat | 12.2 |
|
195 |
| InternLM-7B-Chat | 14.0 |
|
196 |
| Baichuan-13B-Chat | 16.5 |
|
@@ -204,7 +204,7 @@ The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/hu
|
|
204 |
The accuracy of Qwen-7B-Chat on GSM8K is shown below
|
205 |
|
206 |
| Model | Zero-shot Acc. | 4-shot Acc. |
|
207 |
-
|
208 |
| ChatGLM2-6B-Chat | - | 28.0 |
|
209 |
| LLaMA2-7B-Chat | 20.4 | 28.2 |
|
210 |
| LLaMA2-13B-Chat | 29.4 | 36.7 |
|
@@ -224,7 +224,7 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
|
|
224 |
**(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
|
225 |
|
226 |
| Model | VCSUM (zh) |
|
227 |
-
|
228 |
| GPT-3.5-Turbo-16k | 16.0 |
|
229 |
| LLama2-7B-Chat | 0.2 |
|
230 |
| InternLM-7B-Chat | 13.0 |
|
@@ -240,7 +240,7 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
|
|
240 |
Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
|
241 |
|
242 |
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
|
243 |
-
|
244 |
| GPT-4 | 95% | **0.90** | 15% |
|
245 |
| GPT-3.5 | 85% | 0.88 | 75% |
|
246 |
| **Qwen-7B-Chat** | **99%** | 0.89 | **8.5%** |
|
@@ -263,7 +263,7 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
|
|
263 |
Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
|
264 |
|
265 |
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
|
266 |
-
|
267 |
|GPT-4 | **100** | **100** | **97.41** |
|
268 |
|GPT-3.5 | 95.37 | 96.30 | 87.04 |
|
269 |
|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
|
@@ -303,7 +303,7 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
303 |
With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
|
304 |
|
305 |
| Precision | MMLU | Memory |
|
306 |
-
|
|
307 |
| BF16 | 56.7 | 16.2G |
|
308 |
| Int8 | 52.8 | 10.1G |
|
309 |
| NF4 | 48.9 | 7.4G |
|
|
|
100 |
The details of the model architecture of Qwen-7B-Chat are listed as follows
|
101 |
|
102 |
| Hyperparameter | Value |
|
103 |
+
|:------|:------|
|
104 |
| n_layers | 32 |
|
105 |
| n_heads | 32 |
|
106 |
| d_model | 4096 |
|
|
|
139 |
We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
|
140 |
|
141 |
| Model | Avg. Acc. |
|
142 |
+
|:--------------|:------:|
|
143 |
| LLaMA2-7B-Chat | 31.9 |
|
144 |
| LLaMA2-13B-Chat | 40.6 |
|
145 |
| Chinese-Alpaca-2-7B | 41.3 |
|
|
|
154 |
The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
|
155 |
|
156 |
| Model | Avg. | STEM | Social Sciences | Humanities | Others |
|
157 |
+
|:--------------|:------:|:------:|:------:|:------:|:------:|
|
158 |
| Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
|
159 |
| Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
|
160 |
| ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
|
|
|
175 |
The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
|
176 |
|
177 |
| Model | Avg. Acc. |
|
178 |
+
|:--------------|:------:|
|
179 |
| ChatGLM2-6B-Chat | 45.5 |
|
180 |
| LLaMA2-7B-Chat | 47.0 |
|
181 |
| InternLM-7B-Chat | 50.8 |
|
|
|
190 |
The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
|
191 |
|
192 |
| Model | Pass@1 |
|
193 |
+
|:--------------|:------:|
|
194 |
| LLaMA2-7B-Chat | 12.2 |
|
195 |
| InternLM-7B-Chat | 14.0 |
|
196 |
| Baichuan-13B-Chat | 16.5 |
|
|
|
204 |
The accuracy of Qwen-7B-Chat on GSM8K is shown below
|
205 |
|
206 |
| Model | Zero-shot Acc. | 4-shot Acc. |
|
207 |
+
|:--------------|:------:|:------:|
|
208 |
| ChatGLM2-6B-Chat | - | 28.0 |
|
209 |
| LLaMA2-7B-Chat | 20.4 | 28.2 |
|
210 |
| LLaMA2-13B-Chat | 29.4 | 36.7 |
|
|
|
224 |
**(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
|
225 |
|
226 |
| Model | VCSUM (zh) |
|
227 |
+
|:----------------|:-------:|
|
228 |
| GPT-3.5-Turbo-16k | 16.0 |
|
229 |
| LLama2-7B-Chat | 0.2 |
|
230 |
| InternLM-7B-Chat | 13.0 |
|
|
|
240 |
Qwen-7B-Chat supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat's performance is as follows:
|
241 |
|
242 |
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
|
243 |
+
|:-----------------|:----------------------:|:---------------------:|:---------------------:|
|
244 |
| GPT-4 | 95% | **0.90** | 15% |
|
245 |
| GPT-3.5 | 85% | 0.88 | 75% |
|
246 |
| **Qwen-7B-Chat** | **99%** | 0.89 | **8.5%** |
|
|
|
263 |
Qwen-7B-Chat also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). Its performance on the run-mode benchmark provided by HuggingFace is as follows:
|
264 |
|
265 |
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
|
266 |
+
|:-|:-:|:-:|:-:|
|
267 |
|GPT-4 | **100** | **100** | **97.41** |
|
268 |
|GPT-3.5 | 95.37 | 96.30 | 87.04 |
|
269 |
|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
|
|
|
303 |
With this method, it is available to load Qwen-7B-Chat in `NF4`and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
|
304 |
|
305 |
| Precision | MMLU | Memory |
|
306 |
+
| :---------| :-------: | :-----: |
|
307 |
| BF16 | 56.7 | 16.2G |
|
308 |
| Int8 | 52.8 | 10.1G |
|
309 |
| NF4 | 48.9 | 7.4G |
|