alexmarques commited on
Commit
e8e0ea8
1 Parent(s): 422a385

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -17
README.md CHANGED
@@ -130,10 +130,11 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w8a16")
130
 
131
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
132
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
133
- This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
134
 
135
  ### Accuracy
136
 
 
137
  <table>
138
  <tr>
139
  <td><strong>Benchmark</strong>
@@ -148,21 +149,31 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
148
  <tr>
149
  <td>MMLU (5-shot)
150
  </td>
151
- <td>82.21
152
  </td>
153
- <td>82.12
154
  </td>
155
- <td>99.9%
 
 
 
 
 
 
 
 
 
 
156
  </td>
157
  </tr>
158
  <tr>
159
  <td>ARC Challenge (0-shot)
160
  </td>
161
- <td>95.05
162
  </td>
163
- <td>93.60
164
  </td>
165
- <td>98.5%
166
  </td>
167
  </tr>
168
  <tr>
@@ -208,11 +219,11 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge and
208
  <tr>
209
  <td><strong>Average</strong>
210
  </td>
211
- <td><strong>83.60</strong>
212
  </td>
213
- <td><strong>82.66</strong>
214
  </td>
215
- <td><strong>99.1%</strong>
216
  </td>
217
  </tr>
218
  </table>
@@ -225,17 +236,30 @@ The results were obtained using the following commands:
225
  ```
226
  lm_eval \
227
  --model vllm \
228
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
229
- --tasks mmlu \
 
 
230
  --num_fewshot 5 \
231
  --batch_size auto
232
  ```
233
 
 
 
 
 
 
 
 
 
 
 
 
234
  #### ARC-Challenge
235
  ```
236
  lm_eval \
237
  --model vllm \
238
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
239
  --tasks arc_challenge_llama_3.1_instruct \
240
  --apply_chat_template \
241
  --num_fewshot 0 \
@@ -246,7 +270,7 @@ lm_eval \
246
  ```
247
  lm_eval \
248
  --model vllm \
249
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
250
  --tasks gsm8k_cot_llama_3.1_instruct \
251
  --fewshot_as_multiturn \
252
  --apply_chat_template \
@@ -258,7 +282,7 @@ lm_eval \
258
  ```
259
  lm_eval \
260
  --model vllm \
261
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
262
  --tasks hellaswag \
263
  --num_fewshot 10 \
264
  --batch_size auto
@@ -268,7 +292,7 @@ lm_eval \
268
  ```
269
  lm_eval \
270
  --model vllm \
271
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
272
  --tasks winogrande \
273
  --num_fewshot 5 \
274
  --batch_size auto
@@ -278,7 +302,7 @@ lm_eval \
278
  ```
279
  lm_eval \
280
  --model vllm \
281
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4 \
282
  --tasks truthfulqa \
283
  --num_fewshot 0 \
284
  --batch_size auto
 
130
 
131
  The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
132
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
133
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
134
 
135
  ### Accuracy
136
 
137
+ #### Open LLM Leaderboard evaluation scores
138
  <table>
139
  <tr>
140
  <td><strong>Benchmark</strong>
 
149
  <tr>
150
  <td>MMLU (5-shot)
151
  </td>
152
+ <td>83.88
153
  </td>
154
+ <td>81.07
155
  </td>
156
+ <td>96.6%
157
+ </td>
158
+ </tr>
159
+ <tr>
160
+ <td>MMLU (CoT, 0-shot)
161
+ </td>
162
+ <td>85.74
163
+ </td>
164
+ <td>83.29
165
+ </td>
166
+ <td>97.1%
167
  </td>
168
  </tr>
169
  <tr>
170
  <td>ARC Challenge (0-shot)
171
  </td>
172
+ <td>93.26
173
  </td>
174
+ <td>91.98
175
  </td>
176
+ <td>98.6%
177
  </td>
178
  </tr>
179
  <tr>
 
219
  <tr>
220
  <td><strong>Average</strong>
221
  </td>
222
+ <td><strong>83.89</strong>
223
  </td>
224
+ <td><strong>82.54</strong>
225
  </td>
226
+ <td><strong>98.4%</strong>
227
  </td>
228
  </tr>
229
  </table>
 
236
  ```
237
  lm_eval \
238
  --model vllm \
239
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
240
+ --tasks mmlu_llama_3.1_instruct \
241
+ --fewshot_as_multiturn \
242
+ --apply_chat_template \
243
  --num_fewshot 5 \
244
  --batch_size auto
245
  ```
246
 
247
+ #### MMLU-CoT
248
+ ```
249
+ lm_eval \
250
+ --model vllm \
251
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
252
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
253
+ --apply_chat_template \
254
+ --num_fewshot 0 \
255
+ --batch_size auto
256
+ ```
257
+
258
  #### ARC-Challenge
259
  ```
260
  lm_eval \
261
  --model vllm \
262
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
263
  --tasks arc_challenge_llama_3.1_instruct \
264
  --apply_chat_template \
265
  --num_fewshot 0 \
 
270
  ```
271
  lm_eval \
272
  --model vllm \
273
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
274
  --tasks gsm8k_cot_llama_3.1_instruct \
275
  --fewshot_as_multiturn \
276
  --apply_chat_template \
 
282
  ```
283
  lm_eval \
284
  --model vllm \
285
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
286
  --tasks hellaswag \
287
  --num_fewshot 10 \
288
  --batch_size auto
 
292
  ```
293
  lm_eval \
294
  --model vllm \
295
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
296
  --tasks winogrande \
297
  --num_fewshot 5 \
298
  --batch_size auto
 
302
  ```
303
  lm_eval \
304
  --model vllm \
305
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
306
  --tasks truthfulqa \
307
  --num_fewshot 0 \
308
  --batch_size auto