[WIP] Add evaluation results to model card metadata
Browse filesThis is a work in progress. The goal is to list evaluation results in the model card metadata, especially the results from the Open LLM Leaderboard. This PR has **not** been created automatically.
#### Pending questions:
1. Should we report all metrics for each task? (especially the `_stderr` ones?) Or only the one that is displayed in the LLM Leaderboard?
2. Are the dataset `type`/`name`/`config`/`split`/`num_few_shot` accurate in the suggested changes?
3. How to report the MMLU results? There are 57 different `hendrycksTest` datasets for a total of 228 metrics? 😵
4. How to report MT-Bench results? (asking since they are reported in the model card but not in the metadata)
5. How to report AlpacaEval results? (asking since they are reported in the model card but not in the metadata)
Related thread: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/370#65663f60589e212284db2ffc.
Related PR in the Hub docs: https://github.com/huggingface/hub-docs/pull/1144.
cc
@clefourrier
@julien-c
@lewtun
@Weyaxi
@@ -1,9 +1,6 @@
|
|
1 |
---
|
2 |
tags:
|
3 |
- generated_from_trainer
|
4 |
-
model-index:
|
5 |
-
- name: zephyr-7b-beta
|
6 |
-
results: []
|
7 |
license: mit
|
8 |
datasets:
|
9 |
- HuggingFaceH4/ultrachat_200k
|
@@ -16,8 +13,161 @@ widget:
|
|
16 |
output:
|
17 |
text: "Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight, but I've got a plan that might help ye get rid of 'im. Ye'll need to gather some carrots and hay, and then lure the llama away with the promise of a tasty treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet once again. But beware, me hearty, for there may be more llamas where that one came from! Arr!"
|
18 |
pipeline_tag: text-generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
---
|
20 |
-
|
21 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
22 |
should probably proofread and complete it, then remove this comment. -->
|
23 |
|
@@ -86,12 +236,9 @@ Here's how you can run the model using the `pipeline()` function from 🤗 Trans
|
|
86 |
# Install transformers from source - only needed for versions <= v4.34
|
87 |
# pip install git+https://github.com/huggingface/transformers.git
|
88 |
# pip install accelerate
|
89 |
-
|
90 |
import torch
|
91 |
from transformers import pipeline
|
92 |
-
|
93 |
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
|
94 |
-
|
95 |
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
|
96 |
messages = [
|
97 |
{
|
@@ -149,12 +296,8 @@ The following hyperparameters were used during training:
|
|
149 |
- lr_scheduler_type: linear
|
150 |
- lr_scheduler_warmup_ratio: 0.1
|
151 |
- num_epochs: 3.0
|
152 |
-
|
153 |
### Training results
|
154 |
-
|
155 |
The table below shows the full set of DPO training metrics:
|
156 |
-
|
157 |
-
|
158 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
159 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
160 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|
@@ -215,19 +358,13 @@ The table below shows the full set of DPO training metrics:
|
|
215 |
| 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 |
|
216 |
| 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 |
|
217 |
| 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 |
|
218 |
-
|
219 |
-
|
220 |
### Framework versions
|
221 |
-
|
222 |
- Transformers 4.35.0.dev0
|
223 |
- Pytorch 2.0.1+cu118
|
224 |
- Datasets 2.12.0
|
225 |
- Tokenizers 0.14.0
|
226 |
-
|
227 |
## Citation
|
228 |
-
|
229 |
If you find Zephyr-7B-β is useful in your work, please cite it with:
|
230 |
-
|
231 |
```
|
232 |
@misc{tunstall2023zephyr,
|
233 |
title={Zephyr: Direct Distillation of LM Alignment},
|
@@ -240,7 +377,6 @@ If you find Zephyr-7B-β is useful in your work, please cite it with:
|
|
240 |
```
|
241 |
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
242 |
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta)
|
243 |
-
|
244 |
| Metric | Value |
|
245 |
|-----------------------|---------------------------|
|
246 |
| Avg. | 52.15 |
|
|
|
1 |
---
|
2 |
tags:
|
3 |
- generated_from_trainer
|
|
|
|
|
|
|
4 |
license: mit
|
5 |
datasets:
|
6 |
- HuggingFaceH4/ultrachat_200k
|
|
|
13 |
output:
|
14 |
text: "Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight, but I've got a plan that might help ye get rid of 'im. Ye'll need to gather some carrots and hay, and then lure the llama away with the promise of a tasty treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet once again. But beware, me hearty, for there may be more llamas where that one came from! Arr!"
|
15 |
pipeline_tag: text-generation
|
16 |
+
model-index:
|
17 |
+
- name: zephyr-7b-beta
|
18 |
+
results:
|
19 |
+
|
20 |
+
# AI2 Reasoning Challenge (25-Shot) (Open LLM Leaderboard)
|
21 |
+
- task:
|
22 |
+
type: text-generation
|
23 |
+
name: Text Generation
|
24 |
+
dataset:
|
25 |
+
name: AI2 Reasoning Challenge (25-Shot)
|
26 |
+
type: ai2_arc
|
27 |
+
config: ARC-Challenge
|
28 |
+
split: test
|
29 |
+
args:
|
30 |
+
num_few_shot: 25
|
31 |
+
metrics:
|
32 |
+
- type: acc
|
33 |
+
name: accuracy
|
34 |
+
value: 0.590443686006826
|
35 |
+
- type: acc_stderr
|
36 |
+
value: 0.014370358632472437
|
37 |
+
- type: acc_norm
|
38 |
+
name: normalized accuracy
|
39 |
+
value: 0.6203071672354948
|
40 |
+
- type: acc_norm_stderr
|
41 |
+
value: 0.01418211986697487
|
42 |
+
source:
|
43 |
+
name: Open LLM Leaderboard
|
44 |
+
url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
|
45 |
+
|
46 |
+
# HellaSwag (10-shot) (Open LLM Leaderboard)
|
47 |
+
- task:
|
48 |
+
type: text-generation
|
49 |
+
name: Text Generation
|
50 |
+
dataset:
|
51 |
+
name: HellaSwag (10-Shot)
|
52 |
+
type: Rowan/hellaswag
|
53 |
+
split: test # or validation?
|
54 |
+
args:
|
55 |
+
num_few_shot: 10
|
56 |
+
metrics:
|
57 |
+
- type: acc
|
58 |
+
name: accuracy
|
59 |
+
value: 0.6491734714200359
|
60 |
+
- type: acc_stderr
|
61 |
+
value: 0.004762534245488399
|
62 |
+
- type: acc_norm
|
63 |
+
name: normalized accuracy
|
64 |
+
value: 0.8435570603465445
|
65 |
+
- type: acc_norm_stderr
|
66 |
+
value: 0.003625323221166244
|
67 |
+
source:
|
68 |
+
name: Open LLM Leaderboard
|
69 |
+
url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
|
70 |
+
|
71 |
+
# DROP (3-shot) (Open LLM Leaderboard)
|
72 |
+
- task:
|
73 |
+
type: text-generation
|
74 |
+
name: Text Generation
|
75 |
+
dataset:
|
76 |
+
name: Drop (3-Shot)
|
77 |
+
type: drop
|
78 |
+
split: test
|
79 |
+
args:
|
80 |
+
num_few_shot: 3
|
81 |
+
metrics:
|
82 |
+
- type: em
|
83 |
+
name: exact match
|
84 |
+
value: 0.004928691275167785
|
85 |
+
- type: em_stderr
|
86 |
+
value: 0.0007171872517059793
|
87 |
+
- type: f1
|
88 |
+
name: f1 score
|
89 |
+
value: 0.09662437080536909
|
90 |
+
- type: f1_stderr
|
91 |
+
value: 0.0018807376338089597
|
92 |
+
source:
|
93 |
+
name: Open LLM Leaderboard
|
94 |
+
url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
|
95 |
+
|
96 |
+
# TruthfulQA (0-shot) (Open LLM Leaderboard)
|
97 |
+
- task:
|
98 |
+
type: text-generation
|
99 |
+
name: Text Generation
|
100 |
+
dataset:
|
101 |
+
name: TruthfulQA (0-shot)
|
102 |
+
type: truthful_qa
|
103 |
+
config: multiple_choice
|
104 |
+
split: validation
|
105 |
+
args:
|
106 |
+
num_few_shot: 0
|
107 |
+
metrics:
|
108 |
+
- type: mc1
|
109 |
+
value: 0.40636474908200737
|
110 |
+
- type: mc1_stderr
|
111 |
+
value: 0.017193835812093893
|
112 |
+
- type: mc2
|
113 |
+
value: 0.5744916942762855
|
114 |
+
- type: mc2_stderr
|
115 |
+
value: 0.015742095840959796
|
116 |
+
source:
|
117 |
+
name: Open LLM Leaderboard
|
118 |
+
url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
|
119 |
+
|
120 |
+
# GSM8k (5-shot) (Open LLM Leaderboard)
|
121 |
+
- task:
|
122 |
+
type: text-generation
|
123 |
+
name: Text Generation
|
124 |
+
dataset:
|
125 |
+
name: GSM8k (5-shot)
|
126 |
+
type: gsm8k
|
127 |
+
split: test
|
128 |
+
args:
|
129 |
+
num_few_shot: 5
|
130 |
+
metrics:
|
131 |
+
- type: acc
|
132 |
+
name: accuracy
|
133 |
+
value: 0.12736921910538287
|
134 |
+
- type: acc_stderr
|
135 |
+
value: 0.009183110326737829
|
136 |
+
source:
|
137 |
+
name: Open LLM Leaderboard
|
138 |
+
url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
|
139 |
+
|
140 |
+
# MMLU (5-Shot) (Open LLM Leaderboard)
|
141 |
+
# ???
|
142 |
+
|
143 |
+
# AlpacaEval (taken from model card)
|
144 |
+
- task:
|
145 |
+
type: text-generation
|
146 |
+
name: Text Generation
|
147 |
+
dataset:
|
148 |
+
name: AlpacaEval
|
149 |
+
type: unknown
|
150 |
+
metrics:
|
151 |
+
- type: unknown
|
152 |
+
name: win rate
|
153 |
+
value: 0.9060
|
154 |
+
source:
|
155 |
+
url: https://tatsu-lab.github.io/alpaca_eval/
|
156 |
+
|
157 |
+
# MT-Bench (taken from model card)
|
158 |
+
- task:
|
159 |
+
type: text-generation
|
160 |
+
name: Text Generation
|
161 |
+
dataset:
|
162 |
+
name: MT-Bench
|
163 |
+
type: unknown
|
164 |
+
metrics:
|
165 |
+
- type: unknown
|
166 |
+
name: score
|
167 |
+
value: 7.34
|
168 |
+
source:
|
169 |
+
url: https://huggingface.co/spaces/lmsys/mt-bench
|
170 |
---
|
|
|
171 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
172 |
should probably proofread and complete it, then remove this comment. -->
|
173 |
|
|
|
236 |
# Install transformers from source - only needed for versions <= v4.34
|
237 |
# pip install git+https://github.com/huggingface/transformers.git
|
238 |
# pip install accelerate
|
|
|
239 |
import torch
|
240 |
from transformers import pipeline
|
|
|
241 |
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
|
|
|
242 |
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
|
243 |
messages = [
|
244 |
{
|
|
|
296 |
- lr_scheduler_type: linear
|
297 |
- lr_scheduler_warmup_ratio: 0.1
|
298 |
- num_epochs: 3.0
|
|
|
299 |
### Training results
|
|
|
300 |
The table below shows the full set of DPO training metrics:
|
|
|
|
|
301 |
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|
302 |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
303 |
| 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
|
|
|
358 |
| 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 |
|
359 |
| 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 |
|
360 |
| 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 |
|
|
|
|
|
361 |
### Framework versions
|
|
|
362 |
- Transformers 4.35.0.dev0
|
363 |
- Pytorch 2.0.1+cu118
|
364 |
- Datasets 2.12.0
|
365 |
- Tokenizers 0.14.0
|
|
|
366 |
## Citation
|
|
|
367 |
If you find Zephyr-7B-β is useful in your work, please cite it with:
|
|
|
368 |
```
|
369 |
@misc{tunstall2023zephyr,
|
370 |
title={Zephyr: Direct Distillation of LM Alignment},
|
|
|
377 |
```
|
378 |
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
379 |
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta)
|
|
|
380 |
| Metric | Value |
|
381 |
|-----------------------|---------------------------|
|
382 |
| Avg. | 52.15 |
|