Added Eval-Scores
Browse files
README.md
CHANGED
@@ -105,9 +105,70 @@ print(outputs[0]["generated_text"])
|
|
105 |
```
|
106 |
|
107 |
## 🏆 Evaluation Scores
|
108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
|
111 |
## Special thanks & Reference
|
112 |
-
- Maxime Labonne for their easy-to-use colab-notebook [Merging LLMs with MergeKit](https://github.com/mlabonne/llm-course/blob/main/Mergekit.ipynb)
|
113 |
- Authors of [Mergekit](https://github.com/arcee-ai/mergekit)
|
|
|
105 |
```
|
106 |
|
107 |
## 🏆 Evaluation Scores
|
108 |
+
|
109 |
+
### Nous
|
110 |
+
|
111 |
+
| Model |AGIEval|TruthfulQA|Bigbench|
|
112 |
+
|----------------------------------------------------------------------------------------------------------------|------:|---------:|-------:|
|
113 |
+
|[Llama3-8B-SuperNova-Spectrum-dare_ties](https://huggingface.co/yuvraj17/Llama3-8B-SuperNova-Spectrum-dare_ties)| 38.32| 57.15| 43.91|
|
114 |
+
|
115 |
+
### AGIEval
|
116 |
+
| Task |Version| Metric |Value| |Stderr|
|
117 |
+
|------------------------------|------:|--------|----:|---|-----:|
|
118 |
+
|agieval_aqua_rat | 0|acc |20.47|± | 2.54|
|
119 |
+
| | |acc_norm|18.50|± | 2.44|
|
120 |
+
|agieval_logiqa_en | 0|acc |35.94|± | 1.88|
|
121 |
+
| | |acc_norm|35.64|± | 1.88|
|
122 |
+
|agieval_lsat_ar | 0|acc |21.74|± | 2.73|
|
123 |
+
| | |acc_norm|20.00|± | 2.64|
|
124 |
+
|agieval_lsat_lr | 0|acc |41.37|± | 2.18|
|
125 |
+
| | |acc_norm|40.98|± | 2.18|
|
126 |
+
|agieval_lsat_rc | 0|acc |59.11|± | 3.00|
|
127 |
+
| | |acc_norm|56.13|± | 3.03|
|
128 |
+
|agieval_sat_en | 0|acc |63.59|± | 3.36|
|
129 |
+
| | |acc_norm|60.19|± | 3.42|
|
130 |
+
|agieval_sat_en_without_passage| 0|acc |40.29|± | 3.43|
|
131 |
+
| | |acc_norm|37.38|± | 3.38|
|
132 |
+
|agieval_sat_math | 0|acc |38.64|± | 3.29|
|
133 |
+
| | |acc_norm|37.73|± | 3.28|
|
134 |
+
|
135 |
+
Average: 38.32%
|
136 |
+
|
137 |
+
### TruthfulQA
|
138 |
+
| Task |Version|Metric|Value| |Stderr|
|
139 |
+
|-------------|------:|------|----:|---|-----:|
|
140 |
+
|truthfulqa_mc| 1|mc1 |38.43|± | 1.7|
|
141 |
+
| | |mc2 |57.15|± | 1.5|
|
142 |
+
|
143 |
+
Average: 57.15%
|
144 |
+
|
145 |
+
### Bigbench
|
146 |
+
| Task |Version| Metric |Value| |Stderr|
|
147 |
+
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
148 |
+
|bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
|
149 |
+
|bigbench_date_understanding | 0|multiple_choice_grade|70.73|± | 2.37|
|
150 |
+
|bigbench_disambiguation_qa | 0|multiple_choice_grade|30.23|± | 2.86|
|
151 |
+
|bigbench_geometric_shapes | 0|multiple_choice_grade|47.35|± | 2.64|
|
152 |
+
| | |exact_str_match | 0.00|± | 0.00|
|
153 |
+
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|29.00|± | 2.03|
|
154 |
+
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|21.00|± | 1.54|
|
155 |
+
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|51.33|± | 2.89|
|
156 |
+
|bigbench_movie_recommendation | 0|multiple_choice_grade|33.20|± | 2.11|
|
157 |
+
|bigbench_navigate | 0|multiple_choice_grade|55.40|± | 1.57|
|
158 |
+
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|66.35|± | 1.06|
|
159 |
+
|bigbench_ruin_names | 0|multiple_choice_grade|45.76|± | 2.36|
|
160 |
+
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|28.26|± | 1.43|
|
161 |
+
|bigbench_snarks | 0|multiple_choice_grade|62.43|± | 3.61|
|
162 |
+
|bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59|
|
163 |
+
|bigbench_temporal_sequences | 0|multiple_choice_grade|48.00|± | 1.58|
|
164 |
+
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.60|± | 1.20|
|
165 |
+
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.66|± | 0.91|
|
166 |
+
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|51.33|± | 2.89|
|
167 |
+
|
168 |
+
Average: 43.91%
|
169 |
+
|
170 |
|
171 |
|
172 |
## Special thanks & Reference
|
173 |
+
- Maxime Labonne for their easy-to-use colab-notebook [Merging LLMs with MergeKit](https://github.com/mlabonne/llm-course/blob/main/Mergekit.ipynb), [Blog](https://towardsdatascience.com/merge-large-language-models-with-mergekit-2118fb392b54) and [LLM-AutoEva Notebookl](https://github.com/mlabonne/llm-autoeval)
|
174 |
- Authors of [Mergekit](https://github.com/arcee-ai/mergekit)
|