cstr commited on
Commit
5f651ca
1 Parent(s): f880b55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md CHANGED
@@ -37,6 +37,64 @@ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-le
37
  |MuSR (0-shot) |11.15|
38
  |MMLU-PRO (5-shot) |30.34|
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  # merge tree
42
 
 
37
  |MuSR (0-shot) |11.15|
38
  |MMLU-PRO (5-shot) |30.34|
39
 
40
+ | Model |AGIEval|TruthfulQA|Bigbench|
41
+ |--------------------------------------------------------------------------------|------:|---------:|-------:|
42
+ |[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)| 42.05| 57.2| 44.75|
43
+
44
+ ### AGIEval
45
+ | Task |Version| Metric |Value| |Stderr|
46
+ |------------------------------|------:|--------|----:|---|-----:|
47
+ |agieval_aqua_rat | 0|acc |24.02|± | 2.69|
48
+ | | |acc_norm|23.62|± | 2.67|
49
+ |agieval_logiqa_en | 0|acc |40.09|± | 1.92|
50
+ | | |acc_norm|39.78|± | 1.92|
51
+ |agieval_lsat_ar | 0|acc |22.17|± | 2.75|
52
+ | | |acc_norm|21.74|± | 2.73|
53
+ |agieval_lsat_lr | 0|acc |50.39|± | 2.22|
54
+ | | |acc_norm|45.29|± | 2.21|
55
+ |agieval_lsat_rc | 0|acc |64.31|± | 2.93|
56
+ | | |acc_norm|58.36|± | 3.01|
57
+ |agieval_sat_en | 0|acc |81.07|± | 2.74|
58
+ | | |acc_norm|73.79|± | 3.07|
59
+ |agieval_sat_en_without_passage| 0|acc |45.15|± | 3.48|
60
+ | | |acc_norm|38.83|± | 3.40|
61
+ |agieval_sat_math | 0|acc |40.91|± | 3.32|
62
+ | | |acc_norm|35.00|± | 3.22|
63
+
64
+ Average: 42.05%
65
+
66
+ ### TruthfulQA
67
+ | Task |Version|Metric|Value| |Stderr|
68
+ |-------------|------:|------|----:|---|-----:|
69
+ |truthfulqa_mc| 1|mc1 |39.66|± | 1.71|
70
+ | | |mc2 |57.20|± | 1.51|
71
+
72
+ Average: 57.2%
73
+
74
+ ### Bigbench
75
+ | Task |Version| Metric |Value| |Stderr|
76
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
77
+ |bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
78
+ |bigbench_date_understanding | 0|multiple_choice_grade|70.46|± | 2.38|
79
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|31.40|± | 2.89|
80
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|33.43|± | 2.49|
81
+ | | |exact_str_match | 0.00|± | 0.00|
82
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05|
83
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|24.29|± | 1.62|
84
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87|
85
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18|
86
+ |bigbench_navigate | 0|multiple_choice_grade|50.20|± | 1.58|
87
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.50|± | 1.03|
88
+ |bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36|
89
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.77|± | 1.49|
90
+ |bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55|
91
+ |bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59|
92
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|45.70|± | 1.58|
93
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17|
94
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.03|± | 0.90|
95
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87|
96
+
97
+ Average: 44.75%
98
 
99
  # merge tree
100