File size: 16,575 Bytes
7aeccfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
INFO: 2024-10-17 21:30:14,019: llmtf.base.evaluator: Starting eval on ['darumeru/multiq']
INFO: 2024-10-17 21:30:14,019: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:30:14,019: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:30:20,481: llmtf.base.darumeru/MultiQ: Loading Dataset: 6.46s
INFO: 2024-10-17 21:35:59,593: llmtf.base.darumeru/MultiQ: Processing Dataset: 339.11s
INFO: 2024-10-17 21:35:59,593: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ:
INFO: 2024-10-17 21:35:59,594: llmtf.base.darumeru/MultiQ: {'f1': 0.3346248767848689, 'em': 0.22275334608030592}
INFO: 2024-10-17 21:35:59,599: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:35:59,599: llmtf.base.evaluator: 
mean	darumeru/MultiQ
0.279	0.279
INFO: 2024-10-17 21:36:08,809: llmtf.base.evaluator: Starting eval on ['darumeru/parus']
INFO: 2024-10-17 21:36:08,810: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:36:08,810: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:36:12,969: llmtf.base.darumeru/PARus: Loading Dataset: 4.16s
INFO: 2024-10-17 21:36:18,316: llmtf.base.darumeru/PARus: Processing Dataset: 5.35s
INFO: 2024-10-17 21:36:18,317: llmtf.base.darumeru/PARus: Results for darumeru/PARus:
INFO: 2024-10-17 21:36:18,327: llmtf.base.darumeru/PARus: {'acc': 0.7}
INFO: 2024-10-17 21:36:18,327: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:36:18,328: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus
0.489	0.279	0.700
INFO: 2024-10-17 21:36:27,550: llmtf.base.evaluator: Starting eval on ['darumeru/rcb']
INFO: 2024-10-17 21:36:27,550: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:36:27,551: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:36:31,450: llmtf.base.darumeru/RCB: Loading Dataset: 3.90s
INFO: 2024-10-17 21:36:38,683: llmtf.base.darumeru/RCB: Processing Dataset: 7.23s
INFO: 2024-10-17 21:36:38,683: llmtf.base.darumeru/RCB: Results for darumeru/RCB:
INFO: 2024-10-17 21:36:38,686: llmtf.base.darumeru/RCB: {'acc': 0.5454545454545454, 'f1_macro': 0.49090309951702227}
INFO: 2024-10-17 21:36:38,687: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:36:38,688: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB
0.499	0.279	0.700	0.518
INFO: 2024-10-17 21:36:48,734: llmtf.base.evaluator: Starting eval on ['darumeru/ruopenbookqa']
INFO: 2024-10-17 21:36:48,735: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:36:48,735: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:36:54,900: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 6.17s
INFO: 2024-10-17 21:38:00,519: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 65.62s
INFO: 2024-10-17 21:38:00,520: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA:
INFO: 2024-10-17 21:38:00,532: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.7302405498281787, 'f1_macro': 0.7304546157096631}
INFO: 2024-10-17 21:38:00,541: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:38:00,542: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/ruOpenBookQA
0.557	0.279	0.700	0.518	0.730
INFO: 2024-10-17 21:38:09,745: llmtf.base.evaluator: Starting eval on ['darumeru/ruworldtree']
INFO: 2024-10-17 21:38:09,745: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:38:09,745: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:38:14,102: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 4.36s
INFO: 2024-10-17 21:38:16,932: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 2.83s
INFO: 2024-10-17 21:38:16,933: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree:
INFO: 2024-10-17 21:38:16,936: llmtf.base.darumeru/ruWorldTree: {'acc': 0.9047619047619048, 'f1_macro': 0.9043404138496471}
INFO: 2024-10-17 21:38:16,936: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:38:16,937: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.626	0.279	0.700	0.518	0.730	0.905
INFO: 2024-10-17 21:38:26,077: llmtf.base.evaluator: Starting eval on ['darumeru/rwsd']
INFO: 2024-10-17 21:38:26,077: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:38:26,077: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:38:30,781: llmtf.base.darumeru/RWSD: Loading Dataset: 4.70s
INFO: 2024-10-17 21:38:36,497: llmtf.base.darumeru/RWSD: Processing Dataset: 5.72s
INFO: 2024-10-17 21:38:36,497: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD:
INFO: 2024-10-17 21:38:36,498: llmtf.base.darumeru/RWSD: {'acc': 0.6029411764705882}
INFO: 2024-10-17 21:38:36,499: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:38:36,500: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.622	0.279	0.700	0.518	0.603	0.730	0.905
INFO: 2024-10-17 21:38:45,688: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive']
INFO: 2024-10-17 21:38:45,688: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:38:45,688: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:39:02,002: llmtf.base.daru/treewayextractive: Loading Dataset: 16.31s
INFO: 2024-10-17 21:42:05,777: llmtf.base.daru/treewayextractive: Processing Dataset: 183.77s
INFO: 2024-10-17 21:42:05,777: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive:
INFO: 2024-10-17 21:42:06,010: llmtf.base.daru/treewayextractive: {'r-prec': 0.3917218614718615}
INFO: 2024-10-17 21:42:06,052: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:42:06,054: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.589	0.392	0.279	0.700	0.518	0.603	0.730	0.905
INFO: 2024-10-17 21:42:15,170: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu']
INFO: 2024-10-17 21:42:15,170: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:42:15,170: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:46:47,282: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 272.11s
INFO: 2024-10-17 21:56:29,398: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 582.12s
INFO: 2024-10-17 21:56:29,399: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU:
INFO: 2024-10-17 21:56:29,464: llmtf.base.nlpcoreteam/ruMMLU:                                        metric
subject                                      
abstract_algebra                     0.340000
anatomy                              0.414815
astronomy                            0.611842
business_ethics                      0.610000
clinical_knowledge                   0.554717
college_biology                      0.548611
college_chemistry                    0.380000
college_computer_science             0.450000
college_mathematics                  0.400000
college_medicine                     0.526012
college_physics                      0.470588
computer_security                    0.620000
conceptual_physics                   0.565957
econometrics                         0.377193
electrical_engineering               0.537931
elementary_mathematics               0.529101
formal_logic                         0.365079
global_facts                         0.360000
high_school_biology                  0.664516
high_school_chemistry                0.487685
high_school_computer_science         0.700000
high_school_european_history         0.751515
high_school_geography                0.722222
high_school_government_and_politics  0.564767
high_school_macroeconomics           0.528205
high_school_mathematics              0.433333
high_school_microeconomics           0.533613
high_school_physics                  0.403974
high_school_psychology               0.713761
high_school_statistics               0.523148
high_school_us_history               0.661765
high_school_world_history            0.717300
human_aging                          0.587444
human_sexuality                      0.618321
international_law                    0.735537
jurisprudence                        0.666667
logical_fallacies                    0.564417
machine_learning                     0.392857
management                           0.650485
marketing                            0.752137
medical_genetics                     0.580000
miscellaneous                        0.632184
moral_disputes                       0.583815
moral_scenarios                      0.299441
nutrition                            0.637255
philosophy                           0.617363
prehistory                           0.561728
professional_accounting              0.386525
professional_law                     0.377445
professional_medicine                0.481618
professional_psychology              0.516340
public_relations                     0.500000
security_studies                     0.648980
sociology                            0.756219
us_foreign_policy                    0.720000
virology                             0.439759
world_religions                      0.719298
INFO: 2024-10-17 21:56:29,473: llmtf.base.nlpcoreteam/ruMMLU:                                    metric
subject                                  
STEM                             0.503308
humanities                       0.586259
other (business, health, misc.)  0.543782
social sciences                  0.599968
INFO: 2024-10-17 21:56:29,478: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.5583294528508019}
INFO: 2024-10-17 21:56:29,516: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 21:56:29,518: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/ruMMLU
0.586	0.392	0.279	0.700	0.518	0.603	0.730	0.905	0.558
INFO: 2024-10-17 21:56:39,535: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu']
INFO: 2024-10-17 21:56:39,536: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 21:56:39,536: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 21:58:54,966: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 135.43s
INFO: 2024-10-17 22:08:04,419: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 549.45s
INFO: 2024-10-17 22:08:04,426: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU:
INFO: 2024-10-17 22:08:04,492: llmtf.base.nlpcoreteam/enMMLU:                                        metric
subject                                      
abstract_algebra                     0.380000
anatomy                              0.637037
astronomy                            0.717105
business_ethics                      0.700000
clinical_knowledge                   0.705660
college_biology                      0.715278
college_chemistry                    0.470000
college_computer_science             0.580000
college_mathematics                  0.330000
college_medicine                     0.664740
college_physics                      0.509804
computer_security                    0.740000
conceptual_physics                   0.642553
econometrics                         0.508772
electrical_engineering               0.600000
elementary_mathematics               0.547619
formal_logic                         0.412698
global_facts                         0.360000
high_school_biology                  0.783871
high_school_chemistry                0.581281
high_school_computer_science         0.710000
high_school_european_history         0.800000
high_school_geography                0.757576
high_school_government_and_politics  0.854922
high_school_macroeconomics           0.679487
high_school_mathematics              0.455556
high_school_microeconomics           0.773109
high_school_physics                  0.437086
high_school_psychology               0.844037
high_school_statistics               0.652778
high_school_us_history               0.833333
high_school_world_history            0.843882
human_aging                          0.677130
human_sexuality                      0.786260
international_law                    0.768595
jurisprudence                        0.814815
logical_fallacies                    0.803681
machine_learning                     0.446429
management                           0.786408
marketing                            0.858974
medical_genetics                     0.760000
miscellaneous                        0.795658
moral_disputes                       0.667630
moral_scenarios                      0.311732
nutrition                            0.732026
philosophy                           0.704180
prehistory                           0.712963
professional_accounting              0.503546
professional_law                     0.457627
professional_medicine                0.658088
professional_psychology              0.668301
public_relations                     0.709091
security_studies                     0.697959
sociology                            0.800995
us_foreign_policy                    0.800000
virology                             0.506024
world_religions                      0.801170
INFO: 2024-10-17 22:08:04,506: llmtf.base.nlpcoreteam/enMMLU:                                    metric
subject                                  
STEM                             0.572187
humanities                       0.687100
other (business, health, misc.)  0.667521
social sciences                  0.740042
INFO: 2024-10-17 22:08:04,511: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6667125709237595}
INFO: 2024-10-17 22:08:04,554: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 22:08:04,556: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.595	0.392	0.279	0.700	0.518	0.603	0.730	0.905	0.667	0.558
INFO: 2024-10-17 22:08:14,512: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive']
INFO: 2024-10-17 22:08:14,513: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 22:08:14,513: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 22:08:18,791: llmtf.base.daru/treewayabstractive: Loading Dataset: 4.28s
INFO: 2024-10-17 22:11:46,260: llmtf.base.daru/treewayabstractive: Processing Dataset: 207.47s
INFO: 2024-10-17 22:11:46,260: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive:
INFO: 2024-10-17 22:11:46,261: llmtf.base.daru/treewayabstractive: {'rouge1': 0.33109987599556284, 'rouge2': 0.11202889150257295}
INFO: 2024-10-17 22:11:46,262: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 22:11:46,263: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.557	0.222	0.392	0.279	0.700	0.518	0.603	0.730	0.905	0.667	0.558
INFO: 2024-10-17 22:11:55,717: llmtf.base.evaluator: Starting eval on ['darumeru/cp_para_ru']
INFO: 2024-10-17 22:11:55,717: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-17 22:11:55,717: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 22:11:59,846: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 4.13s
INFO: 2024-10-17 22:14:29,975: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 150.13s
INFO: 2024-10-17 22:14:29,975: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru:
INFO: 2024-10-17 22:14:29,976: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 3.993754090002875, 'len': 0.9986883734384026, 'lcs': 0.98}
INFO: 2024-10-17 22:14:29,977: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 22:14:29,977: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/cp_para_ru	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.596	0.222	0.392	0.279	0.700	0.518	0.603	0.980	0.730	0.905	0.667	0.558