File size: 16,570 Bytes
68fe11e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
INFO: 2024-10-17 07:12:53,947: llmtf.base.evaluator: Starting eval on ['darumeru/multiq']
INFO: 2024-10-17 07:12:53,947: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:12:53,947: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:13:01,539: llmtf.base.darumeru/MultiQ: Loading Dataset: 7.59s
INFO: 2024-10-17 07:18:20,829: llmtf.base.darumeru/MultiQ: Processing Dataset: 319.29s
INFO: 2024-10-17 07:18:20,829: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ:
INFO: 2024-10-17 07:18:20,830: llmtf.base.darumeru/MultiQ: {'f1': 0.3485719410941241, 'em': 0.24282982791587}
INFO: 2024-10-17 07:18:20,835: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:18:20,835: llmtf.base.evaluator: 
mean	darumeru/MultiQ
0.296	0.296
INFO: 2024-10-17 07:18:30,261: llmtf.base.evaluator: Starting eval on ['darumeru/parus']
INFO: 2024-10-17 07:18:30,261: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:18:30,261: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:18:34,809: llmtf.base.darumeru/PARus: Loading Dataset: 4.55s
INFO: 2024-10-17 07:18:39,184: llmtf.base.darumeru/PARus: Processing Dataset: 4.37s
INFO: 2024-10-17 07:18:39,184: llmtf.base.darumeru/PARus: Results for darumeru/PARus:
INFO: 2024-10-17 07:18:39,194: llmtf.base.darumeru/PARus: {'acc': 0.68}
INFO: 2024-10-17 07:18:39,194: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:18:39,195: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus
0.488	0.296	0.680
INFO: 2024-10-17 07:18:48,257: llmtf.base.evaluator: Starting eval on ['darumeru/rcb']
INFO: 2024-10-17 07:18:48,258: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:18:48,258: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:18:52,169: llmtf.base.darumeru/RCB: Loading Dataset: 3.91s
INFO: 2024-10-17 07:18:57,742: llmtf.base.darumeru/RCB: Processing Dataset: 5.57s
INFO: 2024-10-17 07:18:57,742: llmtf.base.darumeru/RCB: Results for darumeru/RCB:
INFO: 2024-10-17 07:18:57,745: llmtf.base.darumeru/RCB: {'acc': 0.5272727272727272, 'f1_macro': 0.47584611730940257}
INFO: 2024-10-17 07:18:57,746: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:18:57,747: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB
0.492	0.296	0.680	0.502
INFO: 2024-10-17 07:19:07,388: llmtf.base.evaluator: Starting eval on ['darumeru/ruopenbookqa']
INFO: 2024-10-17 07:19:07,388: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:19:07,388: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:19:13,124: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 5.74s
INFO: 2024-10-17 07:20:12,666: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 59.54s
INFO: 2024-10-17 07:20:12,666: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA:
INFO: 2024-10-17 07:20:12,678: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.7207903780068728, 'f1_macro': 0.7206838429510474}
INFO: 2024-10-17 07:20:12,689: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:20:12,690: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/ruOpenBookQA
0.549	0.296	0.680	0.502	0.721
INFO: 2024-10-17 07:20:21,945: llmtf.base.evaluator: Starting eval on ['darumeru/ruworldtree']
INFO: 2024-10-17 07:20:21,945: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:20:21,945: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:20:25,640: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 3.69s
INFO: 2024-10-17 07:20:28,309: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 2.67s
INFO: 2024-10-17 07:20:28,310: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree:
INFO: 2024-10-17 07:20:28,312: llmtf.base.darumeru/ruWorldTree: {'acc': 0.8952380952380953, 'f1_macro': 0.8944916936662219}
INFO: 2024-10-17 07:20:28,313: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:20:28,314: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.619	0.296	0.680	0.502	0.721	0.895
INFO: 2024-10-17 07:20:37,966: llmtf.base.evaluator: Starting eval on ['darumeru/rwsd']
INFO: 2024-10-17 07:20:37,967: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:20:37,967: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:20:42,582: llmtf.base.darumeru/RWSD: Loading Dataset: 4.62s
INFO: 2024-10-17 07:20:47,988: llmtf.base.darumeru/RWSD: Processing Dataset: 5.41s
INFO: 2024-10-17 07:20:47,988: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD:
INFO: 2024-10-17 07:20:47,989: llmtf.base.darumeru/RWSD: {'acc': 0.5343137254901961}
INFO: 2024-10-17 07:20:47,989: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:20:47,990: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.605	0.296	0.680	0.502	0.534	0.721	0.895
INFO: 2024-10-17 07:20:57,317: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive']
INFO: 2024-10-17 07:20:57,317: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:20:57,317: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:21:13,664: llmtf.base.daru/treewayextractive: Loading Dataset: 16.35s
INFO: 2024-10-17 07:24:01,803: llmtf.base.daru/treewayextractive: Processing Dataset: 168.14s
INFO: 2024-10-17 07:24:01,803: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive:
INFO: 2024-10-17 07:24:02,038: llmtf.base.daru/treewayextractive: {'r-prec': 0.3983020202020202}
INFO: 2024-10-17 07:24:02,084: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:24:02,085: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.575	0.398	0.296	0.680	0.502	0.534	0.721	0.895
INFO: 2024-10-17 07:24:11,344: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu']
INFO: 2024-10-17 07:24:11,345: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:24:11,345: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:29:12,497: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 301.15s
INFO: 2024-10-17 07:35:18,210: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 365.71s
INFO: 2024-10-17 07:35:18,210: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU:
INFO: 2024-10-17 07:35:18,279: llmtf.base.nlpcoreteam/ruMMLU:                                        metric
subject                                      
abstract_algebra                     0.330000
anatomy                              0.422222
astronomy                            0.625000
business_ethics                      0.580000
clinical_knowledge                   0.592453
college_biology                      0.506944
college_chemistry                    0.340000
college_computer_science             0.540000
college_mathematics                  0.370000
college_medicine                     0.549133
college_physics                      0.431373
computer_security                    0.570000
conceptual_physics                   0.536170
econometrics                         0.385965
electrical_engineering               0.531034
elementary_mathematics               0.515873
formal_logic                         0.333333
global_facts                         0.390000
high_school_biology                  0.670968
high_school_chemistry                0.487685
high_school_computer_science         0.660000
high_school_european_history         0.733333
high_school_geography                0.696970
high_school_government_and_politics  0.569948
high_school_macroeconomics           0.523077
high_school_mathematics              0.429630
high_school_microeconomics           0.521008
high_school_physics                  0.443709
high_school_psychology               0.706422
high_school_statistics               0.523148
high_school_us_history               0.642157
high_school_world_history            0.729958
human_aging                          0.587444
human_sexuality                      0.641221
international_law                    0.694215
jurisprudence                        0.638889
logical_fallacies                    0.533742
machine_learning                     0.419643
management                           0.650485
marketing                            0.726496
medical_genetics                     0.550000
miscellaneous                        0.629630
moral_disputes                       0.575145
moral_scenarios                      0.248045
nutrition                            0.614379
philosophy                           0.643087
prehistory                           0.546296
professional_accounting              0.358156
professional_law                     0.373533
professional_medicine                0.500000
professional_psychology              0.495098
public_relations                     0.500000
security_studies                     0.665306
sociology                            0.701493
us_foreign_policy                    0.700000
virology                             0.433735
world_religions                      0.672515
INFO: 2024-10-17 07:35:18,289: llmtf.base.nlpcoreteam/ruMMLU:                                    metric
subject                                  
STEM                             0.496176
humanities                       0.566481
other (business, health, misc.)  0.541724
social sciences                  0.592209
INFO: 2024-10-17 07:35:18,294: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.549147460511024}
INFO: 2024-10-17 07:35:18,341: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:35:18,343: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/ruMMLU
0.572	0.398	0.296	0.680	0.502	0.534	0.721	0.895	0.549
INFO: 2024-10-17 07:35:27,953: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu']
INFO: 2024-10-17 07:35:27,953: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:35:27,953: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:37:30,758: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 122.80s
INFO: 2024-10-17 07:43:10,625: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 339.87s
INFO: 2024-10-17 07:43:10,626: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU:
INFO: 2024-10-17 07:43:10,691: llmtf.base.nlpcoreteam/enMMLU:                                        metric
subject                                      
abstract_algebra                     0.370000
anatomy                              0.622222
astronomy                            0.697368
business_ethics                      0.670000
clinical_knowledge                   0.709434
college_biology                      0.701389
college_chemistry                    0.450000
college_computer_science             0.570000
college_mathematics                  0.360000
college_medicine                     0.670520
college_physics                      0.480392
computer_security                    0.720000
conceptual_physics                   0.655319
econometrics                         0.500000
electrical_engineering               0.565517
elementary_mathematics               0.539683
formal_logic                         0.357143
global_facts                         0.370000
high_school_biology                  0.800000
high_school_chemistry                0.561576
high_school_computer_science         0.670000
high_school_european_history         0.763636
high_school_geography                0.772727
high_school_government_and_politics  0.849741
high_school_macroeconomics           0.679487
high_school_mathematics              0.440741
high_school_microeconomics           0.756303
high_school_physics                  0.450331
high_school_psychology               0.849541
high_school_statistics               0.643519
high_school_us_history               0.813725
high_school_world_history            0.835443
human_aging                          0.695067
human_sexuality                      0.763359
international_law                    0.768595
jurisprudence                        0.787037
logical_fallacies                    0.779141
machine_learning                     0.464286
management                           0.805825
marketing                            0.884615
medical_genetics                     0.750000
miscellaneous                        0.784163
moral_disputes                       0.650289
moral_scenarios                      0.270391
nutrition                            0.718954
philosophy                           0.717042
prehistory                           0.737654
professional_accounting              0.496454
professional_law                     0.458931
professional_medicine                0.672794
professional_psychology              0.668301
public_relations                     0.681818
security_studies                     0.718367
sociology                            0.810945
us_foreign_policy                    0.790000
virology                             0.487952
world_religions                      0.812865
INFO: 2024-10-17 07:43:10,700: llmtf.base.nlpcoreteam/enMMLU:                                    metric
subject                                  
STEM                             0.563340
humanities                       0.673223
other (business, health, misc.)  0.667000
social sciences                  0.736716
INFO: 2024-10-17 07:43:10,705: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6600696360837558}
INFO: 2024-10-17 07:43:10,741: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:43:10,743: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.582	0.398	0.296	0.680	0.502	0.534	0.721	0.895	0.660	0.549
INFO: 2024-10-17 07:43:20,115: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive']
INFO: 2024-10-17 07:43:20,115: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:43:20,115: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:43:24,372: llmtf.base.daru/treewayabstractive: Loading Dataset: 4.26s
INFO: 2024-10-17 07:47:01,407: llmtf.base.daru/treewayabstractive: Processing Dataset: 217.03s
INFO: 2024-10-17 07:47:01,407: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive:
INFO: 2024-10-17 07:47:01,408: llmtf.base.daru/treewayabstractive: {'rouge1': 0.32720307606797727, 'rouge2': 0.10857945570692258}
INFO: 2024-10-17 07:47:01,409: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:47:01,410: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.545	0.218	0.398	0.296	0.680	0.502	0.534	0.721	0.895	0.660	0.549
INFO: 2024-10-17 07:47:10,811: llmtf.base.evaluator: Starting eval on ['darumeru/cp_para_ru']
INFO: 2024-10-17 07:47:10,811: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508]
INFO: 2024-10-17 07:47:10,811: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-17 07:47:15,676: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 4.86s
INFO: 2024-10-17 07:49:51,029: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 155.35s
INFO: 2024-10-17 07:49:51,030: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru:
INFO: 2024-10-17 07:49:51,031: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 3.76859951568896, 'len': 0.9950709951674359, 'lcs': 0.9}
INFO: 2024-10-17 07:49:51,031: llmtf.base.evaluator: Ended eval
INFO: 2024-10-17 07:49:51,032: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/cp_para_ru	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.578	0.218	0.398	0.296	0.680	0.502	0.534	0.900	0.721	0.895	0.660	0.549