File size: 16,582 Bytes
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
0a8c241
 
 
 
 
 
c9cc895
0a8c241
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
0a8c241
 
 
 
c9cc895
0a8c241
 
 
 
 
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
 
0a8c241
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
0a8c241
 
 
 
 
 
 
 
c9cc895
0a8c241
 
 
c9cc895
0a8c241
 
c9cc895
0a8c241
 
 
 
 
 
 
 
c9cc895
0a8c241
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
 
0a8c241
 
 
 
c9cc895
0a8c241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9cc895
0a8c241
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
INFO: 2024-10-28 13:15:15,094: llmtf.base.evaluator: Starting eval on ['darumeru/multiq']
INFO: 2024-10-28 13:15:15,094: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:15:15,094: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:15:16,695: llmtf.base.evaluator: Starting eval on ['darumeru/parus']
INFO: 2024-10-28 13:15:16,695: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:15:16,695: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:15:18,943: llmtf.base.darumeru/PARus: Loading Dataset: 2.25s
INFO: 2024-10-28 13:15:19,297: llmtf.base.darumeru/MultiQ: Loading Dataset: 4.20s
INFO: 2024-10-28 13:15:22,318: llmtf.base.darumeru/PARus: Processing Dataset: 3.37s
INFO: 2024-10-28 13:15:22,318: llmtf.base.darumeru/PARus: Results for darumeru/PARus:
INFO: 2024-10-28 13:15:22,329: llmtf.base.darumeru/PARus: {'acc': 0.78}
INFO: 2024-10-28 13:15:22,330: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:15:22,330: llmtf.base.evaluator: 
mean	darumeru/PARus
0.780	0.780
INFO: 2024-10-28 13:15:30,304: llmtf.base.evaluator: Starting eval on ['darumeru/ruopenbookqa']
INFO: 2024-10-28 13:15:30,304: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:15:30,304: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:15:33,637: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 3.33s
INFO: 2024-10-28 13:16:05,173: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 31.54s
INFO: 2024-10-28 13:16:05,173: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA:
INFO: 2024-10-28 13:16:05,184: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.8256013745704467, 'f1_macro': 0.8262484506706507}
INFO: 2024-10-28 13:16:05,191: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:16:05,192: llmtf.base.evaluator: 
mean	darumeru/PARus	darumeru/ruOpenBookQA
0.803	0.780	0.826
INFO: 2024-10-28 13:16:13,923: llmtf.base.evaluator: Starting eval on ['darumeru/rwsd']
INFO: 2024-10-28 13:16:13,923: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:16:13,923: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:16:16,429: llmtf.base.darumeru/RWSD: Loading Dataset: 2.51s
INFO: 2024-10-28 13:16:22,246: llmtf.base.darumeru/RWSD: Processing Dataset: 5.82s
INFO: 2024-10-28 13:16:22,246: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD:
INFO: 2024-10-28 13:16:22,247: llmtf.base.darumeru/RWSD: {'acc': 0.5441176470588235}
INFO: 2024-10-28 13:16:22,248: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:16:22,249: llmtf.base.evaluator: 
mean	darumeru/PARus	darumeru/RWSD	darumeru/ruOpenBookQA
0.717	0.780	0.544	0.826
INFO: 2024-10-28 13:16:31,348: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu']
INFO: 2024-10-28 13:16:31,348: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:16:31,348: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:18:38,554: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 127.21s
INFO: 2024-10-28 13:20:06,478: llmtf.base.darumeru/MultiQ: Processing Dataset: 287.18s
INFO: 2024-10-28 13:20:06,479: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ:
INFO: 2024-10-28 13:20:06,480: llmtf.base.darumeru/MultiQ: {'f1': 0.2503859074384594, 'em': 0.14531548757170173}
INFO: 2024-10-28 13:20:06,488: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:20:06,489: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RWSD	darumeru/ruOpenBookQA
0.587	0.198	0.780	0.544	0.826
INFO: 2024-10-28 13:20:15,334: llmtf.base.evaluator: Starting eval on ['darumeru/rcb']
INFO: 2024-10-28 13:20:15,335: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:20:15,335: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:20:18,179: llmtf.base.darumeru/RCB: Loading Dataset: 2.84s
INFO: 2024-10-28 13:20:23,505: llmtf.base.darumeru/RCB: Processing Dataset: 5.33s
INFO: 2024-10-28 13:20:23,506: llmtf.base.darumeru/RCB: Results for darumeru/RCB:
INFO: 2024-10-28 13:20:23,510: llmtf.base.darumeru/RCB: {'acc': 0.5863636363636363, 'f1_macro': 0.5032640286161413}
INFO: 2024-10-28 13:20:23,511: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:20:23,512: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA
0.579	0.198	0.780	0.545	0.544	0.826
INFO: 2024-10-28 13:20:32,046: llmtf.base.evaluator: Starting eval on ['darumeru/ruworldtree']
INFO: 2024-10-28 13:20:32,046: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:20:32,046: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:20:34,403: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 2.36s
INFO: 2024-10-28 13:20:36,969: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 2.57s
INFO: 2024-10-28 13:20:36,969: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree:
INFO: 2024-10-28 13:20:36,972: llmtf.base.darumeru/ruWorldTree: {'acc': 0.9047619047619048, 'f1_macro': 0.9038817229146561}
INFO: 2024-10-28 13:20:36,972: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:20:36,972: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree
0.633	0.198	0.780	0.545	0.544	0.826	0.904
INFO: 2024-10-28 13:20:45,488: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive']
INFO: 2024-10-28 13:20:45,488: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:20:45,488: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:20:59,443: llmtf.base.daru/treewayextractive: Loading Dataset: 13.95s
INFO: 2024-10-28 13:23:49,533: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 310.98s
INFO: 2024-10-28 13:23:49,533: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU:
INFO: 2024-10-28 13:23:49,597: llmtf.base.nlpcoreteam/ruMMLU:                                        metric
subject                                      
abstract_algebra                     0.480000
anatomy                              0.562963
astronomy                            0.769737
business_ethics                      0.640000
clinical_knowledge                   0.671698
college_biology                      0.673611
college_chemistry                    0.470000
college_computer_science             0.670000
college_mathematics                  0.440000
college_medicine                     0.589595
college_physics                      0.480392
computer_security                    0.720000
conceptual_physics                   0.655319
econometrics                         0.482456
electrical_engineering               0.606897
elementary_mathematics               0.616402
formal_logic                         0.428571
global_facts                         0.370000
high_school_biology                  0.809677
high_school_chemistry                0.571429
high_school_computer_science         0.770000
high_school_european_history         0.751515
high_school_geography                0.782828
high_school_government_and_politics  0.725389
high_school_macroeconomics           0.658974
high_school_mathematics              0.525926
high_school_microeconomics           0.705882
high_school_physics                  0.463576
high_school_psychology               0.796330
high_school_statistics               0.606481
high_school_us_history               0.779412
high_school_world_history            0.801688
human_aging                          0.632287
human_sexuality                      0.717557
international_law                    0.743802
jurisprudence                        0.675926
logical_fallacies                    0.662577
machine_learning                     0.482143
management                           0.747573
marketing                            0.816239
medical_genetics                     0.650000
miscellaneous                        0.711367
moral_disputes                       0.627168
moral_scenarios                      0.244693
nutrition                            0.689542
philosophy                           0.646302
prehistory                           0.660494
professional_accounting              0.439716
professional_law                     0.411343
professional_medicine                0.613971
professional_psychology              0.591503
public_relations                     0.545455
security_studies                     0.665306
sociology                            0.736318
us_foreign_policy                    0.800000
virology                             0.500000
world_religions                      0.760234
INFO: 2024-10-28 13:23:49,606: llmtf.base.nlpcoreteam/ruMMLU:                                    metric
subject                                  
STEM                             0.600644
humanities                       0.630286
other (business, health, misc.)  0.616782
social sciences                  0.684000
INFO: 2024-10-28 13:23:49,611: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.6329281396317665}
INFO: 2024-10-28 13:23:49,646: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:23:49,648: llmtf.base.evaluator: 
mean	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/ruMMLU
0.633	0.198	0.780	0.545	0.544	0.826	0.904	0.633
INFO: 2024-10-28 13:23:57,887: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive']
INFO: 2024-10-28 13:23:57,887: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:23:57,887: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:24:02,221: llmtf.base.daru/treewayabstractive: Loading Dataset: 4.33s
INFO: 2024-10-28 13:26:15,188: llmtf.base.daru/treewayextractive: Processing Dataset: 315.74s
INFO: 2024-10-28 13:26:15,188: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive:
INFO: 2024-10-28 13:26:15,447: llmtf.base.daru/treewayextractive: {'r-prec': 0.40380281385281386}
INFO: 2024-10-28 13:26:15,501: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:26:15,503: llmtf.base.evaluator: 
mean	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/ruMMLU
0.604	0.404	0.198	0.780	0.545	0.544	0.826	0.904	0.633
INFO: 2024-10-28 13:26:24,206: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu']
INFO: 2024-10-28 13:26:24,207: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:26:24,207: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:26:48,154: llmtf.base.daru/treewayabstractive: Processing Dataset: 165.93s
INFO: 2024-10-28 13:26:48,154: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive:
INFO: 2024-10-28 13:26:48,155: llmtf.base.daru/treewayabstractive: {'rouge1': 0.3489002151166006, 'rouge2': 0.12404569962254197}
INFO: 2024-10-28 13:26:48,156: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:26:48,157: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/ruMMLU
0.563	0.236	0.404	0.198	0.780	0.545	0.544	0.826	0.904	0.633
INFO: 2024-10-28 13:28:23,832: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 119.62s
INFO: 2024-10-28 13:33:05,781: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 281.95s
INFO: 2024-10-28 13:33:05,781: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU:
INFO: 2024-10-28 13:33:05,844: llmtf.base.nlpcoreteam/enMMLU:                                        metric
subject                                      
abstract_algebra                     0.450000
anatomy                              0.725926
astronomy                            0.861842
business_ethics                      0.750000
clinical_knowledge                   0.762264
college_biology                      0.854167
college_chemistry                    0.510000
college_computer_science             0.720000
college_mathematics                  0.470000
college_medicine                     0.699422
college_physics                      0.509804
computer_security                    0.770000
conceptual_physics                   0.706383
econometrics                         0.605263
electrical_engineering               0.696552
elementary_mathematics               0.666667
formal_logic                         0.492063
global_facts                         0.420000
high_school_biology                  0.861290
high_school_chemistry                0.620690
high_school_computer_science         0.840000
high_school_european_history         0.824242
high_school_geography                0.873737
high_school_government_and_politics  0.927461
high_school_macroeconomics           0.761538
high_school_mathematics              0.566667
high_school_microeconomics           0.873950
high_school_physics                  0.582781
high_school_psychology               0.888073
high_school_statistics               0.708333
high_school_us_history               0.838235
high_school_world_history            0.860759
human_aging                          0.762332
human_sexuality                      0.786260
international_law                    0.809917
jurisprudence                        0.796296
logical_fallacies                    0.828221
machine_learning                     0.526786
management                           0.854369
marketing                            0.914530
medical_genetics                     0.810000
miscellaneous                        0.848020
moral_disputes                       0.736994
moral_scenarios                      0.459218
nutrition                            0.797386
philosophy                           0.723473
prehistory                           0.805556
professional_accounting              0.556738
professional_law                     0.507823
professional_medicine                0.742647
professional_psychology              0.750000
public_relations                     0.636364
security_studies                     0.759184
sociology                            0.845771
us_foreign_policy                    0.850000
virology                             0.506024
world_religions                      0.853801
INFO: 2024-10-28 13:33:05,852: llmtf.base.nlpcoreteam/enMMLU:                                    metric
subject                                  
STEM                             0.662331
humanities                       0.733585
other (business, health, misc.)  0.724976
social sciences                  0.796467
INFO: 2024-10-28 13:33:05,857: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.7293395108036221}
INFO: 2024-10-28 13:33:05,908: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:33:05,910: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.580	0.236	0.404	0.198	0.780	0.545	0.544	0.826	0.904	0.729	0.633
INFO: 2024-10-28 13:33:14,562: llmtf.base.evaluator: Starting eval on ['darumeru/cp_para_ru']
INFO: 2024-10-28 13:33:14,562: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [147077]
INFO: 2024-10-28 13:33:14,562: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
INFO: 2024-10-28 13:33:17,057: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 2.49s
INFO: 2024-10-28 13:35:21,669: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 124.61s
INFO: 2024-10-28 13:35:21,670: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru:
INFO: 2024-10-28 13:35:21,670: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 3.9953318595732386, 'len': 0.9990656928305265, 'lcs': 1.0}
INFO: 2024-10-28 13:35:21,671: llmtf.base.evaluator: Ended eval
INFO: 2024-10-28 13:35:21,672: llmtf.base.evaluator: 
mean	daru/treewayabstractive	daru/treewayextractive	darumeru/MultiQ	darumeru/PARus	darumeru/RCB	darumeru/RWSD	darumeru/cp_para_ru	darumeru/ruOpenBookQA	darumeru/ruWorldTree	nlpcoreteam/enMMLU	nlpcoreteam/ruMMLU
0.618	0.236	0.404	0.198	0.780	0.545	0.544	1.000	0.826	0.904	0.729	0.633