OpenELM-1_1B-DPO-full-max-second-reward

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 1.4829
Rewards/chosen: -12.4375
Rewards/rejected: -12.875
Rewards/accuracies: 0.5371
Rewards/margins: 0.4414
Logps/rejected: -1576.0
Logps/chosen: -1560.0
Logits/rejected: 10.8125
Logits/chosen: 8.8125

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6927	0.1047	100	0.6950	-0.2334	-0.2539	0.5254	0.0201	-314.0	-342.0	-13.125	-13.25
0.6759	0.2094	200	0.7065	-0.6484	-0.7617	0.5488	0.1123	-366.0	-384.0	-11.6875	-11.9375
0.6912	0.3141	300	0.7235	-1.1484	-1.2344	0.5527	0.0845	-412.0	-434.0	-14.0	-14.0625
0.7002	0.4188	400	0.7412	-1.2734	-1.2578	0.4883	-0.0128	-414.0	-446.0	-13.5	-13.5
0.6819	0.5236	500	0.7542	-1.75	-1.7656	0.4961	0.0173	-466.0	-492.0	-12.125	-12.3125
0.7065	0.6283	600	0.7290	-1.9297	-1.9453	0.5039	0.0159	-482.0	-512.0	-12.1875	-12.375
0.6892	0.7330	700	0.7298	-2.1094	-2.1719	0.5117	0.0518	-506.0	-532.0	-11.75	-11.8125
0.7117	0.8377	800	0.7436	-2.25	-2.2812	0.4961	0.0247	-516.0	-544.0	-8.5625	-8.875
0.6835	0.9424	900	0.7565	-2.1562	-2.1875	0.5137	0.0284	-508.0	-536.0	-7.8125	-8.1875
0.2775	1.0471	1000	0.9428	-4.0938	-4.125	0.5137	0.0229	-700.0	-728.0	-10.75	-11.1875
0.2471	1.1518	1100	0.9772	-5.6562	-5.75	0.5234	0.0986	-864.0	-884.0	-3.9844	-4.8438
0.2465	1.2565	1200	0.9777	-5.125	-5.2188	0.5254	0.0688	-808.0	-832.0	-4.1562	-5.0312
0.2601	1.3613	1300	0.9855	-6.5	-6.6875	0.5488	0.1846	-956.0	-968.0	0.3164	-0.7695
0.2404	1.4660	1400	0.9077	-6.8438	-7.0938	0.5293	0.2520	-1000.0	-1004.0	2.0312	0.6367
0.2371	1.5707	1500	0.9027	-5.8438	-6.0625	0.5508	0.2061	-896.0	-904.0	1.4141	0.0143
0.2329	1.6754	1600	0.9480	-6.7812	-7.0312	0.5488	0.2617	-992.0	-996.0	2.0312	0.5664
0.231	1.7801	1700	0.8705	-6.2812	-6.5625	0.5527	0.2598	-944.0	-948.0	-1.6484	-2.7031
0.2045	1.8848	1800	0.9315	-7.4375	-7.7188	0.5625	0.3086	-1064.0	-1064.0	-1.3906	-2.5
0.2467	1.9895	1900	0.8831	-7.0625	-7.375	0.5586	0.3145	-1024.0	-1024.0	0.2656	-0.9961
0.0377	2.0942	2000	1.3504	-10.6875	-11.0625	0.5371	0.3652	-1392.0	-1384.0	6.25	4.5625
0.0265	2.1990	2100	1.5050	-11.5	-11.8125	0.5566	0.3320	-1472.0	-1472.0	8.1875	6.375
0.0363	2.3037	2200	1.4563	-11.625	-11.9375	0.5312	0.3398	-1480.0	-1480.0	8.9375	7.1562
0.0292	2.4084	2300	1.5373	-12.125	-12.5	0.5449	0.3535	-1536.0	-1528.0	9.6875	7.7812
0.0491	2.5131	2400	1.4556	-12.0625	-12.5	0.5410	0.4355	-1536.0	-1528.0	9.8125	7.875
0.0324	2.6178	2500	1.4875	-12.5	-12.9375	0.5391	0.4414	-1584.0	-1568.0	10.5	8.5625
0.0247	2.7225	2600	1.4541	-12.0625	-12.5	0.5410	0.4336	-1536.0	-1528.0	10.25	8.3125
0.0335	2.8272	2700	1.4734	-12.3125	-12.75	0.5371	0.4434	-1568.0	-1552.0	10.6875	8.75
0.0263	2.9319	2800	1.4829	-12.4375	-12.875	0.5371	0.4414	-1576.0	-1560.0	10.8125	8.8125

Framework versions

Transformers 4.44.2
Pytorch 2.3.0
Datasets 3.0.0
Tokenizers 0.19.1

CharlesLi
/

OpenELM-1_1B-DPO-full-max-second-reward

OpenELM-1_1B-DPO-full-max-second-reward

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Evaluation results