llama3_8b_instruct_dpo_bwgenerator_v2

This model is a fine-tuned version of NanQiangHF/llama3_8b_instruct_bwgenerator on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5461	0.0719	1000	0.4574	-0.0823	-0.9594	0.8261	0.8771	-77.3979	-39.4010	0.6931	0.1837
0.426	0.1438	2000	0.3856	-0.3308	-1.6338	0.8454	1.3030	-84.1417	-41.8860	0.7041	0.1914
0.3758	0.2157	3000	0.3593	-0.4540	-1.9108	0.8652	1.4567	-86.9117	-43.1185	0.7065	0.1933
0.3611	0.2876	4000	0.3515	-0.5039	-2.0063	0.8687	1.5024	-87.8675	-43.6177	0.7088	0.1952
0.3438	0.3595	5000	0.3502	-0.5107	-2.0200	0.8681	1.5093	-88.0041	-43.6858	0.7085	0.1951
0.357	0.4313	6000	0.3487	-0.5159	-2.0325	0.8668	1.5166	-88.1288	-43.7373	0.7092	0.1955
0.3562	0.5032	7000	0.3496	-0.5151	-2.0278	0.8707	1.5127	-88.0820	-43.7290	0.7093	0.1956
0.3597	0.5751	8000	0.3493	-0.5179	-2.0304	0.8707	1.5125	-88.1081	-43.7570	0.7092	0.1956
0.3437	0.6470	9000	0.3492	-0.5132	-2.0264	0.8691	1.5132	-88.0680	-43.7105	0.7109	0.1971
0.3544	0.7189	10000	0.3488	-0.5160	-2.0301	0.8704	1.5142	-88.1054	-43.7379	0.7089	0.1953
0.3451	0.7908	11000	0.3498	-0.5116	-2.0235	0.8694	1.5119	-88.0395	-43.6945	0.7089	0.1951
0.3543	0.8627	12000	0.3485	-0.5155	-2.0306	0.8687	1.5151	-88.1099	-43.7334	0.7091	0.1955
0.3609	0.9346	13000	0.3494	-0.5156	-2.0278	0.8713	1.5122	-88.0817	-43.7343	0.7079	0.1945