--- license: other base_model: deepseek-ai/deepseek-llm-7b-chat tags: - alignment-handbook - trl - dpo - generated_from_trainer - trl - dpo - generated_from_trainer datasets: - self-generate/ds_chat_original_cn_mining_oj_iter0-binarized - self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized - self-generate/ds_chat_original_cn_rl_oj_iter0-binarized model-index: - name: ds_chat_sppo_hard_new_iter0_2024-09-14-21.15 results: [] --- [Visualize in Weights & Biases](https://ml.byteintl.net/experiment/tracking/detail?Id=project_20240915_20321b8f&selectedTrial=run_20240915_971b4903) # ds_chat_sppo_hard_new_iter0_2024-09-14-21.15 This model is a fine-tuned version of [deepseek-ai/deepseek-llm-7b-chat](https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat) on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set: - Loss: 0.4951 - Rewards/chosen: 0.0190 - Rewards/rejected: -0.0009 - Rewards/accuracies: 0.3684 - Rewards/margins: 0.0199 - Logps/rejected: -63.9738 - Logps/chosen: -121.2440 - Logits/rejected: 1.7159 - Logits/chosen: 1.6562 - Debug/policy Chosen Logits: 1.6562 - Debug/policy Rejected Logits: 1.7159 - Debug/policy Chosen Logps: -121.2440 - Debug/policy Rejected Logps: -63.9738 - Debug/reference Chosen Logps: -123.1481 - Debug/reference Rejected Logps: -63.8871 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-07 - train_batch_size: 8 - eval_batch_size: 4 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - total_train_batch_size: 64 - total_eval_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - lr_scheduler_warmup_steps: 100 - num_epochs: 8.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | Debug/policy Chosen Logits | Debug/policy Rejected Logits | Debug/policy Chosen Logps | Debug/policy Rejected Logps | Debug/reference Chosen Logps | Debug/reference Rejected Logps | |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|:--------------------------:|:----------------------------:|:-------------------------:|:---------------------------:|:----------------------------:|:------------------------------:| | 0.4997 | 0.3623 | 100 | 0.4979 | 0.0051 | -0.0005 | 0.3421 | 0.0056 | -63.9373 | -122.6352 | 1.7236 | 1.6612 | 1.6612 | 1.7236 | -122.6352 | -63.9373 | -123.1481 | -63.8871 | | 0.5018 | 0.7246 | 200 | 0.4996 | 0.0156 | 0.0052 | 0.3421 | 0.0104 | -63.3698 | -121.5860 | 1.7403 | 1.6799 | 1.6799 | 1.7403 | -121.5860 | -63.3698 | -123.1481 | -63.8871 | | 0.4991 | 1.0870 | 300 | 0.4987 | 0.0190 | 0.0068 | 0.3158 | 0.0123 | -63.2120 | -121.2448 | 1.7605 | 1.7000 | 1.7000 | 1.7605 | -121.2448 | -63.2120 | -123.1481 | -63.8871 | | 0.5007 | 1.4493 | 400 | 0.4975 | 0.0176 | 0.0038 | 0.2895 | 0.0139 | -63.5094 | -121.3837 | 1.7412 | 1.6815 | 1.6815 | 1.7412 | -121.3837 | -63.5094 | -123.1481 | -63.8871 | | 0.5006 | 1.8116 | 500 | 0.4966 | 0.0132 | 0.0019 | 0.3553 | 0.0113 | -63.6979 | -121.8322 | 1.7278 | 1.6669 | 1.6669 | 1.7278 | -121.8322 | -63.6979 | -123.1481 | -63.8871 | | 0.4944 | 2.1739 | 600 | 0.4969 | 0.0196 | 0.0035 | 0.3421 | 0.0160 | -63.5333 | -121.1920 | 1.7400 | 1.6805 | 1.6805 | 1.7400 | -121.1920 | -63.5333 | -123.1481 | -63.8871 | | 0.4988 | 2.5362 | 700 | 0.4959 | 0.0175 | 0.0032 | 0.3553 | 0.0143 | -63.5656 | -121.4005 | 1.7441 | 1.6843 | 1.6843 | 1.7441 | -121.4005 | -63.5656 | -123.1481 | -63.8871 | | 0.4975 | 2.8986 | 800 | 0.4967 | 0.0221 | 0.0072 | 0.3553 | 0.0150 | -63.1701 | -120.9358 | 1.7439 | 1.6851 | 1.6851 | 1.7439 | -120.9358 | -63.1701 | -123.1481 | -63.8871 | | 0.495 | 3.2609 | 900 | 0.4955 | 0.0202 | 0.0021 | 0.3421 | 0.0180 | -63.6741 | -121.1320 | 1.7492 | 1.6875 | 1.6875 | 1.7492 | -121.1320 | -63.6741 | -123.1481 | -63.8871 | | 0.4961 | 3.6232 | 1000 | 0.4958 | 0.0210 | 0.0019 | 0.3421 | 0.0191 | -63.6937 | -121.0436 | 1.7449 | 1.6854 | 1.6854 | 1.7449 | -121.0436 | -63.6937 | -123.1481 | -63.8871 | | 0.4979 | 3.9855 | 1100 | 0.4952 | 0.0160 | -0.0011 | 0.3816 | 0.0171 | -63.9974 | -121.5451 | 1.7309 | 1.6720 | 1.6720 | 1.7309 | -121.5451 | -63.9974 | -123.1481 | -63.8871 | | 0.4985 | 4.3478 | 1200 | 0.4958 | 0.0157 | 0.0002 | 0.3289 | 0.0154 | -63.8621 | -121.5809 | 1.7273 | 1.6675 | 1.6675 | 1.7273 | -121.5809 | -63.8621 | -123.1481 | -63.8871 | | 0.4977 | 4.7101 | 1300 | 0.4968 | 0.0195 | 0.0012 | 0.3158 | 0.0182 | -63.7631 | -121.2019 | 1.7106 | 1.6512 | 1.6512 | 1.7106 | -121.2019 | -63.7631 | -123.1481 | -63.8871 | | 0.4966 | 5.0725 | 1400 | 0.4958 | 0.0186 | 0.0002 | 0.3289 | 0.0184 | -63.8648 | -121.2832 | 1.7173 | 1.6585 | 1.6585 | 1.7173 | -121.2832 | -63.8648 | -123.1481 | -63.8871 | | 0.4935 | 5.4348 | 1500 | 0.4958 | 0.0160 | 0.0005 | 0.2632 | 0.0155 | -63.8391 | -121.5465 | 1.7152 | 1.6570 | 1.6570 | 1.7152 | -121.5465 | -63.8391 | -123.1481 | -63.8871 | | 0.4975 | 5.7971 | 1600 | 0.4963 | 0.0197 | 0.0018 | 0.3026 | 0.0179 | -63.7076 | -121.1778 | 1.7160 | 1.6571 | 1.6571 | 1.7160 | -121.1778 | -63.7076 | -123.1481 | -63.8871 | | 0.4934 | 6.1594 | 1700 | 0.4958 | 0.0142 | -0.0019 | 0.3553 | 0.0162 | -64.0808 | -121.7252 | 1.7082 | 1.6502 | 1.6502 | 1.7082 | -121.7252 | -64.0808 | -123.1481 | -63.8871 | | 0.4956 | 6.5217 | 1800 | 0.4957 | 0.0210 | 0.0005 | 0.3421 | 0.0205 | -63.8361 | -121.0436 | 1.7185 | 1.6581 | 1.6581 | 1.7185 | -121.0436 | -63.8361 | -123.1481 | -63.8871 | | 0.496 | 6.8841 | 1900 | 0.4958 | 0.0212 | 0.0018 | 0.2895 | 0.0194 | -63.7090 | -121.0307 | 1.7158 | 1.6582 | 1.6582 | 1.7158 | -121.0307 | -63.7090 | -123.1481 | -63.8871 | | 0.495 | 7.2464 | 2000 | 0.4953 | 0.0175 | 0.0019 | 0.3289 | 0.0156 | -63.6983 | -121.4027 | 1.7189 | 1.6600 | 1.6600 | 1.7189 | -121.4027 | -63.6983 | -123.1481 | -63.8871 | | 0.4967 | 7.6087 | 2100 | 0.4958 | 0.0202 | -0.0001 | 0.2895 | 0.0203 | -63.8998 | -121.1321 | 1.7188 | 1.6592 | 1.6592 | 1.7188 | -121.1321 | -63.8998 | -123.1481 | -63.8871 | | 0.4948 | 7.9710 | 2200 | 0.4951 | 0.0190 | -0.0009 | 0.3684 | 0.0199 | -63.9738 | -121.2440 | 1.7159 | 1.6562 | 1.6562 | 1.7159 | -121.2440 | -63.9738 | -123.1481 | -63.8871 | ### Framework versions - Transformers 4.42.0 - Pytorch 2.3.0+cu121 - Datasets 2.14.6 - Tokenizers 0.19.1