metadata
license: other
base_model: deepseek-ai/deepseek-llm-7b-chat
tags:
- alignment-handbook
- trl
- dpo
- generated_from_trainer
- trl
- dpo
- generated_from_trainer
datasets:
- self-generate/ds_chat_original_cn_mining_oj_iter0-binarized
- self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized
- self-generate/ds_chat_original_cn_rl_oj_iter0-binarized
model-index:
- name: ds_chat_sppo_hard_iter0_2024-09-14-21.15
results: []
ds_chat_sppo_hard_iter0_2024-09-14-21.15
This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:
- Loss: 4952.6191
- Rewards/chosen: 0.0173
- Rewards/rejected: 0.0003
- Rewards/accuracies: 0.2763
- Rewards/margins: 0.0170
- Logps/rejected: -63.8573
- Logps/chosen: -121.4135
- Logits/rejected: 1.7167
- Logits/chosen: 1.6591
- Debug/policy Chosen Logits: 1.6591
- Debug/policy Rejected Logits: 1.7167
- Debug/policy Chosen Logps: -121.4135
- Debug/policy Rejected Logps: -63.8573
- Debug/reference Chosen Logps: -123.1481
- Debug/reference Rejected Logps: -63.8871
- Debug/sppo Chosen Reward In Loss: 1.7345
- Debug/sppo Rej Reward In Loss: 0.0297
- Debug/sppo Chosen Loss: 2393.8552
- Debug/sppo Reject Loss: 2503.0667
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-07
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- lr_scheduler_warmup_steps: 100
- num_epochs: 8.0
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | Debug/policy Chosen Logits | Debug/policy Rejected Logits | Debug/policy Chosen Logps | Debug/policy Rejected Logps | Debug/reference Chosen Logps | Debug/reference Rejected Logps | Debug/sppo Chosen Reward In Loss | Debug/sppo Rej Reward In Loss | Debug/sppo Chosen Loss | Debug/sppo Reject Loss |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4997.2328 | 0.3623 | 100 | 4979.2803 | 0.0044 | -0.0011 | 0.3289 | 0.0055 | -64.0002 | -122.7075 | 1.7218 | 1.6606 | 1.6606 | 1.7218 | -122.7075 | -64.0002 | -123.1481 | -63.8871 | 0.4405 | -0.1132 | 2463.0359 | 2489.7429 |
5010.2789 | 0.7246 | 200 | 4991.7910 | 0.0172 | 0.0053 | 0.3289 | 0.0119 | -63.3570 | -121.4287 | 1.7384 | 1.6785 | 1.6785 | 1.7384 | -121.4287 | -63.3570 | -123.1481 | -63.8871 | 1.7193 | 0.5301 | 2393.0474 | 2574.4319 |
4985.9242 | 1.0870 | 300 | 4983.2910 | 0.0172 | 0.0045 | 0.3026 | 0.0128 | -63.4403 | -121.4232 | 1.7425 | 1.6831 | 1.6831 | 1.7425 | -121.4232 | -63.4403 | -123.1481 | -63.8871 | 1.7249 | 0.4468 | 2390.1040 | 2560.6348 |
5008.7777 | 1.4493 | 400 | 4973.2788 | 0.0150 | 0.0051 | 0.3289 | 0.0099 | -63.3768 | -121.6436 | 1.7315 | 1.6724 | 1.6724 | 1.7315 | -121.6436 | -63.3768 | -123.1481 | -63.8871 | 1.5044 | 0.5103 | 2394.0208 | 2569.0662 |
5014.366 | 1.8116 | 500 | 4963.4956 | 0.0126 | 0.0012 | 0.2895 | 0.0114 | -63.7654 | -121.8871 | 1.7289 | 1.6691 | 1.6691 | 1.7289 | -121.8871 | -63.7654 | -123.1481 | -63.8871 | 1.2610 | 0.1216 | 2407.8684 | 2513.1951 |
4949.5211 | 2.1739 | 600 | 4968.5161 | 0.0164 | 0.0024 | 0.2895 | 0.0140 | -63.6428 | -121.5044 | 1.7287 | 1.6694 | 1.6694 | 1.7287 | -121.5044 | -63.6428 | -123.1481 | -63.8871 | 1.6436 | 0.2443 | 2388.6809 | 2529.8535 |
4995.5281 | 2.5362 | 700 | 4965.4644 | 0.0172 | 0.0029 | 0.3684 | 0.0143 | -63.5985 | -121.4247 | 1.7321 | 1.6727 | 1.6727 | 1.7321 | -121.4247 | -63.5985 | -123.1481 | -63.8871 | 1.7233 | 0.2886 | 2388.6721 | 2533.9565 |
4969.6547 | 2.8986 | 800 | 4971.4702 | 0.0216 | 0.0059 | 0.3684 | 0.0157 | -63.2935 | -120.9840 | 1.7477 | 1.6868 | 1.6868 | 1.7477 | -120.9840 | -63.2935 | -123.1481 | -63.8871 | 2.1640 | 0.5935 | 2372.7554 | 2588.5020 |
4953.4711 | 3.2609 | 900 | 4955.8784 | 0.0187 | 0.0036 | 0.3026 | 0.0152 | -63.5316 | -121.2758 | 1.7427 | 1.6827 | 1.6827 | 1.7427 | -121.2758 | -63.5316 | -123.1481 | -63.8871 | 1.8722 | 0.3555 | 2372.4011 | 2545.3831 |
4961.9289 | 3.6232 | 1000 | 4967.9907 | 0.0209 | 0.0059 | 0.3026 | 0.0150 | -63.3005 | -121.0624 | 1.7481 | 1.6892 | 1.6892 | 1.7481 | -121.0624 | -63.3005 | -123.1481 | -63.8871 | 2.0856 | 0.5865 | 2372.6270 | 2586.2114 |
4979.5078 | 3.9855 | 1100 | 4955.5312 | 0.0142 | 0.0005 | 0.3158 | 0.0138 | -63.8419 | -121.7271 | 1.7192 | 1.6605 | 1.6605 | 1.7192 | -121.7271 | -63.8419 | -123.1481 | -63.8871 | 1.4210 | 0.0452 | 2399.0156 | 2504.7903 |
4991.6695 | 4.3478 | 1200 | 4958.5435 | 0.0144 | 0.0012 | 0.3026 | 0.0133 | -63.7715 | -121.7064 | 1.7235 | 1.6634 | 1.6634 | 1.7235 | -121.7064 | -63.7715 | -123.1481 | -63.8871 | 1.4416 | 0.1155 | 2397.6743 | 2512.6323 |
4979.216 | 4.7101 | 1300 | 4964.5874 | 0.0206 | 0.0044 | 0.2763 | 0.0162 | -63.4478 | -121.0882 | 1.7125 | 1.6538 | 1.6538 | 1.7125 | -121.0882 | -63.4478 | -123.1481 | -63.8871 | 2.0598 | 0.4393 | 2382.0825 | 2559.9192 |
4971.2352 | 5.0725 | 1400 | 4960.7969 | 0.0177 | -0.0003 | 0.3158 | 0.0180 | -63.9134 | -121.3772 | 1.7162 | 1.6581 | 1.6581 | 1.7162 | -121.3772 | -63.9134 | -123.1481 | -63.8871 | 1.7709 | -0.0264 | 2388.1853 | 2497.3933 |
4934.9098 | 5.4348 | 1500 | 4958.9351 | 0.0189 | 0.0008 | 0.3026 | 0.0181 | -63.8062 | -121.2587 | 1.7177 | 1.6574 | 1.6574 | 1.7177 | -121.2587 | -63.8062 | -123.1481 | -63.8871 | 1.8893 | 0.0808 | 2388.3345 | 2508.9651 |
4983.5867 | 5.7971 | 1600 | 4956.6689 | 0.0176 | 0.0012 | 0.2763 | 0.0164 | -63.7669 | -121.3872 | 1.7142 | 1.6548 | 1.6548 | 1.7142 | -121.3872 | -63.7669 | -123.1481 | -63.8871 | 1.7609 | 0.1201 | 2383.7910 | 2513.2532 |
4934.0355 | 6.1594 | 1700 | 4958.1274 | 0.0174 | -0.0002 | 0.25 | 0.0175 | -63.9030 | -121.4107 | 1.7053 | 1.6455 | 1.6455 | 1.7053 | -121.4107 | -63.9030 | -123.1481 | -63.8871 | 1.7373 | -0.0159 | 2402.3301 | 2498.4744 |
4962.0086 | 6.5217 | 1800 | 4966.0581 | 0.0219 | 0.0012 | 0.3289 | 0.0207 | -63.7644 | -120.9535 | 1.7137 | 1.6560 | 1.6560 | 1.7137 | -120.9535 | -63.7644 | -123.1481 | -63.8871 | 2.1945 | 0.1226 | 2383.6753 | 2514.6057 |
4963.9734 | 6.8841 | 1900 | 4958.1865 | 0.0215 | 0.0013 | 0.3026 | 0.0202 | -63.7605 | -120.9998 | 1.7137 | 1.6550 | 1.6550 | 1.7137 | -120.9998 | -63.7605 | -123.1481 | -63.8871 | 2.1483 | 0.1265 | 2384.9424 | 2514.3125 |
4951.3387 | 7.2464 | 2000 | 4958.5044 | 0.0208 | 0.0020 | 0.3158 | 0.0189 | -63.6920 | -121.0652 | 1.7131 | 1.6545 | 1.6545 | 1.7131 | -121.0652 | -63.6920 | -123.1481 | -63.8871 | 2.0829 | 0.1950 | 2385.3457 | 2523.0876 |
4969.7758 | 7.6087 | 2100 | 4950.9175 | 0.0165 | -0.0004 | 0.3421 | 0.0169 | -63.9299 | -121.4973 | 1.7156 | 1.6569 | 1.6569 | 1.7156 | -121.4973 | -63.9299 | -123.1481 | -63.8871 | 1.6508 | -0.0429 | 2386.9766 | 2495.8533 |
4946.4094 | 7.9710 | 2200 | 4952.6191 | 0.0173 | 0.0003 | 0.2763 | 0.0170 | -63.8573 | -121.4135 | 1.7167 | 1.6591 | 1.6591 | 1.7167 | -121.4135 | -63.8573 | -123.1481 | -63.8871 | 1.7345 | 0.0297 | 2393.8552 | 2503.0667 |
Framework versions
- Transformers 4.42.0
- Pytorch 2.3.0+cu121
- Datasets 2.14.6
- Tokenizers 0.19.1