ds_chat_sppo_hard_iter0_2024-09-15-01.39
This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:
- Loss: 4624.1011
- Rewards/chosen: 0.0051
- Rewards/rejected: -0.0370
- Rewards/accuracies: 0.5789
- Rewards/margins: 0.0421
- Logps/rejected: -263.3607
- Logps/chosen: -252.4096
- Logits/rejected: 1.4404
- Logits/chosen: 1.3959
- Debug/policy Chosen Logits: 1.3959
- Debug/policy Rejected Logits: 1.4404
- Debug/policy Chosen Logps: -252.4096
- Debug/policy Rejected Logps: -263.3607
- Debug/reference Chosen Logps: -252.9185
- Debug/reference Rejected Logps: -259.6586
- Debug/sppo Chosen Reward In Loss: 0.5089
- Debug/sppo Rej Reward In Loss: -3.7021
- Debug/sppo Chosen Loss: 2526.5620
- Debug/sppo Reject Loss: 2309.3242
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-07
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- lr_scheduler_warmup_steps: 100
- num_epochs: 8.0
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | Debug/policy Chosen Logits | Debug/policy Rejected Logits | Debug/policy Chosen Logps | Debug/policy Rejected Logps | Debug/reference Chosen Logps | Debug/reference Rejected Logps | Debug/sppo Chosen Reward In Loss | Debug/sppo Rej Reward In Loss | Debug/sppo Chosen Loss | Debug/sppo Reject Loss |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4975.3273 | 0.3623 | 100 | 4981.6489 | -0.0033 | -0.0038 | 0.4605 | 0.0004 | -260.0373 | -253.2532 | 1.7010 | 1.6372 | 1.6372 | 1.7010 | -253.2532 | -260.0373 | -252.9185 | -259.6586 | -0.3347 | -0.3786 | 2534.3679 | 2463.3860 |
4930.2141 | 0.7246 | 200 | 4924.0649 | -0.0013 | -0.0060 | 0.5789 | 0.0047 | -260.2596 | -253.0476 | 1.6680 | 1.6070 | 1.6070 | 1.6680 | -253.0476 | -260.2596 | -252.9185 | -259.6586 | -0.1291 | -0.6009 | 2514.6309 | 2444.3210 |
4841.2859 | 1.0870 | 300 | 4866.0864 | -0.0095 | -0.0185 | 0.5395 | 0.0089 | -261.5047 | -253.8716 | 1.6500 | 1.5926 | 1.5926 | 1.6500 | -253.8716 | -261.5047 | -252.9185 | -259.6586 | -0.9531 | -1.8460 | 2603.5461 | 2331.7520 |
4822.266 | 1.4493 | 400 | 4827.9761 | -0.0173 | -0.0295 | 0.5395 | 0.0122 | -262.6080 | -254.6497 | 1.6162 | 1.5603 | 1.5603 | 1.6162 | -254.6497 | -262.6080 | -252.9185 | -259.6586 | -1.7313 | -2.9494 | 2692.5408 | 2243.4092 |
4715.0469 | 1.8116 | 500 | 4771.2051 | -0.0007 | -0.0176 | 0.4868 | 0.0169 | -261.4219 | -252.9887 | 1.5898 | 1.5341 | 1.5341 | 1.5898 | -252.9887 | -261.4219 | -252.9185 | -259.6586 | -0.0703 | -1.7633 | 2529.2981 | 2376.3818 |
4665.2648 | 2.1739 | 600 | 4749.7798 | 0.0008 | -0.0212 | 0.5395 | 0.0220 | -261.7789 | -252.8382 | 1.5688 | 1.5147 | 1.5147 | 1.5688 | -252.8382 | -261.7789 | -252.9185 | -259.6586 | 0.0803 | -2.1202 | 2515.5928 | 2344.7095 |
4625.0359 | 2.5362 | 700 | 5035.4683 | 0.0876 | 0.0697 | 0.6447 | 0.0179 | -252.6841 | -244.1548 | 1.5685 | 1.5098 | 1.5098 | 1.5685 | -244.1548 | -252.6841 | -252.9185 | -259.6586 | 8.7637 | 6.9746 | 1714.2816 | 3259.7661 |
4637.3375 | 2.8986 | 800 | 4705.7749 | -0.0031 | -0.0319 | 0.5921 | 0.0287 | -262.8461 | -253.2311 | 1.5294 | 1.4773 | 1.4773 | 1.5294 | -253.2311 | -262.8461 | -252.9185 | -259.6586 | -0.3127 | -3.1874 | 2569.7046 | 2272.2061 |
4550.082 | 3.2609 | 900 | 4687.2900 | -0.0001 | -0.0318 | 0.5921 | 0.0317 | -262.8345 | -252.9287 | 1.5160 | 1.4652 | 1.4652 | 1.5160 | -252.9287 | -262.8345 | -252.9185 | -259.6586 | -0.0102 | -3.1759 | 2544.3586 | 2288.0042 |
4612.343 | 3.6232 | 1000 | 4670.3667 | 0.0005 | -0.0323 | 0.5658 | 0.0328 | -262.8906 | -252.8681 | 1.5061 | 1.4569 | 1.4569 | 1.5061 | -252.8681 | -262.8906 | -252.9185 | -259.6586 | 0.0504 | -3.2320 | 2546.7378 | 2296.4641 |
4579.3098 | 3.9855 | 1100 | 4676.5903 | -0.0058 | -0.0391 | 0.5263 | 0.0333 | -263.5656 | -253.4963 | 1.5062 | 1.4565 | 1.4565 | 1.5062 | -253.4963 | -263.5656 | -252.9185 | -259.6586 | -0.5778 | -3.9070 | 2616.4526 | 2253.1421 |
4461.193 | 4.3478 | 1200 | 4657.2646 | 0.0038 | -0.0339 | 0.6053 | 0.0377 | -263.0466 | -252.5387 | 1.4919 | 1.4449 | 1.4449 | 1.4919 | -252.5387 | -263.0466 | -252.9185 | -259.6586 | 0.3798 | -3.3879 | 2517.6655 | 2292.2590 |
4688.9563 | 4.7101 | 1300 | 4654.3955 | -0.0002 | -0.0373 | 0.5658 | 0.0371 | -263.3885 | -252.9360 | 1.4725 | 1.4244 | 1.4244 | 1.4725 | -252.9360 | -263.3885 | -252.9185 | -259.6586 | -0.0175 | -3.7298 | 2567.2290 | 2285.4812 |
4572.3969 | 5.0725 | 1400 | 4650.5352 | -0.0014 | -0.0398 | 0.5789 | 0.0384 | -263.6363 | -253.0607 | 1.4663 | 1.4206 | 1.4206 | 1.4663 | -253.0607 | -263.6363 | -252.9185 | -259.6586 | -0.1422 | -3.9776 | 2580.2542 | 2263.7637 |
4497.8313 | 5.4348 | 1500 | 4637.4077 | 0.0039 | -0.0371 | 0.5658 | 0.0410 | -263.3676 | -252.5313 | 1.4566 | 1.4118 | 1.4118 | 1.4566 | -252.5313 | -263.3676 | -252.9185 | -259.6586 | 0.3872 | -3.7090 | 2528.2339 | 2293.6980 |
4573.9879 | 5.7971 | 1600 | 4628.5752 | 0.0069 | -0.0333 | 0.5921 | 0.0402 | -262.9847 | -252.2267 | 1.4558 | 1.4099 | 1.4099 | 1.4558 | -252.2267 | -262.9847 | -252.9185 | -259.6586 | 0.6917 | -3.3261 | 2501.1956 | 2325.0657 |
4493.7113 | 6.1594 | 1700 | 4615.8252 | 0.0106 | -0.0325 | 0.5921 | 0.0431 | -262.9095 | -251.8597 | 1.4488 | 1.4028 | 1.4028 | 1.4488 | -251.8597 | -262.9095 | -252.9185 | -259.6586 | 1.0587 | -3.2509 | 2467.5171 | 2344.7961 |
4579.916 | 6.5217 | 1800 | 4618.2861 | 0.0059 | -0.0377 | 0.5789 | 0.0436 | -263.4273 | -252.3270 | 1.4455 | 1.4013 | 1.4013 | 1.4455 | -252.3270 | -263.4273 | -252.9185 | -259.6586 | 0.5915 | -3.7687 | 2516.5059 | 2301.5999 |
4682.2398 | 6.8841 | 1900 | 4613.9302 | 0.0060 | -0.0385 | 0.6184 | 0.0445 | -263.5052 | -252.3165 | 1.4429 | 1.3991 | 1.3991 | 1.4429 | -252.3165 | -263.5052 | -252.9185 | -259.6586 | 0.6019 | -3.8466 | 2513.9785 | 2293.4380 |
4497.943 | 7.2464 | 2000 | 4617.7402 | 0.0049 | -0.0368 | 0.6053 | 0.0417 | -263.3337 | -252.4285 | 1.4409 | 1.3966 | 1.3966 | 1.4409 | -252.4285 | -263.3337 | -252.9185 | -259.6586 | 0.4900 | -3.6751 | 2527.1399 | 2309.4104 |
4470.4805 | 7.6087 | 2100 | 4616.2676 | 0.0083 | -0.0372 | 0.6053 | 0.0455 | -263.3792 | -252.0898 | 1.4419 | 1.3983 | 1.3983 | 1.4419 | -252.0898 | -263.3792 | -252.9185 | -259.6586 | 0.8286 | -3.7205 | 2493.6099 | 2304.2241 |
4514.8016 | 7.9710 | 2200 | 4624.1011 | 0.0051 | -0.0370 | 0.5789 | 0.0421 | -263.3607 | -252.4096 | 1.4404 | 1.3959 | 1.3959 | 1.4404 | -252.4096 | -263.3607 | -252.9185 | -259.6586 | 0.5089 | -3.7021 | 2526.5620 | 2309.3242 |
Framework versions
- Transformers 4.42.0
- Pytorch 2.3.0+cu121
- Datasets 2.14.6
- Tokenizers 0.19.1
- Downloads last month
- 2
Model tree for yiran-wang3/ds_chat_sppo_hard_iter0_nomask_linear_schedule
Base model
deepseek-ai/deepseek-llm-7b-chat