Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
yiran-wang3's picture
End of training
8606fab verified
metadata
license: other
base_model: deepseek-ai/deepseek-llm-7b-chat
tags:
  - alignment-handbook
  - trl
  - dpo
  - generated_from_trainer
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - self-generate/ds_chat_original_cn_mining_oj_iter0-binarized
  - self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized
  - self-generate/ds_chat_original_cn_rl_oj_iter0-binarized
model-index:
  - name: ds_chat_sppo_hard_iter0_2024-09-14-21.15
    results: []

Visualize in Weights & Biases

ds_chat_sppo_hard_iter0_2024-09-14-21.15

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 4952.6191
  • Rewards/chosen: 0.0173
  • Rewards/rejected: 0.0003
  • Rewards/accuracies: 0.2763
  • Rewards/margins: 0.0170
  • Logps/rejected: -63.8573
  • Logps/chosen: -121.4135
  • Logits/rejected: 1.7167
  • Logits/chosen: 1.6591
  • Debug/policy Chosen Logits: 1.6591
  • Debug/policy Rejected Logits: 1.7167
  • Debug/policy Chosen Logps: -121.4135
  • Debug/policy Rejected Logps: -63.8573
  • Debug/reference Chosen Logps: -123.1481
  • Debug/reference Rejected Logps: -63.8871
  • Debug/sppo Chosen Reward In Loss: 1.7345
  • Debug/sppo Rej Reward In Loss: 0.0297
  • Debug/sppo Chosen Loss: 2393.8552
  • Debug/sppo Reject Loss: 2503.0667

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps Debug/sppo Chosen Reward In Loss Debug/sppo Rej Reward In Loss Debug/sppo Chosen Loss Debug/sppo Reject Loss
4997.2328 0.3623 100 4979.2803 0.0044 -0.0011 0.3289 0.0055 -64.0002 -122.7075 1.7218 1.6606 1.6606 1.7218 -122.7075 -64.0002 -123.1481 -63.8871 0.4405 -0.1132 2463.0359 2489.7429
5010.2789 0.7246 200 4991.7910 0.0172 0.0053 0.3289 0.0119 -63.3570 -121.4287 1.7384 1.6785 1.6785 1.7384 -121.4287 -63.3570 -123.1481 -63.8871 1.7193 0.5301 2393.0474 2574.4319
4985.9242 1.0870 300 4983.2910 0.0172 0.0045 0.3026 0.0128 -63.4403 -121.4232 1.7425 1.6831 1.6831 1.7425 -121.4232 -63.4403 -123.1481 -63.8871 1.7249 0.4468 2390.1040 2560.6348
5008.7777 1.4493 400 4973.2788 0.0150 0.0051 0.3289 0.0099 -63.3768 -121.6436 1.7315 1.6724 1.6724 1.7315 -121.6436 -63.3768 -123.1481 -63.8871 1.5044 0.5103 2394.0208 2569.0662
5014.366 1.8116 500 4963.4956 0.0126 0.0012 0.2895 0.0114 -63.7654 -121.8871 1.7289 1.6691 1.6691 1.7289 -121.8871 -63.7654 -123.1481 -63.8871 1.2610 0.1216 2407.8684 2513.1951
4949.5211 2.1739 600 4968.5161 0.0164 0.0024 0.2895 0.0140 -63.6428 -121.5044 1.7287 1.6694 1.6694 1.7287 -121.5044 -63.6428 -123.1481 -63.8871 1.6436 0.2443 2388.6809 2529.8535
4995.5281 2.5362 700 4965.4644 0.0172 0.0029 0.3684 0.0143 -63.5985 -121.4247 1.7321 1.6727 1.6727 1.7321 -121.4247 -63.5985 -123.1481 -63.8871 1.7233 0.2886 2388.6721 2533.9565
4969.6547 2.8986 800 4971.4702 0.0216 0.0059 0.3684 0.0157 -63.2935 -120.9840 1.7477 1.6868 1.6868 1.7477 -120.9840 -63.2935 -123.1481 -63.8871 2.1640 0.5935 2372.7554 2588.5020
4953.4711 3.2609 900 4955.8784 0.0187 0.0036 0.3026 0.0152 -63.5316 -121.2758 1.7427 1.6827 1.6827 1.7427 -121.2758 -63.5316 -123.1481 -63.8871 1.8722 0.3555 2372.4011 2545.3831
4961.9289 3.6232 1000 4967.9907 0.0209 0.0059 0.3026 0.0150 -63.3005 -121.0624 1.7481 1.6892 1.6892 1.7481 -121.0624 -63.3005 -123.1481 -63.8871 2.0856 0.5865 2372.6270 2586.2114
4979.5078 3.9855 1100 4955.5312 0.0142 0.0005 0.3158 0.0138 -63.8419 -121.7271 1.7192 1.6605 1.6605 1.7192 -121.7271 -63.8419 -123.1481 -63.8871 1.4210 0.0452 2399.0156 2504.7903
4991.6695 4.3478 1200 4958.5435 0.0144 0.0012 0.3026 0.0133 -63.7715 -121.7064 1.7235 1.6634 1.6634 1.7235 -121.7064 -63.7715 -123.1481 -63.8871 1.4416 0.1155 2397.6743 2512.6323
4979.216 4.7101 1300 4964.5874 0.0206 0.0044 0.2763 0.0162 -63.4478 -121.0882 1.7125 1.6538 1.6538 1.7125 -121.0882 -63.4478 -123.1481 -63.8871 2.0598 0.4393 2382.0825 2559.9192
4971.2352 5.0725 1400 4960.7969 0.0177 -0.0003 0.3158 0.0180 -63.9134 -121.3772 1.7162 1.6581 1.6581 1.7162 -121.3772 -63.9134 -123.1481 -63.8871 1.7709 -0.0264 2388.1853 2497.3933
4934.9098 5.4348 1500 4958.9351 0.0189 0.0008 0.3026 0.0181 -63.8062 -121.2587 1.7177 1.6574 1.6574 1.7177 -121.2587 -63.8062 -123.1481 -63.8871 1.8893 0.0808 2388.3345 2508.9651
4983.5867 5.7971 1600 4956.6689 0.0176 0.0012 0.2763 0.0164 -63.7669 -121.3872 1.7142 1.6548 1.6548 1.7142 -121.3872 -63.7669 -123.1481 -63.8871 1.7609 0.1201 2383.7910 2513.2532
4934.0355 6.1594 1700 4958.1274 0.0174 -0.0002 0.25 0.0175 -63.9030 -121.4107 1.7053 1.6455 1.6455 1.7053 -121.4107 -63.9030 -123.1481 -63.8871 1.7373 -0.0159 2402.3301 2498.4744
4962.0086 6.5217 1800 4966.0581 0.0219 0.0012 0.3289 0.0207 -63.7644 -120.9535 1.7137 1.6560 1.6560 1.7137 -120.9535 -63.7644 -123.1481 -63.8871 2.1945 0.1226 2383.6753 2514.6057
4963.9734 6.8841 1900 4958.1865 0.0215 0.0013 0.3026 0.0202 -63.7605 -120.9998 1.7137 1.6550 1.6550 1.7137 -120.9998 -63.7605 -123.1481 -63.8871 2.1483 0.1265 2384.9424 2514.3125
4951.3387 7.2464 2000 4958.5044 0.0208 0.0020 0.3158 0.0189 -63.6920 -121.0652 1.7131 1.6545 1.6545 1.7131 -121.0652 -63.6920 -123.1481 -63.8871 2.0829 0.1950 2385.3457 2523.0876
4969.7758 7.6087 2100 4950.9175 0.0165 -0.0004 0.3421 0.0169 -63.9299 -121.4973 1.7156 1.6569 1.6569 1.7156 -121.4973 -63.9299 -123.1481 -63.8871 1.6508 -0.0429 2386.9766 2495.8533
4946.4094 7.9710 2200 4952.6191 0.0173 0.0003 0.2763 0.0170 -63.8573 -121.4135 1.7167 1.6591 1.6591 1.7167 -121.4135 -63.8573 -123.1481 -63.8871 1.7345 0.0297 2393.8552 2503.0667

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1