Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
yiran-wang3's picture
End of training
c2e928d verified
metadata
license: other
base_model: deepseek-ai/deepseek-llm-7b-chat
tags:
  - alignment-handbook
  - trl
  - dpo
  - generated_from_trainer
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - self-generate/ds_chat_original_cn_mining_oj_iter0-binarized
  - self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized
  - self-generate/ds_chat_original_cn_rl_oj_iter0-binarized
model-index:
  - name: ds_chat_sppo_hard_new_iter0_2024-09-15-01.40
    results: []

Visualize in Weights & Biases

ds_chat_sppo_hard_new_iter0_2024-09-15-01.40

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 0.4619
  • Rewards/chosen: 0.0067
  • Rewards/rejected: -0.0352
  • Rewards/accuracies: 0.5921
  • Rewards/margins: 0.0419
  • Logps/rejected: -263.1805
  • Logps/chosen: -252.2534
  • Logits/rejected: 1.4436
  • Logits/chosen: 1.3993
  • Debug/policy Chosen Logits: 1.3993
  • Debug/policy Rejected Logits: 1.4436
  • Debug/policy Chosen Logps: -252.2534
  • Debug/policy Rejected Logps: -263.1805
  • Debug/reference Chosen Logps: -252.9185
  • Debug/reference Rejected Logps: -259.6586

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps
0.4973 0.3623 100 0.4977 -0.0056 -0.0071 0.5132 0.0014 -260.3654 -253.4812 1.6987 1.6372 1.6372 1.6987 -253.4812 -260.3654 -252.9185 -259.6586
0.4917 0.7246 200 0.4919 -0.0069 -0.0126 0.5395 0.0058 -260.9230 -253.6065 1.6704 1.6087 1.6087 1.6704 -253.6065 -260.9230 -252.9185 -259.6586
0.4837 1.0870 300 0.4862 -0.0085 -0.0167 0.5789 0.0082 -261.3287 -253.7711 1.6490 1.5905 1.5905 1.6490 -253.7711 -261.3287 -252.9185 -259.6586
0.4821 1.4493 400 0.4822 -0.0046 -0.0173 0.5132 0.0127 -261.3844 -253.3754 1.6131 1.5560 1.5560 1.6131 -253.3754 -261.3844 -252.9185 -259.6586
0.4724 1.8116 500 0.4773 -0.0010 -0.0181 0.4737 0.0171 -261.4722 -253.0200 1.5870 1.5328 1.5328 1.5870 -253.0200 -261.4722 -252.9185 -259.6586
0.4677 2.1739 600 0.4750 -0.0007 -0.0218 0.5132 0.0212 -261.8435 -252.9872 1.5701 1.5167 1.5167 1.5701 -252.9872 -261.8435 -252.9185 -259.6586
0.4625 2.5362 700 0.5077 0.0917 0.0741 0.6447 0.0176 -252.2495 -243.7507 1.5700 1.5133 1.5133 1.5700 -243.7507 -252.2495 -252.9185 -259.6586
0.465 2.8986 800 0.4709 -0.0024 -0.0313 0.5658 0.0289 -262.7887 -253.1583 1.5298 1.4781 1.4781 1.5298 -253.1583 -262.7887 -252.9185 -259.6586
0.4551 3.2609 900 0.4689 -0.0039 -0.0344 0.5658 0.0304 -263.0977 -253.3132 1.5177 1.4670 1.4670 1.5177 -253.3132 -263.0977 -252.9185 -259.6586
0.4614 3.6232 1000 0.4687 -0.0108 -0.0450 0.5789 0.0342 -264.1606 -253.9997 1.5075 1.4592 1.4592 1.5075 -253.9997 -264.1606 -252.9185 -259.6586
0.4579 3.9855 1100 0.4668 0.0012 -0.0346 0.5789 0.0358 -263.1156 -252.7994 1.5016 1.4527 1.4527 1.5016 -252.7994 -263.1156 -252.9185 -259.6586
0.4466 4.3478 1200 0.4663 0.0006 -0.0344 0.5526 0.0349 -263.0953 -252.8606 1.4940 1.4448 1.4448 1.4940 -252.8606 -263.0953 -252.9185 -259.6586
0.4696 4.7101 1300 0.4644 0.0027 -0.0346 0.5921 0.0373 -263.1194 -252.6523 1.4687 1.4226 1.4226 1.4687 -252.6523 -263.1194 -252.9185 -259.6586
0.4571 5.0725 1400 0.4643 -0.0002 -0.0394 0.5789 0.0392 -263.5992 -252.9413 1.4644 1.4177 1.4177 1.4644 -252.9413 -263.5992 -252.9185 -259.6586
0.45 5.4348 1500 0.4637 0.0047 -0.0343 0.5789 0.0390 -263.0912 -252.4461 1.4551 1.4102 1.4102 1.4551 -252.4461 -263.0912 -252.9185 -259.6586
0.4561 5.7971 1600 0.4627 0.0063 -0.0340 0.5921 0.0403 -263.0588 -252.2838 1.4579 1.4127 1.4127 1.4579 -252.2838 -263.0588 -252.9185 -259.6586
0.4505 6.1594 1700 0.4616 0.0094 -0.0319 0.6316 0.0413 -262.8479 -251.9740 1.4445 1.4000 1.4000 1.4445 -251.9740 -262.8479 -252.9185 -259.6586
0.4563 6.5217 1800 0.4613 0.0084 -0.0356 0.6053 0.0440 -263.2198 -252.0771 1.4420 1.3981 1.3981 1.4420 -252.0771 -263.2198 -252.9185 -259.6586
0.4675 6.8841 1900 0.4616 0.0069 -0.0366 0.6053 0.0435 -263.3192 -252.2319 1.4424 1.3959 1.3959 1.4424 -252.2319 -263.3192 -252.9185 -259.6586
0.4502 7.2464 2000 0.4619 0.0071 -0.0364 0.5789 0.0435 -263.2976 -252.2066 1.4432 1.3985 1.3985 1.4432 -252.2066 -263.2976 -252.9185 -259.6586
0.4473 7.6087 2100 0.4623 0.0028 -0.0403 0.5921 0.0431 -263.6902 -252.6375 1.4423 1.3964 1.3964 1.4423 -252.6375 -263.6902 -252.9185 -259.6586
0.4508 7.9710 2200 0.4619 0.0067 -0.0352 0.5921 0.0419 -263.1805 -252.2534 1.4436 1.3993 1.3993 1.4436 -252.2534 -263.1805 -252.9185 -259.6586

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1