werty1248's picture
Update README.md
89a67dc verified
metadata
base_model:
  - werty1248/Mistral-Nemo-NT-Ko-12B-sft
datasets:
  - zake7749/kyara-chinese-preference-rl-dpo-s0-30K
  - sionic/ko-dpo-mix-7k-trl-style
  - kuotient/orca-math-korean-dpo-pairs
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - en
  - ko
  - ja
  - zh
license: apache-2.0

Mistral-Nemo-NT-Ko-12B-dpo

Description

Mistral-Nemo-NT-Ko-12B-dpo is a shallowly DPO-trained version of werty1248/Mistral-Nemo-NT-Ko-12B-sft.

According to the Hermes 3 Tech Report, DPO made negligible performance improvements in their model. Therefore, I followed the same approach described in the report and applied DPO using LoRA.

  • LoRA r = 32
  • Lora alpha = 16
  • lr = 3e-6
  • neftune alpha = 5

The datasets used are as follows:

I've been looking for native Korean/Japanese DPO datasets, but haven't found anything that I'm personally satisfied with(Quantity/Quality).

From each dataset, I sampled a subset based on the score given by the reward model. In the end, I used about 13K samples for training for each language.

Features

  • The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.

  • This model works well for multi-turn conversations, and tends to strongly reflect the previous conversation.

Evaluation

LogicKor

Cot-1-shot

모델 방법 추론 수학 글쓰기 코딩 이해 문법 싱글턴 멀티턴 총점
Mistral-Nemo-NT-Ko-12B-sft cot-1-shot 7.36 6.57 8.71 8.57 9.57 6.43 7.81 7.93 7.87
Mistral-Nemo-NT-Ko-12B-dpo cot-1-shot 6.79 6.43 9.43 9.79 9.43 5.29 7.71 8.00 7.86
Mistral Nemo cot-1-shot 5.43 6.86 6.07 7.57 5.86 7.57 7.50 5.62 6.56

1-shot

모델 방법 추론 수학 글쓰기 코딩 이해 문법 싱글턴 멀티턴 총점
Mistral-Nemo-NT-Ko-12B-dpo 1-shot 8.14 5.50 9.36 8.57 9.50 4.71 7.38 7.88 7.63
Mistral-Nemo-NT-Ko-12B-sft 1-shot 9.00 5.71 7.93 8.29 7.93 5.21 7.29 7.40 7.35
Mistral Nemo 1-shot 5.00 6.50 6.86 8.07 7.64 8.43 7.60 6.57 7.08

Default

모델 방법 추론 수학 글쓰기 코딩 이해 문법 싱글턴 멀티턴 총점
Mistral-Nemo-NT-Ko-12B-dpo default 6.21 5.79 8.00 8.36 9.43 5.43 7.17 7.24 7.20
Mistral-Nemo-NT-Ko-12B-sft default 6.00 4.93 5.43 7.14 9.71 4.00 6.45 5.95 6.20
Mistral Nemo default 0.43 7.64 6.21 7.14 6.79 7.21 6.26 5.55 5.90

Language-Confusion

Model Language Monolingual-LPR Monolingual-WPR Crosslingual-LPR Crosslingual-WPR
Mistral-Nemo-NT-Ko-12B-dpo ko 100.00% 97.96% 85.63% 96.93%
Mistral-Nemo-NT-Ko-12B-sft ko 100.00% 99.00% 87.51% 96.96%
Mistral-Nemo-Instruct-2407 ko 90.72% 93.18% 46.75% 92.84%
Meta-Llama-3.1-8B-Instruct ko 99.00% 96.97% 91.45% 93.01%
gemma-2-9b-it ko 100.00% 98.00% 87.93% 95.58%
--- --- --- --- --- ---
Mistral-Nemo-NT-Ko-12B-dpo zh 99.00% 99.50% 80.52% 97.51%
Mistral-Nemo-Instruct-2407 zh 97.50% 98.98% 53.43% 93.58%
--- --- --- --- --- ---
Mistral-Nemo-NT-Ko-12B-dpo ja 100.00% 100.00% 86.89% 95.41%
Mistral-Nemo-Instruct-2407 ja 94.00% 98.94% 50.27% 96.05%

Template

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.

Dataset

  • zake7749/kyara-chinese-preference-rl-dpo-s0-30K
  • sionic/ko-dpo-mix-7k-trl-style
  • kuotient/orca-math-korean-dpo-pairs
  • HuggingFaceH4/ultrafeedback_binarized

Training Details

  • GPU: 2xA100
  • epoch: 1
  • total batch size: 32
  • learning rate: 3e-6
  • neftune_noise_alpha: 5
See axolotl config

axolotl version: 0.4.1

base_model: werty1248/Mistral-Nemo-NT-Ko-12B-sft
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

dpo_beta: 0.1
rl: dpo

datasets:
  - path: werty1248/NT-dpo
    split: train
    type: chatml.prompt_pairs

dataset_prepared_path: /workspace/data/prepared_datasets
output_dir: /workspace/data
save_steps: 500

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: rmsprop
weight_decay: 0.0
learning_rate: 0.000003
lr_scheduler: linear
neftune_noise_alpha: 5

train_on_inputs: false
group_by_length: false

#wandb_project:
#wandb_entity:
#wandb_watch:
#wandb_name:
#wandb_log_model:

bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
flash_attention: true
warmup_steps: 9

eval_steps:
val_set_size: 0
early_stopping_patience:
logging_steps: 1

special_tokens:
  pad_token: <pad>

  • reward margin

image/png