--- base_model: - werty1248/Mistral-Nemo-NT-Ko-12B-sft datasets: - zake7749/kyara-chinese-preference-rl-dpo-s0-30K - sionic/ko-dpo-mix-7k-trl-style - kuotient/orca-math-korean-dpo-pairs - HuggingFaceH4/ultrafeedback_binarized language: - en - ko - ja - zh license: apache-2.0 --- # Mistral-Nemo-NT-Ko-12B-dpo ## Description **Mistral-Nemo-NT-Ko-12B-dpo** is a shallowly DPO-trained version of [*werty1248/Mistral-Nemo-NT-Ko-12B-sft*](https://huggingface.co/werty1248/Mistral-Nemo-NT-Ko-12B-sft). According to the [Hermes 3 Tech Report](https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf), DPO made negligible performance improvements in their model. Therefore, I followed the same approach described in the report and applied DPO using LoRA. - LoRA r = 32 - Lora alpha = 16 - lr = 3e-6 - neftune alpha = 5 The datasets used are as follows: - (En) [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) - (Ko, translated from En) [sionic/ko-dpo-mix-7k-translation-exclude](https://huggingface.co/datasets/sionic/ko-dpo-mix-7k-translation-exclude) - (Ko, translated from En) [kuotient/orca-math-korean-dpo-pairs](https://huggingface.co/datasets/kuotient/orca-math-korean-dpo-pairs) - (Zh) [zake7749/kyara-chinese-preference-rl-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-rl-dpo-s0-30K) I've been looking for native Korean/Japanese DPO datasets, but haven't found anything that I'm personally satisfied with(Quantity/Quality). From each dataset, I sampled a subset based on the score given by the reward model. In the end, I used about 13K samples for training for each language. ## Features - The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size. - This model works well for **multi-turn conversations**, and tends to strongly reflect the previous conversation. # Evaluation ### LogicKor *Cot-1-shot* | 모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |Mistral-Nemo-NT-Ko-12B-sft| cot-1-shot |7.36 | 6.57 | 8.71 | 8.57 | 9.57 | 6.43 | 7.81 | 7.93 | **7.87** | |**Mistral-Nemo-NT-Ko-12B-dpo**| cot-1-shot | 6.79 | 6.43 | 9.43 | 9.79 | 9.43 | 5.29 | 7.71 | 8.00 | **7.86** | | Mistral Nemo | cot-1-shot | 5.43 | 6.86 | 6.07 | 7.57 | 5.86 | 7.57 | 7.50 | 5.62 |6.56| *1-shot* | 모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |**Mistral-Nemo-NT-Ko-12B-dpo**| 1-shot | 8.14 | 5.50 | 9.36 | 8.57 | 9.50 | 4.71 | 7.38 | 7.88 | **7.63** | |Mistral-Nemo-NT-Ko-12B-sft| 1-shot | 9.00 | 5.71 | 7.93 | 8.29 | 7.93 | 5.21 | 7.29 | 7.40 | 7.35 | | Mistral Nemo | 1-shot | 5.00 | 6.50 | 6.86 | 8.07 | 7.64 | 8.43 | 7.60 | 6.57 |7.08| *Default* | 모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |**Mistral-Nemo-NT-Ko-12B-dpo**| default | 6.21 | 5.79 | 8.00 | 8.36 | 9.43 | 5.43 | 7.17 | 7.24 | **7.20** | |Mistral-Nemo-NT-Ko-12B-sft| default | 6.00 | 4.93 | 5.43 | 7.14 | 9.71 | 4.00 | 6.45 | 5.95 | 6.20 | | Mistral Nemo | default | 0.43 | 7.64 | 6.21 | 7.14 | 6.79 | 7.21 | 6.26 | 5.55 |5.90| ### Language-Confusion | Model | Language | Monolingual-LPR | Monolingual-WPR | Crosslingual-LPR | Crosslingual-WPR | | --- | --- | --- | --- | --- | --- | |Mistral-Nemo-NT-Ko-12B-dpo| ko | 100.00% | 97.96% | **85.63%** | 96.93% | |Mistral-Nemo-NT-Ko-12B-sft| ko | 100.00% | 99.00% | **87.51%** | 96.96% | |Mistral-Nemo-Instruct-2407 | ko | 90.72% | 93.18% | 46.75% | 92.84% | |Meta-Llama-3.1-8B-Instruct | ko | 99.00% | 96.97% | 91.45% | 93.01% | |gemma-2-9b-it | ko | 100.00% | 98.00% | 87.93% | 95.58% | | --- | --- | --- | --- | --- | --- | |Mistral-Nemo-NT-Ko-12B-dpo| zh | 99.00% | 99.50% | **80.52%** | 97.51% | |Mistral-Nemo-Instruct-2407 | zh | 97.50% | 98.98% | 53.43% | 93.58% | | --- | --- | --- | --- | --- | --- | |Mistral-Nemo-NT-Ko-12B-dpo| ja | 100.00% | 100.00% | **86.89%** | 95.41% | |Mistral-Nemo-Instruct-2407 | ja | 94.00% | 98.94% | 50.27% | 96.05% | ## Template ``` <|im_start|>system You are a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant ``` *I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.* # Dataset - zake7749/kyara-chinese-preference-rl-dpo-s0-30K - sionic/ko-dpo-mix-7k-trl-style - kuotient/orca-math-korean-dpo-pairs - HuggingFaceH4/ultrafeedback_binarized # Training Details - GPU: 2xA100 - epoch: 1 - total batch size: 32 - learning rate: 3e-6 - neftune_noise_alpha: 5
See axolotl config axolotl version: `0.4.1` ```yaml base_model: werty1248/Mistral-Nemo-NT-Ko-12B-sft model_type: MistralForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false adapter: lora lora_model_dir: lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: dpo_beta: 0.1 rl: dpo datasets: - path: werty1248/NT-dpo split: train type: chatml.prompt_pairs dataset_prepared_path: /workspace/data/prepared_datasets output_dir: /workspace/data save_steps: 500 sequence_len: 8192 sample_packing: false pad_to_sequence_len: true gradient_accumulation_steps: 16 micro_batch_size: 1 num_epochs: 1 optimizer: rmsprop weight_decay: 0.0 learning_rate: 0.000003 lr_scheduler: linear neftune_noise_alpha: 5 train_on_inputs: false group_by_length: false #wandb_project: #wandb_entity: #wandb_watch: #wandb_name: #wandb_log_model: bf16: true fp16: false tf32: false gradient_checkpointing: true flash_attention: true warmup_steps: 9 eval_steps: val_set_size: 0 early_stopping_patience: logging_steps: 1 special_tokens: pad_token: ```

- reward margin ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/5m2K7azV5ZhGGZqWJZNWX.png)