Mistral-Nemo-NT-Ko-12B-dpo / README.md

werty1248

Update README.md

89a67dc verified 6 days ago

preview code

raw

history blame

No virus

6.2 kB

	---
	base_model:
	- werty1248/Mistral-Nemo-NT-Ko-12B-sft
	datasets:
	- zake7749/kyara-chinese-preference-rl-dpo-s0-30K
	- sionic/ko-dpo-mix-7k-trl-style
	- kuotient/orca-math-korean-dpo-pairs
	- HuggingFaceH4/ultrafeedback_binarized
	language:
	- en
	- ko
	- ja
	- zh
	license: apache-2.0
	---
	# Mistral-Nemo-NT-Ko-12B-dpo

	## Description

	Mistral-Nemo-NT-Ko-12B-dpo is a shallowly DPO-trained version of [werty1248/Mistral-Nemo-NT-Ko-12B-sft](https://huggingface.co/werty1248/Mistral-Nemo-NT-Ko-12B-sft).

	According to the [Hermes 3 Tech Report](https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf), DPO made negligible performance improvements in their model. Therefore, I followed the same approach described in the report and applied DPO using LoRA.
	- LoRA r = 32
	- Lora alpha = 16
	- lr = 3e-6
	- neftune alpha = 5

	The datasets used are as follows:

	- (En) [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
	- (Ko, translated from En) [sionic/ko-dpo-mix-7k-translation-exclude](https://huggingface.co/datasets/sionic/ko-dpo-mix-7k-translation-exclude)
	- (Ko, translated from En) [kuotient/orca-math-korean-dpo-pairs](https://huggingface.co/datasets/kuotient/orca-math-korean-dpo-pairs)
	- (Zh) [zake7749/kyara-chinese-preference-rl-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-rl-dpo-s0-30K)

	I've been looking for native Korean/Japanese DPO datasets, but haven't found anything that I'm personally satisfied with(Quantity/Quality).

	From each dataset, I sampled a subset based on the score given by the reward model. In the end, I used about 13K samples for training for each language.

	## Features

	- The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.

	- This model works well for multi-turn conversations, and tends to strongly reflect the previous conversation.

	# Evaluation

	### LogicKor

	Cot-1-shot
	\| 모델 \| 방법 \| 추론 \| 수학 \| 글쓰기 \| 코딩 \| 이해 \| 문법 \| 싱글턴 \| 멀티턴 \| 총점 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| cot-1-shot \|7.36 \| 6.57 \| 8.71 \| 8.57 \| 9.57 \| 6.43 \| 7.81 \| 7.93 \| 7.87 \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| cot-1-shot \| 6.79 \| 6.43 \| 9.43 \| 9.79 \| 9.43 \| 5.29 \| 7.71 \| 8.00 \| 7.86 \|
	\| Mistral Nemo \| cot-1-shot \| 5.43 \| 6.86 \| 6.07 \| 7.57 \| 5.86 \| 7.57 \| 7.50 \| 5.62 \|6.56\|

	1-shot
	\| 모델 \| 방법 \| 추론 \| 수학 \| 글쓰기 \| 코딩 \| 이해 \| 문법 \| 싱글턴 \| 멀티턴 \| 총점 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| 1-shot \| 8.14 \| 5.50 \| 9.36 \| 8.57 \| 9.50 \| 4.71 \| 7.38 \| 7.88 \| 7.63 \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| 1-shot \| 9.00 \| 5.71 \| 7.93 \| 8.29 \| 7.93 \| 5.21 \| 7.29 \| 7.40 \| 7.35 \|
	\| Mistral Nemo \| 1-shot \| 5.00 \| 6.50 \| 6.86 \| 8.07 \| 7.64 \| 8.43 \| 7.60 \| 6.57 \|7.08\|

	Default
	\| 모델 \| 방법 \| 추론 \| 수학 \| 글쓰기 \| 코딩 \| 이해 \| 문법 \| 싱글턴 \| 멀티턴 \| 총점 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| default \| 6.21 \| 5.79 \| 8.00 \| 8.36 \| 9.43 \| 5.43 \| 7.17 \| 7.24 \| 7.20 \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| default \| 6.00 \| 4.93 \| 5.43 \| 7.14 \| 9.71 \| 4.00 \| 6.45 \| 5.95 \| 6.20 \|
	\| Mistral Nemo \| default \| 0.43 \| 7.64 \| 6.21 \| 7.14 \| 6.79 \| 7.21 \| 6.26 \| 5.55 \|5.90\|

	### Language-Confusion

	\| Model \| Language \| Monolingual-LPR \| Monolingual-WPR \| Crosslingual-LPR \| Crosslingual-WPR \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| ko \| 100.00% \| 97.96% \| 85.63% \| 96.93% \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| ko \| 100.00% \| 99.00% \| 87.51% \| 96.96% \|
	\|Mistral-Nemo-Instruct-2407 \| ko \| 90.72% \| 93.18% \| 46.75% \| 92.84% \|
	\|Meta-Llama-3.1-8B-Instruct \| ko \| 99.00% \| 96.97% \| 91.45% \| 93.01% \|
	\|gemma-2-9b-it \| ko \| 100.00% \| 98.00% \| 87.93% \| 95.58% \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| zh \| 99.00% \| 99.50% \| 80.52% \| 97.51% \|
	\|Mistral-Nemo-Instruct-2407 \| zh \| 97.50% \| 98.98% \| 53.43% \| 93.58% \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-dpo\| ja \| 100.00% \| 100.00% \| 86.89% \| 95.41% \|
	\|Mistral-Nemo-Instruct-2407 \| ja \| 94.00% \| 98.94% \| 50.27% \| 96.05% \|

	## Template

	```
	<\|im_start\|>system
	You are a helpful AI assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.

	# Dataset

	- zake7749/kyara-chinese-preference-rl-dpo-s0-30K
	- sionic/ko-dpo-mix-7k-trl-style
	- kuotient/orca-math-korean-dpo-pairs
	- HuggingFaceH4/ultrafeedback_binarized

	# Training Details

	- GPU: 2xA100
	- epoch: 1
	- total batch size: 32
	- learning rate: 3e-6
	- neftune_noise_alpha: 5



	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.1`
	```yaml
	base_model: werty1248/Mistral-Nemo-NT-Ko-12B-sft
	model_type: MistralForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	adapter: lora
	lora_model_dir:
	lora_r: 32
	lora_alpha: 16
	lora_dropout: 0.05
	lora_target_linear: true
	lora_fan_in_fan_out:

	dpo_beta: 0.1
	rl: dpo

	datasets:
	- path: werty1248/NT-dpo
	split: train
	type: chatml.prompt_pairs

	dataset_prepared_path: /workspace/data/prepared_datasets
	output_dir: /workspace/data
	save_steps: 500

	sequence_len: 8192
	sample_packing: false
	pad_to_sequence_len: true
	gradient_accumulation_steps: 16
	micro_batch_size: 1
	num_epochs: 1
	optimizer: rmsprop
	weight_decay: 0.0
	learning_rate: 0.000003
	lr_scheduler: linear
	neftune_noise_alpha: 5

	train_on_inputs: false
	group_by_length: false

	#wandb_project:
	#wandb_entity:
	#wandb_watch:
	#wandb_name:
	#wandb_log_model:

	bf16: true
	fp16: false
	tf32: false

	gradient_checkpointing: true
	flash_attention: true
	warmup_steps: 9

	eval_steps:
	val_set_size: 0
	early_stopping_patience:
	logging_steps: 1

	special_tokens:
	pad_token: <pad>
	```

	</details><br>


	- reward margin

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/5m2K7azV5ZhGGZqWJZNWX.png)