metadata

library_name: transformers
datasets:
  - jondurbin/truthy-dpo-v0.1

MBX-7B-v3-DPO

This model is a finetune of flemmingmiguel/MBX-7B-v3 using jondurbin/truthy-dpo-v0.1

Code Example

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("macadeliccc/MBX-7B-v3-DPO")
model = AutoModelForCausalLM.from_pretrained("macadeliccc/MBX-7B-v3-DPO")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Can you write me a creative haiku?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")

GGUF

Available here

Evaluations

EQ-Bench

----Benchmark Complete----
2024-01-30 15:22:18
Time taken: 145.9 mins
Prompt Format: ChatML
Model: macadeliccc/MBX-7B-v3-DPO
Score (v2): 74.32
Parseable: 166.0
---------------
Batch completed
Time taken: 145.9 mins
---------------

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
MBX-7B-v3-DPO	45.16	77.73	74.62	48.83	61.58

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.95	±	2.82
		acc_norm	26.77	±	2.78
agieval_logiqa_en	0	acc	41.01	±	1.93
		acc_norm	40.55	±	1.93
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	50.78	±	2.22
		acc_norm	52.94	±	2.21
agieval_lsat_rc	0	acc	66.54	±	2.88
		acc_norm	65.80	±	2.90
agieval_sat_en	0	acc	77.67	±	2.91
		acc_norm	77.67	±	2.91
agieval_sat_en_without_passage	0	acc	43.20	±	3.46
		acc_norm	43.20	±	3.46
agieval_sat_math	0	acc	32.27	±	3.16
		acc_norm	30.45	±	3.11

Average: 45.16%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	68.43	±	1.36
		acc_norm	68.34	±	1.36
arc_easy	0	acc	87.54	±	0.68
		acc_norm	82.11	±	0.79
boolq	1	acc	88.20	±	0.56
hellaswag	0	acc	69.76	±	0.46
		acc_norm	87.40	±	0.33
openbookqa	0	acc	40.20	±	2.19
		acc_norm	49.60	±	2.24
piqa	0	acc	83.68	±	0.86
		acc_norm	85.36	±	0.82
winogrande	0	acc	83.11	±	1.05

Average: 77.73%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	58.87	±	1.72
		mc2	74.62	±	1.44

Average: 74.62%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	60.00	±	3.56
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	47.67	±	3.12
bigbench_geometric_shapes	0	multiple_choice_grade	22.56	±	2.21
		exact_str_match	0.84	±	0.48
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	33.20	±	2.11
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.00	±	1.59
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	59.67	±	2.84
bigbench_movie_recommendation	0	multiple_choice_grade	47.40	±	2.24
bigbench_navigate	0	multiple_choice_grade	56.10	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	71.25	±	1.01
bigbench_ruin_names	0	multiple_choice_grade	56.47	±	2.35
bigbench_salient_translation_error_detection	0	multiple_choice_grade	35.27	±	1.51
bigbench_snarks	0	multiple_choice_grade	73.48	±	3.29
bigbench_sports_understanding	0	multiple_choice_grade	75.46	±	1.37
bigbench_temporal_sequences	0	multiple_choice_grade	52.10	±	1.58
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.64	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	19.83	±	0.95
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	59.67	±	2.84

Average: 48.83%

Average score: 61.58%

Elapsed time: 02:37:39