tenyx/Llama3-TenyxChat-70B · Script to reproduce MT-Bench

May 8

Congrats on your fine-tuned Llama-3-70B model. There is a section in your README mentioning MT-Bench specially in multi-turn:

Note: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.

Model First Turn Second Turn Average

tenyx/Llama3-TenyxChat-70B 8.12 8.18 8.15

meta-llama/Llama3-TenyxChat-70B 8.05 7.87 7.96

MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 8.05 7.82 7.93

Model	First Turn	Second Turn	Average
tenyx/Llama3-TenyxChat-70B	8.12	8.18	8.15
meta-llama/Llama3-TenyxChat-70B	8.05	7.87	7.96
MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4	8.05	7.82	7.93

Could you please provide the script for this evaluation? I would like to see if the prompt template and eos_token was respected during the eval, since my models use ChatML.

Thanks and congrats again! :)

sarath-shekkizhar

Tenyx org May 8

@MaziyarPanahi -- Thanks, and congrats on your fine tunes as well 🤗. We used the code from here: lm-sys/FastChat. Note that to update the model to use gpt-4-0125 as a judge, you would need to integrate this PR; reasons and repo owners' comments for this are in the PR.

MaziyarPanahi

May 9

Thank you @sarath-shekkizhar for sharing the script, appreciate it. I'll try to use this for the next fine-tunes.

PS: Please, keep up the good work! 🤗❤️

MaziyarPanahi changed discussion status to closed May 9