Script to reproduce MT-Bench
Congrats on your fine-tuned Llama-3-70B model. There is a section in your README mentioning MT-Bench
specially in multi-turn:
Note: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
Model First Turn Second Turn Average tenyx/Llama3-TenyxChat-70B 8.12 8.18 8.15 meta-llama/Llama3-TenyxChat-70B 8.05 7.87 7.96 MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 8.05 7.82 7.93
Could you please provide the script for this evaluation? I would like to see if the prompt template and eos_token was respected during the eval, since my models use ChatML.
Thanks and congrats again! :)
@MaziyarPanahi
-- Thanks, and congrats on your fine tunes as well 🤗. We used the code from here: lm-sys/FastChat. Note that to update the model to use gpt-4-0125
as a judge, you would need to integrate this PR; reasons and repo owners' comments for this are in the PR.
Thank you @sarath-shekkizhar for sharing the script, appreciate it. I'll try to use this for the next fine-tunes.
PS: Please, keep up the good work! 🤗❤️