Inconsistent Eval Results with Openchat 3.5?

#7
by banghua - opened

Hi,

Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..

Hi,

Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..

Openchat uses a proprietary unknown method to generate evaluation results, and does not use LM Eval Harness, so it's impossible for me to say how the benchmark scores were derived

OpenChat evaluation results are run using zero-shot and few-shot CoT. For reproduction instructions, see here

python -m ochat.evaluation.run_eval --model-type chatml_mistral --model teknium/OpenHermes-2.5-Mistral-7B
teknium changed discussion status to closed

Sign up or log in to comment