teknium/OpenHermes-2.5-Mistral-7B · Inconsistent Eval Results with Openchat 3.5?

Nov 22, 2023

Hi,

Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..

teknium

Owner Nov 24, 2023

Hi,

Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..

Openchat uses a proprietary unknown method to generate evaluation results, and does not use LM Eval Harness, so it's impossible for me to say how the benchmark scores were derived

imone

Nov 24, 2023

OpenChat evaluation results are run using zero-shot and few-shot CoT. For reproduction instructions, see here

python -m ochat.evaluation.run_eval --model-type chatml_mistral --model teknium/OpenHermes-2.5-Mistral-7B

teknium changed discussion status to closed Nov 25, 2023