Inconsistent Eval Results with Openchat 3.5?
Hi,
Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..
Hi,
Thank you for the great work! I'm curious why the reported evaluation is very different than what is reported for Openchat 3.5 (https://huggingface.co/openchat/openchat_3.5)? It's kind of interesting that in Openchat 3.5, they also compared with OpenHermes 2.5 and claimed that they are better. And I noticed that the scores reported for HumanEval, TruthfulQA etc. do not match on both sides..
Openchat uses a proprietary unknown method to generate evaluation results, and does not use LM Eval Harness, so it's impossible for me to say how the benchmark scores were derived