Surprisingly low score of Rhea-72b-v0.5 and TW3-JRGL-v2

#875
by CombinHorizon - opened

Both are based on qwen[v1] 72B (llama-ified), which are large models, and used to be a top performer in the old and other leaderboards, (TW3-JRGL-v2 didn't run successfully before the old leaderboard was shut down)
Would you look into its output, could there be any reason for the low scores?

Request files:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/davidkim205/Rhea-72b-v0.5_eval_request_False_bfloat16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/TW3-JRGL-v2_eval_request_False_bfloat16_Original.json

Result files:
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/davidkim205/Rhea-72b-v0.5/results_2024-07-26T07-16-56.959001.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/paloalma/TW3-JRGL-v2/results_2024-07-20T13-10-35.823593.json

Are there any available concrete data and text-output results, there are some guesses and speculation, but can these be tested?

Rhea-72b-v0.5is based on Smaug-72B-v0.1, and the corresponding models performed better
abacusai/Smaug-72B-v0.129.56, vs 4.02 Rhea-72b-v0.5
paloalma/Le_Triomphant-ECE-TW331.66, vs 4.57 TW3-JRGL-v2

Open LLM Leaderboard org

Hi!

To inspect this, you'll need to look for the details repos of these (example for the rhea model), the naming pattern is consistent, and to do a manual inspection of the outputs, to see what's happening.

Is the score you are reporting the average? If yes it's likely a conversion problem. If it's for a specific eval it's likely a problem with a bad eos token or equivalent.

Good luck investigating!

clefourrier changed discussion status to closed

Sign up or log in to comment