Spaces:
Running
on
CPU Upgrade
Surprisingly low score of Rhea-72b-v0.5 and TW3-JRGL-v2
Both are based on qwen[v1] 72B (llama-ified), which are large models, and used to be a top performer in the old and other leaderboards, (TW3-JRGL-v2
didn't run successfully before the old leaderboard was shut down)
Would you look into its output, could there be any reason for the low scores?
Request files:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/davidkim205/Rhea-72b-v0.5_eval_request_False_bfloat16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/TW3-JRGL-v2_eval_request_False_bfloat16_Original.json
Result files:
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/davidkim205/Rhea-72b-v0.5/results_2024-07-26T07-16-56.959001.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/paloalma/TW3-JRGL-v2/results_2024-07-20T13-10-35.823593.json
Are there any available concrete data and text-output results, there are some guesses and speculation, but can these be tested?
Rhea-72b-v0.5
is based on Smaug-72B-v0.1
, and the corresponding models performed betterabacusai/Smaug-72B-v0.1
29.56, vs 4.02 Rhea-72b-v0.5
paloalma/Le_Triomphant-ECE-TW3
31.66, vs 4.57 TW3-JRGL-v2
Hi!
To inspect this, you'll need to look for the details
repos of these (example for the rhea model), the naming pattern is consistent, and to do a manual inspection of the outputs, to see what's happening.
Is the score you are reporting the average? If yes it's likely a conversion problem. If it's for a specific eval it's likely a problem with a bad eos token or equivalent.
Good luck investigating!