Spaces:
Running
on
CPU Upgrade
Mistake in gaia's scoring function.
I believe there's a mistake, or logical inconsistency in the gaia scoring function listed both here https://huggingface.co/spaces/gaia-benchmark/leaderboard/blob/main/scorer.py and in supplementary materials in your ICLR paper. here https://openreview.net/forum?id=fibxvahvs3
when the following model answer and output are fed to the scoring function, it rates it as a correct answer, both on equivalence and on being a valid grammatical sentence.
- ground truth: "The seagull glided peacefully to my chair.",
- model answer: "THESE A GULL GLIDED PEACEFULLY TO MY CHAIR"
Hi, thanks for your interest!
Given the dataset that we have, and the fact that we require answers to be given as word successions not sentences, the point that you raised is not an issue.
To get an exact match on our specific examples (even with incorrect spacing), you'd still need to have understood correctly both the prompt and expected answer.
Hey Clementine. Thanks for responding.
This is an interesting choice given that this particular question from your validation set is precisely about asking GPT to identify the spaces in a sequence that would directly match the correct answer always if spaces are removed.
Appreciate your response.
Hm, can you give me the link to the question you are referring to? I might have misunderstood your comment
The question is in the second page of the validation set https://huggingface.co/datasets/gaia-benchmark/GAIA/viewer/2023_all/validation?p=1&row=124
Whereas the correct answer is: The seagull glided peacefully to my chair, the raw input of the problem, if submitted to the scoring function would also score as correct, although it isn’t, which defeats the purpose of this task.
Gotcha, super good point - I had completely missed this specific sample when we designed the scoring function!
I'm pinging people internally, reopening, we'll keep you posted
Great. Thank you.