Spaces:

gaia-benchmark
/

leaderboard

Running on CPU Upgrade

App Files Files Community

Mistake in gaia's scoring function.

#10

by amedhat - opened Feb 13

Discussion

amedhat

Feb 13

•

edited Feb 13

I believe there's a mistake, or logical inconsistency in the gaia scoring function listed both here https://huggingface.co/spaces/gaia-benchmark/leaderboard/blob/main/scorer.py and in supplementary materials in your ICLR paper. here https://openreview.net/forum?id=fibxvahvs3

when the following model answer and output are fed to the scoring function, it rates it as a correct answer, both on equivalence and on being a valid grammatical sentence.

ground truth: "The seagull glided peacefully to my chair.",
model answer: "THESE A GULL GLIDED PEACEFULLY TO MY CHAIR"

amedhat changed discussion title from Mistake in gaia scoring function listed in ICLR to Mistake in gaia's scoring function. Feb 13

clefourrier

GAIA org Feb 14

Hi, thanks for your interest!
Given the dataset that we have, and the fact that we require answers to be given as word successions not sentences, the point that you raised is not an issue.
To get an exact match on our specific examples (even with incorrect spacing), you'd still need to have understood correctly both the prompt and expected answer.

clefourrier changed discussion status to closed Feb 14

amedhat

Feb 14

Hey Clementine. Thanks for responding.

This is an interesting choice given that this particular question from your validation set is precisely about asking GPT to identify the spaces in a sequence that would directly match the correct answer always if spaces are removed.

Appreciate your response.

clefourrier

GAIA org Feb 14

Hm, can you give me the link to the question you are referring to? I might have misunderstood your comment

amedhat

Feb 14

The question is in the second page of the validation set https://huggingface.co/datasets/gaia-benchmark/GAIA/viewer/2023_all/validation?p=1&row=124

Whereas the correct answer is: The seagull glided peacefully to my chair, the raw input of the problem, if submitted to the scoring function would also score as correct, although it isn’t, which defeats the purpose of this task.

clefourrier

GAIA org Feb 14

Gotcha, super good point - I had completely missed this specific sample when we designed the scoring function!

I'm pinging people internally, reopening, we'll keep you posted

clefourrier changed discussion status to open Feb 14

amedhat

Feb 14

Great. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment