Spaces:
Running
on
CPU Upgrade
Eval models for data contamination?
Well, I think it's fair to say that data contamination is destroying the reliability leaderboard.
I just did this experiments, that test Llama-2
, Baichuan2
, and Yi
in clean and containated test set.
The results is kinda surprising: models seem to achieve 10 pts more on dirty dataset.
dataset version mode baichuan2-7b-base-hf - - qwen-7b-hf - - llama_30b_autogptq - -
---------- --------- ------ ---------------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- -------------------- ----------------------------- ---------------------------------------
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated
mmlu - ppl 56.76 44.69 54.93 58.74 48.67 58.28 57.46 45.72 57.16
hellaswag 47bff9 ppl 66.87 57.14 70.97 86.42 89.29 90.88 76.71 57.14 82.37
I just add this feature to OpenCompass, I wonder if there is anyone interested in proposal? I could do a more comphrehensive analysis on data contaminatoin?
Check out my implementation and reports here: https://github.com/liyucheng09/Contamination_Detector.
Hi
@liyucheng
,
Thanks for your comment!
Can you detail your methodology a bit?
@clefourrier Hi Clémentine, sorry for the late reply.
I have discussed my approach with Edward Benching in one of the interview.
Basically I check benchmark examples' persence in Common Crawl, and classify them into three categories:
- Clean.
- Input-only contamination: the input (question/passage) appears in Common Crawl, but not the label/answer.
- Input-and-label contamination: the contamination give away both input and label.
And we calculate metrics seperately on them to have am impression about their contamination degree.
According to the the practice in OpenCompass, it's convenient, requiring no extra computing.
I have done checks for six popular QA benchmark till now, see here.
Sorry I missed this answer for so long! I'm adding this idea to our contamination detection backlog