open-llm-leaderboard/open_llm_leaderboard · Eval models for data contamination?

Jan 25

Well, I think it's fair to say that data contamination is destroying the reliability leaderboard.

I just did this experiments, that test Llama-2, Baichuan2, and Yi in clean and containated test set.

The results is kinda surprising: models seem to achieve 10 pts more on dirty dataset.

dataset     version    mode    baichuan2-7b-base-hf    -                              -                                        qwen-7b-hf        -                              -                                        llama_30b_autogptq    -                              -
----------  ---------  ------  ----------------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  --------------------  -----------------------------  ---------------------------------------
-           -          -       accuracy - clean        accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean      accuracy - input contaminated  accuracy - input-and-label contaminated
mmlu        -          ppl     56.76                   44.69                          54.93                                    58.74             48.67                          58.28                                    57.46                 45.72                          57.16
hellaswag   47bff9     ppl     66.87                   57.14                          70.97                                    86.42             89.29                          90.88                                    76.71                 57.14                          82.37

I just add this feature to OpenCompass, I wonder if there is anyone interested in proposal? I could do a more comphrehensive analysis on data contaminatoin?

Check out my implementation and reports here: https://github.com/liyucheng09/Contamination_Detector.

clefourrier

Open LLM Leaderboard org Jan 30

Hi @liyucheng ,
Thanks for your comment!
Can you detail your methodology a bit?

liyucheng

Feb 9

•

edited Feb 9

@clefourrier Hi Clémentine, sorry for the late reply.

I have discussed my approach with Edward Benching in one of the interview.

Basically I check benchmark examples' persence in Common Crawl, and classify them into three categories:

Clean.
Input-only contamination: the input (question/passage) appears in Common Crawl, but not the label/answer.
Input-and-label contamination: the contamination give away both input and label.

And we calculate metrics seperately on them to have am impression about their contamination degree.

According to the the practice in OpenCompass, it's convenient, requiring no extra computing.
I have done checks for six popular QA benchmark till now, see here.

clefourrier changed discussion status to closed Jun 21

clefourrier

Open LLM Leaderboard org Jun 21

Sorry I missed this answer for so long! I'm adding this idea to our contamination detection backlog