Eval models for data contamination?

#561
by liyucheng - opened

Well, I think it's fair to say that data contamination is destroying the reliability leaderboard.

I just did this experiments, that test Llama-2, Baichuan2, and Yi in clean and containated test set.

The results is kinda surprising: models seem to achieve 10 pts more on dirty dataset.

dataset     version    mode    baichuan2-7b-base-hf    -                              -                                        qwen-7b-hf        -                              -                                        llama_30b_autogptq    -                              -
----------  ---------  ------  ----------------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  --------------------  -----------------------------  ---------------------------------------
-           -          -       accuracy - clean        accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean      accuracy - input contaminated  accuracy - input-and-label contaminated
mmlu        -          ppl     56.76                   44.69                          54.93                                    58.74             48.67                          58.28                                    57.46                 45.72                          57.16
hellaswag   47bff9     ppl     66.87                   57.14                          70.97                                    86.42             89.29                          90.88                                    76.71                 57.14                          82.37

I just add this feature to OpenCompass, I wonder if there is anyone interested in proposal? I could do a more comphrehensive analysis on data contaminatoin?

Check out my implementation and reports here: https://github.com/liyucheng09/Contamination_Detector.

Open LLM Leaderboard org

Hi @liyucheng ,
Thanks for your comment!
Can you detail your methodology a bit?

@clefourrier Hi Clémentine, sorry for the late reply.

I have discussed my approach with Edward Benching in one of the interview.

Basically I check benchmark examples' persence in Common Crawl, and classify them into three categories:

  1. Clean.
  2. Input-only contamination: the input (question/passage) appears in Common Crawl, but not the label/answer.
  3. Input-and-label contamination: the contamination give away both input and label.

And we calculate metrics seperately on them to have am impression about their contamination degree.

According to the the practice in OpenCompass, it's convenient, requiring no extra computing.
I have done checks for six popular QA benchmark till now, see here.

clefourrier changed discussion status to closed
Open LLM Leaderboard org

Sorry I missed this answer for so long! I'm adding this idea to our contamination detection backlog

Sign up or log in to comment