Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
pminervini
commited on
Merge branch 'main' of https://huggingface.co/spaces/pminervini/hallucinations-leaderboard into main
Browse files- src/display/about.py +37 -3
src/display/about.py
CHANGED
@@ -42,9 +42,43 @@ For all these evaluations, a higher score is a better score.
|
|
42 |
- You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the π emoji after the model name
|
43 |
|
44 |
# Reproducibility
|
45 |
-
|
46 |
-
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
"""
|
49 |
|
50 |
FAQ_TEXT = """
|
|
|
42 |
- You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the π emoji after the model name
|
43 |
|
44 |
# Reproducibility
|
45 |
+
To reproduce our results, here is the commands you can run, using [this script](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/blob/main/backend-cli.py): python backend-cli.py.
|
46 |
+
|
47 |
+
Alternatively, if you're interested in evaluating a specific task with a particular model, you can use [this script](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
48 |
+
`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,revision=<your_model_revision>"`
|
49 |
+
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>` (Note that you may need to add tasks from [here](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/tree/main/src/backend/tasks) to [this folder](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks))
|
50 |
+
|
51 |
+
The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
|
52 |
+
|
53 |
+
|
54 |
+
The tasks and few shots parameters are:
|
55 |
+
- NQ Open: 64-shot, *nq_open* (`exact_match`)
|
56 |
+
- NQ Open 8: 8-shot, *nq8* (`exact_match`)
|
57 |
+
- TriviaQA: 64-shot, *triviaqa* (`exact_match`)
|
58 |
+
- TriviaQA 8: 8-shot, *tqa8* (`exact_match`)
|
59 |
+
- TruthfulQA MC1: 0-shot, *truthfulqa_mc1* (`acc`)
|
60 |
+
- TruthfulQA MC2: 0-shot, *truthfulqa_mc2* (`acc`)
|
61 |
+
- HaluEval QA: 0-shot, *halueval_qa* (`em`)
|
62 |
+
- HaluEval Summ: 0-shot, *halueval_summarization* (`em`)
|
63 |
+
- HaluEval Dial: 0-shot, *halueval_dialogue* (`em`)
|
64 |
+
- XSum: 2-shot, *xsum* (`rougeLsum`)
|
65 |
+
- CNN/DM: 2-shot, *cnndm* (`rougeLsum`)
|
66 |
+
- MemoTrap: 0-shot, *memo-trap* (`acc`)
|
67 |
+
- IFEval: 0-shot, *ifeval* (`prompt_level_strict_acc`)
|
68 |
+
- SelfCheckGPT: 0-shot, *selfcheckgpt* (``)
|
69 |
+
- FEVER: 16-shot, *fever10* (`acc`)
|
70 |
+
- SQuADv2: 4-shot, *squadv2* (`squad_v2`)
|
71 |
+
- TrueFalse: 8-shot, *truefalse_cieacf* (`acc`)
|
72 |
+
- FaithDial: 8-shot, *faithdial_hallu* (`acc`)
|
73 |
+
- RACE: 0-shot, *race* (`acc`)
|
74 |
+
|
75 |
+
## Icons
|
76 |
+
- {ModelType.PT.to_str(" : ")} model: new, base models, trained on a given corpora
|
77 |
+
- {ModelType.FT.to_str(" : ")} model: pretrained models finetuned on more data
|
78 |
+
Specific fine-tune subcategories (more adapted to chat):
|
79 |
+
- {ModelType.IFT.to_str(" : ")} model: instruction fine-tunes, which are model fine-tuned specifically on datasets of task instruction
|
80 |
+
- {ModelType.RL.to_str(" : ")} model: reinforcement fine-tunes, which usually change the model loss a bit with an added policy.
|
81 |
+
If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
|
82 |
"""
|
83 |
|
84 |
FAQ_TEXT = """
|