Spaces:
Running
on
CPU Upgrade
Metrics for hallucination detection for summarization.
The metric for summarization currently reports aggregated ROUGE scores, which measure the overlap between the generated and the reference summary. However, it may not necessarily capture hallucinations.
More recent work has utilized QA and NLI-based methods for detecting hallucinations in abstractive summarization, and we can also report scores using these approaches.
- QAFactEval (https://aclanthology.org/2022.naacl-main.187/)
- TrueTeacher (https://aclanthology.org/2023.emnlp-main.127/)
TrueTeacher is available at google/t5_11b_trueteacher_and_anli
Yeah one problem is that I'm not sure that the harness unloads the model before starting evaluation (I can check), so the 11B model might not fit into memory -- let's see!
Is there anything smaller we can use
@zorik
@rohitsaxena
?
Another work scale score (EMNLP 2023) supports relatively smaller models. FLAN T5 large is one of the recommended models for best results.
https://github.com/asappresearch/scale-score
Note: Empirically I found its performance is not at par with TrueTeacher