MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Paper • 2408.00765 • Published Aug 1 • 12
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent Paper • 2407.21646 • Published Jul 31 • 18
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection Paper • 2408.04284 • Published Aug 8 • 22
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability Paper • 2408.07852 • Published Aug 14 • 14
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering Paper • 2409.06595 • Published Sep 10 • 37
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications Paper • 2409.07314 • Published Sep 11 • 50
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse Paper • 2409.11242 • Published Sep 17 • 5
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles Paper • 2410.05262 • Published Oct 7 • 9