PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Abstract
We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
Community
Hey!
Here is the benchmark page: https://ilyagusev.github.io/ping_pong_bench/en_v2
And the GitHub repo: https://github.com/IlyaGusev/ping_pong_bench/
I hope the benchmark will be helpful to both RP model developers and users.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering (2024)
- Identity-Driven Hierarchical Role-Playing Agents (2024)
- BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model (2024)
- WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback (2024)
- The Oscars of AI Theater: A Survey on Role-Playing with Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper