Stick To Your Role! Leaderboard

{{ table_html|safe }}
Motivation

Benchmarks usually compare models with MANY QUESTIONS from A SINGLE MINIMAL CONTEXT, e.g. as multiple choices questions. This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature). We argue that CONTEXT-DEPENDENCE can be seen as a PROPERTY of LLMs: a dimension of LLM comparison alongside others like size, speed, or knowledge. We evaluate LLMs by asking the SAME QUESTIONS from MANY DIFFERENT CONTEXTS .

LLMs are often used to simulate personas and populations. We study the coherence of simulated populations over different contexts (conversations on different topics). To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations. We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism, to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS.

Administering a questionnaire in context to a simulated persona

To evaluate the stability on a population level we need to be able to evaluate a value profile expressed by a simulated individual in a specific context (conversation topic). To do that we use the following procedure:

  1. The Tested model is instructed to simulate a persona
  2. A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot”
  3. A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a joke)
  4. A conversation is simulated
  5. A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s response is recorded (this is repeated for every item in the questionnaire)
  6. The questionnaire is scored to obtain scores for the 10 personal values
Contexts

We aim to score the expressed value profile for each simulated persona in different contexts. More precisely a population (50 personas) will be evaluated in a context chunk (50 topics: one per persona). Then the population in one context chunk will be compared to the same population in another context chunk. Here are the considered context chunks:

Metrics

We evaluate the following metrics (+ denotes higher is better; - denotes lower is better)

Back

If you found this project useful, please cite our related paper:

@article{kovavc2024stick, title={Stick to your Role! Stability of Personal Values Expressed in Large Language Models}, author={Kova{\v{c}}, Grgur and Portelas, R{\'e}my and Sawayama, Masataka and Dominey, Peter Ford and Oudeyer, Pierre-Yves}, journal={arXiv preprint arXiv:2402.14846}, year={2024} }