Stick To Your Role! Leaderboard

Motivation

Benchmarks usually compare models with MANY QUESTIONS from A SINGLE MINIMAL CONTEXT, e.g. as multiple choices questions. This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature). We argue that CONTEXT-DEPENDENCE can be seen as a PROPERTY of LLMs: a dimension of LLM comparison alongside others like size, speed, or knowledge. We evaluate LLMs by asking the SAME QUESTIONS from MANY DIFFERENT CONTEXTS .

LLMs are often used to simulate personas and populations. We study the coherence of simulated populations over different contexts (conversations on different topics). To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations. We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism, to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS.

Administering a questionnaire in context to a simulated persona

To evaluate the stability on a population level we need to be able to evaluate a value profile expressed by a simulated individual in a specific context (conversation topic). To do that we use the following procedure:

The Tested model is instructed to simulate a persona
A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot”
A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a joke)
A conversation is simulated
A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s response is recorded (this is repeated for every item in the questionnaire)
The questionnaire is scored to obtain scores for the 10 personal values

Contexts

We aim to score the expressed value profile for each simulated persona in different contexts. More precisely a population (50 personas) will be evaluated in a context chunk (50 topics: one per persona). Then the population in one context chunk will be compared to the same population in another context chunk. Here are the considered context chunks:

no_conv : no conversation is simulated the questions from the PVQ-40 questionnaire are given directly
no_conv_svs : no conversation is simulated the questions from the SVS questionnaire are given directly
chunk_0-chunk-4 : 50 reddit posts are used as the initial Interlocutor model messages (one per persona). chunk_0 contains the longest posts, chunk_4 the shortest
chess : "1. e4" is given as the initial message to all personas, but for each persona the Interlocutor model is instructed to simulate a different persona (instead of a human user)
grammar : like chess, but "Can you check this sentence for grammar? \n Whilst Jane was waiting to meet hers friend their nose started bleeding." is given as the initial message.

Metrics

We evaluate the following metrics (+ denotes higher is better; - denotes lower is better)

RO Stability (+) - Average Rank-Order stability between each pair of context chunks. More details are given in the per-model pages (e.g. gpt-4o-0513) and in our paper
Rank Distance (-) - Average distance between the theoretical and the observed order of values in a circular space. More details are given in the per-model pages (e.g. gpt-4o-0513)
CFI (+) - a common Validity metric
SRMR (-) - a common Validity metric
RMSEA (-) - a common Validity metric
Cronbach alpha (+) - a common Reliability metric
Ordinal (Win rate) (+) - each context pair and each metric is considered as a game between models, the metric shows the average win rate over all such games
Cardinal (Score) (+) - the average over all context pairs and metrics (with descending metric inverted)

Back

If you found this project useful, please cite our related paper:

@article{kovavc2024stick, title={Stick to your Role! Stability of Personal Values Expressed in Large Language Models}, author={Kova{\v{c}}, Grgur and Portelas, R{\'e}my and Sawayama, Masataka and Dominey, Peter Ford and Oudeyer, Pierre-Yves}, journal={arXiv preprint arXiv:2402.14846}, year={2024} }