|
<!DOCTYPE html> |
|
<html lang="en"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
<title>Stick To Your Role! Leaderboard</title> |
|
|
|
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/css/bootstrap.min.css"> |
|
|
|
<link rel="stylesheet" href="https://cdn.datatables.net/1.11.5/css/dataTables.bootstrap5.min.css"> |
|
|
|
<style> |
|
body { |
|
background-color: #f8f9fa; |
|
font-family: 'Arial', sans-serif; |
|
} |
|
.container { |
|
max-width: 1200px; |
|
margin: auto; |
|
padding: 20px; |
|
background: #fff; |
|
border-radius: 8px; |
|
box-shadow: 0 4px 8px rgba(0,0,0,0.1); |
|
} |
|
h1 { |
|
color: #333; |
|
text-align: center; |
|
} |
|
h2 { |
|
color: #333; |
|
margin-top: 30px; |
|
text-align: center; |
|
} |
|
.table-responsive { |
|
margin-top: 20px; |
|
} |
|
table { |
|
border-collapse: separate; |
|
border-spacing: 0; |
|
font-size: 14px; |
|
width: 100%; |
|
border: none; |
|
} |
|
table thead th { |
|
background-color: #610b5d; |
|
color: white; |
|
border: 1px solid #dee2e6; |
|
text-align: left; |
|
} |
|
table tbody tr { |
|
background-color: #fff; |
|
box-shadow: 0 2px 4px rgba(0,0,0,0.1); |
|
} |
|
table tbody tr:hover { |
|
background-color: #f1f1f1; |
|
} |
|
table td, table th { |
|
padding: 10px; |
|
border: 1px solid #dee2e6; |
|
} |
|
table th:first-child { |
|
border-top-left-radius: 10px; |
|
} |
|
table th:last-child { |
|
border-top-right-radius: 10px; |
|
} |
|
.section{ |
|
padding-top: 19px; |
|
text-align: left; |
|
} |
|
|
|
.section p { |
|
padding-left: 150px; |
|
padding-right: 150px; |
|
text-indent: 2em; |
|
margin: auto; |
|
margin-bottom: 10px; |
|
text-align: left; |
|
} |
|
|
|
.section ol,ul { |
|
padding-left: 150px; |
|
padding-right: 150px; |
|
margin: auto; |
|
margin-bottom: 20px; |
|
margin-left: 50px; |
|
text-align: left; |
|
margin-top: 0px; |
|
} |
|
|
|
.citation-section { |
|
width: 100%; |
|
margin-top: 50px; |
|
text-align: center; |
|
} |
|
.citation-box { |
|
background-color: #f8f9fa; |
|
border: 1px solid #dee2e6; |
|
border-radius: 8px; |
|
padding: 10px; |
|
margin-top: 5px; |
|
font-size: 15px; |
|
text-align: left; |
|
font-family: 'Courier New', Courier, monospace; |
|
white-space: pre; |
|
} |
|
|
|
.image-container { |
|
width: 100%; |
|
margin-bottom: 40px; |
|
} |
|
.image-container img { |
|
width: 90%; |
|
max-width: 650px; |
|
height: auto; |
|
display: block; |
|
margin: auto; |
|
} |
|
.section-title { |
|
font-size: 24px; |
|
font-weight: bold; |
|
text-align: center; |
|
margin-bottom: 40px; |
|
padding: 20px; |
|
background-color: #610b5d; |
|
color: #fff; |
|
border-radius: 15px; |
|
} |
|
.back-button { |
|
text-align: center; |
|
margin-top: 50px; |
|
} |
|
.custom-button { |
|
background-color: #610b5d; |
|
color: #fff; |
|
border-radius: 15px; |
|
padding: 10px 20px; |
|
font-size: 18px; |
|
text-decoration: none; |
|
} |
|
.custom-button:hover { |
|
background-color: #812b7d; |
|
color: #fff; |
|
} |
|
</style> |
|
</head> |
|
<body> |
|
<div class="container"> |
|
<h1 class="mt-5">Stick To Your Role! Leaderboard</h1> |
|
<div class="table-responsive"> |
|
|
|
{{ table_html|safe }} |
|
</div> |
|
<div class="section"> |
|
<div class="section-title">Motivation</div> |
|
<p> |
|
Benchmarks usually compare models with <b>MANY QUESTIONS</b> from <b>A SINGLE MINIMAL CONTEXT</b>, e.g. as multiple choices questions. |
|
This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature). |
|
We argue that <b>CONTEXT-DEPENDENCE</b> can be seen as a <b>PROPERTY of LLMs</b>: a dimension of LLM comparison alongside others like size, speed, or knowledge. |
|
We evaluate LLMs by asking the <b> SAME QUESTIONS </b> from <b> MANY DIFFERENT CONTEXTS </b>. |
|
</p> |
|
<p> |
|
LLMs are often used to simulate personas and populations. |
|
We study the coherence of simulated populations over different contexts (conversations on different topics). |
|
To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations. |
|
We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism, |
|
to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS. |
|
</p> |
|
</div> |
|
<div class="section"> |
|
<div class="section-title">Administering a questionnaire in context to a simulated persona</div> |
|
<p>To evaluate the stability on a population level we need to be able to evaluate a <b>value profile</b> expressed by a <b>simulated individual</b> in a <b>specific context</b> (conversation topic). To do that we use the following procedure:</p> |
|
<ol> |
|
<li> The Tested model is instructed to simulate a persona</li> |
|
<li> A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot” |
|
<li> A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a |
|
joke) |
|
<li> A conversation is simulated |
|
<li> A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s |
|
response is recorded (this is repeated for every item in the questionnaire) |
|
<li> The questionnaire is scored to obtain scores for the 10 personal values |
|
</ol> |
|
<div class="image-container"> |
|
<a href="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" target="_blank"> |
|
<img src="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" alt="Structure"> |
|
</a> |
|
</div> |
|
</div> |
|
<div class="section"> |
|
<div class="section-title">Contexts</div> |
|
<p> |
|
We aim to score the expressed value profile for each simulated persona in different contexts. |
|
More precisely a population (50 personas) will be evaluated in a context chunk (50 topics: one per persona). |
|
Then the population in one context chunk will be compared to the same population in another context chunk. |
|
Here are the considered context chunks: |
|
</p> |
|
<ul> |
|
<li> <b> no_conv </b>: no conversation is simulated the questions from the PVQ-40 questionnaire are given directly </li> |
|
<li> <b> no_conv_svs </b>: no conversation is simulated the questions from the SVS questionnaire are given directly </li> |
|
<li> <b> chunk_0-chunk-4 </b>: 50 reddit posts are used as the initial Interlocutor model messages (one per persona). chunk_0 contains the longest posts, chunk_4 the shortest </li> |
|
<li> <b> chess </b>: "1. e4" is given as the initial message to all personas, but for each persona the Interlocutor model is instructed to simulate a different persona (instead of a human user) </li> |
|
<li> <b> grammar </b>: like chess, but "Can you check this sentence for grammar? \n Whilst Jane was waiting to meet hers friend their nose started bleeding." is given as the initial message. |
|
</ul> |
|
</div> |
|
<div class="section"> |
|
<div class="section-title">Metrics</div> |
|
<p>We evaluate the following metrics (+ denotes higher is better; - denotes lower is better) </p> |
|
<ul> |
|
<li> <b> RO Stability (+) </b> - Average Rank-Order stability between each pair of context chunks. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) and in our <a href="https://arxiv.org/abs/2402.14846">paper</a> </li> |
|
<li> <b> Rank Distance (-) </b> - Average distance between the theoretical and the observed order of values in a circular space. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) </li> |
|
<li> <b> CFI (+) </b> - a common Validity metric </li> |
|
<li> <b> SRMR (-) </b> - a common Validity metric </li> |
|
<li> <b> RMSEA (-) </b> - a common Validity metric </li> |
|
<li> <b> Cronbach alpha (+) </b> - a common Reliability metric </li> |
|
<li> <b> Ordinal (Win rate) (+) </b> - each context pair and each metric is considered as a game between models, the metric shows the average win rate over all such games</li> |
|
<li> <b> Cardinal (Score) (+) </b> - the average over all context pairs and metrics (with descending metric inverted) </li> |
|
</ul> |
|
</div> |
|
<div class="back-button"> |
|
<a href="{{ url_for('index') }}" class="custom-button mt-3">Back</a> |
|
</div> |
|
<div class="citation-section"> |
|
<p>If you found this project useful, please cite our related paper:</p> |
|
<div class="citation-box" id="citation-text"> |
|
@article{kovavc2024stick, |
|
title={Stick to your Role! Stability of Personal Values Expressed in Large Language Models}, |
|
author={Kova{\v{c}}, Grgur and Portelas, R{\'e}my and Sawayama, Masataka and Dominey, Peter Ford and Oudeyer, Pierre-Yves}, |
|
journal={arXiv preprint arXiv:2402.14846}, |
|
year={2024} |
|
} |
|
</div> |
|
</div> |
|
</div> |
|
|
|
|
|
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script> |
|
|
|
<script src="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/js/bootstrap.bundle.min.js"></script> |
|
|
|
<script src="https://cdn.datatables.net/1.11.5/js/jquery.dataTables.min.js"></script> |
|
<script src="https://cdn.datatables.net/1.11.5/js/dataTables.bootstrap5.min.js"></script> |
|
|
|
<script> |
|
$(document).ready(function() { |
|
const table = $('table').DataTable({ |
|
"paging": false, |
|
"info": false, |
|
"columnDefs": [ |
|
{ "orderable": false, "targets": 0 }, |
|
{ "searchable": false, "targets": 0 } |
|
], |
|
"order": [[ 2, 'desc' ]], |
|
"drawCallback": function(settings) { |
|
var api = this.api(); |
|
api.column(0, {order:'applied'}).nodes().each(function(cell, i) { |
|
cell.innerHTML = i + 1; |
|
}); |
|
} |
|
}); |
|
}); |
|
|
|
</script> |
|
</body> |
|
</html> |
|
|