grg's picture
Decorative changes
b19d196
raw
history blame
No virus
12.3 kB
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stick To Your Role! Leaderboard</title>
<!-- Include Bootstrap CSS for styling -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/css/bootstrap.min.css">
<!-- Include DataTables CSS -->
<link rel="stylesheet" href="https://cdn.datatables.net/1.11.5/css/dataTables.bootstrap5.min.css">
<!-- Custom CSS for additional styling -->
<style>
body {
background-color: #f8f9fa;
font-family: 'Arial', sans-serif;
}
.container {
max-width: 1200px; /* Limit the width of the container */
margin: auto; /* Center the container */
padding: 20px; /* Add some padding */
background: #fff;
border-radius: 8px;
box-shadow: 0 4px 8px rgba(0,0,0,0.1);
}
h1 {
color: #333;
text-align: center;
}
h2 {
color: #333;
margin-top: 30px;
text-align: center;
}
.table-responsive {
margin-top: 20px;
}
table {
border-collapse: separate;
border-spacing: 0;
font-size: 14px; /* Reduce the font size */
width: 100%;
border: none; /* Remove any default border */
}
table thead th {
background-color: #610b5d;
color: white;
border: 1px solid #dee2e6;
text-align: left;
}
table tbody tr {
background-color: #fff;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
table tbody tr:hover {
background-color: #f1f1f1;
}
table td, table th {
padding: 10px; /* Reduce padding */
border: 1px solid #dee2e6;
}
table th:first-child {
border-top-left-radius: 10px;
}
table th:last-child {
border-top-right-radius: 10px;
}
.section{
padding-top: 19px;
text-align: left;
}
.section p {
padding-left: 150px;
padding-right: 150px;
text-indent: 2em;
margin: auto;
margin-bottom: 10px;
text-align: left;
}
.section ol,ul {
padding-left: 150px;
padding-right: 150px;
margin: auto;
margin-bottom: 20px;
margin-left: 50px;
text-align: left;
margin-top: 0px;
}
.citation-section {
width: 100%;
margin-top: 50px;
text-align: center;
}
.citation-box {
background-color: #f8f9fa;
border: 1px solid #dee2e6;
border-radius: 8px;
padding: 10px;
margin-top: 5px;
font-size: 15px;
text-align: left;
font-family: 'Courier New', Courier, monospace;
white-space: pre;
}
.image-container {
width: 100%;
margin-bottom: 40px;
}
.image-container img {
width: 90%;
max-width: 650px;
height: auto;
display: block;
margin: auto;
}
.section-title {
font-size: 24px;
font-weight: bold;
text-align: center;
margin-bottom: 40px;
padding: 20px; /* Add padding for more margin around text */
background-color: #610b5d;
color: #fff; /* Ensure text is readable on dark background */
border-radius: 15px; /* Rounded edges */
}
.back-button {
text-align: center;
margin-top: 50px;
}
.custom-button {
background-color: #610b5d;
color: #fff; /* Set white text color */
border-radius: 15px; /* Rounded edges */
padding: 10px 20px; /* Padding for the button */
font-size: 18px; /* Increase font size */
text-decoration: none; /* Remove underline */
}
.custom-button:hover {
background-color: #812b7d;
color: #fff;
}
</style>
</head>
<body>
<div class="container">
<h1 class="mt-5">Stick To Your Role! Leaderboard</h1>
<div class="table-responsive">
<!-- Render the table HTML here -->
{{ table_html|safe }}
</div>
<div class="section">
<div class="section-title">Motivation</div>
<p>
Benchmarks usually compare models with <b>MANY QUESTIONS</b> from <b>A SINGLE MINIMAL CONTEXT</b>, e.g. as multiple choices questions.
This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature).
We argue that <b>CONTEXT-DEPENDENCE</b> can be seen as a <b>PROPERTY of LLMs</b>: a dimension of LLM comparison alongside others like size, speed, or knowledge.
We evaluate LLMs by asking the <b> SAME QUESTIONS </b> from <b> MANY DIFFERENT CONTEXTS </b>.
</p>
<p>
LLMs are often used to simulate personas and populations.
We study the coherence of simulated populations over different contexts (conversations on different topics).
To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations.
We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism,
to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS.
</p>
</div>
<div class="section">
<div class="section-title">Administering a questionnaire in context to a simulated persona</div>
<p>To evaluate the stability on a population level we need to be able to evaluate a <b>value profile</b> expressed by a <b>simulated individual</b> in a <b>specific context</b> (conversation topic). To do that we use the following procedure:</p>
<ol>
<li> The Tested model is instructed to simulate a persona</li>
<li> A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot”
<li> A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a
joke)
<li> A conversation is simulated
<li> A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s
response is recorded (this is repeated for every item in the questionnaire)
<li> The questionnaire is scored to obtain scores for the 10 personal values
</ol>
<div class="image-container">
<a href="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" target="_blank">
<img src="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" alt="Structure">
</a>
</div>
</div>
<div class="section">
<div class="section-title">Contexts</div>
<p>
We aim to score the expressed value profile for each simulated persona in different contexts.
More precisely a population (50 personas) will be evaluated in a context chunk (50 topics: one per persona).
Then the population in one context chunk will be compared to the same population in another context chunk.
Here are the considered context chunks:
</p>
<ul>
<li> <b> no_conv </b>: no conversation is simulated the questions from the PVQ-40 questionnaire are given directly </li>
<li> <b> no_conv_svs </b>: no conversation is simulated the questions from the SVS questionnaire are given directly </li>
<li> <b> chunk_0-chunk-4 </b>: 50 reddit posts are used as the initial Interlocutor model messages (one per persona). chunk_0 contains the longest posts, chunk_4 the shortest </li>
<li> <b> chess </b>: "1. e4" is given as the initial message to all personas, but for each persona the Interlocutor model is instructed to simulate a different persona (instead of a human user) </li>
<li> <b> grammar </b>: like chess, but "Can you check this sentence for grammar? \n Whilst Jane was waiting to meet hers friend their nose started bleeding." is given as the initial message.
</ul>
</div>
<div class="section">
<div class="section-title">Metrics</div>
<p>We evaluate the following metrics (+ denotes higher is better; - denotes lower is better) </p>
<ul>
<li> <b> RO Stability (+) </b> - Average Rank-Order stability between each pair of context chunks. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) and in our <a href="https://arxiv.org/abs/2402.14846">paper</a> </li>
<li> <b> Rank Distance (-) </b> - Average distance between the theoretical and the observed order of values in a circular space. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) </li>
<li> <b> CFI (+) </b> - a common Validity metric </li>
<li> <b> SRMR (-) </b> - a common Validity metric </li>
<li> <b> RMSEA (-) </b> - a common Validity metric </li>
<li> <b> Cronbach alpha (+) </b> - a common Reliability metric </li>
<li> <b> Ordinal (Win rate) (+) </b> - each context pair and each metric is considered as a game between models, the metric shows the average win rate over all such games</li>
<li> <b> Cardinal (Score) (+) </b> - the average over all context pairs and metrics (with descending metric inverted) </li>
</ul>
</div>
<div class="back-button">
<a href="{{ url_for('index') }}" class="custom-button mt-3">Back</a>
</div>
<div class="citation-section">
<p>If you found this project useful, please cite our related paper:</p>
<div class="citation-box" id="citation-text">
@article{kovavc2024stick,
title={Stick to your Role! Stability of Personal Values Expressed in Large Language Models},
author={Kova{\v{c}}, Grgur and Portelas, R{\'e}my and Sawayama, Masataka and Dominey, Peter Ford and Oudeyer, Pierre-Yves},
journal={arXiv preprint arXiv:2402.14846},
year={2024}
}
</div>
</div>
</div>
<!-- Include jQuery -->
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
<!-- Include Bootstrap JS -->
<script src="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/js/bootstrap.bundle.min.js"></script>
<!-- Include DataTables JS -->
<script src="https://cdn.datatables.net/1.11.5/js/jquery.dataTables.min.js"></script>
<script src="https://cdn.datatables.net/1.11.5/js/dataTables.bootstrap5.min.js"></script>
<!-- Initialize DataTables -->
<script>
$(document).ready(function() {
const table = $('table').DataTable({
"paging": false,
"info": false,
"columnDefs": [
{ "orderable": false, "targets": 0 },
{ "searchable": false, "targets": 0 }
],
"order": [[ 2, 'desc' ]],
"drawCallback": function(settings) {
var api = this.api();
api.column(0, {order:'applied'}).nodes().each(function(cell, i) {
cell.innerHTML = i + 1;
});
}
});
});
</script>
</body>
</html>