File size: 12,335 Bytes
b42419d b19d196 b42419d b19d196 b42419d b19d196 b42419d b19d196 b42419d b19d196 b42419d b19d196 b42419d ca60da9 b19d196 ca60da9 b19d196 ca60da9 b19d196 ca60da9 b42419d b19d196 b42419d b19d196 b42419d ca60da9 b42419d ca60da9 b42419d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stick To Your Role! Leaderboard</title>
<!-- Include Bootstrap CSS for styling -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/css/bootstrap.min.css">
<!-- Include DataTables CSS -->
<link rel="stylesheet" href="https://cdn.datatables.net/1.11.5/css/dataTables.bootstrap5.min.css">
<!-- Custom CSS for additional styling -->
<style>
body {
background-color: #f8f9fa;
font-family: 'Arial', sans-serif;
}
.container {
max-width: 1200px; /* Limit the width of the container */
margin: auto; /* Center the container */
padding: 20px; /* Add some padding */
background: #fff;
border-radius: 8px;
box-shadow: 0 4px 8px rgba(0,0,0,0.1);
}
h1 {
color: #333;
text-align: center;
}
h2 {
color: #333;
margin-top: 30px;
text-align: center;
}
.table-responsive {
margin-top: 20px;
}
table {
border-collapse: separate;
border-spacing: 0;
font-size: 14px; /* Reduce the font size */
width: 100%;
border: none; /* Remove any default border */
}
table thead th {
background-color: #610b5d;
color: white;
border: 1px solid #dee2e6;
text-align: left;
}
table tbody tr {
background-color: #fff;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
table tbody tr:hover {
background-color: #f1f1f1;
}
table td, table th {
padding: 10px; /* Reduce padding */
border: 1px solid #dee2e6;
}
table th:first-child {
border-top-left-radius: 10px;
}
table th:last-child {
border-top-right-radius: 10px;
}
.section{
padding-top: 19px;
text-align: left;
}
.section p {
padding-left: 150px;
padding-right: 150px;
text-indent: 2em;
margin: auto;
margin-bottom: 10px;
text-align: left;
}
.section ol,ul {
padding-left: 150px;
padding-right: 150px;
margin: auto;
margin-bottom: 20px;
margin-left: 50px;
text-align: left;
margin-top: 0px;
}
.citation-section {
width: 100%;
margin-top: 50px;
text-align: center;
}
.citation-box {
background-color: #f8f9fa;
border: 1px solid #dee2e6;
border-radius: 8px;
padding: 10px;
margin-top: 5px;
font-size: 15px;
text-align: left;
font-family: 'Courier New', Courier, monospace;
white-space: pre;
}
.image-container {
width: 100%;
margin-bottom: 40px;
}
.image-container img {
width: 90%;
max-width: 650px;
height: auto;
display: block;
margin: auto;
}
.section-title {
font-size: 24px;
font-weight: bold;
text-align: center;
margin-bottom: 40px;
padding: 20px; /* Add padding for more margin around text */
background-color: #610b5d;
color: #fff; /* Ensure text is readable on dark background */
border-radius: 15px; /* Rounded edges */
}
.back-button {
text-align: center;
margin-top: 50px;
}
.custom-button {
background-color: #610b5d;
color: #fff; /* Set white text color */
border-radius: 15px; /* Rounded edges */
padding: 10px 20px; /* Padding for the button */
font-size: 18px; /* Increase font size */
text-decoration: none; /* Remove underline */
}
.custom-button:hover {
background-color: #812b7d;
color: #fff;
}
</style>
</head>
<body>
<div class="container">
<h1 class="mt-5">Stick To Your Role! Leaderboard</h1>
<div class="table-responsive">
<!-- Render the table HTML here -->
{{ table_html|safe }}
</div>
<div class="section">
<div class="section-title">Motivation</div>
<p>
Benchmarks usually compare models with <b>MANY QUESTIONS</b> from <b>A SINGLE MINIMAL CONTEXT</b>, e.g. as multiple choices questions.
This kind of evaluation is little informative of LLMs' behavior in deployment when exposed to new contexts (especially when we consider the LLMs highly context-dependant nature).
We argue that <b>CONTEXT-DEPENDENCE</b> can be seen as a <b>PROPERTY of LLMs</b>: a dimension of LLM comparison alongside others like size, speed, or knowledge.
We evaluate LLMs by asking the <b> SAME QUESTIONS </b> from <b> MANY DIFFERENT CONTEXTS </b>.
</p>
<p>
LLMs are often used to simulate personas and populations.
We study the coherence of simulated populations over different contexts (conversations on different topics).
To do that we leverage the psychological methodology to study the interpersonal stability of personal value expression of those simulated populations.
We adopt the Schwartz Theory of Basic Personal Values that defines 10 values: Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, and Universalism,
to evaluate their expression we use the associated questionnaires: PVQ-40, and SVS.
</p>
</div>
<div class="section">
<div class="section-title">Administering a questionnaire in context to a simulated persona</div>
<p>To evaluate the stability on a population level we need to be able to evaluate a <b>value profile</b> expressed by a <b>simulated individual</b> in a <b>specific context</b> (conversation topic). To do that we use the following procedure:</p>
<ol>
<li> The Tested model is instructed to simulate a persona</li>
<li> A separate model instance - The Interlocutor - is instructed to simulate a “human using a chatbot”
<li> A conversation topic is induced by manually setting the first Interlocutor’s message (e.g. Tell me a
joke)
<li> A conversation is simulated
<li> A question from the questionnaire is set as the last Interlocutor’s last message and The Tested model’s
response is recorded (this is repeated for every item in the questionnaire)
<li> The questionnaire is scored to obtain scores for the 10 personal values
</ol>
<div class="image-container">
<a href="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" target="_blank">
<img src="{{ url_for('static', filename='figures/admin_questionnaire.svg') }}" alt="Structure">
</a>
</div>
</div>
<div class="section">
<div class="section-title">Contexts</div>
<p>
We aim to score the expressed value profile for each simulated persona in different contexts.
More precisely a population (50 personas) will be evaluated in a context chunk (50 topics: one per persona).
Then the population in one context chunk will be compared to the same population in another context chunk.
Here are the considered context chunks:
</p>
<ul>
<li> <b> no_conv </b>: no conversation is simulated the questions from the PVQ-40 questionnaire are given directly </li>
<li> <b> no_conv_svs </b>: no conversation is simulated the questions from the SVS questionnaire are given directly </li>
<li> <b> chunk_0-chunk-4 </b>: 50 reddit posts are used as the initial Interlocutor model messages (one per persona). chunk_0 contains the longest posts, chunk_4 the shortest </li>
<li> <b> chess </b>: "1. e4" is given as the initial message to all personas, but for each persona the Interlocutor model is instructed to simulate a different persona (instead of a human user) </li>
<li> <b> grammar </b>: like chess, but "Can you check this sentence for grammar? \n Whilst Jane was waiting to meet hers friend their nose started bleeding." is given as the initial message.
</ul>
</div>
<div class="section">
<div class="section-title">Metrics</div>
<p>We evaluate the following metrics (+ denotes higher is better; - denotes lower is better) </p>
<ul>
<li> <b> RO Stability (+) </b> - Average Rank-Order stability between each pair of context chunks. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) and in our <a href="https://arxiv.org/abs/2402.14846">paper</a> </li>
<li> <b> Rank Distance (-) </b> - Average distance between the theoretical and the observed order of values in a circular space. More details are given in the per-model pages (e.g. <a href="model/gpt-4o-0513">gpt-4o-0513</a>) </li>
<li> <b> CFI (+) </b> - a common Validity metric </li>
<li> <b> SRMR (-) </b> - a common Validity metric </li>
<li> <b> RMSEA (-) </b> - a common Validity metric </li>
<li> <b> Cronbach alpha (+) </b> - a common Reliability metric </li>
<li> <b> Ordinal (Win rate) (+) </b> - each context pair and each metric is considered as a game between models, the metric shows the average win rate over all such games</li>
<li> <b> Cardinal (Score) (+) </b> - the average over all context pairs and metrics (with descending metric inverted) </li>
</ul>
</div>
<div class="back-button">
<a href="{{ url_for('index') }}" class="custom-button mt-3">Back</a>
</div>
<div class="citation-section">
<p>If you found this project useful, please cite our related paper:</p>
<div class="citation-box" id="citation-text">
@article{kovavc2024stick,
title={Stick to your Role! Stability of Personal Values Expressed in Large Language Models},
author={Kova{\v{c}}, Grgur and Portelas, R{\'e}my and Sawayama, Masataka and Dominey, Peter Ford and Oudeyer, Pierre-Yves},
journal={arXiv preprint arXiv:2402.14846},
year={2024}
}
</div>
</div>
</div>
<!-- Include jQuery -->
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
<!-- Include Bootstrap JS -->
<script src="https://stackpath.bootstrapcdn.com/bootstrap/5.1.3/js/bootstrap.bundle.min.js"></script>
<!-- Include DataTables JS -->
<script src="https://cdn.datatables.net/1.11.5/js/jquery.dataTables.min.js"></script>
<script src="https://cdn.datatables.net/1.11.5/js/dataTables.bootstrap5.min.js"></script>
<!-- Initialize DataTables -->
<script>
$(document).ready(function() {
const table = $('table').DataTable({
"paging": false,
"info": false,
"columnDefs": [
{ "orderable": false, "targets": 0 },
{ "searchable": false, "targets": 0 }
],
"order": [[ 2, 'desc' ]],
"drawCallback": function(settings) {
var api = this.api();
api.column(0, {order:'applied'}).nodes().each(function(cell, i) {
cell.innerHTML = i + 1;
});
}
});
});
</script>
</body>
</html>
|