arxiv:2406.10149

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Published on Jun 14

· Submitted by

yurakuratov on Jun 17

Upvote

Authors:

Yuri Kuratov ,

Aydar Bulatov ,

Petr Anokhin ,

Ivan Rodkin ,

Artyom Sorokin ,

Mikhail Burtsev

Abstract

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

View arXiv page View PDF Add to collection

Community

yurakuratov

Paper author Paper submitter Jun 17

GitHub: https://github.com/booydar/babilong

BABILong on HF Datasets: https://huggingface.co/datasets/RMT-team/babilong

mbur

Paper author Jun 17

One of our main findings is that simple commonsense reasoning of BABILong is still a challenge for current long-context models

Even models that claim to support 128K tokens experience degradation beyond 10% of their input capacity. RAG methods do not help, while fine-tuning of small scale models (RMT 137M and Mamba 130M) shows that the tasks are solvable. Values represent average accuracy over QA1-QA5 tasks from BABILong.

mbur

Paper author Jun 17

Q: How effectively do LLMs use context window in QA tasks?

A: LLMs struggle to answer questions about facts in texts larger than 10,000 tokens.

The plots demonstrate how the performance of selected leading models deteriorates with increasing context size. For single supporting fact questions (QA1), the majority of models perform well up to 4,000 tokens. However, when a correct response requires two (QA2) or three (QA3) facts, LLMs fail to achieve satisfactory accuracy.