Hyper-multi-step: The Truth Behind Difficult Long-context Tasks
Abstract
Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.
Community
Our code is publicly available at https://github.com/yuyijiong/hard_retrieval_for_llm
The datasets is at https://huggingface.co/datasets/yuyijiong/difficult_retrieval
This paper reveals a tough fact ๐ค that:
A long-context language model can never directly address advanced long-context tasks well ๐ต, such as repo-level code generation or filtering tabular data. This is because LLMs are inherently unable to complete a large number of reasoning steps within a limited generation length ๐ฉ, which is often a necessity for advanced long-context tasks ๐ข, but not for simple long-context tasks like needle-in-a-haystack ๐.
When doing retrieval, LLMs are actually searching "relevant" items, but not "logical corresponding" ones.
A model good at both math and retrieval still cannot directly solve a math + retrieval task, unless it pays much more efforts in test-time ๐ง.
This is a well-sequenced paper.
A minute before reading "Can these issues be further decomposed into simple solvable components?" my brain had the same thought. More or less, "Can direct retrieval of 100+ KVs be decomposed?"
With that being said, the logic and multi-retrieval issues are (from my perspective) different types/categories of mathematical problems (would an LLM fine-tuned for Math problems like Qwen2.5-Math-1.5B/7B/72B vs. a non-fine-tuned model as Qwen2.5-1.5B/7B/72B perform better due to having been trained on mathematical problem-solving data?).
One point: When I see the word "delve," (source: To delve deeper into why LLMs struggle...) I think, "Did an LLM edit or write this paper?" However, that is my bias as an American-English reader, and this word may be intrinsic to your culture, so I do not think you need to remove it.
Thank you for your contribution to science and engineering!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues (2024)
- Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation (2024)
- Multilingual Evaluation of Long Context Retrieval and Reasoning (2024)
- ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering (2024)
- You Only Use Reactive Attention Slice For Long Context Retrieval (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper