Papers
arxiv:2410.04422

Hyper-multi-step: The Truth Behind Difficult Long-context Tasks

Published on Oct 6
ยท Submitted by yuyijiong on Oct 9
Authors:

Abstract

Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.

Community

Paper author Paper submitter
โ€ข
edited 23 days ago
Paper author Paper submitter
โ€ข
edited 28 days ago

This paper reveals a tough fact ๐Ÿค• that:
A long-context language model can never directly address advanced long-context tasks well ๐Ÿ˜ต, such as repo-level code generation or filtering tabular data. This is because LLMs are inherently unable to complete a large number of reasoning steps within a limited generation length ๐Ÿ˜ฉ, which is often a necessity for advanced long-context tasks ๐Ÿ”ข, but not for simple long-context tasks like needle-in-a-haystack ๐Ÿ˜ƒ.

When doing retrieval, LLMs are actually searching "relevant" items, but not "logical corresponding" ones.

A model good at both math and retrieval still cannot directly solve a math + retrieval task, unless it pays much more efforts in test-time ๐Ÿง.

This is a well-sequenced paper.

A minute before reading "Can these issues be further decomposed into simple solvable components?" my brain had the same thought. More or less, "Can direct retrieval of 100+ KVs be decomposed?"

With that being said, the logic and multi-retrieval issues are (from my perspective) different types/categories of mathematical problems (would an LLM fine-tuned for Math problems like Qwen2.5-Math-1.5B/7B/72B vs. a non-fine-tuned model as Qwen2.5-1.5B/7B/72B perform better due to having been trained on mathematical problem-solving data?).

One point: When I see the word "delve," (source: To delve deeper into why LLMs struggle...) I think, "Did an LLM edit or write this paper?" However, that is my bias as an American-English reader, and this word may be intrinsic to your culture, so I do not think you need to remove it.

Thank you for your contribution to science and engineering!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.04422 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.04422 in a Space README.md to link it from this page.

Collections including this paper 2