Papers
arxiv:2402.17463

Training-Free Long-Context Scaling of Large Language Models

Published on Feb 27
· Submitted by akhaliq on Feb 28
Authors:
,

Abstract

The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

Community

Could it be that such attention mechanism mostly works, cause instruction following GPTs use attention only as a redundant help pattern for their feed-forward nets? https://github.com/jessevig/bertviz/issues/128

·
Paper author

That's a very interesting point! Actually, I believe the main bottleneck for long-context LLMs has become the inference cost (memory, speed, FLOPs), rather than context extrapolation. For instance, models like Gemini with a 1M context window are challenging to run on current GPUs. Assuming they are attention-based models, the computational cost in the attention layer can likely be further reduced.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.17463 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.17463 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.17463 in a Space README.md to link it from this page.

Collections including this paper 13