arxiv:2402.15220

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Published on Feb 23

· Submitted by

Authors:

Abstract

Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8times compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

View arXiv page View PDF Add to collection

Community

librarian-bot

Feb 27

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

yelu27

Feb 27

Glad to see ChunkAttn is featured. This is a revised version of the ICLR 2024 submission. The reviews have a few concerns: 1) proof of shared prefix lengths. 2) more experiments. We fixed them and released the current version. Feel free to provide any feedback.
https://openreview.net/forum?id=9k27IITeAZ
The code needs to pass the open-source process. It will be available on Github soon.

leegao19

Mar 1

Is the core idea here:

There are many scenarios where you might end up sharing (potentially very) long prefixes that you need to prefill (e.g. system prompt, exemplars, maybe even an instruction manual with some of the 100K - 1M models).
Under this regime, it's helpful to be able to cache common shared prefixes in the form of the KV cache chunks so they don't have to be recomputed during expensive prefill in, for e.g., disaggregated/batched serving regimes
This paper proposes an approach to identify and store these common prefixes

That seems like a super useful thing. Especially if these chunks could be loaded offline (e.g. downloading a static precomputed dictionary of KV cache chunks for the most common shared chunks or your entire giant system prompt)