ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Abstract
Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8times compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hydragen: High-Throughput LLM Inference with Shared Prefixes (2024)
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts (2024)
- APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding (2024)
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024)
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Glad to see ChunkAttn is featured. This is a revised version of the ICLR 2024 submission. The reviews have a few concerns: 1) proof of shared prefix lengths. 2) more experiments. We fixed them and released the current version. Feel free to provide any feedback.
https://openreview.net/forum?id=9k27IITeAZ
The code needs to pass the open-source process. It will be available on Github soon.
Is the core idea here:
- There are many scenarios where you might end up sharing (potentially very) long prefixes that you need to prefill (e.g. system prompt, exemplars, maybe even an instruction manual with some of the 100K - 1M models).
- Under this regime, it's helpful to be able to cache common shared prefixes in the form of the KV cache chunks so they don't have to be recomputed during expensive prefill in, for e.g., disaggregated/batched serving regimes
- This paper proposes an approach to identify and store these common prefixes
That seems like a super useful thing. Especially if these chunks could be loaded offline (e.g. downloading a static precomputed dictionary of KV cache chunks for the most common shared chunks or your entire giant system prompt)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hydragen: High-Throughput LLM Inference with Shared Prefixes (2024)
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts (2024)
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning (2024)
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024)
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper