KV Cache Compression for Inference Efficiency in LLMs: A Review
August 08, 2025 ยท The Cartographer ยท ๐ Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing
"No code URL or promise found in abstract"
"Title-pattern auto-detect: KV Cache Compression for Inference Efficiency in LLMs: A Review"
Evidence collected by the PWNC Scanner
Authors
Yanyu Liu, Jingying Fu, Sixiang Liu, Yitian Zou, You Fu, Jiehan Zhou, Shouhua Zhang
arXiv ID
2508.06297
Category
cs.DC: Distributed Computing
Citations
1
Venue
Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing
Last Checked
4 days ago
Abstract
Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant memory bottleneck, limiting the inference efficiency and scalability of the models. Therefore, optimizing the KV cache during inference is crucial for enhancing performance and efficiency. This review systematically examines current KV cache optimization techniques, including compression strategies such as selective token strategies, quantization, and attention compression. We evaluate the effectiveness, trade-offs, and application scenarios of these methods, providing a comprehensive analysis of their impact on memory usage and inference speed. We focus on identifying the limitations and challenges of existing methods, such as compatibility issues with different models and tasks. Additionally, this review highlights future research directions, including hybrid optimization techniques, adaptive dynamic strategies, and software-hardware co-design. These approaches aim to improve inference efficiency and promote the practical application of large language models.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Distributed Computing
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
๐ป
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
๐ป
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
๐ป
Ghosted
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
R.I.P.
๐ป
Ghosted