FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

February 19, 2025 · Declared Dead · 🏛 arXiv.org

"Paper promises code 'coming soon'"

Evidence collected by the PWNC Scanner

Authors Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu arXiv ID 2502.15804 Category cs.DC: Distributed Computing Cross-listed cs.AI Citations 1 Venue arXiv.org Last Checked 1 month ago

Abstract

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Distributed Computing

R.I.P. 👻 Ghosted

TensorFlow: A system for large-scale machine learning

Martín Abadi, Paul Barham, ... (+20 more)

cs.DC 🏛 OSDI 📚 19.3K cites 9 years ago

R.I.P. 👻 Ghosted

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, Ashish Agarwal, ... (+38 more)

cs.DC 🏛 arXiv 📚 11.6K cites 10 years ago

R.I.P. 👻 Ghosted

Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains

Elli Androulaki, Artem Barger, ... (+19 more)

cs.DC 🏛 European Conference on Computer Systems 📚 4.0K cites 8 years ago

R.I.P. 👻 Ghosted

Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger

Duncan A. Brown, Karan Vahi, ... (+3 more)

cs.DC 🏛 Computing in science & engineering (Print) 📚 2.3K cites 5 years ago

R.I.P. 👻 Ghosted

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li, ... (+8 more)

cs.DC 🏛 arXiv 📚 2.3K cites 10 years ago

R.I.P. 👻 Ghosted

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

Vasimuddin Md, Sanchit Misra, ... (+2 more)

cs.DC 🏛 IPDPS 📚 2.3K cites 6 years ago

Died the same way — ⏳ Coming Soon™

R.I.P. ⏳ Coming Soon™

Exploring Simple Siamese Representation Learning

Xinlei Chen, Kaiming He

cs.CV 🏛 CVPR 📚 4.8K cites 5 years ago

R.I.P. ⏳ Coming Soon™

An Analysis of Scale Invariance in Object Detection - SNIP

Bharat Singh, Larry S. Davis

cs.CV 🏛 CVPR 📚 795 cites 8 years ago

R.I.P. ⏳ Coming Soon™

Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

Benjin Zhu, Zhengkai Jiang, ... (+3 more)

cs.CV 🏛 arXiv 📚 556 cites 6 years ago

R.I.P. ⏳ Coming Soon™

FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors

Yu Chen, Ying Tai, ... (+3 more)

cs.CV 🏛 CVPR 📚 542 cites 8 years ago