MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

March 21, 2026 ยท Grace Period ยท ๐Ÿ› the ACM Computing Frontiers 2026 Conference and the ICML 2025 Long Context Modeling Workshop

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu arXiv ID 2603.20586 Category cs.LG: Machine Learning Cross-listed cs.AI Citations 0 Venue the ACM Computing Frontiers 2026 Conference and the ICML 2025 Long Context Modeling Workshop
Abstract
As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning