R.I.P.
๐ป
Ghosted
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
April 16, 2026 ยท Grace Period ยท + Add venue
Authors
Jevin Jiang, Ying Chen, Blake A. Hechtman, Fenghui Zhang, Yarong Mu
arXiv ID
2604.15464
Category
cs.PF: Performance
Cross-listed
cs.AI,
cs.LG
Citations
0
Abstract
Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. Integrated as the primary TPU backend in vLLM and SGLang, RPA provides a production-grade foundation for efficient TPU inference and offers practical insights into kernel design.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Performance
R.I.P.
๐ป
Ghosted
A General Formula for the Stationary Distribution of the Age of Information and Its Application to Single-Server Queues
R.I.P.
๐ป
Ghosted
AI Benchmark: All About Deep Learning on Smartphones in 2019
R.I.P.
๐ป
Ghosted
BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning
R.I.P.
๐ป
Ghosted
CLTune: A Generic Auto-Tuner for OpenCL Kernels
R.I.P.
๐ป
Ghosted