GPU Performance Portability needs Autotuning

April 30, 2025 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Burkhard Ringlein, Thomas Parnell, Radu Stoica arXiv ID 2505.03780 Category cs.AR: Hardware Architecture Cross-listed cs.AI, cs.PL Citations 2 Venue arXiv.org Last Checked 3 months ago

Abstract

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Hardware Architecture

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Corona: System Implications of Emerging Nanophotonic Technology

Dana Vantrease, Robert Schreiber, ... (+8 more)

cs.AR 🏛 ISCA 📚 710 cites 2 years ago

R.I.P. 👻 Ghosted

A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)

Saber Moradi, Ning Qiao, ... (+2 more)

cs.AR 🏛 IEEE TBCS 📚 544 cites 8 years ago

R.I.P. 👻 Ghosted

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

Hanrui Wang, Zhekai Zhang, Song Han

cs.AR 🏛 ISCA 📚 503 cites 5 years ago

R.I.P. 👻 Ghosted

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert, Xiaowei Wang, ... (+6 more)

cs.AR 🏛 ISCA 📚 373 cites 8 years ago

R.I.P. 👻 Ghosted

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhekai Zhang, Hanrui Wang, ... (+2 more)

cs.AR 🏛 ISCA 📚 274 cites 6 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 8 years ago

R.I.P. 👻 Ghosted

Equality of Opportunity in Supervised Learning

Moritz Hardt, Eric Price, Nathan Srebro

cs.LG 🏛 NeurIPS 📚 4.9K cites 9 years ago