A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning
July 21, 2025 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Robert Hochgraf, Sreepathi Pai
arXiv ID
2507.15277
Category
cs.PL: Programming Languages
Citations
0
Venue
arXiv.org
Last Checked
4 months ago
Abstract
Hand-optimizing linear algebra kernels for different GPU devices and applications is complex and labor-intensive. Instead, many developers use automatic performance tuning (autotuning) to achieve high performance on a variety of devices. However, autotuning "overfits", and must be redone if any part of the environment changes, such as if the device or input characteristics change. In most non-trivial cases, a single compute kernel cannot maintain near-optimal performance across all environments. Changing the kernel to specialize it to the current execution environment is possible, but on GPUs, runtime tuning and compilation can be expensive. In this work, we use multi-versioning -- producing several variants of the same code -- as a way to generate performance portable code. We describe a framework called portability tuning that can automatically generate multi-versioned code whose performance is portable, requiring no retuning. We evaluate our framework on a dataset of execution times for GEMM kernels from the CLBlast linear algebra library. We find our portability tuning techniques outperform CLBlast's default kernels -- often approaching within 10% of the theoretical maximum performance -- despite CLBlast using autotuning techniques. Further, we find that our generated programs generalize well to new and unseen devices, matching the performance of autotuning without ever portability tuning for those devices.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Programming Languages
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
R.I.P.
π»
Ghosted
Glow: Graph Lowering Compiler Techniques for Neural Networks
R.I.P.
π»
Ghosted
Learnable Programming: Blocks and Beyond
R.I.P.
π»
Ghosted
Scenic: A Language for Scenario Specification and Scene Generation
R.I.P.
π»
Ghosted
Vandal: A Scalable Security Analysis Framework for Smart Contracts
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted