UPMEM Unleashed: Software Secrets for Speed
October 03, 2025 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Krystian Chmielewski, JarosΕaw Εawnicki, Uladzislau Lukyanau, Tadeusz Kobus, Maciej Maciejewski
arXiv ID
2510.15927
Category
cs.AR: Hardware Architecture
Cross-listed
cs.DC,
cs.PF
Citations
0
Venue
arXiv.org
Last Checked
3 months ago
Abstract
Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide essential tools, these emerging platforms still leave significant room for performance optimization. In this paper, we reveal surprising inefficiencies in UPMEM software stack and play with non-standard programming techniques. By making simple modifications to the assembly generated by the UPMEM compiler, we achieve speedups of 1.6-2x in integer addition and 1.4-5.9x in integer multiplication, depending on the data type. We also demonstrate that bit-serial processing of low precision data is a viable option for UPMEM: in INT4 bit-serial dot-product calculation, UPMEM can achieve over 2.7x speedup over the baseline. Minor API extensions for PIM allocation that account for the non-uniform memory access (NUMA) architecture of the server further improve the consistency and throughput of host-PIM data transfers by up to 2.9x. Finally, we show that, when the matrix is preloaded into PIM, our optimized kernels outperform a dual-socket CPU server by over 3x for INT8 generalized matrix-vector multiplication (GEMV) and by 10x for INT4 GEMV. Our optimized INT8 GEMV kernel outperforms the baseline 3.5x.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Hardware Architecture
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Corona: System Implications of Emerging Nanophotonic Technology
R.I.P.
π»
Ghosted
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
R.I.P.
π»
Ghosted
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
R.I.P.
π»
Ghosted
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
R.I.P.
π»
Ghosted
SpArch: Efficient Architecture for Sparse Matrix Multiplication
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted
Explanation in Artificial Intelligence: Insights from the Social Sciences
R.I.P.
π»
Ghosted