UPMEM Unleashed: Software Secrets for Speed

October 03, 2025 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Krystian Chmielewski, Jarosław Ławnicki, Uladzislau Lukyanau, Tadeusz Kobus, Maciej Maciejewski arXiv ID 2510.15927 Category cs.AR: Hardware Architecture Cross-listed cs.DC, cs.PF Citations 0 Venue arXiv.org Last Checked 3 months ago

Abstract

Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide essential tools, these emerging platforms still leave significant room for performance optimization. In this paper, we reveal surprising inefficiencies in UPMEM software stack and play with non-standard programming techniques. By making simple modifications to the assembly generated by the UPMEM compiler, we achieve speedups of 1.6-2x in integer addition and 1.4-5.9x in integer multiplication, depending on the data type. We also demonstrate that bit-serial processing of low precision data is a viable option for UPMEM: in INT4 bit-serial dot-product calculation, UPMEM can achieve over 2.7x speedup over the baseline. Minor API extensions for PIM allocation that account for the non-uniform memory access (NUMA) architecture of the server further improve the consistency and throughput of host-PIM data transfers by up to 2.9x. Finally, we show that, when the matrix is preloaded into PIM, our optimized kernels outperform a dual-socket CPU server by over 3x for INT8 generalized matrix-vector multiplication (GEMV) and by 10x for INT4 GEMV. Our optimized INT8 GEMV kernel outperforms the baseline 3.5x.