Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving

May 19, 2025 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Juntao Zhao, Jiuru Li, Chuan Wu arXiv ID 2507.18454 Category cs.AR: Hardware Architecture Cross-listed cs.AI, cs.DC, cs.PL Citations 1 Venue arXiv.org Last Checked 3 months ago

Abstract

Utilizing CPUs to serve large language models (LLMs) is a resource-friendly alternative to GPU serving. Existing CPU-based solutions ignore workload differences between the prefill and the decode phases of LLM inference, applying a static per-NUMA (Non-Uniform Memory Access) node model partition and utilizing vendor libraries for operator-level execution, which is suboptimal. We propose Sandwich, a hardware-centric CPU-based LLM serving engine that uses different execution plans for the prefill and decode phases and optimizes them separately. We evaluate Sandwich across diverse baselines and datasets on five CPU platforms, including x86 with AVX-2 and AVX-512, as well as ARM with NEON. Sandwich achieves an average 2.01x throughput improvement and 90% satisfactory time-to-first-token (TTFT) and time-per-output-token (TPOT) latencies with up to 3.40x lower requirements in single sequence serving, and significant improvement in Goodput in continuous-batching serving. The GEMM kernels generated by Sandwich outperform representative vendor kernels and other dynamic shape solutions, achieving performance comparable to static compilers with three orders of magnitude less kernel tuning costs.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Hardware Architecture

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Corona: System Implications of Emerging Nanophotonic Technology

Dana Vantrease, Robert Schreiber, ... (+8 more)

cs.AR 🏛 ISCA 📚 710 cites 2 years ago

R.I.P. 👻 Ghosted

A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)

Saber Moradi, Ning Qiao, ... (+2 more)

cs.AR 🏛 IEEE TBCS 📚 544 cites 8 years ago

R.I.P. 👻 Ghosted

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

Hanrui Wang, Zhekai Zhang, Song Han

cs.AR 🏛 ISCA 📚 503 cites 5 years ago

R.I.P. 👻 Ghosted

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert, Xiaowei Wang, ... (+6 more)

cs.AR 🏛 ISCA 📚 373 cites 8 years ago

R.I.P. 👻 Ghosted

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhekai Zhang, Hanrui Wang, ... (+2 more)

cs.AR 🏛 ISCA 📚 274 cites 6 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 8 years ago

R.I.P. 👻 Ghosted

Equality of Opportunity in Supervised Learning

Moritz Hardt, Eric Price, Nathan Srebro

cs.LG 🏛 NeurIPS 📚 4.9K cites 9 years ago