π
π
The Cartographer
SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding
April 11, 2026 Β· Grace Period Β· + Add venue
Authors
Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu
arXiv ID
2604.10152
Category
cs.AI: Artificial Intelligence
Cross-listed
cs.LG
Citations
0
Abstract
The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Artificial Intelligence
R.I.P.
π»
Ghosted
Explanation in Artificial Intelligence: Insights from the Social Sciences
R.I.P.
π»
Ghosted
Federated Machine Learning: Concept and Applications
R.I.P.
π»
Ghosted
Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR
R.I.P.
π»
Ghosted
DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks
R.I.P.
π»
Ghosted