Accelerating Frontier MoE Training with 3D Integrated Optics
September 09, 2025 Β· Declared Dead Β· π IEEE Symposium on High-Performance Interconnects
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Mikhail Bernadskiy, Peter Carson, Thomas Graham, Taylor Groves, Ho John Lee, Eric Yeh
arXiv ID
2510.15893
Category
cs.AR: Hardware Architecture
Cross-listed
cs.AI,
cs.DC,
cs.LG
Citations
1
Venue
IEEE Symposium on High-Performance Interconnects
Last Checked
3 months ago
Abstract
The unabated growth in AI workload demands is driving the need for concerted advances in compute, memory, and interconnect performance. As traditional semiconductor scaling slows, high-speed interconnects have emerged as the new scaling engine, enabling the creation of larger logical GPUs by linking many GPUs into a single, low-latency, high-bandwidth compute domain. While initial scale-up fabrics leveraged copper interconnects for their power and cost advantages, the maximum reach of passive electrical interconnects (approximately 1 meter) effectively limits the scale-up domain to within a single rack. The advent of 3D-stacked optics and logic offers a transformative, power-efficient scale-up solution for connecting hundreds of GPU packages (thousands of GPUs) across multiple data center racks. This work explores the design tradeoffs of scale-up technologies and demonstrates how frontier LLMs necessitate novel photonic solutions to achieve aggressive power and performance targets. We model the benefits of 3D CPO (Passage) enabled GPUs and switches within the scale-up domain when training Frontier Mixture of Experts (MoE) models exceeding one trillion parameters. Our results show that the substantial increases in bandwidth and radix enabled by 3D CPO allow for an 8X increase in scale-up capability. This affords new opportunities for multi-dimensional parallelism within the scale-up domain and results in a 2.7X reduction in time-to-train, unlocking unprecedented model scaling.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Hardware Architecture
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Corona: System Implications of Emerging Nanophotonic Technology
R.I.P.
π»
Ghosted
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
R.I.P.
π»
Ghosted
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
R.I.P.
π»
Ghosted
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
R.I.P.
π»
Ghosted
SpArch: Efficient Architecture for Sparse Matrix Multiplication
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted
Explanation in Artificial Intelligence: Insights from the Social Sciences
R.I.P.
π»
Ghosted