COMET: A Framework for Modeling Compound Operation Dataflows with Explicit Collectives

August 30, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Shubham Negi, Manik Singhal, Aayush Ankit, Sudeep Bhoja, Kaushik Roy arXiv ID 2509.00599 Category cs.AR: Hardware Architecture Cross-listed cs.DC Citations 2 Venue arXiv.org Last Checked 3 months ago
Abstract
Modern machine learning accelerators are designed to efficiently execute deep neural networks (DNNs) by optimizing data movement, memory hierarchy, and compute throughput. However, emerging DNN models such as large language models, state space models increasingly rely on compound operations-structured compositions of multiple basic operations-which introduce new challenges for dataflow optimization and minimizing off-chip memory traffic. Moreover, as model size continues to grow, deployment across spatially distributed compute clusters becomes essential, requiring frequent and complex collective communication. Existing dataflow optimization frameworks and performance models either focus on single operations or lack explicit modeling of collective communication cost, limiting their applicability to modern workloads. To address these limitations, we propose, a framework for modeling and optimizing dataflow for compound operations on machine learning accelerators. COMET introduces a novel representation that explicitly models collective communication across spatial clusters, along with latency and energy cost models that account for both GEMM and non-GEMM operation level dependencies within compound operations. We demonstrate COMET's capabilities to analyze and optimize dataflows for compound operations such as GEMM--Softmax, GEMM--LayerNorm, and self-attention, across both edge and cloud accelerator configurations. Our collective-aware modeling enables exploration of a broader mapping space, leading to improved performance and energy efficiency. Specifically, our optimized dataflows achieve up to 1.42$\times$ speedup for GEMM-Softmax, 3.46$\times$ for GEMM-LayerNorm and 1.82$\times$ for self-attention compared to unfused baselines.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Hardware Architecture

Died the same way β€” πŸ‘» Ghosted