TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
March 26, 2025 Β· Declared Dead Β· π Conference on Machine Learning and Systems
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu
arXiv ID
2503.20313
Category
cs.DC: Distributed Computing
Citations
15
Venue
Conference on Machine Learning and Systems
Last Checked
4 months ago
Abstract
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone. In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17\times$ to $20.76\times$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Distributed Computing
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
π»
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
π»
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
π»
Ghosted
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
R.I.P.
π»
Ghosted
iFogSim: A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted