Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
June 12, 2024 ยท The Cartographer ยท ๐ arXiv.org
"No code URL or promise found in abstract"
"Title-pattern auto-detect: Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey"
Evidence collected by the PWNC Scanner
Authors
Feng Liang, Zhen Zhang, Haifeng Lu, Chengming Li, Victor C. M. Leung, Yanyi Guo, Xiping Hu
arXiv ID
2406.08115
Category
cs.DC: Distributed Computing
Cross-listed
cs.AI
Citations
11
Venue
arXiv.org
Last Checked
3 days ago
Abstract
With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep learning. The large-scale environment with large volumes of datasets, models, and computational and communication resources raises various unique challenges for resource allocation and workload scheduling in distributed deep learning, such as scheduling complexity, resource and workload heterogeneity, and fault tolerance. To uncover these challenges and corresponding solutions, this survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL. We explore these strategies by focusing on various resource types, scheduling granularity levels, and performance goals during distributed training and inference processes. We highlight critical challenges for each topic and discuss key insights of existing technologies. To illustrate practical large-scale resource allocation and workload scheduling in real distributed deep learning scenarios, we use a case study of training large language models. This survey aims to encourage computer science, artificial intelligence, and communications researchers to understand recent advances and explore future research directions for efficient framework strategies for large-scale distributed deep learning.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Distributed Computing
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
๐ป
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
๐ป
Ghosted
Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems
R.I.P.
๐ป
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
๐ป
Ghosted