Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

June 30, 2020 ยท Declared Dead ยท ๐Ÿ› International Symposium on Computer Architecture

๐Ÿ‘ป CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Ni, Tushar Krishna arXiv ID 2007.00156 Category cs.AR: Hardware Architecture Cross-listed cs.DC Citations 55 Venue International Symposium on Computer Architecture Last Checked 2 months ago
Abstract
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Hardware Architecture

Died the same way โ€” ๐Ÿ‘ป Ghosted