Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

May 22, 2023 Β· Declared Dead Β· πŸ› Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Behnaz Arzani, Siva Kesava Reddy Kakarla, Miguel Castro, Srikanth Kandula, Saeed Maleki, Luke Marshall arXiv ID 2305.13479 Category cs.NI: Networking & Internet Citations 49 Venue Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication Last Checked 3 months ago
Abstract
We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Networking & Internet

Died the same way β€” πŸ‘» Ghosted