Kub: Enabling Elastic HPC Workloads on Containerized Environments
October 14, 2024 Β· Declared Dead Β· π Symposium on Computer Architecture and High Performance Computing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Daniel Medeiros, Jacob Wahlgren, Gabin Schieffer, Ivy Peng
arXiv ID
2410.10655
Category
cs.DC: Distributed Computing
Citations
7
Venue
Symposium on Computer Architecture and High Performance Computing
Last Checked
4 months ago
Abstract
The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications -- GROMACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Distributed Computing
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
π»
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
π»
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
π»
Ghosted
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
R.I.P.
π»
Ghosted
iFogSim: A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted