Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
February 15, 2016 Β· Declared Dead Β· π ACM Symposium on Parallelism in Algorithms and Architectures
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
David Dinh, Harsha Vardhan Simhadri, Yuan Tang
arXiv ID
1602.04552
Category
cs.DC: Distributed Computing
Citations
24
Venue
ACM Symposium on Parallelism in Algorithms and Architectures
Last Checked
4 months ago
Abstract
The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "$\parallel$" (parallel) and "$;$" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "$\leadsto$" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is $O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};Ο\cdot M_i)\cdot C_i}{p}\right)$, where $Q^{*}$ is the cache complexity of task ${\mathsf t}$, $C_i$ is the cost of cache miss at level-$i$ cache which is of size $M_i$, $Ο\in(0,1)$ is a constant, and $p$ is the number of processors in an $h$-level cache hierarchy.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Distributed Computing
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
π»
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
π»
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
π»
Ghosted
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
R.I.P.
π»
Ghosted
iFogSim: A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted