GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors
December 19, 2024 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Chengming Zhang, Xinheng Ding, Baixi Sun, Xiaodong Yu, Weijian Zheng, Zhen Xie, Dingwen Tao
arXiv ID
2412.19829
Category
cs.AR: Hardware Architecture
Cross-listed
cs.LG
Citations
0
Venue
arXiv.org
Last Checked
3 months ago
Abstract
Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Hardware Architecture
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Corona: System Implications of Emerging Nanophotonic Technology
R.I.P.
π»
Ghosted
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
R.I.P.
π»
Ghosted
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
R.I.P.
π»
Ghosted
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
R.I.P.
π»
Ghosted
SpArch: Efficient Architecture for Sparse Matrix Multiplication
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted
Explanation in Artificial Intelligence: Insights from the Social Sciences
R.I.P.
π»
Ghosted