How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention
November 02, 2020 ยท Declared Dead ยท ๐ International Conference on Computational Linguistics
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Yue Guan, Jingwen Leng, Chao Li, Quan Chen, Minyi Guo
arXiv ID
2011.00943
Category
cs.CL: Computation & Language
Citations
19
Venue
International Conference on Computational Linguistics
Last Checked
4 months ago
Abstract
Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
๐
๐
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
๐ฎ
๐ฎ
The Ethereal
Effective Approaches to Attention-based Neural Machine Translation
๐
๐
Old Age
A large annotated corpus for learning natural language inference
๐
๐
Old Age
HellaSwag: Can a Machine Really Finish Your Sentence?
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted