๐
๐
Old Age
Addressing Token Uniformity in Transformers via Singular Value Transformation
August 24, 2022 ยท Entered Twilight ยท ๐ Conference on Uncertainty in Artificial Intelligence
Repo contents: .gitignore, .vscode, LICENSE, README.md, examples, intro_pic.png, src, sts_results.png, unsupervisedSTS, utils
Authors
Hanqi Yan, Lin Gui, Wenjie Li, Yulan He
arXiv ID
2208.11790
Category
cs.CL: Computation & Language
Citations
16
Venue
Conference on Uncertainty in Artificial Intelligence
Repository
https://github.com/hanqi-qi/tokenUni.git
โญ 9
Last Checked
1 month ago
Abstract
Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted