Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

June 06, 2022 ยท Entered Twilight ยท ๐Ÿ› Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: LICENSE, README.md, data, examples, models, requirements.txt, scripts, training, utils

Authors Wei Wang, Liangzhu Ge, Jingqiao Zhang, Cheng Yang arXiv ID 2206.02457 Category cs.CL: Computation & Language Cross-listed cs.IR Citations 26 Venue Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Repository https://github.com/alibaba/SimCSE-with-CARDS โญ 16 Last Checked 1 month ago
Abstract
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computation & Language

๐ŸŒ… ๐ŸŒ… Old Age

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, ... (+6 more)

cs.CL ๐Ÿ› NeurIPS ๐Ÿ“š 166.0K cites 8 years ago