Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

June 06, 2022 · Entered Twilight · 🏛 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Repo contents: LICENSE, README.md, data, examples, models, requirements.txt, scripts, training, utils

Authors Wei Wang, Liangzhu Ge, Jingqiao Zhang, Cheng Yang arXiv ID 2206.02457 Category cs.CL: Computation & Language Cross-listed cs.IR Citations 26 Venue Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Repository https://github.com/alibaba/SimCSE-with-CARDS ⭐ 16 Last Checked 1 month ago

Abstract

Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.