
InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings
Xing Wu1,2,3,Chaochen Gao1,2∗
,Zijia Lin3,Jizhong Han1,Zhongyuan Wang3,Songlin Hu1,2†
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
3Kuaishou Technology
{gaochaochen,zangliangjun,hanjizhong,husonglin}@iie.ac.cn
{wuxing,wangzhongyuan}@kuaishou.com, linzijia07@tsinghua.org.cn
Abstract
Contrastive learning has been extensively stud-
ied in sentence embedding learning, which
assumes that the embeddings of different
views of the same sentence are closer. The
constraint brought by this assumption is weak,
and a good sentence representation should also
be able to reconstruct the original sentence
fragments. Therefore, this paper proposes an
information-aggregated contrastive learning
framework for learning unsupervised sentence
embeddings, termed InfoCSE. InfoCSE forces
the representation of [CLS] positions to
aggregate denser sentence information by
introducing an additional Masked language
model task and a well-designed network.
We evaluate the proposed InfoCSE on sev-
eral benchmark datasets w.r.t the semantic
text similarity (STS) task. Experimental
results show that InfoCSE outperforms
SimCSE by an average Spearman correlation
of 2.60% on BERT-base, and 1.77% on
BERT-large, achieving state-of-the-art results
among unsupervised sentence representation
learning methods. Our code are available at
github.com/caskcsg/sentemb/tree/main/InfoCSE.
1 Introduction
Sentence embeddings aim to capture rich seman-
tic information to be applied in many downstream
tasks (Zhang et al.;Wu et al.;Liu et al.,2021). Re-
cently, researchers have started to use contrastive
learning to learn better unsupervised sentence em-
beddings (Gao et al.,2021;Yan et al.,2021;Wu
et al.,2021a;Zhou et al.,2022;Wu et al.,2021b;
Chuang et al.,2022). Contrastive learning methods
assume that effective sentence embeddings should
bring similar sentences closer and dissimilar sen-
tences farther. These methods use various data aug-
mentation methods to randomly generate different
∗
The first two authors contribute equally.
†
Corresponding author.
Model STS-B
SimCSE-BERTbase 86.2
w/ MLM
λ= 0.01 85.7
λ= 0.185.7
λ= 1 85.1
Table 1: Table from SimCSE (Gao et al.,2021). The
masked language model (MLM) objective brings a con-
sistent drop to the SimCSE model in semantic textual
similarity tasks. “w/" means “with", λis the balance
hyperparameter for MLM loss.
views for each sentence and constrain one sentence
semantically to be more similar to its augmented
counterpart than any other sentence. SimCSE (Gao
et al.,2021) is the representative work of con-
trastive sentence embedding, which uses dropout
acts as minimal data augmentation. SimCSE en-
codes the same sentence twice into embeddings
to obtain “positive pairs" and takes other sentence
embeddings in the same mini-batch as “negatives".
There have been many improvements to SimCSE,
including enhancing positive and negative sample
building methods (Wu et al.,2021a), alleviating the
influence of improper mini-batch negatives (Zhou
et al.,2022), and learning sentence representations
that are aware of direct surface-level augmentations
(Chuang et al.,2022).
Although SimCSE and its variants have achieved
good results and can learn sentence embeddings
that can distinguish different sentences, this is not
enough to indicate sentence embeddings already
contain the semantics of sentences well. If a sen-
tence embedding is sufficiently equivalent to the
semantics of the sentence, it should also be able to
reconstruct the original sentence to a large extent
(Lu et al.,2021). However, as shown in Table 1
from (Gao et al.,2021), experiments show that “the
masked language model objective brings a consis-
tent drop in semantic textual similarity tasks" in
SimCSE. This is due to the gradients of the MLM
arXiv:2210.06432v3 [cs.CL] 14 Oct 2022