InfoCSE Information-aggregated Contrastive Learning of Sentence Embeddings Xing Wu123Chaochen Gao12Zijia Lin3Jizhong Han1Zhongyuan Wang3Songlin Hu12y

2025-04-27 0 0 488.26KB 11 页 10玖币
侵权投诉
InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings
Xing Wu1,2,3,Chaochen Gao1,2
,Zijia Lin3,Jizhong Han1,Zhongyuan Wang3,Songlin Hu1,2
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
3Kuaishou Technology
{gaochaochen,zangliangjun,hanjizhong,husonglin}@iie.ac.cn
{wuxing,wangzhongyuan}@kuaishou.com, linzijia07@tsinghua.org.cn
Abstract
Contrastive learning has been extensively stud-
ied in sentence embedding learning, which
assumes that the embeddings of different
views of the same sentence are closer. The
constraint brought by this assumption is weak,
and a good sentence representation should also
be able to reconstruct the original sentence
fragments. Therefore, this paper proposes an
information-aggregated contrastive learning
framework for learning unsupervised sentence
embeddings, termed InfoCSE. InfoCSE forces
the representation of [CLS] positions to
aggregate denser sentence information by
introducing an additional Masked language
model task and a well-designed network.
We evaluate the proposed InfoCSE on sev-
eral benchmark datasets w.r.t the semantic
text similarity (STS) task. Experimental
results show that InfoCSE outperforms
SimCSE by an average Spearman correlation
of 2.60% on BERT-base, and 1.77% on
BERT-large, achieving state-of-the-art results
among unsupervised sentence representation
learning methods. Our code are available at
github.com/caskcsg/sentemb/tree/main/InfoCSE.
1 Introduction
Sentence embeddings aim to capture rich seman-
tic information to be applied in many downstream
tasks (Zhang et al.;Wu et al.;Liu et al.,2021). Re-
cently, researchers have started to use contrastive
learning to learn better unsupervised sentence em-
beddings (Gao et al.,2021;Yan et al.,2021;Wu
et al.,2021a;Zhou et al.,2022;Wu et al.,2021b;
Chuang et al.,2022). Contrastive learning methods
assume that effective sentence embeddings should
bring similar sentences closer and dissimilar sen-
tences farther. These methods use various data aug-
mentation methods to randomly generate different
The first two authors contribute equally.
Corresponding author.
Model STS-B
SimCSE-BERTbase 86.2
w/ MLM
λ= 0.01 85.7
λ= 0.185.7
λ= 1 85.1
Table 1: Table from SimCSE (Gao et al.,2021). The
masked language model (MLM) objective brings a con-
sistent drop to the SimCSE model in semantic textual
similarity tasks. “w/" means “with", λis the balance
hyperparameter for MLM loss.
views for each sentence and constrain one sentence
semantically to be more similar to its augmented
counterpart than any other sentence. SimCSE (Gao
et al.,2021) is the representative work of con-
trastive sentence embedding, which uses dropout
acts as minimal data augmentation. SimCSE en-
codes the same sentence twice into embeddings
to obtain “positive pairs" and takes other sentence
embeddings in the same mini-batch as “negatives".
There have been many improvements to SimCSE,
including enhancing positive and negative sample
building methods (Wu et al.,2021a), alleviating the
influence of improper mini-batch negatives (Zhou
et al.,2022), and learning sentence representations
that are aware of direct surface-level augmentations
(Chuang et al.,2022).
Although SimCSE and its variants have achieved
good results and can learn sentence embeddings
that can distinguish different sentences, this is not
enough to indicate sentence embeddings already
contain the semantics of sentences well. If a sen-
tence embedding is sufficiently equivalent to the
semantics of the sentence, it should also be able to
reconstruct the original sentence to a large extent
(Lu et al.,2021). However, as shown in Table 1
from (Gao et al.,2021), experiments show that “the
masked language model objective brings a consis-
tent drop in semantic textual similarity tasks" in
SimCSE. This is due to the gradients of the MLM
arXiv:2210.06432v3 [cs.CL] 14 Oct 2022
Figure 1: Comparison of InfoCSE and SimCSE structures. SimCSE learns sentence representations through con-
trastive learning on the [CLS] output embeddings of the BERT model. In addition to contrastive learning, InfoCSE
designs an auxiliary network for sentence reconstruction with the [CLS] embeddings, enabling to learn better
sentence representations.
objective optimization will easily over-update the
parameters of the encoder network, thus causing
disturbance to the contrastive learning task. There-
fore, it is not an easy job to incorporate sentence
reconstruction task in contrastive sentence embed-
ding learning.
To improve contrastive sentence embedding
learning with the sentence reconstruction task,
we propose an information-aggregated contrastive
learning framework, termed InfoCSE. The previ-
ous work (Gao et al.,2021) shares the encoder
when jointly optimizing and contrastive learning
objective and MLM objective. Unlike (Gao et al.,
2021), we design an auxiliary network to optimize
the MLM objective, as shown in Figure 1-(b). The
auxiliary network is an 8-layer transformer network
consisting of a frozen lower six layers and an ad-
ditional two layers. The auxiliary network takes
two inputs. One is the sentence embedding of the
original text, and the other is the masked text. The
sentence embedding is a vector of the [CLS] posi-
tion encoded by 12 layers of BERT, abbreviated as
12-cls. The lower six Transformer layers encode
the masked text to output hidden states at each posi-
tion. We collectively refer to
non
-[CLS] positions’
representations as 6-hiddens. Then, we feed the
concatenation of the 12-cls and 6-hiddens into the
additional two layers to perform token prediction
for the masked positions. Such a design brings two
benefits.
Since the auxiliary network only contains 8
Transformer layers and the frozen lower six lay-
ers cannot be optimized, the sentence reconstruc-
tion ability is limited. Also, the non-[CLS]
embeddings are the outputs of the 6th Trans-
former layer, with insufficient semantic infor-
mation learned. So the MLM task is forced to
rely more on the 12-cls embedding, encouraging
the 12-cls embedding to encode richer semantic
information.
The gradient update of the auxiliary network
is only back-propagated to the BERT network
through the 12-cls embedding. Compared to
perfTorming the MLM task directly on the BERT,
the effect of gradient updates using the 12-cls em-
bedding will be much smaller and will not cause
large perturbations to the contrastive learning
task.
Therefore, under the InfoCSE framework, the 12-
cls embedding learned can be distinguished from
other sentence embeddings through contrastive
learning and reconstructing sentences through aux-
iliary MLM training, while avoiding the disadvan-
tage that the gradient of MLM objective will over-
update the parameters of the encoder network.
Experiments on the semantic text similarity
(STS) tasks show that InfoCSE outperforms Sim-
CSE by an average Spearman correlation of 2.60%,
1.77% on the base of BERT-base, BERT-large, re-
spectively. InfoCSE also significantly outperforms
SimCSE on the open-domain information retrieval
benchmark BEIR. We also conduct a set of abla-
tion studies to illustrate the soundness of InfoCSE
design.
2 Backgrounds
Definitions
Let’s define some symbols first. Sup-
pose we have a set of sentences
xX
, and a 12-
layer Transformer blocks BERT model
Enc
. For a
sentence
x
of length
l
, we append a special [CLS]
token to it, and then feed it into BERT for encoding.
The output of each layer is a vector list of length
l+1
, also called hidden states. We use
H
to denote
the last layer’s hidden states and
h
to refer to the
vector at [CLS] position. In addition, we use
M
to represent the hidden state of the 6th layer,
M>0
to represent the other hidden states of the 6th layer
except the [CLS] position,
SimCSE
As shown in subfigure (a) of Figure
1, we plot the structure diagram of SimCSE-
BERT
base
. The model’s output is the [CLS] po-
sition’s vector, which is used as the semantic repre-
sentation of the sentence. SimCSE uses the same
sentence to construct semantically related positive
pairs
hx, x+i
, i.e.
x+=x
. Specifically, SimCSE
uses dropout as the minimum data augmentation,
feeding the same input
x
twice to the encoder with
different dropout masks zand z+, and outputs the
hidden states of the last layer:
H=Enc (x, z), H+=Enc x, z+(1)
A pooler layer
P ooler
is further applied to the
hidden states as follows:
h=P ooler (H), h+=P ooler H+(2)
Then, for a mini-batch
B
, the contrastive learning
objective w.r.t xis formulated as:
Lcl =log exp(sim (h, h+))
P
h0B
exp(sim (h, h0+))(3)
, where
τ
is a temperature hyperparameter and
sim (h, h+)
is the similarity metric, which is typi-
cally the cosine similarity function.
BERT’s MLM objective
MLM randomly
masks out a subset of input tokens and requires
the model to predict them. Given a sentence
x
, following BERT (Devlin et al.,2018), we
randomly replace 15% of the tokens with [MASK]
and get a masked sentence
bx
. Then
bx
will be feed
into the BERT, and the hidden states of the last
layer
H
will be projected through a matrix
W
to
predict the original token of each masked position.
The process uses the cross entropy loss function
CE for optimization:
Lmlm =X
jmasked
CE HjW, bxj(4)
, where masked denotes the masked positions.
3 InfoCSE: Information-aggregated
Contrastive Learning
In this section, we first introduce the MLM pre-
training of the auxiliary network, and then describe
how to jointly train the contrastive learning ob-
jective and the MLM objective with the auxiliary
network.
3.1 Pre-training of The Auxiliary Network
As shown in Figure 1-(b). the auxiliary network
is an 8-layer transformer network consisting of a
lower six layers and an additional two layers. In
the pre-training phase of the auxiliary network, the
auxiliary network and the BERT network share the
lower six layers. We optimize two MLM objectives
simultaneously using the same input. The first is
BERT’s MLM objective, which we have already
covered. The second is the MLM objective of the
auxiliary network. We concatenate the vector at
[CLS] position of BERT’s 12th layer (
h
) and the
hidden states of the 6th layer except for the [CLS]
position (M>0):
H= [h, M>0]
Then
H
is fed into the additional two layers, and
the output hidden state will be used to calculate the
cross-entropy loss:
Laux =X
jmasked
CE
H
j
W, bxj(5)
Therefore, the pre-training loss of the entire auxil-
iary network is defined as the sum of the two MLM
losses:
Lpretrain =Laux +Lmlm (6)
The output projection matrix
W
is shared between
the two MLM losses.
3.2 Joint Training of MLM and Contrastive
Learning
When jointly training the contrastive learning and
MLM objectives, the auxiliary network no longer
摘要:

InfoCSE:Information-aggregatedContrastiveLearningofSentenceEmbeddingsXingWu1,2,3,ChaochenGao1,2,ZijiaLin3,JizhongHan1,ZhongyuanWang3,SonglinHu1,2y1InstituteofInformationEngineering,ChineseAcademyofSciences2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences3KuaishouTechnology{gaochaochen,zan...

展开>> 收起<<
InfoCSE Information-aggregated Contrastive Learning of Sentence Embeddings Xing Wu123Chaochen Gao12Zijia Lin3Jizhong Han1Zhongyuan Wang3Songlin Hu12y.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:488.26KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注