InfoCSE Information-aggregated Contrastive Learning of Sentence Embeddings Xing Wu123Chaochen Gao12Zijia Lin3Jizhong Han1Zhongyuan Wang3Songlin Hu12y

2025-04-27 0 0 488.26KB 11 页 10玖币

侵权投诉

InfoCSE: Information-aggregated Contrastive Learning of Sentence

Embeddings

Xing Wu1,2,3,Chaochen Gao1,2∗

,Zijia Lin3,Jizhong Han1,Zhongyuan Wang3,Songlin Hu1,2†

1Institute of Information Engineering, Chinese Academy of Sciences

2School of Cyber Security, University of Chinese Academy of Sciences

3Kuaishou Technology

{gaochaochen,zangliangjun,hanjizhong,husonglin}@iie.ac.cn

{wuxing,wangzhongyuan}@kuaishou.com, linzijia07@tsinghua.org.cn

Abstract

Contrastive learning has been extensively stud-

ied in sentence embedding learning, which

assumes that the embeddings of different

views of the same sentence are closer. The

constraint brought by this assumption is weak,

and a good sentence representation should also

be able to reconstruct the original sentence

fragments. Therefore, this paper proposes an

information-aggregated contrastive learning

framework for learning unsupervised sentence

embeddings, termed InfoCSE. InfoCSE forces

the representation of [CLS] positions to

aggregate denser sentence information by

introducing an additional Masked language

model task and a well-designed network.

We evaluate the proposed InfoCSE on sev-

eral benchmark datasets w.r.t the semantic

text similarity (STS) task. Experimental

results show that InfoCSE outperforms

SimCSE by an average Spearman correlation

of 2.60% on BERT-base, and 1.77% on

BERT-large, achieving state-of-the-art results

among unsupervised sentence representation

learning methods. Our code are available at

github.com/caskcsg/sentemb/tree/main/InfoCSE.

1 Introduction

Sentence embeddings aim to capture rich seman-

tic information to be applied in many downstream

tasks (Zhang et al.;Wu et al.;Liu et al.,2021). Re-

cently, researchers have started to use contrastive

learning to learn better unsupervised sentence em-

beddings (Gao et al.,2021;Yan et al.,2021;Wu

et al.,2021a;Zhou et al.,2022;Wu et al.,2021b;

Chuang et al.,2022). Contrastive learning methods

assume that effective sentence embeddings should

bring similar sentences closer and dissimilar sen-

tences farther. These methods use various data aug-

mentation methods to randomly generate different

∗

The ﬁrst two authors contribute equally.

†

Corresponding author.

Model STS-B

SimCSE-BERTbase 86.2

w/ MLM

λ= 0.01 85.7

λ= 0.185.7

λ= 1 85.1

Table 1: Table from SimCSE (Gao et al.,2021). The

masked language model (MLM) objective brings a con-

sistent drop to the SimCSE model in semantic textual

similarity tasks. “w/" means “with", λis the balance

hyperparameter for MLM loss.

views for each sentence and constrain one sentence

semantically to be more similar to its augmented

counterpart than any other sentence. SimCSE (Gao

et al.,2021) is the representative work of con-

trastive sentence embedding, which uses dropout

acts as minimal data augmentation. SimCSE en-

codes the same sentence twice into embeddings

to obtain “positive pairs" and takes other sentence

embeddings in the same mini-batch as “negatives".

There have been many improvements to SimCSE,

including enhancing positive and negative sample

building methods (Wu et al.,2021a), alleviating the

inﬂuence of improper mini-batch negatives (Zhou

et al.,2022), and learning sentence representations

that are aware of direct surface-level augmentations

(Chuang et al.,2022).

Although SimCSE and its variants have achieved

good results and can learn sentence embeddings

that can distinguish different sentences, this is not

enough to indicate sentence embeddings already

contain the semantics of sentences well. If a sen-

tence embedding is sufﬁciently equivalent to the

semantics of the sentence, it should also be able to

reconstruct the original sentence to a large extent

(Lu et al.,2021). However, as shown in Table 1

from (Gao et al.,2021), experiments show that “the

masked language model objective brings a consis-

tent drop in semantic textual similarity tasks" in

SimCSE. This is due to the gradients of the MLM

arXiv:2210.06432v3 [cs.CL] 14 Oct 2022

Figure 1: Comparison of InfoCSE and SimCSE structures. SimCSE learns sentence representations through con-

trastive learning on the [CLS] output embeddings of the BERT model. In addition to contrastive learning, InfoCSE

designs an auxiliary network for sentence reconstruction with the [CLS] embeddings, enabling to learn better

sentence representations.

objective optimization will easily over-update the

parameters of the encoder network, thus causing

disturbance to the contrastive learning task. There-

fore, it is not an easy job to incorporate sentence

reconstruction task in contrastive sentence embed-

ding learning.

To improve contrastive sentence embedding

learning with the sentence reconstruction task,

we propose an information-aggregated contrastive

learning framework, termed InfoCSE. The previ-

ous work (Gao et al.,2021) shares the encoder

when jointly optimizing and contrastive learning

objective and MLM objective. Unlike (Gao et al.,

2021), we design an auxiliary network to optimize

the MLM objective, as shown in Figure 1-(b). The

auxiliary network is an 8-layer transformer network

consisting of a frozen lower six layers and an ad-

ditional two layers. The auxiliary network takes

two inputs. One is the sentence embedding of the

original text, and the other is the masked text. The

sentence embedding is a vector of the [CLS] posi-

tion encoded by 12 layers of BERT, abbreviated as

12-cls. The lower six Transformer layers encode

the masked text to output hidden states at each posi-

tion. We collectively refer to

non

-[CLS] positions’

representations as 6-hiddens. Then, we feed the

concatenation of the 12-cls and 6-hiddens into the

additional two layers to perform token prediction

for the masked positions. Such a design brings two

beneﬁts.

•

Since the auxiliary network only contains 8

Transformer layers and the frozen lower six lay-

ers cannot be optimized, the sentence reconstruc-

tion ability is limited. Also, the non-[CLS]

embeddings are the outputs of the 6th Trans-

former layer, with insufﬁcient semantic infor-

mation learned. So the MLM task is forced to

rely more on the 12-cls embedding, encouraging

the 12-cls embedding to encode richer semantic

information.

•

The gradient update of the auxiliary network

is only back-propagated to the BERT network

through the 12-cls embedding. Compared to

perfTorming the MLM task directly on the BERT,

the effect of gradient updates using the 12-cls em-

bedding will be much smaller and will not cause

large perturbations to the contrastive learning

task.

Therefore, under the InfoCSE framework, the 12-

cls embedding learned can be distinguished from

other sentence embeddings through contrastive

learning and reconstructing sentences through aux-

iliary MLM training, while avoiding the disadvan-

tage that the gradient of MLM objective will over-

update the parameters of the encoder network.

Experiments on the semantic text similarity

(STS) tasks show that InfoCSE outperforms Sim-

CSE by an average Spearman correlation of 2.60%,

1.77% on the base of BERT-base, BERT-large, re-

spectively. InfoCSE also signiﬁcantly outperforms

SimCSE on the open-domain information retrieval

benchmark BEIR. We also conduct a set of abla-

tion studies to illustrate the soundness of InfoCSE

design.

2 Backgrounds

Deﬁnitions

Let’s deﬁne some symbols ﬁrst. Sup-

pose we have a set of sentences

x∈X

, and a 12-

layer Transformer blocks BERT model

Enc

. For a

sentence

of length

, we append a special [CLS]

token to it, and then feed it into BERT for encoding.

The output of each layer is a vector list of length

l+1

, also called hidden states. We use

to denote

the last layer’s hidden states and

to refer to the

vector at [CLS] position. In addition, we use

to represent the hidden state of the 6th layer,

M>0

to represent the other hidden states of the 6th layer

except the [CLS] position,

SimCSE

As shown in subﬁgure (a) of Figure

1, we plot the structure diagram of SimCSE-

BERT

base

. The model’s output is the [CLS] po-

sition’s vector, which is used as the semantic repre-

sentation of the sentence. SimCSE uses the same

sentence to construct semantically related positive

pairs

hx, x+i

, i.e.

x+=x

. Speciﬁcally, SimCSE

uses dropout as the minimum data augmentation,

feeding the same input

twice to the encoder with

different dropout masks zand z+, and outputs the

hidden states of the last layer:

H=Enc (x, z), H+=Enc x, z+(1)

A pooler layer

P ooler

is further applied to the

hidden states as follows:

h=P ooler (H), h+=P ooler H+(2)

Then, for a mini-batch

, the contrastive learning

objective w.r.t xis formulated as:

Lcl =−log exp(sim (h, h+)/τ)

h0∈B

exp(sim (h, h0+)/τ)(3)

, where

is a temperature hyperparameter and

sim (h, h+)

is the similarity metric, which is typi-

cally the cosine similarity function.

BERT’s MLM objective

MLM randomly

masks out a subset of input tokens and requires

the model to predict them. Given a sentence

, following BERT (Devlin et al.,2018), we

randomly replace 15% of the tokens with [MASK]

and get a masked sentence

. Then

will be feed

into the BERT, and the hidden states of the last

layer

will be projected through a matrix

predict the original token of each masked position.

The process uses the cross entropy loss function

CE for optimization:

Lmlm =X

j∈masked

CE HjW, bxj(4)

, where masked denotes the masked positions.

3 InfoCSE: Information-aggregated

Contrastive Learning

In this section, we ﬁrst introduce the MLM pre-

training of the auxiliary network, and then describe

how to jointly train the contrastive learning ob-

jective and the MLM objective with the auxiliary

network.

3.1 Pre-training of The Auxiliary Network

As shown in Figure 1-(b). the auxiliary network

is an 8-layer transformer network consisting of a

lower six layers and an additional two layers. In

the pre-training phase of the auxiliary network, the

auxiliary network and the BERT network share the

lower six layers. We optimize two MLM objectives

simultaneously using the same input. The ﬁrst is

BERT’s MLM objective, which we have already

covered. The second is the MLM objective of the

auxiliary network. We concatenate the vector at

[CLS] position of BERT’s 12th layer (

) and the

hidden states of the 6th layer except for the [CLS]

position (M>0):

∼

H= [h, M>0]

Then

∼

is fed into the additional two layers, and

the output hidden state will be used to calculate the

cross-entropy loss:

Laux =X

j∈masked

CE ∼

W, bxj(5)

Therefore, the pre-training loss of the entire auxil-

iary network is deﬁned as the sum of the two MLM

losses:

Lpretrain =Laux +Lmlm (6)

The output projection matrix

is shared between

the two MLM losses.

3.2 Joint Training of MLM and Contrastive

Learning

When jointly training the contrastive learning and

MLM objectives, the auxiliary network no longer

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InfoCSE:Information-aggregatedContrastiveLearningofSentenceEmbeddingsXingWu1,2,3,ChaochenGao1,2,ZijiaLin3,JizhongHan1,ZhongyuanWang3,SonglinHu1,2y1InstituteofInformationEngineering,ChineseAcademyofSciences2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences3KuaishouTechnology{gaochaochen,zan...

展开>> 收起<<

InfoCSE Information-aggregated Contrastive Learning of Sentence Embeddings Xing Wu123Chaochen Gao12Zijia Lin3Jizhong Han1Zhongyuan Wang3Songlin Hu12y.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

InfoCSE Information-aggregated Contrastive Learning of Sentence Embeddings Xing Wu123Chaochen Gao12Zijia Lin3Jizhong Han1Zhongyuan Wang3Songlin Hu12y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: