DABERT Dual Attention Enhanced BERT for Semantic Matching Sirui Wang1 Di Liang12 Jian Song Yuntao Li Wei Wu Tsinghua University Beijing China

2025-04-27 0 0 657.26KB 10 页 10玖币

侵权投诉

DABERT: Dual Attention Enhanced BERT for Semantic Matching

Sirui Wang1,* , Di Liang1,2,† , Jian Song†, Yuntao Li †, Wei Wu†

Tsinghua University, Beijing, China*

Meituan Inc.,Beijing, China†

{wangsirui,liangdi04,songjian20,liyuntao04,wuwei30}@meituan.com

Abstract

Transformer-based pre-trained language mod-

els such as BERT have achieved remarkable

results in Semantic Sentence Matching. How-

ever, existing models still suffer from insufﬁ-

cient ability to capture subtle differences. Mi-

nor noise like word addition, deletion, and

modiﬁcation of sentences may cause ﬂipped

predictions. To alleviate this problem, we

propose a novel Dual Attention Enhanced

BERT (DABERT) to enhance the ability of

BERT to capture ﬁne-grained differences in

sentence pairs. DABERT comprises (1) Dual

Attention module, which measures soft word

matches by introducing a new dual channel

alignment mechanism to model afﬁnity and

difference attention. (2) Adaptive Fusion mod-

ule, this module uses attention to learn the ag-

gregation of difference and afﬁnity features,

and generates a vector describing the match-

ing details of sentence pairs. We conduct ex-

tensive experiments on well-studied semantic

matching and robustness test datasets, and the

experimental results show the effectiveness of

our proposed method.

1 Introduction

Semantic Sentence Matching (SSM) is a funda-

mental NLP task. The goal of SSM is to compare

two sentences and identify their semantic relation-

ship. In paraphrase identiﬁcation, SSM is used to

determine whether two sentences are paraphrase

or not (Madnani et al.,2012). In natural language

inference task, SSM is utilized to judge whether a

hypothesis sentence can be inferred from a premise

sentence (Bowman et al.,2015). In the answer sen-

tence selection task, SSM is employed to assess the

relevance between query-answer pairs and rank all

candidate answers.

Across the rich history of semantic sentence

matching research, there have been two main

1These authors contributed equally to this work.

2Corresponding author.

The secretaries knew the students.

The secretaries knew the students slept .

Is a girl looks at you,

someone stared at you ?

what does it meant ?

What does it mean when

S1:

S2:

S1:

S2:

Is the weather sunny this

last

weekend ?

Is the weather sunny weekend ?

S1:

S2:

S1:

S2:S2:

S1:

S2:S2:

S1:

S2:S2:

S1:

S2:S2:

Figure 1: Example sentences with similar text but dif-

ferent semantics. S1 and S2 are sentence pair.

streams of studies for solving this problem. One

is to utilize a sentence encoder to convert sen-

tences into low-dimensional vectors in the latent

space, and apply a parameterized function to learn

the matching scores between them (Reimers and

Gurevych,2019). Another paradigm adopts atten-

tion mechanism to calculate scores between tokens

from two sentences, and then the matching scores

are aggregated to make a sentence-level decision

(Chen et al.,2016;Tay et al.,2017). In recent

years, pre-trained models, such as BERT (Devlin

et al.,2018), RoBERTa (Liu et al.,2019), have be-

came much more popular and achieved outstanding

performance in SSM. Recent work also attempts

to enhance the performance of BERT by inject-

ing knowledge into it, such as SemBERT (Zhang

et al.,2020), UER-BERT (Xia et al.,2021), Syntax-

BERT (Bai et al.,2021) and so on.

Although previous studies have provided some

insights, those models do not perform well in distin-

guishing sentence pairs with high literal similarities

but different semantics. Figure 1demonstrates sev-

eral cases suffering from this problem. Although

the sentence pairs in this ﬁgure are semantically

different, they are too similar in literal for those

pre-trained language models to distinguish accu-

rately. This could be caused by the self-attention

architecture itself. Self-attention mechanism fo-

cuses on using the context of a word to understand

arXiv:2210.03454v4 [cs.CL] 14 Apr 2023

the semantics of the word, while ignoring model-

ing the semantic difference between sentence pairs.

De-attention (Tay et al.,2019) and Sparsegen (Mar-

tins and Astudillo,2016) have proved that equip-

ping with attention mechanism with more ﬂexi-

ble structure, models can generate more powerful

representations. In this paper, we also focus on

enhancing the attention mechanism in transformer-

based pre-trained models to better integrate dif-

ference information between sentence pairs. We

hypothesize that paying more attention to the ﬁne-

grained semantic differences, explicitly modeling

the difference and afﬁnity vectors together will fur-

ther improve the performance of pre-trained model.

Therefore, two systemic questions arise naturally:

Q1: How to equip vanilla attention mech-

anism with the ability on modeling semantics

of ﬁne-grained differences between a sentence

pair?

Vanilla attention, or named afﬁnity atten-

tion, less focuses on the ﬁne-grained difference

between sentence pairs, which may lead to error

predictions for SSM tasks. An intuitive solution to

this problem is to make subtraction between repre-

sentation vectors to harvest their semantic differen-

tiation. In this paper, we propose a dual attention

module including a difference attention accompa-

nied with the afﬁnity attention. The difference

attention uses subtraction-based cross-attention to

aggregate word- and phrase- level interaction dif-

ferences. Meanwhile, to fully utilize the difference

information, we use dual-channel inject the differ-

ence information into the multi-head attention in

the transformer to obtain semantic representations

describing afﬁnity and difference respectively.

Q2: How to fuse two types of semantic rep-

resentations into a uniﬁed representation?

hard fusion of two signals by extra structure may

break the representing ability of the pre-trained

model. How to inject those information softly to

pre-trained model remains a hard issue. In this pa-

per, we propose an Adaptive Fusion module, which

uses an additional attention to learn the difference

and afﬁnity features to generate vectors describing

sentence matching details. It ﬁrst inter-aligns the

two signals through distinct attentions to capture

semantic interactions, and then uses gated fusion

to adaptively fuse the difference features. Those

generated vectors are further scaled with another

fuse-gate module to reduce the damage of the pre-

trained model caused by the injection of difference

information. The output ﬁnal vectors can better

describe the matching details of sentence pairs.

Our main contributions are three fold:

•

We point out that explicitly modeling ﬁne-

grained difference semantics between sen-

tence pairs can effectively beneﬁt sentence

semantic matching tasks, and we propose

a novel dual attention enhanced mechanism

based on BERT.

•

Our proposed DABERT model uses a dual-

channel attention to separately focus on the

afﬁnity and difference features in sentence

pairs, and adopts a soft-integrated regulation

mechanism to adaptively aggregate those two

features. Thereby, the generated vectors can

better describe the matching details of sen-

tence pairs.

•

To verify the effectiveness of DABERT, we

conduct experiments on 10 semantic matching

datasets and several data-noised dataset to test

model’s robustness. The results show that

DABERT achieves an absolute improvement

for over 2% compared with pure BERT and

outperforms other BERT-based models with

more advanced techniques and external data

usage.

2 Approach

Our proposed DABERT is a modiﬁcation of the

original transformer structure, whose structure is

shown in Figure 2. Two submodules are included

in this new structure. (1) Dual Attention Module,

which uses a dual channel mechanism in multi-

head attention to match words between two sen-

tences. Each channel uses a different attention

head to calculate afﬁnity and difference scores sep-

arately, and obtains two representations to measure

afﬁnity and difference information respectively. (2)

Adaptive Fusion Module, which is used to fuse the

representation obtained by dual attention. It ﬁrst

uses guide-attention to align the two signals. And

then, multiple gate modules are used to fuse the

two signals. Finally, a vector is output including

more ﬁne-grained matching details. In the follow-

ing sections, we explain each component in detail.

2.1 Dual Attention Module

In this module, we use two distinct attention func-

tions, namely afﬁnity attention and difference at-

tention, to compare the afﬁnities and differences

of vectors between two sentences. The input of

the dual attention module is a triple of

K, Q, V ∈

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DABERT:DualAttentionEnhancedBERTforSemanticMatchingSiruiWang1,*,DiLiang1,2,,JianSong,YuntaoLi,WeiWuTsinghuaUniversity,Beijing,China*MeituanInc.,Beijing,Chinafwangsirui,liangdi04,songjian20,liyuntao04,wuwei30g@meituan.comAbstractTransformer-basedpre-trainedlanguagemod-elssuchasBERThaveachievedre...

展开>> 收起<<

DABERT Dual Attention Enhanced BERT for Semantic Matching Sirui Wang1 Di Liang12 Jian Song Yuntao Li Wei Wu Tsinghua University Beijing China.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DABERT Dual Attention Enhanced BERT for Semantic Matching Sirui Wang1 Di Liang12 Jian Song Yuntao Li Wei Wu Tsinghua University Beijing China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: