DABERT Dual Attention Enhanced BERT for Semantic Matching Sirui Wang1 Di Liang12 Jian Song Yuntao Li Wei Wu Tsinghua University Beijing China

2025-04-27 0 0 657.26KB 10 页 10玖币
侵权投诉
DABERT: Dual Attention Enhanced BERT for Semantic Matching
Sirui Wang1,* , Di Liang1,2,† , Jian Song, Yuntao Li , Wei Wu
Tsinghua University, Beijing, China*
Meituan Inc.,Beijing, China
{wangsirui,liangdi04,songjian20,liyuntao04,wuwei30}@meituan.com
Abstract
Transformer-based pre-trained language mod-
els such as BERT have achieved remarkable
results in Semantic Sentence Matching. How-
ever, existing models still suffer from insuffi-
cient ability to capture subtle differences. Mi-
nor noise like word addition, deletion, and
modification of sentences may cause flipped
predictions. To alleviate this problem, we
propose a novel Dual Attention Enhanced
BERT (DABERT) to enhance the ability of
BERT to capture fine-grained differences in
sentence pairs. DABERT comprises (1) Dual
Attention module, which measures soft word
matches by introducing a new dual channel
alignment mechanism to model affinity and
difference attention. (2) Adaptive Fusion mod-
ule, this module uses attention to learn the ag-
gregation of difference and affinity features,
and generates a vector describing the match-
ing details of sentence pairs. We conduct ex-
tensive experiments on well-studied semantic
matching and robustness test datasets, and the
experimental results show the effectiveness of
our proposed method.
1 Introduction
Semantic Sentence Matching (SSM) is a funda-
mental NLP task. The goal of SSM is to compare
two sentences and identify their semantic relation-
ship. In paraphrase identification, SSM is used to
determine whether two sentences are paraphrase
or not (Madnani et al.,2012). In natural language
inference task, SSM is utilized to judge whether a
hypothesis sentence can be inferred from a premise
sentence (Bowman et al.,2015). In the answer sen-
tence selection task, SSM is employed to assess the
relevance between query-answer pairs and rank all
candidate answers.
Across the rich history of semantic sentence
matching research, there have been two main
1These authors contributed equally to this work.
2Corresponding author.
The secretaries knew the students.
The secretaries knew the students slept .
Is a girl looks at you,
someone stared at you ?
what does it meant ?
What does it mean when
S1:
S2:
S1:
S2:
Is the weather sunny this
last
weekend ?
Is the weather sunny weekend ?
S1:
S2:
S2:
S1:
S2:S2:
S1:
S2:S2:
S1:
S2:S2:
S1:
S2:S2:
Figure 1: Example sentences with similar text but dif-
ferent semantics. S1 and S2 are sentence pair.
streams of studies for solving this problem. One
is to utilize a sentence encoder to convert sen-
tences into low-dimensional vectors in the latent
space, and apply a parameterized function to learn
the matching scores between them (Reimers and
Gurevych,2019). Another paradigm adopts atten-
tion mechanism to calculate scores between tokens
from two sentences, and then the matching scores
are aggregated to make a sentence-level decision
(Chen et al.,2016;Tay et al.,2017). In recent
years, pre-trained models, such as BERT (Devlin
et al.,2018), RoBERTa (Liu et al.,2019), have be-
came much more popular and achieved outstanding
performance in SSM. Recent work also attempts
to enhance the performance of BERT by inject-
ing knowledge into it, such as SemBERT (Zhang
et al.,2020), UER-BERT (Xia et al.,2021), Syntax-
BERT (Bai et al.,2021) and so on.
Although previous studies have provided some
insights, those models do not perform well in distin-
guishing sentence pairs with high literal similarities
but different semantics. Figure 1demonstrates sev-
eral cases suffering from this problem. Although
the sentence pairs in this figure are semantically
different, they are too similar in literal for those
pre-trained language models to distinguish accu-
rately. This could be caused by the self-attention
architecture itself. Self-attention mechanism fo-
cuses on using the context of a word to understand
arXiv:2210.03454v4 [cs.CL] 14 Apr 2023
the semantics of the word, while ignoring model-
ing the semantic difference between sentence pairs.
De-attention (Tay et al.,2019) and Sparsegen (Mar-
tins and Astudillo,2016) have proved that equip-
ping with attention mechanism with more flexi-
ble structure, models can generate more powerful
representations. In this paper, we also focus on
enhancing the attention mechanism in transformer-
based pre-trained models to better integrate dif-
ference information between sentence pairs. We
hypothesize that paying more attention to the fine-
grained semantic differences, explicitly modeling
the difference and affinity vectors together will fur-
ther improve the performance of pre-trained model.
Therefore, two systemic questions arise naturally:
Q1: How to equip vanilla attention mech-
anism with the ability on modeling semantics
of fine-grained differences between a sentence
pair?
Vanilla attention, or named affinity atten-
tion, less focuses on the fine-grained difference
between sentence pairs, which may lead to error
predictions for SSM tasks. An intuitive solution to
this problem is to make subtraction between repre-
sentation vectors to harvest their semantic differen-
tiation. In this paper, we propose a dual attention
module including a difference attention accompa-
nied with the affinity attention. The difference
attention uses subtraction-based cross-attention to
aggregate word- and phrase- level interaction dif-
ferences. Meanwhile, to fully utilize the difference
information, we use dual-channel inject the differ-
ence information into the multi-head attention in
the transformer to obtain semantic representations
describing affinity and difference respectively.
Q2: How to fuse two types of semantic rep-
resentations into a unified representation?
A
hard fusion of two signals by extra structure may
break the representing ability of the pre-trained
model. How to inject those information softly to
pre-trained model remains a hard issue. In this pa-
per, we propose an Adaptive Fusion module, which
uses an additional attention to learn the difference
and affinity features to generate vectors describing
sentence matching details. It first inter-aligns the
two signals through distinct attentions to capture
semantic interactions, and then uses gated fusion
to adaptively fuse the difference features. Those
generated vectors are further scaled with another
fuse-gate module to reduce the damage of the pre-
trained model caused by the injection of difference
information. The output final vectors can better
describe the matching details of sentence pairs.
Our main contributions are three fold:
We point out that explicitly modeling fine-
grained difference semantics between sen-
tence pairs can effectively benefit sentence
semantic matching tasks, and we propose
a novel dual attention enhanced mechanism
based on BERT.
Our proposed DABERT model uses a dual-
channel attention to separately focus on the
affinity and difference features in sentence
pairs, and adopts a soft-integrated regulation
mechanism to adaptively aggregate those two
features. Thereby, the generated vectors can
better describe the matching details of sen-
tence pairs.
To verify the effectiveness of DABERT, we
conduct experiments on 10 semantic matching
datasets and several data-noised dataset to test
model’s robustness. The results show that
DABERT achieves an absolute improvement
for over 2% compared with pure BERT and
outperforms other BERT-based models with
more advanced techniques and external data
usage.
2 Approach
Our proposed DABERT is a modification of the
original transformer structure, whose structure is
shown in Figure 2. Two submodules are included
in this new structure. (1) Dual Attention Module,
which uses a dual channel mechanism in multi-
head attention to match words between two sen-
tences. Each channel uses a different attention
head to calculate affinity and difference scores sep-
arately, and obtains two representations to measure
affinity and difference information respectively. (2)
Adaptive Fusion Module, which is used to fuse the
representation obtained by dual attention. It first
uses guide-attention to align the two signals. And
then, multiple gate modules are used to fuse the
two signals. Finally, a vector is output including
more fine-grained matching details. In the follow-
ing sections, we explain each component in detail.
2.1 Dual Attention Module
In this module, we use two distinct attention func-
tions, namely affinity attention and difference at-
tention, to compare the affinities and differences
of vectors between two sentences. The input of
the dual attention module is a triple of
K, Q, V
摘要:

DABERT:DualAttentionEnhancedBERTforSemanticMatchingSiruiWang1,*,DiLiang1,2,†,JianSong†,YuntaoLi†,WeiWu†TsinghuaUniversity,Beijing,China*MeituanInc.,Beijing,China†fwangsirui,liangdi04,songjian20,liyuntao04,wuwei30g@meituan.comAbstractTransformer-basedpre-trainedlanguagemod-elssuchasBERThaveachievedre...

展开>> 收起<<
DABERT Dual Attention Enhanced BERT for Semantic Matching Sirui Wang1 Di Liang12 Jian Song Yuntao Li Wei Wu Tsinghua University Beijing China.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:657.26KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注