the semantics of the word, while ignoring model-
ing the semantic difference between sentence pairs.
De-attention (Tay et al.,2019) and Sparsegen (Mar-
tins and Astudillo,2016) have proved that equip-
ping with attention mechanism with more flexi-
ble structure, models can generate more powerful
representations. In this paper, we also focus on
enhancing the attention mechanism in transformer-
based pre-trained models to better integrate dif-
ference information between sentence pairs. We
hypothesize that paying more attention to the fine-
grained semantic differences, explicitly modeling
the difference and affinity vectors together will fur-
ther improve the performance of pre-trained model.
Therefore, two systemic questions arise naturally:
Q1: How to equip vanilla attention mech-
anism with the ability on modeling semantics
of fine-grained differences between a sentence
pair?
Vanilla attention, or named affinity atten-
tion, less focuses on the fine-grained difference
between sentence pairs, which may lead to error
predictions for SSM tasks. An intuitive solution to
this problem is to make subtraction between repre-
sentation vectors to harvest their semantic differen-
tiation. In this paper, we propose a dual attention
module including a difference attention accompa-
nied with the affinity attention. The difference
attention uses subtraction-based cross-attention to
aggregate word- and phrase- level interaction dif-
ferences. Meanwhile, to fully utilize the difference
information, we use dual-channel inject the differ-
ence information into the multi-head attention in
the transformer to obtain semantic representations
describing affinity and difference respectively.
Q2: How to fuse two types of semantic rep-
resentations into a unified representation?
A
hard fusion of two signals by extra structure may
break the representing ability of the pre-trained
model. How to inject those information softly to
pre-trained model remains a hard issue. In this pa-
per, we propose an Adaptive Fusion module, which
uses an additional attention to learn the difference
and affinity features to generate vectors describing
sentence matching details. It first inter-aligns the
two signals through distinct attentions to capture
semantic interactions, and then uses gated fusion
to adaptively fuse the difference features. Those
generated vectors are further scaled with another
fuse-gate module to reduce the damage of the pre-
trained model caused by the injection of difference
information. The output final vectors can better
describe the matching details of sentence pairs.
Our main contributions are three fold:
•
We point out that explicitly modeling fine-
grained difference semantics between sen-
tence pairs can effectively benefit sentence
semantic matching tasks, and we propose
a novel dual attention enhanced mechanism
based on BERT.
•
Our proposed DABERT model uses a dual-
channel attention to separately focus on the
affinity and difference features in sentence
pairs, and adopts a soft-integrated regulation
mechanism to adaptively aggregate those two
features. Thereby, the generated vectors can
better describe the matching details of sen-
tence pairs.
•
To verify the effectiveness of DABERT, we
conduct experiments on 10 semantic matching
datasets and several data-noised dataset to test
model’s robustness. The results show that
DABERT achieves an absolute improvement
for over 2% compared with pure BERT and
outperforms other BERT-based models with
more advanced techniques and external data
usage.
2 Approach
Our proposed DABERT is a modification of the
original transformer structure, whose structure is
shown in Figure 2. Two submodules are included
in this new structure. (1) Dual Attention Module,
which uses a dual channel mechanism in multi-
head attention to match words between two sen-
tences. Each channel uses a different attention
head to calculate affinity and difference scores sep-
arately, and obtains two representations to measure
affinity and difference information respectively. (2)
Adaptive Fusion Module, which is used to fuse the
representation obtained by dual attention. It first
uses guide-attention to align the two signals. And
then, multiple gate modules are used to fuse the
two signals. Finally, a vector is output including
more fine-grained matching details. In the follow-
ing sections, we explain each component in detail.
2.1 Dual Attention Module
In this module, we use two distinct attention func-
tions, namely affinity attention and difference at-
tention, to compare the affinities and differences
of vectors between two sentences. The input of
the dual attention module is a triple of
K, Q, V ∈